Solving hard problems

We help companies solve hard problems in mathematics, statistics, and computing. Let’s explore how we might work together.

Experiences with Nvidia

Posted on 9 July 2025 by Wayne Joubert

Our team started working within Nvidia in early 2009 at the beginning of the ORNL Titan project. Our Nvidia contacts dealt with applications, libraries, programming environment and performance optimization. First impressions were that their technical stance on issues was very reasonable. One obscure example: in a C++ CUDA kernel were you allowed to use “enums,” and the answer would be, of course, yes, we would allow that. This was unlike some other companies that might have odd and cumbersome programming restrictions in their parallel programming models (though by now this has become a harder problem for Nvidia since there are so many software products a user might want to interoperate).

Another example, with a colleague at Nvidia on the C++ standards committee, to whom I mentioned, it might be too early to lock this certain feature design into the standard since hardware designs are still rapidly changing. His response was, Oh, yes, we think exactly the same thing. So in short, their software judgments and decisions generally seem to be well thought out, reasonable and well informed. It sounds simple, but it is amazing how many companies have gotten this wrong.

Nvidia has made good strategic decisions. In the 2013 time frame, Intel was becoming a competitive threat with the Xeon Phi processor. Intel was several times larger than Nvidia with huge market dominance. In response, Nvidia formed a partnership with IBM–itself several times larger than Intel at the time. This came to fruition in the ORNL Summit system in 2018. In the meantime, the Xeon Phi’s OpenMP programming model, though standards-based, turned out to be difficult to write optimized code for, and Nvidia CUDA captured market share dominance of accelerated user software. Intel eventually dropped the Xeon Phi product line.

In the early 2000s, Nvidia went all-in on CUDA. I’ve heard some project teams say they would never use CUDA, because it is nonstandard and too low-level. Many have turned back on this decision. Of course, it is often possible to write an abstraction layer on top of CUDA to make it easier to use and maintain. Also newer programming models like Kokkos can be helpful.

Nvidia also made a prescient decision early to bet big on AI. A little later they decided to go all in on developing a huge number of software libraries is to enable access to many new markets. A huge moat. AMD is trying hard to improve their software processes and catch up.

On the downside, Nvidia high prices are upsetting to many, from gamers to buyers of the world’s largest HPC systems. Competition from AMD and others is a good thing.

And Nvidia marketing speak is sometimes confusing. A comparison was once made claiming that a four GPU system was more powerful than one of the world’s top CPU-only supercomputers on a very specific science problem. I’d like to see the details of that comparison. Also, different figures are being given on how long it took to stand up xAI’s Colossus supercomputer, from 19 days to 122 days. One has to dig a little to find out what these figures mean. Also it was widely reported last year that the GB200 NVL72 GPU was “30 times faster” than H100, but this only refers to certain operations, not key performance measures like flops per watt.

Those are my takes. For more perspectives, see Tae Kim’s excellent book, The Nvidia Way, or this interview.

Thoughts on Nvidia? Please leave in the comments.

Legendre polynomials

Posted on 7 July 2025 by John

The previous post mentioned Legendre polynomials. This post will give a brief introduction to these polynomials and a couple hints of how they are used in applications.

One way to define the Legendre polynomials is as follows.

P₀(x) = 1
P_k are orthogonal on [−1, 1].
P_k(1) = 1 for all k ≥ 0.

The middle bullet point means

$\int_{-1}^1 P_m(x) P_n(x) \, dx = 0$

if m ≠ n. The requirement that each P_k is orthogonal to each of its predecessors determines P_k up to a constant, and the condition P_k(1) = 1 determines this constant.

Here’s a plot of the first few Legendre polynomials.

There’s an interesting pattern that appears in the white space of a graph like the one above when you plot a large number of Legendre polynomials. See this post.

The Legendre polynomial P_k(x) satisfies Legendre’s differential equation; that’s what motivated them.

$(1 - x^2) y^{\prime\prime} -2x y^\prime + k(k+1)y = 0$

This differential equation comes up in the context of spherical harmonics.

Next I’ll describe a geometric motivation for the Legendre polynomials. Suppose you have a triangle with one side of unit length and two longer sides of length r and y.

You can find y in terms of r by using the law of cosines:

$y = \sqrt{1 - 2r \cos \theta + r^2}$

But suppose you want to find 1/y in terms of a series in 1/r. (This may seem like an arbitrary problem, but it comes up in applications.) Then the Legendre polynomials give you the coefficients of the series.

$\frac{1}{y} = \frac{P_0(\cos\theta)}{r} + \frac{P_1(\cos\theta)}{r^2} + \frac{P_2(\cos\theta)}{r^3} + \cdots$

Source: Keith Oldham et al. An Atlas of Functions. 2nd edition.

The biggest perturbation of satellite orbits

Posted on 7 July 2025 by John

To first approximation, a satellite orbiting the earth moves in an elliptical orbit. That’s what would get from solving the two-body problem: two point masses orbiting their common center of mass, subject to no forces other than their gravitational attraction to each other.

But the earth is not a point mass. Neither is a satellite, though that’s much less important. The fact that the earth is not exactly a sphere but rather an oblate spheroid is the root cause of the J2 effect.

The J2 effect is the largest perturbation of a satellite orbit from a simple elliptical orbit, at least for satellites in low earth orbit (LEO) and medium earth orbit (MEO). The J2 effect is significant for satellites in higher orbits, though third body effects are larger.

Legendre showed that the gravitational potential of an axially symmetric planet is given by

$V(r, \phi) = \frac{Gm}{r} \left( 1 - \sum_{k=2}^\infty J_k \left( \frac{r_{eq}}{r}\right)^k P_k(\cos \phi) \right)$

Here (r, φ) are spherical coordinates. There’s no θ term because we assume the planet, and hence its gravitational potential, is axially symmetric, i.e. independent of θ. The term r_eq is the equatorial radius of the planet. The P_k are Legendre polynomials.

For a moderately oblate planet, like the one we live on, the J₂ coefficient is much larger than the others, and neglecting the rest of the coefficients gives a good approximation [1].

Here are the first few coefficients for Earth [2].

$\begin{align*} J_2 &= \phantom{-}0.00108263 \\ J_3 &= -0.00000254 \\ J_4 &= -0.00000161 \end{align*}$

Note that J₂ is three orders of magnitude smaller than 1, and so the J2 effect is small. And yet it matters a great deal. The longitude of the point at which a satellite crosses the equatorial plane may vary a few degrees per day. The rate of precession is approximately proportional to J₂.

The value of J₂ for Mars is about twice that of Earth (0.001960454). The largest J₂ in our solar system is Neptune, which is about three times that of Earth (0.003411).

There are many factors left out of the assumptions of the two body problem. The J2 effect doesn’t account for everything that has been left out, but it’s the first refinement.

Transmission Obstacles and Ellipsoids

Posted on 5 July 2025 by John

Suppose you have a radio transmitter T and a receiver R with a clear line of sight between them. Some portion the signal received at R will come straight from T. But some portion will have bounced off some obstacle, such as the ground.

The reflected radio waves will take a longer path than the waves that traveled straight from T to R. The worst case for reception is when the waves traveling a longer path arrive half a period later, i.e. 180° out of phase, canceling out part of the signal that was received directly.

We’d like to describe the region of space that needs to be empty in order to eliminate destructive interference, i.e. signals 180° out of phase. Suppose T and R are a distance d apart and the wavelength of your signal is λ. An obstacle at a location P can cause signals to arrive exactly out of phase if the distance from T to P plus the distance from P to R is d + λ/2.

So we’re looking for the set of all points such that the sum of their distances to fixes points is a constant. This is the nails-and-string description of an ellipse, where the nails are a distance d apart and the string has length d + λ/2.

That would be a description of the region if we were limited to a plane, such as a plane perpendicular to the ground and containing the transmitter and receiver. But signals could reflect off an obstacle that’s outside this plane. So now we need to imagine being able to move the string in three dimensions. We still get all the points we’d get if we were restricted to a plane, but we also get their rotation about the axis running between T and R.

The region we’re describing is an ellipsoid, known as a Fresnel ellipsoid or Fresnel zone.

Suppose we choose our coordinates so that our transmitter T is located at (0, 0, h) and our receiver R is located at (d, 0, h). We imagine a string of length d + λ/2 with endpoints attached to T and R. We stretch the string so it consists of two straight segments. The set of all possible corners in the string traces out the Fresnel ellipsoid.

Greater delays

If reflected waves are delayed by exactly one period, they reinforce the portion of the signal that arrived directly. Signals delayed by an even multiple of a half-period cause constructive interference, but signals delayed by odd multiples of a half-period cause destructive interference. The odd multiples matter most because we’re more often looking to avoid destructive interference rather than seeking out opportunities for constructive interference.

If you repeat the exercise above with a string of length length d + λ you have another Fresnel ellipsoid. The foci remain the same, i.e. T and R, but this new ellipsoid is bigger since the string is longer. This ellipsoid represents locations where a signal reflected at that point will arrive one period later than a signal traveling straight. Obstacles on the surface of this ellipsoid cause constructive interference.

We can repeat this exercise for a string of length d + nλ/2, where odd values of n correspond to regions of destructive interference. This gives us a set of confocal ellipsoids known as the Fresnel ellipsoids.

Zooming in on a fractalish plot

Posted on 29 June 2025 by John

The exponential sum of the day page on my site draws an image every day by plugging the month, day, and year into a formula. Details here.

Today’s image looks almost solid blue in the middle.

The default plotting line width works well for most days. For example, see what the sum of the day will look line on July 1 this year. Making the line width six times thinner reveals more of the detail in the middle.

You can see even more detail in a PDF of the plot.

Most ints are not floats

Posted on 27 June 2025 by John

All integers are real numbers, but most computer representations of integers do not equal computer representations of real numbers.

To make the statement above precise, we have to be more specific about what we mean by computer integers and floating point numbers. I’ll use int32 and int64 to refer to 32-bit and 64-bit signed integers. I’ll use float32 and float64 to refer to IEEE 754 single precision and double precision numbers, what C calls float and double.

Most int32 numbers cannot be represented exactly by a float32. All int32 numbers can be represented approximately as float32 numbers, but usually not exactly. The previous statements remain true if you replace “32” everywhere by “64.”

32 bit details

The int32 data type represents integers −2³¹ through 2³¹ − 1. The float32 data type represents numbers of the form

± 1.f × 2^e

where 1 bit represents the sign, 23 bits represent f, and 8 bits represent e.

The number n = 2²⁴ can be represented by setting the fractional part f to 0 and setting the exponent e to 24. But the number n + 1 cannot be represented as a float32 because its last bit would fall off the end of f:

2²⁴ + 1 = (1 + 2⁻²⁴) 2²⁴ = 1.000000000000000000000001_two × 2²⁴

The bits in f fill up representing the 23 zeros after the decimal place. There’s no 24th bit to store the final 1.

For each value of e, there are 2²³ possible values of f. So for each of e = 24, 25, 26, …, 31 there are 2²³ representable integers, for a total of 8 × 2²³.

So of the 2³¹ non-negative integers that can be represented by an int32 data type, only 9 × 2²³ can also be represented exactly as a float32 data type. This means about 3.5% of 32-bit integers can be represented exactly by a 32-bit float.

64 bit details

The calculations for 64-bit integers and 64-bit floating point numbers are analogous. Numbers represented by float64 data types have the form

± 1.f × 2^e

where now f has 52 bits and e has 11.

Of the 2⁶³ non-negative integers that can be represented by an int64 data type, only 11 × 2⁵² can also be represented exactly as a float64 data type. This means about 0.5% of 64-bit integers can be represented exactly by a 64-bit float.

A note on Python

Python’s integers have unlimited range, while its floating point numbers correspond to float64. So there are two reasons an integer might not be representable as a float: it may be larger than the largest float, or it may require more than 53 bits of precision.

Test whether a large integer is a square

Posted on 26 June 2025 by John

Years ago I wrote about a fast way to test whether an integer n is a square. The algorithm rules out a few numbers that cannot be squares based on their last (hexadecimal) digit. If the the integer passes through this initial screening, the algorithm takes the square root of n as a floating point number, rounds to an integer, and tests whether the square of the rest equals n.

What’s wrong with the old algorithm

The algorithm described above works fine if n is not too large. However, it implicitly assumes that the integer can be represented exactly as a floating point number. This is two assumptions: (1) it’s not too large to represent, and (2) the representation is exact.

A standard 64-bit floating point number has 1 bit for the sign, 11 bits for the exponent, and 52 bits for the significand. (More on that here.) Numbers will overflow if they run out of exponent bits, and they’ll lose precision if they run out of significand bits. So, for example, 2¹⁰²⁴ will overflow [1]. And although 2⁵³ + 1 will not overflow, it cannot be represented exactly.

These are illustrated in Python below.

    >>> float(2**1024)
    OverflowError: int too large to convert to float
    >>> n = 2**53
    >>> float(n) == float(n + 1)
    True

A better algorithm

The following algorithm [2] uses only integer operations, and so it will work exactly for arbitrarily large integers in Python. It’s a discrete analog of Newton’s square root algorithm.

    def intsqrt(N):
        n = N
        while n**2 > N:
            n = (n + N//n) // 2
        return n

    def issquare(N):
        n = intsqrt(N)
        return n**2 == N

The function issquare will test whether N is a square. The function intsqrt is broken out as a separate function because it is independently useful since it returns ⌊√N⌋.

The code above correctly handles examples that the original algorithm could not.

    >>> issquare(2**1024)
    True
    >>> issquare(2**54)
    True
    >>> issquare(2**54 + 1)
    False

You could speed up issquare by quickly returning False for numbers that cannot be squared on account of their final digits (in some base—maybe you’d like to use a base larger than 16) as in the original post.

[1] The largest allowed exponent is 1023 for reasons explained here.

[2] Bob Johnson. A Route to Roots. The Mathematical Gazette, Vol. 74, No. 469 (Oct., 1990), pp. 284–285

Legendre and Ethereum

Posted on 26 June 2025 by John

What does an eighteenth century French mathematician have to do with the Ethereum cryptocurrency?

A pseudorandom number generator based on Legendre symbols, known as Legendre PRF, has been proposed as part of a zero knowledge proof mechanism to demonstrate proof of custody in Ethereum 2.0. I’ll explain what each of these terms mean and include some Python code.

The equation x² = a mod p is solvable for some values of a and not for others. The Legendre symbol

$\left(\frac{a}{p}\right)$

is defined to be 1 if a has a square root mod p, and −1 if it does not. Here a is a positive integer and p is a (large) prime [1]. Note that this is not a fraction, though it looks like one.

As a varies, the Legendre symbols vary randomly. Not literally randomly, of course, but the act random enough to be useful as a pseudorandom number generator.

Legendre PRF

Generating bits by computing Legendre symbols is a lot more work than generating bits using a typical PRNG, so what makes the Legendre PRF of interest to Ethereum?

Legendre symbols can be computed fairly efficiently. You wouldn’t want to use the Legendre PRF to generate billions of random numbers for some numerical integration computation, but for a zero knowledge proof you only need to generate a few dozen bits.

To prove that you know a key k, you can generate a string of pseudorandom bits that depend on the key. If you can do this for a few dozen bits, it’s far more likely that you know the key than that you have guessed the bits correctly. Given k, the Legendre PRF computes

$L_k(x) = \frac{1}{2}\left( 1 - \left( \frac{k+x}{p}\right) \right)$

for n consecutive values of x [2].

One reason this function is of interest is that it naturally lends itself to multiparty computation (MPC). The various parties could divide up the range of x values that each will compute.

The Legendre PRF has not been accepted to be part of Ethereum 2.o. It’s only been proposed, and there is active research into whether it is secure enough.

Python code

Here’s a little Python scrypt to demonstrate using L_k(x).

    from sympy import legendre_symbol

    def L(x, k, p):
        return (1 - legendre_symbol(x + k, p)) // 2

    k = 20250626
    p = 2**521 - 1
    print([L(x, k, p) for x in range(10)])

This produces the following.

    [1, 1, 1, 1, 0, 0, 1, 0, 0, 1]

[1] There is a third possibility: the Legendre symbol equals 0 if a is a multiple of p. We can safely ignore this possibility for this post.

[2] The Legendre symbol takes on the values ±1 and so we subtract this from 1 to get values {0, 2}, then divide by 2 to get bits {0, 1}.

Patching functions together

Posted on 25 June 2025 by John

The previous post looked at a function formed by patching together the function

f(x) = log(1 + x)

for positive x and

f(x) = x

for negative x. The functions have a mediocre fit, which may be adequate for some applications, such as plotting data points, but not for others, such as visual design. Here’s a plot.

There’s something not quite smooth about the plot. It’s continuous, even differentiable, at 0, but still there’s sort of a kink at 0.

The way to quantify smoothness is to count how many times you can differentiate a function and have a continuous result. We would say the function above is C¹ but not C² because its first derivative is continuous but not differentiable. If two functions have the same number of derivatives, you can distinguish their smoothness by looking at the size of the derivatives. Or if you really want to get sophisticated, you can use Sobolev norms.

As a rule of thumb, your vision can detect a discontinuity in the second derivative, maybe the third derivative in some particularly sensitive contexts. For physical applications, you also want continuous second derivatives (acceleration). For example, you want a roller coaster track to be more than just continuous and once differentiable.

Natural cubic splines are often used in applications because they patch cubic polynomials together so that the result is C².

You can patch together smooth functions somewhat arbitrarily, but not analytic functions. If you wanted to extend the function log(1 + x) analytically, your only choice is log(1 + x). More on that here.

If you patch together a flat line and a circle, as in the corners of a rounded rectangle, the curve is continuous but not C². The second derivative jumps from 0 to 1/r where r is the radius of the circle. A squircle is similar to a rounded rectangle but has smoothly varying curvature.

Log-ish

Posted on 24 June 2025 by John

I saw a post online this morning that recommended the transformation

$f(x) = \left\{ \begin{array}{ll} \log(1+x) & \mbox{if } x > 0 \\ x & \mbox{if } x \leq 0 \end{array} \right.$

I could see how this could be very handy. Often you want something like a logarithmic scale, not for the exact properties of the logarithm but because it brings big numbers closer in. And for big values of x there’s little difference between log(x) and log(1 + x).

The function above is linear near the origin, literally linear for negative values and approximately linear for small positive values.

I’ve occasionally needed something like a log scale, but one that would handle values that dip below zero. This transformation would be good for that. If data were equally far above and below zero, I’d use something like arctangent instead.

Notice that the function has a noticeable bend around 0. The next post addresses this.

Solving hard problems

Experiences with Nvidia

Legendre polynomials

Related posts

The biggest perturbation of satellite orbits

More orbital mechanics posts

More Legendre posts

Transmission Obstacles and Ellipsoids

Greater delays

Related posts

Zooming in on a fractalish plot

Most ints are not floats

32 bit details

64 bit details

A note on Python

Related posts

Test whether a large integer is a square

What’s wrong with the old algorithm

A better algorithm

Legendre and Ethereum

Legendre PRF

Python code

Related posts

Patching functions together

Related posts

Log-ish