Scientific papers: innovation … or imitation?

Sometimes a paper comes out that has the seeds of a great idea that could lead to a whole new line of pioneering research. But, instead, nothing much happens, except imitative works that do not push the core idea forward at all.

For example the McCulloch Pitts paper from 1943 showed how neural networks could represent arbitrary logical or Boolean expressions of a certain class. The paper was well-received at the time, brilliantly executed by co-authors with diverse expertise in neuroscience, logic and computing. Had its signficance been fully grasped, this paper might have, at least notionally, formed a unifying conceptual bridge between the two nascent schools of connectionism and symbolic AI (one can at least hope). But instead, the heated conflict in viewpoints in the field has persisted, even to this day.

Another example is George Miller’s 7 +/- 2 paper. This famous result showed humans are able to hold only a small number of pieces of information in mind at the same time while reasoning.  This paper was important not just for the specific result, but for the breakthrough in methodology using rigorous experimental noninvasive methods to discover how human thinking works—a topic we know so little about, even today. However, the followup papers by others, for the most part, only extended or expanded on the specific finding in very minor ways. [1] Thankfully, Miller’s approach did eventually gain influence in more subtle ways.

Of course it’s natural from the incentive structures of publishing that many papers would be primarily derivative rather than original. It’s not a bad thing that, when a pioneering paper comes out, others very quickly write rejoinder papers containing evaluations or minor tweaks of the original result. Not bad, but sometimes we miss the larger implications of the original result and get lost in the details.

Another challenge is stovepiping—we get stuck in our narrow swim lanes for our specific fields and camps of research. [2] We don’t see the broader implications, such as connections and commonalities across fields that could lead to fruitful new directions.

Thankfully, at least to some extent current research in AI shows some mix of both innovation and imitation. Inspired in part by the accelerationist mindset, many new papers appear every day, some with significant new findings and others that are more modest riffs on previous papers.

Notes

[1] Following this line of research on human thought processes could be worthwhile for various reasons. For example, some papers in linguistics state that Chomsky‘s vision for a universal grammar is misguided because the common patterns in human language are entirely explainable by the processing limitations of the human mind. But this claim is made with no justification or methodological rigor of any kind. If I claimed a CPU performs vector addition or atomic operations efficiently because of “the capabilities of the processor,” I would need to provide some supporting evidence, for example, documenting that the CPU has vector processing units or specialized hardware for atomics. The assertions about language structure being shaped by the human mental processing faculty is just an empty truism, unless supported by some amount of scientific rigor and free of the common fallacies of statistical reasoning.

[2] I recently read a paper in linguistics with apparent promise, but the paper totally misconstrued the relationship between Shannon entropy and Kolmogorov complexity. Sadly this paper passed review in a linguistic journal, but if it had had a mathematically inclined reviewer, the problem would have been caught and fixed.

 

 

Binomial number system

I just stumbled across the binomial number system in Exercise 5.38 of Concrete Mathematics. The exercise asks the reader to show that every non-negative integer n can be written as

n = \binom{a}{1} + \binom{b}{2} + \binom{c}{3}

and that the representation is unique if you require 0 ≤ abc. The book calls this the binomial number system. I skimmed a paper that said this has some application in signal processing, but I haven’t looked at it closely [1].

You can find ab, and c much as you would find the representation in many other number systems: first find the largest possible c, then the largest possible b for what’s left, and then the remainder is a.

In order to find c, we start with the observation that the binomial coefficient C(k, 3) is less than k³/6 and so c is less than the cube root of 6n. We can use this as an initial lower bound on c then search incrementally. If we wanted to be more efficient, we could do some sort of binary search.

Here’s Python code to find ab, and c.

from math import comb, factorial

def lower(n, r):
    "Find largest k such that comb(k, r) <= n."
    k = int( (factorial(r)*n)**(1/r) ) # initial guess
    while comb(k, r) <= n: 
        k += 1 
    return k - 1 

def binomial_rep(n): 
    c = lower(n, 3) 
    cc = comb(c, 3) 
    b = lower(n - comb(c, 3), 2) 
    bb = comb(b, 2) 
    a = n - cc - bb 
    assert(c > b > a >= 0)
    return (a, b, c)

For example, here’s the binomial number system representation of today’s date.

>>> binomial_rep(20250605)
(79, 269, 496)
>>> comb(496, 3) + comb(269, 2) + comb(79, 1)
20250605

You could use any number of binomial terms, not just three.

[1] I looked back at the paper, and it is using a different kind of binomial number system, expressing numbers as sums of fixed binomial coefficients, not varying the binomial coefficient arguments. This representation has some advantages for error correction.

Additive and multiplicative persistence

Casting out nines is a well-known way of finding the remainder when a number is divided by 9. You add all the digits of a number n. And if that number is bigger than 9, add all the digits of that number. You keep this up until you get a number less than 9.

This is an example of persistence. The persistence of a number, relative to some operation, is the number of times you have to apply that operation before you reach a fixed point, a point where applying the operation further makes no change.

The additive persistence of a number is the number of times you have to take the digit sum before getting a fixed point (i.e. a number less than 9). For example, the additive persistence of 8675309 is 3 because the digits in 8675309 sum to 38, the digits in 38 sum to 11, and the digits in 11 sum to 2.

The additive persistence of a number in base b is bounded above by its iterated logarithm in base b.

The smallest number with additive persistence n is sequence A006050 in OEIS.

Multiplicative persistence is analogous, except you multiply digits together. Curiously, it seems that multiplicative persistence is bounded. This is true for numbers with less than 30,000 digits, and it is conjectured to be true for all integers.

The smallest number with multiplicative persistence n is sequence A003001 in OEIS.

Here’s Python code to compute multiplicative persistence.

def digit_prod(n):
    s = str(n)
    prod = 1
    for d in s:
        prod *= int(d)
    return prod

def persistence(n):
    c = 0
    while n > 9:
        n = digit_prod(n)
        print(n)
        c += 1
    return c   

You could use this to verify that 277777788888899 has persistence 11. It is conjectured that no number has larger persistence larger than 11.

The persistence of a large number is very likely 1. If you pick a number with 1000 digits at random, for example, it’s very likely that at least of these digits will be 0. So it would seem that the probability of a large number having persistence larger than 11 would be incredibly small, but I would not have expected it to be zero.

Here are plots to visualize the fixed points of the numbers less than N for increasing values of N.


It’s not surprising that zeros become more common as N increases. And a moments thought will explain why even numbers are more common than odd numbers. But it’s a little surprising that 4 is much less common than 2, 6, or 8.

Incidentally, we have worked in base 10 so far. In base 2, the maximum persistence is 1: when you multiply a bunch of 0s and 1s together, you either get 0 or 1. It is conjectured that the maximum persistence in base 3 is 3.

Iterated logarithm

Logarithms give a way to describe large numbers compactly. But sometimes even the logarithm is a huge number, and it would be convenient to take the log again. And maybe again.

For example, consider

googolplex = 10googol

where

googol = 10100.

(The name googol is older than “Google,” and in fact the former inspired the latter name.)

A googolplex is a big number. It’s log base 10 is still a big number. You have to take the log 3 times to get to a small number:

log10( log10( log10 googolplex ) ) = 2.

Log star

The iterated logarithm base b of a positive number x, written log*b (x) is the number of times you have to take the log base b until you get a result less than or equal to 1. So in the example above

log*10 googolplex = 4.

As usual, a missing base implies be. Also just as log base 2 is often denoted lg, the iterated log base 2 is sometimes denoted lg*.

I’ve seen sources that say the base of an iterated logarithm doesn’t matter. That’s not true, because, for example,

log*2 googolplex = 6.

The function log*(x) appears occasionally in the analysis of algorithms.

Python implementation

It’s easy to implement the iterated log, though this is of limited use because the iterated log is usually applied to numbers too large to represent as floating point numbers.

from math import log, exp

def logstar(x, base=exp(1)):
    c = 0
    while x > 1:
        x = log(x, base)
        c += 1
    return c

lgstar = lambda x: logstar(x, 2)

Mersenne primes

The largest known prime, at the time of writing, is

p = 2136279841 − 1.

We can use the code above to determine that lg* 136279841 = 5 and so lg* p = 6.

Skewes’ bounds

The prime number theorem says that the number of prime numbers less than x, denoted π(x), is asymptotically li(x) where

\text{li}(x) = \int_0^x \frac{dt}{\log t}

You may have seen a different form of the prime number theorem that says π(x) is asymptotically x/log x. That’s true too, because li(x) and x/log x are asymptotically equal. But li(x) gives a better approximation to π(x) for finite x. More on that here.

For every x for which π(x) has been computed, π(x) < li(x). However, Littlewood proved in 1914 that there exists some x for which π(x) > li(x). In 1933 Stanley Skewes gave an upper bound on the smallest x such that π(x) > li(x), assuming the Riemann Hypothesis:

S1 = 10^(10^(10^34))

In 1955 he proved the upper bound

S2 = 10^(10^(10^964))

not assuming the Riemann Hypothesis.

You can calculate that

log*10 S1 = 5

and

log*10 S2 = 6.

Iterated iterated logarithm

In all the examples above, log*(x) is a small number. But for some numbers, such as Moser’s number and Graham’s number, even log*(x) is unwieldy and you have to do things like take the iterated logarithm of the iterated logarithm

log**(x) = log*( log*(x) )

before the numbers become imaginable.

Related posts

Approximation of Inverse, Inverse of Approximation

“The inverse of an approximation is an approximation of the inverse.” This statement is either obvious or really clever. Maybe both.

Logs and exponents

Let’s start with an example. For moderately small x,

10^x \approx \frac{1 + x}{1 - x}

That means that near 1 (i.e. near 10 raised to a small number,

\log_{10}x \approx \frac{1-x}{1 + x}

The inverse of a good approximation is a good approximation of the inverse. But the measure of goodness might not be the same in both uses of the word “good.” The inverse of a very good approximation might be a mediocre approximation of the inverse, depending on the sensitivity of the problem. See, for example, Wilkinson’s polynomial.

Inverse function theorem

Now let’s look at derivatives and inverse functions. The derivative of a function is the best linear approximation to that function at a point. And it’s true that the best linear approximation to the inverse of a function at a point is the inverse of the best linear approximation to the function. In symbols,

\frac{dx}{dy} = \frac{1}{\dfrac{dy}{dx}}

This is a good example of a bell curve meme, or coming full circle. The novice naively treats the symbols above as fractions and says “of course.” The sophomore says “Wait a minute. These aren’t fractions. You can’t just do that.” And they would be correct, except the symbols are limits of fractions, and in this context naive symbol manipulation in fact leads to the correct result.

However, the position on the right side of the meme is nuanced. Treating derivatives as fractions, especially partial derivatives, can get you into trouble.

The more rigorous statement of the inverse function theorem is harder to parse visually, and so the version above is a good mnemonic. If f(x) = y then

\bigl(f^{-1}\bigr)^\prime (y) = \frac{1}{f^\prime(x)} = \frac{1}{f^\prime(f^{-1}(y))}

The same is true in multiple variables. If f is a smooth function from ℝn to ℝn then the best linear approximation to f at a point is a linear transformation, which can be represented by a matrix M. And the best linear approximation to the inverse function at that point is the linear transformation represented by the inverse of M. In symbols,

D\bigl(f^{-1}\bigr)(y) = \left[Df\bigl(f^{-1}(y)\bigr)\right]^{-1}

There’s fine print, such as saying a function has to be differentiable in a neighborhood of x and the derivative has to be non-singular, but that’s the gist of the inverse function theorem.

Note that unlike in the first section, there’s no discussion of how good the approximations are. The approximations are exact, at a point. But it might still be the case that the linear approximation on one side is good over a much larger neighborhood than the linear approximation on the other side.

False witnesses

Pinocchio talking to a judge

Fermat’s primality test

Fermat’s little theorem says that if p is a prime and a is not a multiple of p, then

ap−1 = 1 (mod p).

The contrapositive of Fermat’s little theorem says if

ap−1 ≠ 1 (mod p)

then either p is not prime or a is a multiple of p. The contrapositive is used to test whether a number is prime. Pick a number a less than p (and so a is clearly not a multiple of p) and compute ap−1 mod p.  If the result is not congruent to 1 mod p, then p is not a prime.

So Fermat’s little theorem gives us a way to prove that a number is composite: if ap−1 mod p equals anything other than 1, p is definitely composite. But if ap−1 mod p does equal 1, the test is inconclusive. Such a result suggests that p is prime, but it might not be.

Examples

For example, suppose we want to test whether 57 is prime. If we apply Fermat’s primality test with a = 2 we find that 57 is not prime, as the following bit of Python shows.

    >>> pow(2, 56, 57)
    4

Now let’s test whether 91 is prime using a = 64. Then Fermat’s test is inconclusive.

    >>> pow(64, 90, 91)
    1

In fact 91 is not prime because it factors into 7 × 13.

If we had used a = 2 as our base we would have had proof that 91 is composite. In practice, Fermat’s test is applied with multiple values of a (if necessary).

Witnesses

If ap−1 = 1 (mod p), then a is a witness to the primality of p. It’s not proof, but it’s evidence. It’s still possible p is not prime, in which case a is a false witness to the primality of p, or p is a pseudoprime to the base a. In the examples above, 64 is a false witness to the primality of 91.

Until I stumbled on a paper [1] recently, I didn’t realize that on average composite numbers have a lot of false witnesses. For large N, the average number of false witnesses for composite numbers less than N is greater than N15/23. Although N15/23 is big, it’s small relative to N when N is huge. And in applications, N is huge, e.g. N = 22048 for RSA encryption.

However, your chances of picking one of these false witnesses, if you were to choose a base a at random, is small, and so probabilistic primality testing based on Fermat’s little theorem is practical, and in fact common.

Note that the result quoted above is a lower bound on the average number of false witnesses. You really want an upper bound if you want to say that you’re unlikely to run into a false witness by accident. The authors in [1] do give upper bounds, but they’re much more complicated expressions than the lower bound.

Related posts

[1] Paul Erdős and Carl Pomerance. On the Number of False Witnesses for a Composite Number. Mathematics of Computation. Volume 46, Number 173 (January 1986), pp. 259–279.

Houston’s long term planning

When I hear American cities criticized for lack of long-term planning, my initial thought is that this is a good thing because the future cannot be predicted or directed very far in advance, and cities should be allowed to grow organically. Houston, in particular, is subject to a lot of criticism because of its lack of zoning, which I also think is a good thing. [1]

With that in mind, I was very surprised to see the following thread [2] from Aaron Renn.

Maybe the craziest thing about Houston’s third beltway, Grand Parkway, is that the idea originated in 1961, when the Houston metro area population was only 1.2 million people. [It’s now 7.8 million. — JC]

“In October 1965, plans for the Grand Parkway became public. The Houston City Planning Commission released a draft version of the 1966 Major Thoroughfare and Freeway Plan showing the new third loop on the official planning map.”

“The City Planning Commission approved the route and scheduled a public hearing for February 17, 1966. The Houston Chronicle enthusiastically endorsed the new route, saying, ‘No sensible citizen can doubt that this freeway will be needed eventually.’”

Aaron’s quotations are from Houston Freeways.

To a very crude approximation, a map of Houston looks like three concentric rings [2]. According to the sources cited above, planning for the outer ring began long before the middle ring was completed. The first section of the Grand Parkway opened in 1994, and it’s maybe three quarters finished currently.

Maybe what has made Houston successful is that it planned far ahead on a macro scale, but has not micromanaged as much as most cities.

[1] Along these lines, see James Scott’s book Seeing Like a State for examples of failures in long-range state planning. Or if you don’t have the to read the book, you might want to read Venkatash Rao’s article A Big Little Idea Called Legibility.

[2] It’s not exactly a thread. It’s a series of posts that quote the first sentence.

[3] When the outer ring, the Grand Parkway, is completed, it will be the longest beltway in the US and encompass more area than the state of Rhode Island.

Gardener’s ellipse

There are several ways to define an ellipse. If you want to write software draw an ellipse, it’s most convenient to have a parametric form:

\begin{align*} x(t) &= h + a \cos(\theta) \cos(t) - b \sin(\theta) \sin(t) \\ y(t) &= k + a \sin(\theta) \cos(t) + b \cos(\theta) \sin(t) \end{align*}

This gives an ellipse centered at (h, k), with semi-major axis a, semi-minor axis b, and with the major axis rotated by an angle θ from horizontal.

But if you’re going to physically draw an ellipse, there’s a more convenient definition: an ellipse is the set of points such that the sum of the distances to two foci is a constant s. This method is most often called the gardener’s ellipse. It’s also called the nails-and-string method, which is more descriptive.

To draw an ellipse on a sheet of plywood, drive a nails in the plywood at both foci, and attach an end of a string of length s to each nail. Then put a pencil in the middle of the string and pull it tight. Keep the string tight as you move the pencil. This will draw an ellipse [1].

Presumably the name gardener’s ellipse comes from the same idea on a larger scale, such as a gardener marking out an elliptical flower bed using two stakes in the ground and some rope.

Going back to drawing on a computer screen, there is a practical use for the gardener’s ellipse: to determine whether a point is inside an ellipse, add the distance from the point to each of the foci of the ellipse. If the sum is less than s then the point is inside the ellipse. If the sum is greater than s the point is outside the ellipse.

From parameters to nails and string

If you have the parametric form of an ellipse, how do you find the gardener’s method representation?

To make it easier to describe points, set θ = 0 and h = k = 0 for a moment. Then the foci are at (±c, 0) where c² = a² − b².

Since (a, 0) is a point on the string, you have to be able to draw it and so the string must stretch from the focus at (−c, 0) to the point (a, 0) and back to the focus at (c, 0). Therefore the length of the string is 2a.

The length of the string depends on the major axis and not on the minor axis.

When we shift and rotate our ellipse, letting θ, h, and k take on possibly non-zero values, we apply the same transformation to the foci.

\begin{align*} F_1 &= (h - c\cos(\theta), k - c\sin(\theta)) \\ F_2 &= (h + c\cos(\theta), k +\, c\sin(\theta)) \\ \end{align*}

Rotating and translating the ellipse doesn’t change the length of the major axis, so s stays the same.

From nails and string to parameters

Draw a line between the two “nails” (foci). The slope of this line is tan θ and it’s midpoint is (hk). The parameter c is half the length of the line.

The semimajor axis a is half the string length s.

Once you know a and c you can solve for b = √(a² − c²).

 

Related posts

[1] In practice, a carpenter might want to draw an ellipse a different way.

Fitting a parabola to an ellipse and vice versa

The previous post discussed fitting an ellipse and a parabola to the same data. Both fit well, but the ellipse fit a little better. This will often be the case because an ellipse has one more degree of freedom than a parabola.

There is one way to fit a parabola to an ellipse at an extreme point: match the end points and the curvatures. This uses up all the degrees of freedom in the parabola.

When you take the analogous approach to fitting an ellipse to a parabola, you have one degree of freedom left over. The curvature depends on a ratio, and so you can adjust the parameters while maintaining the ratio. You can use this freedom to fit the parabola better over an interval while still matching the curvature at the vertex.

The rest of the post will illustrate the ideas outlined above.

Fitting a parabola to an ellipse

Suppose you have an ellipse with equation

(x/a)² + (y/b)² = 1.

The curvature at (±a, 0) equals a/b² and the curvature at (0, ±b) equals b/a².

Now if you have a parabola

xcy² + d

then its curvature at y = 0 is 2|c|.

If you want to match the parabola and the ellipse at (a, 0) then da.

To match the curvatures at (a, 0) we set a/b² = 2|c|. So c = −a/2b². (Without the negative sign the curvatures would match, but the parabola would turn away from the ellipse.)

Similarly, at (−a, 0) we have d = −a and c = a/2b². And at (0,  ±b) we have d = ±b and c = ∓b/2a².

Here’s an example with a golden ellipse.

Fitting an ellipse to a parabola

Now we fix the parabola, say

ycx²

and find an ellipse

(x/a)² + ((yy0)/b)² = 1

to fit at the vertex (0, 0). For the ellipse to touch the parabola at its vertex we must have

((0 − y0)/b)² = 1

and so y0b. To match curvature we have

b/2a² = c.

So a and b are not uniquely determined, only the ratio b/a². As long as this ratio stays fixed at 2c, every ellipse will touch at the vertex and match curvature there. But larger values of the parameters will match the parabola more closely over a wider range. In the limit as b → ∞  (keeping b/a²  = 2c), the ellipses become a parabola.

Related posts

The ellipse hidden inside Pascal’s triangle

The nth row of Pascal’s triangle contains the binomial coefficients C(nr) for r ranging from 0 to n. For large n, if you print out the numbers in the nth row vertically in binary you can see a circular arc.

Here’s an example with n = 50.

You don’t have to use binary. Here’s are the numbers in the row for n = 100 written in base 10. You may be read the numbers if you squint, but you can see that the shape of the curve is something like a piece of a circle.

The length of the numerical representation of a number is roughly proportional to its logarithm. Changing the base only changes the proportionality constant. The examples above suggests that a plot of the logarithms of a row of Pascal’s triangle will be a portion of a circle, up to some scaling of one of the axes, so in general we have an ellipse.

Here’s a plot of log C(nr) for n = 1000. The shape is the same for all large n, so the choice of n = 1000 doesn’t matter much at all.

The best fitting ellipse is

\left(\frac{r - n/2}{b}\right)^2 + \left(\frac{y-y_0}{b}\right)^2= 1

where a = 554.2. b = 47.12, and y0 = −19.87.

The plots of log C(1000, r) and the ellipse lie on top of each other; the error is less than the width of a plotting line. Here’s a plot of the relative error in approximating log C(1000, r) with the ellipse.

Parabolic fit

WoЇfgang pointed out that the curve should be a parabola rather than an ellipse because the binomial distribution is asymptotically normal. Makes perfect sense.

So I redid my plots with the parabola that interpolates log C(nr) at 0, n/2, and n. This also gives a very good fit, but not as good!

But that’s not a fair comparison because it’s comparing the best (least squares) elliptical fit to a convenient parabolic fit.

So I redid my plots again with the least squares parabolic fit. The fit was better, but still not as good as the elliptical fit.

Part of the reason the ellipse fits better than the parabola has to do with the limitations of the central limit theorem. First of all, it applies to CDFs, not PDFs. Second, it applies to absolute error, not relative error. In practice, the CLT gives a good approximation in the middle but not in the tails. With all the curves mentioned above, the maximum error is in the tails.

Another part of the reason the ellipse fits better is that ellipses are more flexible than parabolas, and the discussion above suggests such flexibility would be useful. You can approximate a parabola arbitrarily well with an ellipse, so in the limit, elliptical fits include parabolic fits, as demonstrated in the next post.