Can AI models reason: Just a stochastic parrot?

OpenAI has just released its full o1 model—a new kind of model that is more capable of multi-step reasoning than previous models. Anthropic, Google and others are no doubt working on similar products. At the same time, it’s hotly debated in many quarters whether AI models actually “reason” in a way similar to humans.

Emily Bender and her colleagues famously described large language models as nothing more than “stochastic parrots“—systems that simply repeat their training data blindly, based on a statistical model, with no real understanding (reminiscent of the Chinese Room experiment). Others have made similar comments, describing LLMs as “n-gram models on steroids” or a “fancy extrapolation algorithm.

There is of course some truth to this. AI models sometimes generate remarkable results and yet lack certain basic aspects of understanding that might inhibit their sometimes generation of nonsensical results. More to the point of “parroting” the training data, recent work from Yejin Choi’s group has shown how LLMs at times will cut and paste snippets from its various training documents, almost verbatim, to formulate its outputs.

Are LLMs (just) glorified information retrieval tools?

The implication of these concerns is that an LLM can “only” repeat back what it was taught (albeit with errors). However this view does not align with the evidence. LLM training is a compression process in which new connections between pieces of information are formed that were not present in the original data. This is evidenced both mathematically and anecdotally. In my own experience, I’ve gotten valid answers to such obscure and detailed technical question that it is hard for me to believe would exist in any training data in exactly that form. Whether you would call this “reasoning” or not might be open to debate, but regardless of what you call it, it is something more than just unadorned information retrieval like a “stochastic parrot.”

What is your experience? Let us know in the comments.

Interval arithmetic and fixed points

A couple days ago I analyzed the observation that repeatedly pressing the cosine key on a calculator leads to a fixed point. After about 90 iterations the number no longer changes. This post will analyze the same phenomenon a different way.

Interval arithmetic

Interval arithmetic is a way to get exact results of a sort from floating point arithmetic.

Suppose you start with a number x that cannot be represented exactly as a floating point number, and you want to compute f(x) for some function f. You can’t represent x exactly, but unless x is too large you can represent a pair of numbers a and b such that x is certainly in the interval [a, b]. Then f(x) is in the set f( [a, b] ).

Maybe you can represent f( [a, b] ) exactly. If not, you can enlarge the interval a bit to exactly represent an interval that contains f(x). After applying several calculations, you have an interval, hopefully one that’s not too big, containing the exact result.

(I said above that interval arithmetic gives you exact results of a sort because even though you don’t generally get an exact number at the end, you do get an exact interval containing the result.)

Cosine iteration

In this post we will use interval arithmetic, not to compensate for the limitations of computer arithmetic, but to illustrate the convergence of iterated cosines.

The cosine of any real number lies in the interval [−1, 1]. To put it another way,

cos( [−∞, ∞] ) = [−1, 1].

Because cosine is an even function,

cos( [−1, 1] ) = cos( [0, 1] )

and so we can limit our attention to the interval [0, 1].

Now the cosine is a monotone decreasing function from 0 to π, and so it’s monotone on [0, 1]. For any two points with 0 ≤ ab ≤ π we have

cos( [a, b] ) = [cos(b), cos(a)].

Note that the order of a and b reverses on the right hand side of the equation because cosine is decreasing. When we apply cosine again we get back the original order.

cos(cos( [a, b] )) = [cos(cos(a)), cos(cos(b))].

Incidentally, this flip-flop is explains why the cobweb plot from the previous post looks like a spiral rather than a staircase.

Now define a0 = 0, b0 = 1, and

[an+1, bn+1] = cos( [an, bn] ) = [cos(bn), cos(an)].

We could implement this in Python with a pair of mutually recursive functions.

    a = lambda n: 0 if n == 0 else cos(b(n-1))
    b = lambda n: 1 if n == 0 else cos(a(n-1))

Here’s a plot of the image of [0, 1] after n iterations.

Note that odd iterations increase the lower bound and even iterations decrease the upper bound.

Numerical interval arithmetic

This post introduced interval arithmetic as a numerical technique, then proceeded to do pure math. Now let’s think about computing again.

The image of [0, 1] under cosine is [cos(1), cos(0)] = [cos(1), 1]. A computer can represent 1 exactly but not cos(1). Suppose we compute

cos(1) = 0.5403023058681398

and assume each digit in the result is correct. Maybe the exact value of cos(1) was slightly smaller and was rounded to this value, but we know for sure that

cos( [0, 1] ) ⊂ [0.5403023058681397, 1]

So in this case we don’t know the image of [0, 1], but we know an interval that contains the image, hence the subset symbol.

We could iterate this process, next computing an interval that contains

cos( [0.5403023058681397, 1] )

and so forth. At each step we would round the left endpoint down to the nearest representable lower bound and round the right endpoint up to the nearest representable upper bound. In practice we’d be concerned with machine representable numbers rather than decimal representable numbers, but the principle is the same.

The potential pitfall of interval arithmetic in practice is that intervals may grow so large that the final result is not useful. But that’s not the case here. The rounding error at each step is tiny, and contraction maps reduce errors at each step rather than magnifying them. In a more complicated calculation, we might have to resort to lose estimates and not have such tight intervals at each step.

Related posts

Golden hospital gowns

Here’s something I posted on X a couple days ago:

There’s no direct connection between AI and cryptocurrency, but they have a similar vibe.

They both leave you wondering whether the emperor is sumptuously clothed, naked, or a mix of both.

Maybe he’s wearing a hospital gown with gold threads.

In case you’re unfamiliar with the story, this is an allusion to The Emperor’s New Clothes, one of the most important stories in literature.

I propose golden hospital gown as a metaphor for things that are a fascinating mixture of good and bad, things that have large numbers of haters and fanboys, both with valid points. There’s continual improvement and a lot work to be done sorting out what works well and what does not.

I tried to get Grok to create an image of what I had in mind by a golden hospital gown. The results were not what I wanted, but passable, which is kinda the point of the comment above. It’s amazing that AI can produce anything remotely resembling a desired image starting from a text description. But there is a very strong temptation to settle for mediocre and vaguely creepy images that aren’t what we really want.

Man in a yellow hospital gown with hairy back and legs exposed

Related posts

LLMs and regular expressions

Yesterday I needed to write a regular expression as part of a client report. Later I was curious whether an LLM could have generated an equivalent expression.

When I started writing the prompt, I realized it wasn’t trivial to tell the LLM what I wanted. I needed some way to describe the pattern that the expression should match.

“Hmm, what’s the easiest way to describe a text pattern? I know: use a regular expression! Oh wait, …”

Prompt engineering and results

I described the pattern in words, which was more difficult than writing the regular expression, and the LLM came up with a valid regular expression, and sample code for demonstrating the use of the expression, but the expression wasn’t quite right. After a couple more nudges I managed to get it to produce a correct regex.

I had asked for a Perl regular expression, and the LLM did generate syntactically correct Perl [1], both for the regex and the sample code. When I asked it to convert the regex to POSIX form it did so, and when I asked it to convert the regex to Python it did that as well, replete with valid test code.

I repeated my experiment using three different LLMs and got similar results. In all cases, the hardest part was specifying what I wanted. Sometimes the output was correct given what I asked for but not what I intended, a common experience since the dawn of computers. It was easier to translate a regex from one syntax flavor to another than to generate a correct regex, easier for both me and the computer: it was easier for me to generate a prompt and the LLM did a better job.

Quality assurance for LLMs

Regular expressions and LLMs are complementary. The downside of regular expressions is that you have to specify exactly what you want. The upside is that you can specify exactly what you want. We’ve had several projects lately in which we tested the output of a client’s model using regular expressions and found problems. Sometimes it takes a low-tech tool to find problems with a high-tech tool.

We’ve also tested LLMs using a different LLM. That has been useful because there’s some degree of independence. But we’ve gotten better results using regular expressions since there is a greater degree of independence.

Related posts

[1] Admittedly that’s a low bar. There’s an old joke that Perl was created by banging on a keyboard then hacking on the compiler until the input compiled.

Rotating MacBook keys

Shortly after I started using a MacBook I remapped the keys so that they function the same way on Mac OS, Windows, and Linux. The key in the lower left corner, for example, behaves the same way across operating systems, as does the key to the left of the space bar.

Note that I’m going by key behavior and not key name. I go into more detail in my post on cross-platform muscle memory.

It’s been two and a half years since I remapped the keys, and it works well for me. The same keystrokes do the same thing whether I’m on my MacBook laptop or my Linux desktop.

The only drawback is that I’m confused when read documentation telling me how to do something on a Mac. When I see something using the key, I have to ask myself which key that corresponds to on my machine. I’m writing this post in hopes that it will help me memorize how my mappings relate to the original keys.

Here’s a diagram of how I have rotated my keys.

command to fn to option to command

So, for example, if I want to make text bold, I hold down the key in the lower left corner and type b. On my MacBook that key is labeled with a globe and on my Linux box it is labeled Ctrl. But I don’t look at the keys and don’t care what they’re labeled.

I also swapped the Caps Lock and what Mac calls the control key , but I mostly use the big control key only in Emacs. As I noted in the earlier article, Mac essentially has one Windows-like control key (labeled ) and one Emacs-like control key (labeled control).

 

 

Asymmetric generation / verification costs

We tend to think that the effort required to generate a solution and verify a solution are roughly equal, assuming that you need to retrace the generation steps to verify that they are correct. But sometimes verification can be far easier than generation [1].

Factoring

For example, suppose I generate two 1000-digit prime numbers, multiply them together, and ask you to recover the two primes by factoring. It takes a split second to verify whether you’ve found the solution, but finding the solution is practically impossible. The security of the internet depends on this fact. (See posts on RSA encryption here.)

Generative AI

Generative AI is notoriously unreliable. But if it’s far easier to verify a solution than to generate it, it’s OK if an AI solution is unreliable. We’ve had several consulting projects lately that amount to evaluating how well an LLM did what it was supposed to do. The LLMs aren’t perfect—no one expected that they would be—but the tasks is to estimate whether the error rate is acceptable.

Differential equations

George Pólya once said “In order to solve a differential equation you look at it till a solution occurs to you.” He was only half joking. Differential equations courses present solution techniques without justification [2], but the solutions can be justified even if they techniques cannot: stick the solution in to the equation and see whether it works.

Finding primes

For the last 70 years, the largest known prime has been a Mersenne prime, with one exception [3]. This is because there is a way to test whether Mersenne numbers are prime that is more efficient than general numbers.

It takes a lot of computation to prove that a large number is prime, but it is possible to produce a certificate of primality that can be quickly verified. One form of primality certificate is Pratt certificates and another kind is elliptic curve primality certificates.

Optimization

Some optimization problems, such as linear programming or more generally convex optimization, produce certificates that verify that a solution is optimal. Maybe the solution was found on a beefy computer but could be verified on a phone.

Lack of solution

It’s typically easier to prove a solution exists than to prove a solution does not exist. You can prove a solution exists by finding one. This may not be easy to do but easy to verify. Certifying that a problem as no solution is more difficult—how do you know whether you’ve looked hard enough?—but sometimes you can produce a certificate of infeasibility.

Footnotes

[1] The question of whether P = NP asks whether in some sense it is possible to efficiently compute solutions to a large class of problems whose solutions can be verified efficiently [4].

[2] The solution techniques can be justified, though not at the level of mathematics students have when taking an introductory differential equations course.

[3] In 1989 the largest known prime was 391581 × 2216193 − 1.

[4] Can a footnote have a footnote? Sure, why not. The P = NP problem asks whether a problem that can be verified in polynomial time can be solved in polynomial time. But if you didn’t already know this, you probably won’t find this description satisfying. It’s kinda subtle, and I won’t add a footnote to a footnote of a footnote.

Confidential OCR

A client emailed me a screenshot of a table rather than pasting the table as text into an email.

I thought about using an LLM to convert it to text, but the table is confidential client information and so I shouldn’t upload it anywhere.

I searched for a command line utility to do OCR and found tesseract. I installed it with

    sudo apt install tesseract-ocr libtesseract-dev tesseract-ocr-eng

and ran it with the default settings

    tesseract screenshot.png textfile

It worked remarkably well. I had to change a C to a U, but otherwise I didn’t have to add or change any text, but I did have to delete a few extraneous parentheses generated by the software.

I work locally in part out of habit; it was the only way to work when I started using a computer. It has numerous advantages, such as being able to keep working when a hurricane knocks out my internet connection, but above all it is private.

I pay more attention to privacy than is convenient because I work in data privacy. And aside from my privacy, I have to protect our clients’ privacy.

Update: According to the comments, ChatGPT uses tesseract. Assuming that’s true, using tesseract directly is better than ChatGPT because it does exactly what you want. No ambiguity as far as what expected. No potential for tinkering with your results before you see them.

Related posts

Resolving a mysterious problem with find

Suppose you want to write a shell script searches the current directory for files that have a keyword in the name of the file or in its contents. Here’s a first attempt.

find . -name '*.py' -type f -print0 | grep -i "$1"
find . -name '*.py' -type f -print0 | xargs -0 grep -il "$1"

This works well for searching file contents but behaves unexpectedly when searching for file names.

If I have a file named frodo.py in the directory, the script will return

grep: (standard input): binary file matches

Binary file matches?! I wasn’t searching binary files. I was searching files with names consisting entirely of ASCII characters. Where is a binary file coming from?

If we cut off the pipe at the end of the first line of the script and run

find . -name '*.py' -type f -print0

we get something like

.elwing.py/.frodo.py/gandalf.py

with no apparent non-ASCII characters. But if we pipe the output through xxd to see a hex dump, we see that there are invisible null characters after each file name.

One way to fix our script would be to add a -a option to the call to grep, telling to treat the input as ASCII. But this will return the same output as above. The output of find is treated as one long (ASCII) string, which matches the regular expression.

Another possibility would be to add a -o flag to direct grep to return just the match. But this is less than ideal as well. If you were looking for file names containing a Q, for example, you’d get Q as your output, which doesn’t tell you the full file name.

There may be better solutions [1], but my solution was to insert a call to strings in the pipeline:

find . -name '*.py' -type f -print0 | strings | grep -i "$1"

This will extract the ASCII strings out of the input it receives, which has the effect of splitting the string of file names into individual names.

By default the strings command defines an ASCII string to be a string of 4 or more consecutive ASCII characters. A file with anything before the .py extension will necessarily have at least four characters, but the analogous script to search C source files would overlook a file named x.c. You could fix this by using strings -n 3 to find sequences of three or more ASCII characters.

If you don’t have the strings command installed, you could use sed to replace the null characters with newlines.

find . -name '*.py' -type f -print0 | sed 's/\x0/\n/g' | grep -i "$1"

Note that the null character is denoted \x0 rather than simply \0.

Related posts

[1] See the comments for better solutions. I really appreciate your feedback. I’ve learned a lot over the years from reader comments.

Trigonometric interpolation

Suppose you want to interpolate a set of data points with a combination of sines and cosines.

One way to approach this problem would be to set up a system of equations for the coefficients of the sines and cosines. If you have N data points, you will get a system of N equations in N unknowns. The system will have a unique solution, though this is not obvious a priori.

Another approach would be to use the discrete Fourier transform (DFT). This is the approach that would commonly be used in practice. It’s even further from obvious a priori that this would work, but it does. (The DFT is so often computed using the FFT algorithm that the transform is often referred to by the algorithm name. If you’d like, mentally substitute FFT for DFT in the rest of the post.)

There are multiple ways to motivate the DFT, and the way suggested by the name is to derive the DFT as a discrete approximation to the (continuous) Fourier transform. Why should a discrete approximation to an integral transform also solve an interpolation problem? This doesn’t sound inevitable, or even plausible, but it is the case.

Another way to motivate the DFT is as the least-squares solution to fitting a sum of sines and cosines to a set of points. Since this is phrased as an optimization problem rather than an interpolation problem, it is clear that it will have a solution. However, it is not clear that the error in the optimal fit will in fact be zero. Furthermore, the equation for the coefficients in the solution is the same as the equation for the DFT. You can find a derivation in [1].

Example

Let’s take the vector [3, 1, 4, 1, 5, 9] and find trig functions that pass through these points. We can use the FFT as implemented in Python’s SciPy library to find a set of complex exponentials that pass through the points.

    from scipy.fft import fft
    from numpy import exp, array, pi, round

    x = array([3, 1, 4, 1, 5, 9])
    y = fft(x)

    N = len(x)
    z = [sum([exp(2j*pi*k*n/N)*y[k] for k in range(N)])/N for n in range(N)]

Aside from rounding errors on the order of 10−15 the vector z equals the vector x.

Turning the expression for z above into a mathematical expression, we have

f(z) = y0 + y1 exp(nπi/3) + y2 exp(2nπi/3) + y3 exp(nπi) + y4 exp(4nπi/3) + y5 exp(5nπi/3)

where the y‘s come from the FFT above.

To find sines and cosines we need to use Euler’s formula

exp(iθ) = cos(θ) + i sin(θ)

Because started with real data x, there will be symmetries in the FFT components x that simplify the reduction of the complex function f to a real-valued function g using sines and cosines; some of the components will be conjugate and so the complex parts cancel out.

6 g(x) = y0 + (y1 + y5) cos(πx/3) + (y2 + y4) cos(2πx/3) + y3 cos(πx)
+ i (y1y5) sin(πx/3) + (y2y4) cos(2πx/3)

and so

g(x) = 3.833 + 0.8333 cos(πx/3) − 1.833 cos(2πx/3) + 0.1666 cos(πx)
− 2.5981 sin(πx/3) − 2.0207 cos(2πx/3)

Here’s a plot that verifies that g(x) passes through the specified points.

Related posts

[1] William L. Briggs and Van Emden Henson. The DFT: An Owner’s Manual for the Discrete Fourier Transform. SIAM 1995.

The impossible puzzle

It’s fascinating that there’s such a thing as the World Jigsaw Puzzle Championship. The winning team of the two-person thousand-piece puzzle round can assemble a Ravensburger puzzle in less than an hour—that’s about 3 -1/2 seconds per piece.

It makes you wonder, how could you measure the hardness of a jigsaw puzzle? And what would a “hardest puzzle” look like?

You can find some puzzles that are claimed to be “hardest.” A first step to creating a hard puzzle is removing the front image entirely, eliminating visual cues. Some puzzles do this. This can be partly simulated with a normal puzzle by putting it together with the image side face down.

But you can do more: make every piece a square exactly the same size, and the “tab“ and “blank“ on each of the four sides of each square exactly the same size, shape and alignment. One can see that this gives you 24 = 16 different unique kinds of puzzle piece, however, due to symmetries you actually have six different kinds of puzzle piece. Now, you are left with a bare minimum of shape cues to help you assemble the puzzle.

For the sake of the argument, one can ignore the “edge” pieces. One can see this from the fact that for a rectangular puzzle, as you scale up the size, the number of edge pieces proportional to the interior pieces of the puzzle becomes smaller and smaller. For example for a 36X28=1008 piece puzzle, there are (2*36+2*28-4)=124 edge pieces, about 12 percent of the whole. The ratio shrinks as you scale up puzzle size. And there is such a thing as a 42,000-piece jigsaw puzzle.

The solution complexity class for assembling such a puzzle is known to be NP-complete. This means that the best known algorithms require compute time that grows exponentially as the number of pieces is increased.

Of course there are simpler special cases. Consider a checkerboard configuration, in which “black square“ location pieces have four tabs and “red square” locations have four blanks, thus only two different kinds of piece. A competent puzzler could assemble this one piece per second, 17 minutes for a 1000-piece puzzle.

However, as far as we know, the number of “special cases” like this is vanishing small for large puzzles. An intuition for this can be taken from Kolmogorov complexity theory, using a counting argument to show that only very few long strings can be compressed to a short program.

Most cases are really hard, for example, constructed by making a totally random assignment for selecting the configuration of each piece-to-piece boundary. Solving this is really hard because you may have to backtrack: you may have the puzzle largely complete, but due to ambiguities, you may need to backtrack because of a wrong decision made early.

A very weak upper bound in the assembly time can be derived by considering the brute force case. For a 1000-piece puzzle, there are 1000! (factorial) possible ways to (attempt to) assemble the puzzle. Suppose one places every piece and then fully backtracks for each trial—this gives 1000 x 1000! puzzle piece placement steps. Using Stirling’s approximation, this is approximately 104468 placement steps.

A human, assuming generously a rate of one step per second, at 31 million seconds/year nonstop, would take on the order of 104460 years. Assuming 10 billion simultaneous puzzlers—order 104450 years.

Now assume the on-record world’s fastest computer, the ORNL Frontier system (though see here), roughly 10 exaflops (1019) mixed precision, and assuming one puzzle step per operation (overly optimistic). Assume further that every atom in the known universe (est. 1082) is a Frontier computer system. In this case—solving the puzzle takes order 104359 years. That’s a thousand billion billion … (repeat “billion” 484 times here) years.

As I mentioned, this is a very weak upper bound. One could solve the problem using a SAT solver with sophisticated CDCL backtracking. Or one could use special methods derived from statistical mechanics that handle random SAT instances. However, to the best of our knowledge, the runtime still grows exponentially as a function of puzzle size. The best worst-case computational complexity for SAT solvers currently known is order O(1.306995n) (see here, here, and here). So, simply by constructing puzzles in the way we’ve described with more and more pieces, you will soon reach a point of having a puzzle that is unfathomably difficult to solve.

It’s conceivable that a yet-undiscovered method could be found for which the solution cost would in fact grow polynomially instead of exponentially. But people have been trying to answer this question one way or the other for over 50 years and it’s still unsolved, with an open million dollar prize for proof or counterexample. So we just don’t know.

What about quantum computing? It can solve some exponential complexity problems like factoring in polynomial time. But how about the SAT problem? It is simply not known (see here, here) whether a polynomial-time quantum algorithm exists for arbitrary NP-complete problems. None are known, and it might not be possible at all (though research continues).

So we’re faced with the curious possibility: a jigsaw puzzle could be constructed for which, you’re holding the box of pieces in your hand, you spill it on the floor, and there is no single or joint intelligence or computational mechanism, existing or even possible, in the entire universe that could ever reassemble it.