Can AI Models Reason: Is Data All You Need?

Many are voicing concern that the world is running out of data and that this will be a blocker to progress toward smarter AI models. One paper in fact projects timelines for when we will run out.

AI researchers are looking for ways to adapt.  Nvidia has trained a specific model to generate synthetic data for training other models. Some use this approach, though using AI-generated data to train AI is not without risk.

Others have asked a bigger question, namely, is something fundamentally missing in our approach that relies so heavily on data. Certainly the bitter lesson thesis and the position long advocated by Geoffrey Hinton argue for a data-first approach with “as few” prior assumptions as possible (though every model has a bias).

But it’s currently simply unknown whether just adding more data and compute will do the trick for achieving general intelligence or whether something else is needed. Neurosymbolic approaches are being experimented with, in various forms. But it’s unclear whether these can scale up to the level needed. And the frontier labs, laser-focused on the current paradigm, may not have adequate time or resources to investigate high-risk/high-reward alternatives.

From a theoretical standpoint, sometimes more data is simply not enough. As discussed in a previous post, some problems in mathematics and engineering require exponentially large amount of data to train neural network models. Exponentials can work in your favor, but also can work against you (think of the Tower of Hanoi problem or the Wheat and Chessboard problem). Some problems on certain models cannot be solved by any amount of data available in the entire universe.

The requirements for solving these problems can grow much more quickly than expected. The strength of neural networks, their flexibility, their universal approximation property, can also be a weakness. It can take so much data to nail down all the parameters so that the model is completely error free. Thankfully, many other problems that people want to solve (such as human language modeling) are fundamentally lower dimensional and thus less vulnerable to this problem.

We just don’t know whether the current data-hungry approach will be enough—or whether we’ll need to learn another bitter lesson.

Can AI models reason like a human?

We’re awaiting the release of OpenAI’s o3 model later this month. Its performance is impressive on very hard benchmarks like SWE-bench Verified, Frontier Math and the ARC AGI benchmark (discussed previously in this blog).

And yet at the same time some behaviors of the frontier AI models are very concerning.

Their performance on assorted math exams is outstanding, but they make mistakes doing simple arithmetic, like wrongly multiplying numbers that are a few digits long. Performance of the o1 preview model on the difficult Putnam math exam is excellent but drops precipitously under simple changes like renaming constants and variables in the problem statement.

Similarly, when o1 is applied to a planning benchmark expressed in standardized language, it performs well, but accuracy falls apart when applied to a mathematically equivalent planning problem in a different domain. And also, a given AI model applied to the simple ROT13 cipher can have wildly different performance based on the value of the cipher key, suggesting the models don’t really understand the algorithm.

It was the best of times, it was the worst of times, . . .

What is going on here?

For years now, some have made claims of “human-level performance” for various deep learning algorithms. And as soon as one party starts making claims like this, it’s hard for the others to resist doing the same.

The confusion is that, from a certain point of view, the claim of “human-level” is true—but the definition of “human-level” is fraught.

Here, “human-level” is taken to mean achievement of some high score on a benchmark set, ostensibly exceeding some human performance measure on the same benchmark. However, a single AI model can vary wildly in capability across behaviors—“smart” compared to humans in some ways, “dumb” in others.

For humans, test-taking is a proxy for measuring a range of skills and abilities. And even for humans it is not always an accurate proxy. A person can perform very well on academic tests and very poorly on the job, or vice versa.

And the capability ratios for AI models are very different still, in ways we don’t fully understand. So, outscoring humans on a software engineering benchmark doesn’t mean the AI has the whole panoply of coding skills, decision-making abilities, software architecture design savvy, etc., needed to be a competent software engineer.

It’s no surprise that recent articles (below) show a growing perception of the limitations of AI benchmarks as currently conceived.

Ways forward

Perhaps we should consider developing requirements like the following before claiming human-level reasoning performance of an AI model:

  • It should be able to “explain its work” at any level of detail to another human (just like a human can), in a way that that human can understand.
  • It should be able to give answers without “hallucinating” or “confabulating” (yes, humans can hallucinate too, but most occupations would not be well-served by an employee who hallucinates on the job).
  • It should be able to reliably and consistently (100% of the time) do things that we routinely expect a human or computer to do accurately (like add or multiply two numbers accurately, for things like filling out tax returns or doing engineering calculations to build an airplane).
  • It should be frank and honest in assessing its level of certainty about an answer it gives (no gaslighting).
  • It should be able to solve a trivial perturbation of a given problem with the same ease as the original problem (to the same extent that a human can).
  • As someone has said, it should be able to do, without specific training, what a 5 year old can do without specific training.
  • This one sounds good, from Emmett Shear: “AGI [artificial general intelligence] is the ability to generalize [without special training by a human] to an adversarially chosen new benchmark.”

AI models are fantastic and amazing tools—and best used when one has eyes wide open about their limitations.

Have you had problems with AI model performance? If so, please share in the comments.

References

Rethinking AI benchmarks: A new paper challenges the status quo of evaluating artificial intelligence, https://venturebeat.com/ai/rethinking-ai-benchmarks-a-new-paper-challenges-the-status-quo-of-evaluating-artificial-intelligence/

Rethink reporting of evaluation results in AI, https://www.science.org/doi/10.1126/science.adf6369, https://eprints.whiterose.ac.uk/198211/

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence, https://arxiv.org/abs/2402.09880

Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless, https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless

Why we must rethink AI benchmarks, https://bdtechtalks.com/2021/12/06/ai-benchmarks-limitations/

AI and the Everything in the Whole Wide World Benchmark, https://arxiv.org/abs/2111.15366

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices, https://arxiv.org/abs/2411.12990

Goodhart’s Law states that when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.

Can AI models reason: Just a stochastic parrot?

OpenAI has just released its full o1 model—a new kind of model that is more capable of multi-step reasoning than previous models. Anthropic, Google and others are no doubt working on similar products. At the same time, it’s hotly debated in many quarters whether AI models actually “reason” in a way similar to humans.

Emily Bender and her colleagues famously described large language models as nothing more than “stochastic parrots“—systems that simply repeat their training data blindly, based on a statistical model, with no real understanding (reminiscent of the Chinese Room experiment). Others have made similar comments, describing LLMs as “n-gram models on steroids” or a “fancy extrapolation algorithm.

There is of course some truth to this. AI models sometimes generate remarkable results and yet lack certain basic aspects of understanding that might inhibit their sometimes generation of nonsensical results. More to the point of “parroting” the training data, recent work from Yejin Choi’s group has shown how LLMs at times will cut and paste snippets from its various training documents, almost verbatim, to formulate its outputs.

Are LLMs (just) glorified information retrieval tools?

The implication of these concerns is that an LLM can “only” repeat back what it was taught (albeit with errors). However this view does not align with the evidence. LLM training is a compression process in which new connections between pieces of information are formed that were not present in the original data. This is evidenced both mathematically and anecdotally. In my own experience, I’ve gotten valid answers to such obscure and detailed technical question that it is hard for me to believe would exist in any training data in exactly that form. Whether you would call this “reasoning” or not might be open to debate, but regardless of what you call it, it is something more than just unadorned information retrieval like a “stochastic parrot.”

What is your experience? Let us know in the comments.

The impossible puzzle

It’s fascinating that there’s such a thing as the World Jigsaw Puzzle Championship. The winning team of the two-person thousand-piece puzzle round can assemble a Ravensburger puzzle in less than an hour—that’s about 3 -1/2 seconds per piece.

It makes you wonder, how could you measure the hardness of a jigsaw puzzle? And what would a “hardest puzzle” look like?

You can find some puzzles that are claimed to be “hardest.” A first step to creating a hard puzzle is removing the front image entirely, eliminating visual cues. Some puzzles do this. This can be partly simulated with a normal puzzle by putting it together with the image side face down.

But you can do more: make every piece a square exactly the same size, and the “tab“ and “blank“ on each of the four sides of each square exactly the same size, shape and alignment. One can see that this gives you 24 = 16 different unique kinds of puzzle piece, however, due to symmetries you actually have six different kinds of puzzle piece. Now, you are left with a bare minimum of shape cues to help you assemble the puzzle.

For the sake of the argument, one can ignore the “edge” pieces. One can see this from the fact that for a rectangular puzzle, as you scale up the size, the number of edge pieces proportional to the interior pieces of the puzzle becomes smaller and smaller. For example for a 36X28=1008 piece puzzle, there are (2*36+2*28-4)=124 edge pieces, about 12 percent of the whole. The ratio shrinks as you scale up puzzle size. And there is such a thing as a 42,000-piece jigsaw puzzle.

The solution complexity class for assembling such a puzzle is known to be NP-complete. This means that the best known algorithms require compute time that grows exponentially as the number of pieces is increased.

Of course there are simpler special cases. Consider a checkerboard configuration, in which “black square“ location pieces have four tabs and “red square” locations have four blanks, thus only two different kinds of piece. A competent puzzler could assemble this one piece per second, 17 minutes for a 1000-piece puzzle.

However, as far as we know, the number of “special cases” like this is vanishing small for large puzzles. An intuition for this can be taken from Kolmogorov complexity theory, using a counting argument to show that only very few long strings can be compressed to a short program.

Most cases are really hard, for example, constructed by making a totally random assignment for selecting the configuration of each piece-to-piece boundary. Solving this is really hard because you may have to backtrack: you may have the puzzle largely complete, but due to ambiguities, you may need to backtrack because of a wrong decision made early.

A very weak upper bound in the assembly time can be derived by considering the brute force case. For a 1000-piece puzzle, there are 1000! (factorial) possible ways to (attempt to) assemble the puzzle. Suppose one places every piece and then fully backtracks for each trial—this gives 1000 x 1000! puzzle piece placement steps. Using Stirling’s approximation, this is approximately 104468 placement steps.

A human, assuming generously a rate of one step per second, at 31 million seconds/year nonstop, would take on the order of 104460 years. Assuming 10 billion simultaneous puzzlers—order 104450 years.

Now assume the on-record world’s fastest computer, the ORNL Frontier system (though see here), roughly 10 exaflops (1019) mixed precision, and assuming one puzzle step per operation (overly optimistic). Assume further that every atom in the known universe (est. 1082) is a Frontier computer system. In this case—solving the puzzle takes order 104359 years. That’s a thousand billion billion … (repeat “billion” 484 times here) years.

As I mentioned, this is a very weak upper bound. One could solve the problem using a SAT solver with sophisticated CDCL backtracking. Or one could use special methods derived from statistical mechanics that handle random SAT instances. However, to the best of our knowledge, the runtime still grows exponentially as a function of puzzle size. The best worst-case computational complexity for SAT solvers currently known is order O(1.306995n) (see here, here, and here). So, simply by constructing puzzles in the way we’ve described with more and more pieces, you will soon reach a point of having a puzzle that is unfathomably difficult to solve.

It’s conceivable that a yet-undiscovered method could be found for which the solution cost would in fact grow polynomially instead of exponentially. But people have been trying to answer this question one way or the other for over 50 years and it’s still unsolved, with an open million dollar prize for proof or counterexample. So we just don’t know.

What about quantum computing? It can solve some exponential complexity problems like factoring in polynomial time. But how about the SAT problem? It is simply not known (see here, here) whether a polynomial-time quantum algorithm exists for arbitrary NP-complete problems. None are known, and it might not be possible at all (though research continues).

So we’re faced with the curious possibility: a jigsaw puzzle could be constructed for which, you’re holding the box of pieces in your hand, you spill it on the floor, and there is no single or joint intelligence or computational mechanism, existing or even possible, in the entire universe that could ever reassemble it.

How hard is constraint programming?

I’ve been writing code for the Z3 SMT solver for several months now. Here are my findings.

Python is used here as the base language. Python/Z3 feels like a two-layer programming model—declarative code for Z3, imperative code for Python. In this it seems reminiscent of C++/CUDA programming for NVIDIA GPUs—in that case, mixed CPU and GPU imperative code. Either case is a clever combination of methodologies that is surprisingly fluent and versatile, albeit not a perfect blend of seamless conceptual cohesion.

Other comparisons:

  • Both have two separate memory spaces (CUDA CPU/GPU memories for one; pure Python variables and Z3 variables for the other).
  • Both can be tricky to debug. In earlier days, CUDA had no debugger, so one had to fall back to the trusty “printf” statement (for a while it didn’t even have that!). If the code crashed, you might get no output at all. To my knowledge, Z3 has no dedicated debugger. If the problem being solved comes back as satisfiable, you can print out the discovered model variables, but if satisfiability fails, you get very little information. Like some other novel platforms, something of a “black box.”
  • In both cases, programmer productivity can be well-served by developing custom abstractions. I developed a Python class to manage multidimensional arrays of Z3 variables, this was a huge time saver.

There are differences too, of course.

  • In Python, “=” is assignment, but in Z3, one only has “==”, logical or numeric equality, not assignment per se. Variables are set once and can’t be changed—sort of a “write-once variables” programming model—as is natural to logic programming.
  • Code speed optimization is challenging. Code modifications for Z3 constraints/variables can have extreme and unpredictable runtime effects, so it’s hard to optimize. Z3 is solving an NP-complete problem after all, so runtimes can theoretically increase massively. Speedups can be massive also; one round of changes I made gave 2000X speedup on a test problem. Runtime of CUDA code can be unpredictable to a lesser degree, depending on the PTX and SASS code generation phases and the aggressive code optimizations of the CUDA compiler. However, it seems easier to “see through” CUDA code, down to the metal, to understand expected performance, at least for smaller code fragments. The Z3 solver can output statistics of the solve, but these are hard to actionably interpret for a non-expert.
  • Z3 provides many, many algorithmic tuning parameters (“tactics”), though it’s hard to reason about which ones to pick. Autotuners like FastSMT might help. Also there have been some efforts to develop tools to visualize the solve process, this might be of help.

It would be great to see more modern tooling support and development of community best practices to help support Z3 code developers.

Getting some (algorithmic) SAT-isfaction

How can you possibly solve a mission-critical problem with millions of variables—when the worst-case computational complexity of every known algorithm for that problem is exponential in the number of variables?

SAT (Satisfiability) solvers have seen dramatic orders-of-magnitude performance gains for many problems through algorithmic improvements over the last couple of decades or so. The SAT problem—finding an assignment of Boolean variables that makes a given Boolean expression true—represents the archetypal NP-complete problem and in the general case is intractable.

However, for many practical problems, solutions can be found very efficiently by use of modern methods. This “killer app” of computer science, as described by Donald Knuth, has applications to many areas, including software verification, electronic design automation, artificial intelligence, bioinformatics, and planning and scheduling.

Its uses are surprising and diverse, from running billion dollar auctions to solving graph coloring problems to computing solutions to Sudoku puzzles. As an example, I’ve included a toy code below that uses SMT, a relative of SAT, to find the English language suffix rule for regular past tense verbs (“-ed”) from data.

When used as a machine learning method, SAT solvers are quite different from other methods such as neural networks. SAT solvers can for some problems have long or unpredictable runtimes (though MAXSAT can sometimes relax this restriction), whereas neural networks have essentially fixed inference cost (though looping agent-based models do not).

On the other hand, answers from SAT solvers are always guaranteed correct, and the process is interpretable; this is currently not so for neural network-based large language models.

To understand better how to think about this difference in method capabilities, we can take a lesson from the computational science community. There, it is common to have a well-stocked computational toolbox of both slow, accurate methods and fast, approximate methods.

In computational chemistry, ab initio methods can give highly accurate results by solving Schrödinger’s equation directly, but only scale to limited numbers of atoms. Molecular dynamics (MD), however, relies more on approximations, but scales efficiently to many more atoms. Both are useful in different contexts. In fact, the two methodologies can cross-pollenate, for example when ab initio calculations are used to devise force fields for MD simulations.

A lesson to take from this is, it is paramount to find the best tool for the given problem, using any and all means at one’s disposal.

The following are some of my favorite general references on SAT solvers:

It would seem that unless P = NP, commonly suspected to be false, the solution of these kinds of problems for any possible input is hopelessly beyond reach of even the world’s fastest computers. Thankfully, many of the problems we care about have an internal structure that makes them much more solvable (and likewise for neural networks). Continued improvement of SAT/SMT methods, in theory and implementation, will greatly benefit the effective solution of these problems.

A toy example: find the English past tense suffix rule using Z3

import csv
import z3

def char2int(c): return ord(c) - ord('a')

def int2char(i): return chr(i + ord('a'))

# Access the language data from the file.
with open('eng_cols.txt', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t')
    table = [row for row in reader]

nrow, ncol = len(table), len(table[0])

# Identify which columns of input table have stem and targeted word form.
stem_col, form_col = 0, 1

# Calculate word lengths.
nstem = [len(table[i][stem_col]) for i in range(nrow)]
nform = [len(table[i][form_col]) for i in range(nrow)]

# Length of suffix being sought.
ns = 2

# Initialize optimizer.
solver = z3.Optimize()

# Define variables to identify the characters of suffix; add constraints.
var_suf = [z3.Int(f'var_suf_{i}') for i in range(ns)]

for i in range(ns):
    solver.add(z3.And(var_suf[i] >= 0, var_suf[i] < 26))

# Define variables to indicate whether the given word matches the rule.
var_m = [z3.Bool(f'var_m_{i}') for i in range(nrow)]

# Loop over words.
for i in range(nrow):

    # Constraint on number of characters.
    constraints = [nform[i] == nstem[i] + ns]

    # Constraint that the form contains the stem.
    for j in range(nstem[i]):
        constraints.append(
            table[i][stem_col][j] == table[i][form_col][j]
                if j < nform[i] else False)

    # Constraint that the end of the word form matches the suffix. 
    for j in range(ns):
        constraints.append(
            char2int(table[i][form_col][nform[i]-1-j]) == var_suf[j]
                if j < nform[i] else False)

    # var_m[i] is the "and" of all these constraints.
    solver.add(var_m[i] == z3.And(constraints))

# Seek suffix that maximizes number of matches.
count = z3.Sum([z3.If(var_m[i], 1, 0) for i in range(nrow)])
solver.maximize(count)

# Run solver, output results.
if solver.check() == z3.sat:
    model = solver.model()
    suf = [model[var_suf[i]] for i in range(ns)]
    print('Suffix identified: ' +
          ''.join(list([int2char(suf[i].as_long())
                        for i in range(ns)]))[::-1])
    print('Number of matches: ' + str(model.evaluate(count)) +
          ' out of ' + str(nrow) + '.')

    var_m_values = [model[var_m[i]] for i in range(nrow)]

    print('Matches:')
    for i in range(nrow):
        if var_m_values[i]:
            print(table[i][stem_col], table[i][form_col])

The search for the perfect prompt


Anyone with more than casual experience with ChatGPT knows that prompt engineering is a thing. Minor or even trivial changes in a chatbot prompt can have significant effects, sometimes even dramatic ones, on the output [1]. For simple requests it may not make much difference, but for detailed requests it could matter a lot.

Industry leaders said they thought this would be a temporary limitation. But we are now a year and a half into the GPT-4 era, and it’s still a problem. And since the number of possible prompts has scaling that is exponential in the prompt length, it can sometimes be hard to find a good prompt given the task.

One proposed solution is to use search procedures to automate the prompt optimization / prompt refinement process. Given a base large language model (LLM) and an input (a prompt specification, commonly with a set of prompt/answer pair samples for training), a search algorithm seeks the best form of a prompt to use to elicit the desired answer from the LLM.

This approach is sometimes touted [2] as a possible solution to the problem. However, it is not without  limitations.

A main one is cost. With this approach, one search for a good prompt can take many, many trial-and-error invocations of the LLM, with cost measured in dollars compared to the fraction of a cent cost of a single token of a single prompt. I know of one report of someone who does LLM prompting with such a tool full time for his job, at cost of about $1,000/month (though, for certain kinds of task, one might alternatively seek a good prompt “template” and reuse that across many near-identical queries, to save costs).

This being said, it would seem that for now (depending on budget) our best option for difficult prompting problems is to use search-based prompt refinement methods. Various new tools have come out recently (for example, [3-6]). The following is a report on some of my (very preliminary) experiences with a couple of these tools.

PromptAgent

The first is PromptAgent [5]. It’s a research code available on GitHub. The method is based on Monte Carlo tree search (MCTS), which tries out multiple chains of modification of a seed prompt and pursues the most promising. MCTS can be a powerful method, being part of the AlphaGo breakthrough result in 2016.

I ran one of the PromptAgent test problems using GPT-4/GPT-3.5 and interrupted it after it rang up a couple of dollars in charges. Looking at the logs, I was somewhat amazed that it generated long detailed prompts that included instructions to the model for what to pay close attention to, what to look out for, and what mistakes to avoid—presumably based on inspecting previous trial prompts generated by the code.

Unfortunately, PromptAgent is a research code and not fully productized, so it would take some work to adapt to a specific user problem.

DSPy

DSPy [6] on the other hand is a finished product available for general users. DSPy is getting some attention lately not only as a prompt optimizer but also more generally as a tool for orchestrating multiple LLMs as agents. There is not much by way of simple examples for how to use the code. The website does have an AI chatbot that can generate sample code, but the code it generated for me required significant work to get it to behave properly.

I ran with the MIPRO optimizer which is most well-suited to prompt optimization. My experience with running the code was that it generated many random prompt variations but did not do in-depth prompt modifications like PromptAgent. PromptAgent does one thing, prompt refinement, and must do it well, unlike DSPy which has multiple uses. DSPy would be well-served to have implemented more powerful prompt refinement algorithms.

Conclusion

I would wholeheartedly agree that it doesn’t seem right for an LLM would be so dependent on the wording of a prompt. Hopefully, future LLMs, with training on more data and other improvements, will do a better job without need for such lengthy trial-and-error processes.

References

[1]  “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting,” https://openreview.net/forum?id=RIu5lyNXjT

[2] “AI Prompt Engineering Is Dead” (https://spectrum.ieee.org/prompt-engineering-is-dead, March 6, 2024

[3]  “Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing,” https://openreview.net/forum?id=OXv0zQ1umU

[4] “Large Language Models as Optimizers,” https://openreview.net/forum?id=Bb4VGOWELI

[5] “PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization,” https://openreview.net/forum?id=22pyNMuIoa

[6] “DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines,” https://openreview.net/forum?id=sY5N0zY5Od

 

Thinking by playing around

Richard Feynman’s Nobel Prize winning discoveries in quantum electrodynamics were partly inspired by his randomly observing a spinning dinner plate one day in the cafeteria. Paul Feyerabend said regarding science discovery, “The only principle that does not inhibit progress is: anything goes” (within relevant ethical constraints, of course).

Ideas can come from anywhere, including physical play. Various books can improve creative discovery skills, like George Pólya’s How to Solve It, Isaac Watts’ Improvement of the Mind, W. J. J. Gordon’s Synectics, and methodologies like mind mapping and C-K theory, to name a few. Many software products present themselves as shiny new tools promising help. However, we are not just disembodied minds interacting with a computer, but instead integrated beings with reasoning integrated with memories, life history, emotions and multisensory input and interaction with the world The tactile is certainly a key avenue of learning, discovering, understanding.

Fidget toys are popular. Different kinds of toys have different semiotics with respect to how they interplay with our imaginations. Like Legos. Structured, like Le Corbusier-style architecture, or multidimensional arrays or tensors, or the snapping together of many software components with well-defined interfaces, with regular scaling from the one to many. Or to take a much older example, Tinkertoys—the analogy of the graph, interconnectedness, semi-structured but composable, like DNA or protein chains, or complex interrelated biological processes, or neuronal connections, or the wild variety within order of human language.

As creative workers, we seek ideas from any and every place to help us in what we do. The tactile, the physical, is a vital place to look.

Choosing a Computer Language for a Project

Julia. Scala. Lua. TypeScript. Haskell. Go. Dart. Various computer languages new and old are sometimes proposed as better alternatives to mainstream languages. But compared to mainstream choices like Python, C, C++ and Java (cf. Tiobe Index)—are they worth using?

Certainly it depends a lot on the planned use: is it a one-off small project, or a large industrial-scale software application?

Yet even a one-off project can quickly grow to production-scale, with accompanying growing pains. Startups sometimes face a growth crisis when the nascent code base becomes unwieldy and must be refactored or fully rewritten (or you could do what Facebook/Meta did and just write a new compiler to make your existing code base run better).

The scope of different types of software projects and their requirements is so incredibly diverse that any single viewpoint from experience runs a risk of being myopic and thus inaccurate for other kinds of projects. With this caveat, I’ll share some of my own experience from observing projects for many dozens of production-scale software applications written for leadership-scale high performance computing systems. These are generally on a scale of 20,000 to 500,000 lines of code and often require support of mathematical and scientific libraries and middleware for build support, parallelism, visualization, I/O, data management and machine learning.

Here are some of the main issues regarding choice of programming languages and compilers for these codes:

1. Language and compiler sustainability. While the lifetime of computing systems is measured in years, the lifetime of an application code base can sometimes be measured in decades. Is the language it is written in likely to survive and be well-supported long into the future? For example, Fortran, though still used and frequently supported, is a less common language thus requiring special effort from vendors, with fewer developer resources than more popular languages. Is there a diversity of multiple compilers from different providers to mitigate risk? A single provider means a single point of failure, a high risk; what happens if the supplier loses funding? Are the language and compilers likely to be adaptable for future computer hardware trends (though sometimes this is hard to predict)? Is there a large customer base to help ensure future support? Similarly, is there an adequate pool of available programmers deeply skilled in the language? Does the language have a well-featured standard library ecosystem and good support for third-party libraries and frameworks? Is there good tool support (debuggers, profilers, build tools)?

2. Related to this is the question of language governance. How are decisions about future updates to the language made? Is there broad participation from the user community and responsiveness to their needs? I’ve known members of the C++ language committee; from my limited experience they seem very reasonable and thoughtful about future directions of the language. On the other hand, some standards have introduced features that scarcely anyone ever uses—a waste of time and more clutter for the standard.

3. Productivity. It is said that programmer productivity is limited by the ability of a few lines of code to express high level abstractions that can do a lot with minimal syntax. Does the language permit this? Does the language standard make sense (coherent, cohesive) and follow the principle of least surprise? At the same time, the language should not engulf what might better be handled modularly by a library. For example, a matrix-matrix product that is bound up with the language might be highly productive for simple cases but have difficulty supporting the many variants of matrix-matrix product provided for example by the NVIDIA CUTLASS library. Also, in-language support for CUDA GPU operations, for example, would make it hard for the language not to lag behind in support of the frequent new releases of CUDA.

4. Strategic advantage. The 10X improvement rule states that an innovation is only worth adopting if it offers 10X improvement compared to existing practice according to some metric . If switching to a given new language doesn’t bring significant improvement, it may not be worth doing. This is particularly true if there is an existing code base of some size. A related question is whether the new language offers an incremental transition path for an existing code to the new language (in many cases this is difficult or impossible).

5. Performance (execution speed). Does the language allow one to get down to bare-metal performance rather than going through costly abstractions or an interpreter layer? Are the features of the underlying hardware exposed for the user to access? Is performance predictable? Can one get a sense of the performance of each line of code just by inspection, or is this occluded by abstractions or a complex compilation process? Is the use of just-in-time compilation or garbage collection unpredictable, which could be a problem for parallel computing wherein unexpected “hangs” can be caused by one process unexpectedly performing one of these operations? Do the compiler developers provide good support for effective and accurate code optimization options? Have results from standardized non-cherry-picked benchmarks been published (kernel benchmarks, proxy apps, full applications)?

Early adopters provide a vibrant “early alert” system for new language and tool developments that are useful for small projects and may be broadly impactful. Python was recognized early in the scientific computing community for its potential complementary use with standard languages for large scientific computations. When it comes to planning large-scale software projects, however, a range of factors based on project requirements must be considered to ensure highest likelihood of success.

 

Hallucinations of AI Science Models

AlphaFold 2, FourCastNet and CorrDiff are exciting. AI-driven autonomous labs are going to be a big deal [1]. Science codes now use AI and machine learning to make scientific discoveries on the world’s most powerful computers [2].

It’s common practice for scientists to ask questions about the validity, reliability and accuracy of the mathematical and computational methods they use. And many have voiced concerns about the lack of explainability and potential pitfalls of AI models, in particular deep neural networks (DNNs) [3].

The impact of this uncertainty varies highly according to project. Science projects that are able to easily check AI-generated results against ground truth may not be that concerned. High-stakes projects like design of a multimillion dollar spacecraft with high project risks may ask about AI model accuracy with more urgency.

Neural network accuracy

Understanding of the properties of DNNs is still in its infancy, with many as-yet unanswered questions. However, in the last few years some significant results have started to come forth.

A fruitful approach to analyzing DNNs is to see them as function approximators (which, of course, they are). One can study how accurately DNNs approximate a function representing some physical phenomenon in a domain (for example, fluid density or temperature).

The approximation error can be measured in various ways. A particularly strong measure is “sup-norm” or “max-norm” error, which requires that the DNN approximation be accurate at every point of the target function’s domain (“uniform approximation”). Some science problems may have a weaker requirement than this, such as low RMS or 2-norm error. However, it’s not unreasonable to ask about max-norm approximation behaviors of numerical methods [4,5].

An illuminating paper by Ben Adcock and Nick Dexter looks at this problem [6]. They show that standard DNN methods applied even to a simple 1-dimensional problem can result in “glitches”: the DNN as a whole matches the function well but at some points totally misapproximates the target function. For a picture that shows this, see [7].

Other mathematical papers have subsequently shed light on these behaviors. I’ll try to summarize the findings below, though the actual findings are very nuanced, and many details are left out here. The reader is advised to refer to the respective papers for exact details.

The findings address three questions: 1) how many DNN parameters are required to approximate a function well? 2) how much data is required to train to a desired accuracy? and 3) what algorithms are capable of training to the desired accuracy?

How many neural network weights are needed?

How large does the neural network need to be for accurate uniform approximation of functions? If tight max-norm approximation requires an excessively large number of weights, then use of DNNs is not computationally practical.

Some answers to this question have been found—in particular, a result 1 is given in [8, Theorem 4.3; cf. 9, 10]. This result shows that the number of neural network weights required to approximate an arbitrary function to high max-norm accuracy grows exponentially in the dimension of the input to the function.

This dependency on dimension is no small limitation, insofar as this is not the dimension of physical space (e.g., 3-D) but the dimension of the input vector (such as the number of gridcells), which for practical problems can be in the tens [11] or even millions or more.

Sadly, this rules out the practical use of DNN for some purposes. Nonetheless, for many practical applications of deep learning, the approximation behaviors are not nearly so pessimistic as this would indicate (cp. [12]). For example, results are more optimistic:

  • if the target function has a strong smoothness property;
  • if the function is not arbitrary but is a composition of simpler functions;
  • if the training and test data are restricted to a (possibly unknown) lower dimensional manifold in the high dimensional space (this is certainly the case for common image and language modeling tasks);
  • if the average case behavior for the desired problem domain is much better than the worst case behavior addressed in the theorem;
  • The theorem assumes multilayer perceptron and ReLU activation; other DNN architectures may perform better (though the analysis is based on multidimensional Taylor’s theorem, which one might conjecture applies also to other architectures).
  • For many practical applications, very high accuracy is not a requirement.
  • For some applications, only low 2-norm error is sufficient, (not low max-norm).
  • For the special case of physics informed neural networks (PINNs), stronger results hold.

Thus, not all hope is lost from the standpoint of theory. However, certain problems for which high accuracy is required may not be suitable for DNN approximation.

How much training data is needed?

Assuming your space of neural network candidates is expressive enough to closely represent the target function—how much training data is required to actually find a good approximation?

A result 2 is given in [13, Theorem 2.2] showing that the number of training samples required to train to high max-norm accuracy grows, again, exponentially in the dimension of the input to the function.

The authors concede however that “if additional problem information about [the target functions] can be incorporated into the learning problem it may be possible to overcome the barriers shown in this work.” One suspects that some of the caveats given above might also be relevant here. Additionally, if one considers 2-norm error instead of max-norm error, the data requirement grows polynomially rather than exponentially, making the training data requirement much more tractable. Nonetheless, for some problems the amount of data required is so large that attempts to “pin down” the DNN to sufficient accuracy become intractable.

What methods can train to high accuracy?

The amount of training data may be sufficient to specify a suitable neural network. But, will standard methods for finding the weights of such a DNN be effective for solving this difficult nonconvex optimization problem?

A recent paper [14] from Max Tegmark’s group empirically studies DNN training to high accuracy. They find that as the input dimension grows, training to very high accuracy with standard stochastic gradient descent methods becomes difficult or impossible.

They also find second order methods perform much better, though these are more computationally expensive and have some difficulty also when the dimension is higher. Notably, second order methods have been used effectively for DNN training for some science applications [15]. Also, various alternative training approaches have been tried to attempt to stabilize training; see, e.g., [16].

Prospects and conclusions

Application of AI methods to scientific discovery continues to deliver amazing results, in spite of lagging theory. Ilya Sutskever has commented, “Progress in AI is a game of faith. The more faith you have, the more progress you can make” [17].

Theory of deep learning methods is in its infancy. The current findings show some cases for which use of DNN methods may not be fruitful, Continued discoveries in deep learning theory can help better guide how to use the methods effectively and inform where new algorithmic advances are needed.

Footnotes

1 Suppose the function to be approximated takes d inputs and has the smoothness property that all nth partial derivatives are continuous (i.e., is in Cn(Ω) for compact Ω). Also suppose a multilayer perceptron with ReLU activation functions must be able to approximate any such function to max-norm no worse than ε. Then the number of weights required is at least a fixed constant times (1/ε)d/(2n).

2 Let F be the space of all functions that can be approximated exactly by a broad class of ReLU neural networks. Suppose there is a training method that can recover all these functions up to max-norm accuracy bounded by ε. Then the number of training samples required is at least a fixed constant times (1/ε)d.

References

[1] “Integrated Research Infrastructure Architecture Blueprint Activity (Final Report 2023),” https://www.osti.gov/biblio/1984466.

[2] Joubert, Wayne, Bronson Messer, Philip C. Roth, Antigoni Georgiadou, Justin Lietz, Markus Eisenbach, and Junqi Yin. “Learning to Scale the Summit: AI for Science on a Leadership Supercomputer.” In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1246-1255. IEEE, 2022, https://www.osti.gov/biblio/2076211.

[3] “Reproducibility Workshop: The Reproducibility Crisis in ML‑based Science,” Princeton University, July 28, 2022, https://sites.google.com/princeton.edu/rep-workshop.

[4] Wahlbin, L. B. (1978). Maximum norm error estimates in the finite element method with isoparametric quadratic elements and numerical integration. RAIRO. Analyse numérique, 12(2), 173-202, https://www.esaim-m2an.org/articles/m2an/pdf/1978/02/m2an1978120201731.pdf

[5] Kashiwabara, T., & Kemmochi, T. (2018). Maximum norm error estimates for the finite element approximation of parabolic problems on smooth domains. https://arxiv.org/abs/1805.01336.

[6] Adcock, Ben, and Nick Dexter. “The gap between theory and practice in function approximation with deep neural networks.” SIAM Journal on Mathematics of Data Science 3, no. 2 (2021): 624-655, https://epubs.siam.org/doi/10.1137/20M131309X.

[7] “Figure 5 from The gap between theory and practice in function approximation with deep neural networks | Semantic Scholar,” https://www.semanticscholar.org/paper/The-gap-between-theory-and-practice-in-function-Adcock-Dexter/156bbfc996985f6c65a51bc2f9522da2a1de1f5f/figure/4

[8] Gühring, I., Raslan, M., & Kutyniok, G. (2022). Expressivity of Deep Neural Networks. In P. Grohs & G. Kutyniok (Eds.), Mathematical Aspects of Deep Learning (pp. 149-199). Cambridge: Cambridge University Press. doi:10.1017/9781009025096.004, https://arxiv.org/abs/2007.04759.

[9] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw., 94:103–114, 2017, https://arxiv.org/abs/1610.01145

[10] I. Gühring, G. Kutyniok, and P. Petersen. Error bounds for approximations with deep relu neural networks in Ws,p norms. Anal. Appl. (Singap.), pages 1–57, 2019, https://arxiv.org/abs/1902.07896

[11] Matt R. Norman, “The MiniWeather Mini App,” https://github.com/mrnorman/miniWeather

[12] Lin, H.W., Tegmark, M. & Rolnick, D. Why Does Deep and Cheap Learning Work So Well?. J Stat Phys 168, 1223–1247 (2017). https://doi.org/10.1007/s10955-017-1836-5

[13] Berner, J., Grohs, P., & Voigtlaender, F. (2022). Training ReLU networks to high uniform accuracy is intractable. ICLR 2023, https://openreview.net/forum?id=nchvKfvNeX0.

[14] Michaud, E. J., Liu, Z., & Tegmark, M. (2023). Precision machine learning. Entropy, 25(1), 175, https://www.mdpi.com/1099-4300/25/1/175.

[15] Markidis, S. (2021). The old and the new: Can physics-informed deep-learning replace traditional linear solvers?. Frontiers in big Data, 4, 669097, https://www.frontiersin.org/articles/10.3389/fdata.2021.669097/full

[16] Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19,  https://papers.nips.cc/paper_files/paper/2006/hash/5da713a690c067105aeb2fae32403405-Abstract.html

[17] “Chat with OpenAI CEO and Co-founder Sam Altman, and Chief Scientist Ilya Sutskever,” https://www.youtube.com/watch?v=mC-0XqTAeMQ&t=250s