Lessons Learned With the Z3 SAT/SMT Solver

Community best practices are useful for helping use a software product more effectively. I’ve just completed a small project using the Z3 solver. Here are some things I’ve learned:

  • My project involves an optimization problem: for a subset of Boolean variables, maximize the count of how many are true. My specific problem is solved much faster with Z3 by converting to a decision problem: set up a base problem to solve for the count being at least a certain fixed number, and iterate using bisection search to find the highest number satisfied. Bisection has been used for this problem before. Also, certain methods may possibly reduce the number of bisection steps.
  • Using Z3  “tactics” can greatly speed up the solve process. I found a good combination of tactics by trial and error, guided in part by the descriptions of the tactics. ChatGPT was of some help in finding good selections to try. An interesting paper discusses use of Monte Carlo tree search to define a good chain of tactics. The branching factor here is high, perhaps around 1000, though there are some redundancies in this number. Training multi-step MCTS might be expensive, but doing this once to get a good static chain of tactics might be worthwhile.
  • The strength of Z3 is in its extremely broad functionality, more so than its raw compute performance. It would be a daunting task for the Z3 team to fully optimize every possible solve option. I examined some of the SMT solver competitions to find faster codes. CVC5 on one case I tried was about twice as fast as Z3; I’ve seen similar reports in the literature. Presently I don’t find it worth the switching costs to use CVC5. One approach might be to use the very capable tactics engine of Z3 and pass the resulting modified problem to CVC5.
  • The specific formulation of the problem can make a big difference in solver performance. I’ve already seen this in the area of iterative linear solvers, for example diagonal matrix scaling can dramatically help (conjugate gradients) or hurt (multigrid) solver performance. Same thing here. Hence the huge importance in good “preprocessing“ for SAT/SMT solvers. One could wish the solver could handle all this automatically without user intervention. However, these powerful tools must be brandished very carefully for maximum effect.
  • Generally, one should move as much of the problem outside of the solver as possible, since the solver is the long pole in the tent in terms of scalability. For example if there is a Z3 integer that must be limited to a certain range and additionally some values in the interval must be blacklisted, it’s better, if possible, to compress all of the valid values into a single interval, to make testing for validity simpler in the Z3 code.
  • Along these lines: the Z3 tactics for propagating constants are not perfect; thus it can help to manually propagate constants (though this unfortunately makes the code more messy). This propagation can also sometimes allow for removal of unneeded constraints, further speeding up performance. Relatedly, some intriguing work by Benjamin Mikek shows how one can use the LLVM code optimizer to optimize the SMT problem in a way that is complementary to Z3 tactics, achieving significant speedup (for more info see here, here and here). I haven’t tried this but it seems promising.
  • Because of the scalability challenge of SMT solvers, various simplifying heuristics to modify the problem can be helpful. For example: solving a subproblem of the main problem and holding the resulting variables fixed in order to solve the rest of the problem. Or solving a simpler, smaller problem first to determine variable presets for the full problem. With these heuristics, one does not in general find the true global optimum; but the result may be adequate.
  • CPU threading does not work for my case (Z3 Python, macOS). Perfect parallelization of SAT and SMP is an unsolved (and perhaps in some sense not fully solvable) problem. One can naïvely parallelize bisection search by converting to trisection, etc., but this does not give perfect speedup (specif., log(P) speedup on P threads). Improvements to parallel bisection in some cases may be possible. Recent work by Armin Biere and colleagues looks promising; as I read it, near perfect speedup up to eight threads (at least for some problems).
  • Some of the main developers of Z3 are on Stack Overflow and have been active in the past answering questions. This seems very useful.

Resources like Handbook of Satisfiability and the proceedings of various SAT/SMT conferences seem helpful. More information on best practices for non-expert practitioners would be a great help to the community. If you know of any good resources, please share in the comments.

Colossus versus El Capitan: A Tale of Two Supercomputers

Colossus

The xAI Colossus supercomputer contains 100,000 NVIDIA H100 GPUs. Upgrades are planned, ultimately up to as much as a million GPUs. The H100 has theoretical peak speed of at least 60 teraFLOPs (FP64 tensor core), though the actual number depends on the power and frequency cap settings on the GPUs. Admittedly FP64 is overkill for Colossus’ intended use for AI model training, though it is required for most scientific and engineering applications on typical supercomputers. This would put Colossus nominally at theoretical peak speed of 6 Exaflops full FP64 precision for dense matrix multiplies.

El Capitan

El Capitan at Lawrence Livermore National Lab ranks now as top #1 fastest system in the world on the TOP500 list, recently taking the crown from Frontier at Oak Ridge National Lab. Both Frontier and El Cap were procured under the same collaborative CORAL-2 project by the two respective laboratories. El Capitan uses AMD Instinct MI300A GPUs for theoretical peak speed of 2.746 Exaflops.

Which system is fastest?

You may wonder about the discrepancy: Colossus has more raw FLOPs, while El Capitan is ranked #1. Which system is actually faster? For decades, top system performance has commonly been measured for TOP500 using the High Performance Linpack (HPL) benchmark. Some have expressed concerns that HPL is an unrepresentative “FLOPs-only” benchmark. However, HPL actually measures more than raw rate of floating point operations. HPL performs distributed matrix products on huge matrices that become smaller and smaller in size during the HPL run, with a serial dependency between sequential matrix multiplies. Near the end of the run, performance becomes very limited by network latency, requiring excellent network performance. Furthermore, HPL is also a system stability test, since the system (often made up of brand new hardware for which bad parts must be weeded out) must stay up for a period of hours without crashing and at the end yield a correct answer (my colleague Phil Roth gives a description of this ordeal for Frontier). In short, a system could have lots of FLOPs but fail these basic tests of being able to run a nontrivial application.

Some commercial system owners may choose not to submit an HPL number, for whatever reason (though Microsoft submitted one and currently has a system at #4). In some cases submitting a TOP500 number may not be a mission priority for the owner. Or the system may not have an adequate network or the requisite system stability to produce a good number, in spite of having adequate FLOPs. Companies don’t typically give reasons for not submitting, but their reasons can be entirely valid, and not submitting a number has certainly happened before.

How long to build a system?

You may also wonder how it is that Colossus was stood up in 122 days (indeed a remarkable achievement by a highly capable team) whereas the CORAL-2 Project, which delivered El Capitan and Frontier, spanned multiple years.

Put simply, a system like Colossus stands on the shoulders of many multi-year investments in vendor hardware and software under projects like CORAL-2. Years ago, Oak Ridge National Lab originally put NVIDIA on the map for supercomputing with Titan, the first NVIDIA-powered petascale supercomputer. Some of the core NVIDIA software in use today was developed in part under this multi-year Titan project. Similarly for AMD and CORAL-2. Many systems, including Colossus, have benefitted from these long-term multi-year investments.

Another reason has to do with intended workloads of the respective systems. Colossus is intended primarily for AI model training; even though model architecture variations have slightly different computational patterns, the requirements are similar. El Capitan on the other hand is a general purpose supercomputer, and as such must support many different kinds of science applications with highly diverse requirements (and even more so at other centers like OLCF, ALCF and NERSC) (on system requirements, application diversity and application readiness see here, here and here). It’s much harder to put together a system to meet the needs of such a variety of science projects.

Conclusion

Colossus and El Capitan are both highly capable systems that will provide millions of node-hours of compute for their respective projects. Colossus has a high flop rate to support reduced precision matrix multiples (and presumably high network bandwidth for Allreduce) required for AI model training. El Capitan has a balanced architecture to support a wide variety of science applications at scale.

ADDENDUM: Colossus is now up to 200K GPUs.

Unicode surrogates

At the highest level, Unicode is a simply a list of symbols. But when you look closer you find that isn’t entirely true. Some of the symbols are sorta meta symbols. And while a list of symbols is not complicated, this list is adjacent to a lot of complexity.

I’ve explored various rabbit holes of Unicode in previous posts, and this time I’d like to go down the rabbit hole of surrogates. These came up in the recent post on how LLMs tokenize Unicode characters.

Incidentally, every time I hear “surrogates” I think of the Bruce Willis movie by that name.

Historical background

Once upon a time, text was represented in ASCII, and life was simple. Letters, numbers, and a few other symbols all fit into a single eight-bit byte with one bit left over. There wasn’t any need to distinguish ASCII and its encoding because they were synonymous. The nth ASCII character was represented the same way as the number n.

The A in ASCII stands for American, and ASCII was sufficient for American English, sorta. Then people played various tricks with the leftover bit to extend ASCII by adding 128 more symbols. This was enough, for example, to do things like include the tilde on the n in jalapeño, but representing anything like Russian letters, math symbols, or Chinese characters was out of the question.

So along came Unicode. Instead of using one byte to represent characters, Unicode used two bytes. This didn’t double the possibilities; it squared them. Instead of 128 ASCII symbols, or 256 extended ASCII symbols, there was the potential to encode 216 = 65,536 symbols. Surely that would be enough! Except it wasn’t. Soon people wanted even more symbols. Currently Unicode has 154,998 code points.

Encoding

The most obvious way to represent Unicode characters, back when there were fewer than 216 characters, was as a 16-bit number, i.e. to represent each character using two bytes rather than one. The nth Unicode character could be represented the same way as the number n, analogous to the case of ASCII.

But while text could potentially require thousands of symbols, at lot of text can be represented just fine with ASCII. Using two bytes per character would make the text take up twice as much memory. UTF-8 encoding was a very clever solution to this problem. Pure ASCII (i.e. not extended) text is represented exact as in ASCII, taking up no more memory. And if text is nearly all ASCII with occasional non-ASCII characters sprinkled in, the size of the text in memory increases in proportional to the number of non-ASCII characters. Nearly pure ASCII text takes up nearly the same amount of memory.

But the cost of UTF-8 is encoding. The bit representation of the nth character is no longer necessarily the bit representation of the integer n.

UTF-16, while wasteful when representing English prose, is a direct encoding. Or at least it was until Unicode spilled over its initial size limit. You need more than a single pair of bytes to represent more than 216 characters.

Surrogate pairs

The way to represent more than 216 symbols with 16-bit characters is to designate some of the “characters” as meta characters. These special characters represent half of a pair of 16-bit units corresponding to a single Unicode character.

There are 1024 characters reserved as high surrogates (U+D800 through U+DBFF) and 1024 reserved as low surrogates (U+DC00 through U+DFFF). High surrogates contain the high-order bits and low surrogates contain the low-order bits.

Let n > 216 be the code point of a character. Then the last 10 bits of the high surrogate are the highest 10 bits of n, and the last 1o bits of the low surrogate are the lowest 10 bits of n.

In terms of math rather than bit representations, the pair of surrogates (HL) represent code point

216 + 210(H − 55396) + (L − 56320)

where 55396 = D800hex and 56320 = DC00hex.

Example

The rocket emoji has Unicode value U+1F680. The bit pattern representing the emoji is DB3DDE80hex, the combination of high surrogate U+D83D and low surrogate U+DE80. We can verify this with the following Python.

    >>> H, L = 0xD83D, 0xDE80
    >>> 2**16 + 2**10*(H - 0xD800) + L - 0xDC00 == 0x1F680
    True

When we write out the high and low surrogates in binary we can visualize how they contain the high and low bits of the rocket codepoint.

H D83D 1101100000111101 L DE80 1101111010000000 1F680 11111011010000000

High private use surrogates

The high surrogate characters U+D800 through U+DBFF are divided into two ranges. The range U+D800 through U+DB7F are simply called high surrogates, but the remaining characters U+DB80 through U+DBFF are called “high private use surrogates.”

These private use surrogates to not work any differently than the rest. They just correspond to code points in Unicode’s private use area.

On Making Databases Run Faster

Database  technology is a mature field, and techniques for optimizing databases are well understood. However, surprises can still happen.

Certain performance optimizations you might expect to be automatic are not really. I’m working with a legacy code developed some time ago, before modern notions of separation of concerns between code business logic and data storage. The code runs slower than you’d expect, and some have wondered as to why this is.

Profiling of the code revealed that the slowdown was not in the core computation, but rather in the reading and writing of the backend database, which occurs frequently when this code executes.

My first thought was to run with the database in RAM disk, which would give higher bandwidth and lower latency than spinning disk or SSD. This helped a little, but not much.

As a short term fix I ended up writing code for (in-memory) hash tables as an interposer between the code and the database. This can cache commonly accessed values and thus reduce database access.

I would’ve thought high-speed RAM caching of values would be default behavior for a database manager. A principle of interface design is to make the defaults as useful as possible. But in this case apparently not.

Thankfully my fix gave over 10X speedup in application launch time and 6X speedup in the core operation of the code.

The project team is moving toward SQLite for database management in the near future. SQLite has perhaps a dozen or so available methods for optimizations like this. However early experiments with SQLite for this case show that more fundamental structural code modifications will also be needed, to improve database access patterns.

As with general code optimization, sometimes you’d be inclined to think the system (compiler, database manager, etc.) will “do it all for you.” But often not.

Code Profiling Without a Profiler

Making your code to run faster starts with understanding where in the code the runtime is actually spent. But suppose, for whatever reason, the code profiling tools won’t work?

I recently used MS Visual Studio on a legacy C++ code. The code crashed shortly after startup when attempting to profile, though otherwise the code ran fine for both release and debug build targets. The cause of the problem was not immediately visible.

If all else fails, using manual timers can help. The idea is to find a high-accuracy system wallclock timer function and use this to read the time before and after some part of the code you want to time. One can essentially apply “bisection search” to the code base to look for the code hot spots. See below for an example.

This can be useful in various situations. Codes in complex languages (or even mixed languages in the code base) can have unusual constructs that break debuggers or profilers. Also, exotic hardware like embedded systems, GPUs or FPGAs may lack full profiler support. Additionally, brand new hardware releases often lack mature tool support, at least initially.

Furthermore, profiling tools themselves, though helpful for getting a quick snapshot of the performance breakdown of each function in the code, have their own limitations. Profilers work either by instrumenting the executable code or sampling. Instrumenting can cause timing inaccuracies by adding overhead from calling the system timer on entrance and exit to every function called. Also it breaks function inlining, often reducing performance.

Sampling on the other hand can be inaccurate if the sample rate is too low, or can distort runtime when sampling at too high a frequency. In contrast, manual timers can circumvent these problems by a very surgical application to specific parts of the code (though some profilers let you turn the profiler on and off at different parts of the code).

Resorting to manual timing of code sections is a messy business. But sometimes it’s the only thing that will work.

Visual Studio C++ Code Example

// mycode.h

#include "cstdio"
#include "cstdarg"

// Get time of day - elapsed seconds
static double gtod() {   
    LARGE_INTEGER ctr, freq;
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&ctr);
    return static_cast(ctr.QuadPart) / static_cast(freq.QuadPart);
}   
    
// Convenience function for printing to file
static void FilePrintf(const char* format, ...) {   
    char buffer[1024];
    va_list args;
    va_start(args, format);
    vsnprintf(buffer, sizeof(buffer), format, args);
    va_end(args);
    FILE* myoutfile = fopen("mytimingsfile.txt", "a");
    fprintf(myoutfile, "%s", buffer);
    fclose(myoutfile);
}   
    
// Storage for timer
extern double g_timer;

// mycode.cpp

#include "mycode.h"

// Initialization for timer
double g_timer = 0;

int main() {

    // ...
    g_timer = 0;

    for (int i=0; i<n; ++i) {
        // ...
        const double t1 = gtod();
        my_expensive_function();
        g_timer += gtod() - t1;
        // ...
    }

    FilePrintf("my_expensive_function runtime: %.6f seconds.\n", g_timer);
    g_timer = 0;

    // ...

Rich Sutton’s bitter lesson of AI

This morning I read Alberto Romeo’s article Grok 3: Another Win For The Bitter Lesson. I wasn’t clear on what exactly the bitter lesson was, so I followed the link to Romeo’s previous article GPT-4: The Bitterer Lesson, which lead to Rich Sutton’s original article The Bitter Lesson.

Sutton opens his article by saying

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

Clear enough, except you have to read further to get the “compared to what.” I asked Grok 3 to read the article and tell me exactly what the bitter lesson is. It begins with this explanation:

The “Bitter Lesson” is the principle that, in the field of artificial intelligence, the most effective and lasting progress has historically come from methods that rely on leveraging ever-increasing computational power rather than depending on human ingenuity to craft domain-specific knowledge or shortcuts. …

and concludes with the following pithy summary of its summary:

In essence, the Bitter Lesson is: computation trumps human-crafted specialization in AI, and embracing this, though humbling, is key to future success.

Sutton supports his thesis with examples from chess, go, speech recognition, and computer vision.

In some sense this is the triumph of statistics over logic. All else being equal, collecting enormous amounts of data and doing statistical analysis (i.e. training an AI model) will beat out rule-based systems. But of course all else is often not equal. Logic is often the way to go, even on big problems.

When you do know whats going on and can write down a (potentially large) set of logical requirements, exploiting that knowledge via techniques like SAT solvers and SMT is the way to go. See, for example, Wayne’s article on constraint programming. But if you want to build the Everything Machine, building software that wants to be all things to all people, statistics is the way to go.

Sometimes you have a modest amount of relevant data, and you don’t have 200,000 Nvidia H100 GPUs. In that case you’re much better off with classical statistics than AI. Classical models are more robust and easier to interpret. They are also not subject to hastily written AI regulation.

DeepSeek-R1: Do we need less compute now?

 

The reactions to the new DeepSeek-R1 AI model in recent days seem limitless. Some say it runs so much faster than existing models that we will no longer need the billions of dollars in compute hardware that big tech is preparing to buy.

Is that plausible?

To get an answer, we need only look back at the experience of the recently-completed Exascale Computing Project. This large scale multi-lab project was tasked with developing technology (primarily software) to prepare for exascale computing, which has recently been achieved by Frontier, Aurora and El Capitan.

During the course of the project, various algorithm and implementation improvements were discovered by the the science teams, these leading to as much as 60X speedup or more, over and above speedups possible from hardware alone [1]. In response, are the teams just running the very same problems faster on older hardware? No — instead, they are now able to run much, much larger problems than previously possible, exploiting both hardware and software improvements.

Or suppose today there were no such thing as the fast Fourier transform (FFT) and scientists were computing Fourier transforms using (essentially) large dense matrix-vector products. If someone then discovered the FFT, I’d guarantee you that scientists would not only say, (1) “Wow, now I can run my existing problems much, much faster,” but also, (2) “Wow, now I can run problems much larger than I ever dreamed and solve problems larger than I could have ever imagined!”

Paradoxically, faster algorithms might even increase the demand for newer, faster hardware. For example, a new faster algorithm for designing medications to cure cancer might be judged so important that it’s worth building the largest machine possible to run it effectively.

All this is not to say whether you should buy or sell Nvidia stock right now. However, it does mean that there is no simplistic argument that faster algorithms and implementations necessarily lead to lower spend on computing hardware. History shows that sometimes this is not true at all. The smart money, on the other hand, is on research teams that are able to exploit any and every new discovery to improve what is possible with their codes, whether by hardware, data, code optimizations or algorithms.

Notes

[1] See slide 9 from Doug Kothe’s talk, “Exascale and Artificial Intelligence: A Great Marriage“. The “Figure of Merit” (FOM) number represents speedup of science output from an application compared to an earlier baseline system. Specifically, a FOM speedup of 50X is the anticipated speedup from baseline due to efficient use of hardware only, for example, on Frontier compared to the earlier OLCF Titan system.

Tricks for radix conversion by hand

The simplest trick for converting from one base to another is grouping. To convert between base b and base bk, group numbers in sets of k and convert one group at a time. To convert from binary to octal, for instance, group bits in sets of three, starting from the right end, and convert each group to octal.

11110010two → (11)(110)(010) → 362eight

For an example going the other direction, let’s convert 476 in base nine to base three.

476nine → (11)(21)(20) → 112120three

In general, conversion between bases is too tedious to do by hand, but one important case where it’s a little easier than it could be is converting between decimal and octal. In combination with the grouping trick above, this means you could, for example, convert between decimal and binary by first converting decimal to octal. Then the conversion from octal to binary is trivial.

The key to converting between decimal and octal is to exploit the fact that 10 = 8 + 2, so powers of 10 become powers of (8 + 2), or powers of 8 become powers of (10 − 2). These tricks are easier to carry out than to explain. You can find descriptions and examples in Knuth’s TAOCP, volume 2, section 4.4.

Knuth cites a note by Walter Soden from 1953 as the first description of the trick for converting octal to decimal.

The trick for moving between base 9 and base 10 (or by grouping, between base 3 and base 10) is simpler and left as an exercise by Knuth. (Problem 12 in section 4.4, with a solution given at the back of the volume.)

Related posts

Double rounding

The previous post started by saying that rounding has a surprising amount of detail. An example of this is double rounding: if you round a number twice, you might not get the same result as if you rounded directly to the final precision.

For example, let’s say we’ll round numbers ending in 0, 1, 2, 3, or 4 down, and numbers ending in 5, 6, 7, 8, or 9 up. Then if we have a number like 123.45 and round it to one decimal place we have 123.5, and if we round that to an integer we have 124. But if we had rounded 123.45 directly to an integer we would have gotten 123. This is not a mere curiosity; it comes up fairly often and has been an issue in lawsuits.

The double rounding problem cannot happen in odd bases. So, for example, if you have some fraction represented in base 7 and you round it first from three figures past the radix point to two, then from two to one, you’ll get the same result as if you directly rounded from three figures to one. Say we start with 4231.243seven. If we round it to two places we get 4231.24seven, and if we round again to one place we get 4231.3seven, the same result we would get by rounding directly from three places to one.

The reason this works is that you cannot represent ½ by a finite expression in an odd base.

A magical land where rounding equals truncation

Rounding numbers has a surprising amount of detail. It may seem trivial but, as with most things, there is a lot more to consider than is immediately obvious. I expect there have been hundreds if not thousands of pages devoted to rounding in IEEE journals.

An example of the complexity of rounding is what William Kahan called The Tablemaker’s Dilemma: there is no way in general to know in advance how accurately you’ll need to compute a number in order to round it correctly.

Rounding can be subtle in any number system, but there is an alternative number system in which it is a little simpler than in base 10. It’s base 3, but with a twist. Instead of using 0, 1, and 2 as “digits”, we use −1, 0, and 1. This is known as the balanced ternary system: ternary because of base 3, and balanced because the digits are symmetrical about 0.

We need a symbol for −1. A common and convenient choice is to use T. Think of moving the minus sign from in front of a 1 to on top of it. Now we could denote the number of hours in a day as 10T0 because

1 \times 3^3 + 0 \times 3^2 + (-1)\times 3 + 0 = 24

A more formal way of a describing balanced ternary representation of a number x is a set of coefficients tk such that

x = \sum_{k=-\infty}^\infty t_k 3^k

with the restriction that each tk is in the set {−1, 0, 1}.

Balanced ternary representation has many interesting properties. For example, positive and negative numbers can all be represented without a minus sign. See, for example, Brain Hayes’ excellent article Third Base. The property we’re interested in here is that to round a balanced ternary number to the nearest integer, you simply lop off the fractional part. Rounding is the same as truncation. To see this, note that the largest possible fractional part is a sequence of all 1s, which represents ½:

\frac{1}{3} + \frac{1}{3^2} + \frac{1}{3^3} + \cdots = \frac{1}{2}

Similarly, the most negative possible fractional part is a sequence of all Ts, which represents −½. So unless the fractional part is exactly equal to ½, truncating the fractional part rounds to the nearest integer. If the fractional part is exactly ½ then there is no nearest integer but two integers that are equally near.

Related posts