Colossus versus El Capitan: A Tale of Two Supercomputers

Colossus

The xAI Colossus supercomputer contains 100,000 NVIDIA H100 GPUs. Upgrades are planned, ultimately up to as much as a million GPUs. The H100 has theoretical peak speed of at least 60 teraFLOPs (FP64 tensor core), though the actual number depends on the power and frequency cap settings on the GPUs. Admittedly FP64 is overkill for Colossus’ intended use for AI model training, though it is required for most scientific and engineering applications on typical supercomputers. This would put Colossus nominally at theoretical peak speed of 6 Exaflops full FP64 precision for dense matrix multiplies.

El Capitan

El Capitan at Lawrence Livermore National Lab ranks now as top #1 fastest system in the world on the TOP500 list, recently taking the crown from Frontier at Oak Ridge National Lab. Both Frontier and El Cap were procured under the same collaborative CORAL-2 project by the two respective laboratories. El Capitan uses AMD Instinct MI300A GPUs for theoretical peak speed of 2.746 Exaflops.

Which system is fastest?

You may wonder about the discrepancy: Colossus has more raw FLOPs, while El Capitan is ranked #1. Which system is actually faster? For decades, top system performance has commonly been measured for TOP500 using the High Performance Linpack (HPL) benchmark. Some have expressed concerns that HPL is an unrepresentative “FLOPs-only” benchmark. However, HPL actually measures more than raw rate of floating point operations. HPL performs distributed matrix products on huge matrices that become smaller and smaller in size during the HPL run, with a serial dependency between sequential matrix multiplies. Near the end of the run, performance becomes very limited by network latency, requiring excellent network performance. Furthermore, HPL is also a system stability test, since the system (often made up of brand new hardware for which bad parts must be weeded out) must stay up for a period of hours without crashing and at the end yield a correct answer (my colleague Phil Roth gives a description of this ordeal for Frontier). In short, a system could have lots of FLOPs but fail these basic tests of being able to run a nontrivial application.

Some commercial system owners may choose not to submit an HPL number, for whatever reason (though Microsoft submitted one and currently has a system at #4). In some cases submitting a TOP500 number may not be a mission priority for the owner. Or the system may not have an adequate network or the requisite system stability to produce a good number, in spite of having adequate FLOPs. Companies don’t typically give reasons for not submitting, but their reasons can be entirely valid, and not submitting a number has certainly happened before.

How long to build a system?

You may also wonder how it is that Colossus was stood up in 122 days (indeed a remarkable achievement by a highly capable team) whereas the CORAL-2 Project, which delivered El Capitan and Frontier, spanned multiple years.

Put simply, a system like Colossus stands on the shoulders of many multi-year investments in vendor hardware and software under projects like CORAL-2. Years ago, Oak Ridge National Lab originally put NVIDIA on the map for supercomputing with Titan, the first NVIDIA-powered petascale supercomputer. Some of the core NVIDIA software in use today was developed in part under this multi-year Titan project. Similarly for AMD and CORAL-2. Many systems, including Colossus, have benefitted from these long-term multi-year investments.

Another reason has to do with intended workloads of the respective systems. Colossus is intended primarily for AI model training; even though model architecture variations have slightly different computational patterns, the requirements are similar. El Capitan on the other hand is a general purpose supercomputer, and as such must support many different kinds of science applications with highly diverse requirements (and even more so at other centers like OLCF, ALCF and NERSC) (on system requirements, application diversity and application readiness see here, here and here). It’s much harder to put together a system to meet the needs of such a variety of science projects.

Conclusion

Colossus and El Capitan are both highly capable systems that will provide millions of node-hours of compute for their respective projects. Colossus has a high flop rate to support reduced precision matrix multiples (and presumably high network bandwidth for Allreduce) required for AI model training. El Capitan has a balanced architecture to support a wide variety of science applications at scale.

ADDENDUM: Colossus is now up to 200K GPUs.