Avoiding Multiprocessing Errors in Bash Shell

 

Suppose you have two Linux processes trying to modify a file at the same time and you don’t want them stepping on each other’s work and making a mess.  A common solution is to use a “lock” mechanism (a.k.a. “mutex”). One process “locks the lock” and by this action has sole ownership of a resource in order to make updates, until it unlocks the lock to allow other processes access.

Writing a custom lock in Linux bash shell is tricky. Here’s an example that DOESN’T work right:

#/bin/bash
let is_locked=1 # helper variable to denote locked state
mylockvariable=$(cat mylockfile 2>/dev/null)  # read the lock
while [ "$mylockvariable" != $is_locked ]  # loop until unlocked
do
    sleep 5 # wait 5 seconds to try again 
    mylockvariable=$(cat mylockfile 2>/dev/null)  # read again
done
echo $is_locked > mylockfile  # lock the lock
# >>> do critical work safely here <<<
# >>> ERROR: NOT SAFE <<<
rm mylockfile  # unlock the lock

Here the lock value is stored in a shared resource, the file “mylockfile”. If the file exists and contains the character “1”, the lock is considered locked; otherwise, it is considered unlocked.  The code will loop until the lock is unlocked, then acquire the lock, do the required single-process work, and then release the lock.

However, this code can fail without warning: suppose two processes A and B execute this code concurrently. Initially the lock is in an unlocked state. Process A reads the lockfile. Then suppose immediately after this, Process A is temporarily interrupted, perhaps to give CPU cycles to run Process B. Then, suppose Process B begins, reads the lock, locks the lock and starts doing its critical work. Suppose now Process B is put into wait state and Process A is restarted. Process A, since it previously read the lockfile, wrongly believes the lock is unlocked, thus proceeds to also lock the lock and do the critical work—resulting in a mess.

This is an example of a classic race condition, in which the order of execution of threads or processes can affect the final outcome of execution.

A solution to this conundrum is found in the excellent book, Unix Power Tools [1,2]. This is a hefty tome but very accessibly written, for some people well worth a read-through to pick up a slew of time-saving tips.

The problem with the example code is the need to both read and set the lock in a single, indivisible (atomic) operation. Here’s a trick to do it:

#/bin/bash
until (umask 222; echo > mylockfile) 2>/dev/null  # check and lock
do  # keep trying if failed
    sleep 5 # wait 5 seconds to try again 
done
# >>> do critical work safely here <<<
rm -f mylockfile  # unlock the lock

Here, the existence of the lockfile itself is the indicator that the lock is set. Setting the umask makes this file creation fail if the file already exists, triggering the loop to activate to keep trying. This works because the existence of a file can either be true or false and nothing else; the existence of a file is guaranteed atomicity by the OS and the filesystem. Thus, assuming the system is working correctly, this code is guaranteed to produce the desired behavior.

Race conditions can be a nuisance to find since their occurrence is nondeterministic and can be rare but devastating. Writing correct code for multiple threads of execution can be confusing to those who haven’t done it before. But with experience it becomes easier to reason about correctness and spot such errors.

References:

[1] Peek, Jerry D., Shelley Powers, Tim O’Reilly and Mike Loukides. “Unix Power Tools, Third Edition.” (2002), https://learning.oreilly.com/library/view/unix-power-tools/0596003307/

[2] https://learning.oreilly.com/library/view/unix-power-tools/0596003307/ch36.html#:-:text=Shell%20Lockfile

Is Low Precision Arithmetic Safe?

The popularity of low precision arithmetic for computing has exploded since the 2017 release of the Nvidia Volta GPU. The half precision tensor cores of Volta offered a massive 16X performance gain over double precision for key operations. The “race to the bottom” for lower precision computations continues: some have even solved significant problems using 1-bit precision arithmetic hardware ([1], [2]). And hardware performance is getting even better: the Nvidia H100 tensor core-enabled FP16 is a full 58X faster than standard FP64, and 1-bit precision is yet another 16X faster than this, for total speedup of over 900X for algorithms that can use it [3].

This eye-popping speedup certainly draws attention. However, in scientific computing, low precision arithmetic has typically been seen as unsafe for modeling and simulation codes. Indeed, lower precision can sometimes be used to advantage [4], commonly in a “mixed precision” setting in which only parts of the calculation are done in low precision. However, in general anything less than double precision is considered inadequate to model complex physical phenomena with fidelity (see, e.g., [5]).

In response, developers have created tools to measure the safety of reduced precision arithmetic in application codes [6]. Some tools can even identify which variables or arrays can be safely demoted to lower precision without loss of accuracy in the final result. However, use of these tools in a blind fashion, not backed by some kind of reasoning process, can be hazardous.

An example will illustrate this. The conjugate gradient method for linear system solving and optimization [7] and the closely related Lanczos method for eigenvalue problem solving [8] showed great promise following their invention in the early 1950s. However, they were considered unsafe due to catastrophic roundoff errors under floating point arithmetic—even more pronounced as floating point precision is reduced. Nonetheless, Chris Paige showed in his pioneering work in the 1970s [9] that the roundoff error, though substantial, did not preclude the usefulness of the methods when properly used. The conjugate gradient method has gone on to become a mainstay in scientific computing.

Notice that no tool could possibly arrive at this finding, without a careful mathematical analysis of the methods. A tool would detect inaccuracy in the calculation but could not certify that these errors could cause no harm to the final result.

Some might propose instead a purely data-driven approach: just try low precision on some test cases, if it works then use low precision in production. This approach is fraught with peril, however: the test cases may not capture all situations that could be encountered in production.

For example, one might test an aerodynamics code only on smooth flow regimes, but production runs may encounter complex flows with steep gradients—that low precision arithmetic cannot correctly model. Academic papers that test low precision methods and tools must rigorously evaluate in challenging real-world scenarios like this.

Sadly, computational science teams frequently don’t have the time to evaluate their codes for potential use of lower precision arithmetic. Tools could certainly help. Also, libraries that encapsulate mixed precision methods can provide benefits to many users. A great success story here is mixed precision dense linear solvers, founded on the solid theoretical work of Nick Highnam and colleagues [10], which has found its way into libraries such as [11].

So the final answer is, “it depends.” Each new case must be looked at carefully, and a determination made based on some combination of analysis and testing.

References

[1] Zhang, Y., Garg, A., Cao, Y., Lew, Ł., Ghorbani, B., Zhang, Z. and Firat, O., 2023. Binarized Neural Machine Translation. arXiv preprint arXiv:2302.04907, https://openreview.net/forum?id=XAyPlfmWpu

[2] Lagergren, J., Cashman, M., Vergara, V.G.M., Eller, P.R., Gazolla, J.G.F.M., Chhetri, H.B., Streich, J., Climer, S., Thornton, P., Joubert, W. and Jacobson, D., 2023. Climatic clustering and longitudinal analysis with impacts on food, bioenergy, and pandemics. Phytobiomes Journal, 7(1), pp.65-77, https://apsjournals.apsnet.org/doi/10.1094/PBIOMES-02-22-0007-R.

[3] “NVIDIA H100 Tensor Core GPU Datasheet,” https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet.

[4] G. Alvarez et al., “New algorithm to enable 400+ TFlop/s sustained performance in simulations of disorder effects in high-Tc superconductors,” SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, Austin, TX, USA, 2008, pp. 1-10, doi: 10.1109/SC.2008.5218119.

[5] Spafford, K., Meredith, J., Vetter, J., Chen, J., Grout, R., Sankaran, R. (2010). Accelerating S3D: A GPGPU Case Study. In: Lin, HX., et al. Euro-Par 2009 – Parallel Processing Workshops. Euro-Par 2009. Lecture Notes in Computer Science, vol 6043. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14122-5_16.

[6] “Mixed precision analysis tools,” https://scholar.google.com/scholar?q=mixed+precision+analysis+tools

[7] Hestenes, M.R. and Stiefel, E., 1952. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards49(6), pp.409-436, https://nvlpubs.nist.gov/nistpubs/jres/049/jresv49n6p409_a1b.pdf.

[8] Cornelius Lanczos, An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators, Journal of Research of the National Bureau of Standards Vol. 45, No. 4, October 1950, https://nvlpubs.nist.gov/nistpubs/jres/045/4/V45.N04.A01.pdf.

[9] Paige, Christopher C.. “The computation of eigenvalues and eigenvectors of very large sparse matrices.” (1971), https://www.cs.mcgill.ca/~chris/pubClassic/PaigeThesis.pdf.

[10] Higham, N.J., Pranesh, S. and Zounon, M., 2019. Squeezing a matrix into half precision, with an application to solving linear systems. SIAM Journal on Scientific Computing41(4), pp.A2536-A2551, https://epubs.siam.org/doi/abs/10.1137/18M1229511.

[11] Lu, Hao; Matheson, Michael; Wang, Feiyi; Joubert, Wayne; Ellis, Austin; Oles, Vladyslav. “OpenMxP-Opensource Mixed Precision Computing,” https://www.osti.gov/biblio/1961398.

New Ways To Make Code Run Faster

Racecar

The news from Meta last week is a vivid reminder of the importance of making code run faster and more power-efficiently. Meta intends to purchase 350,000 Nvidia H100 GPUs this year [1]. Assuming 350W TDP [2] and $0.1621 per kW-h [3] average US energy cost, one expects a figure of $174 million per year in electricity expenses just to power the GPUs (note this is only a rough estimate of the actual). For this and many other datacenter-scale and real-time critical applications, every bit of increased performance can be impactful.

Many approaches can be taken to making code run faster. The excellent book Hacker’s Delight contains many tricks for speeding up low-level code kernels [4]. Also, superoptimization techniques find the very fastest performing implementations of short, loop-free code kernels by exhaustive search [5].

Since exhaustive search scales exponentially with code size, it’s no surprise that other methods have been tried. Recent work with reinforcement learning reduces the number of scalar multiples needed for matrix products, discovering new Strassen-like algorithms [6]. Other work focuses more on hardware design; for example, PrefixRL optimizes chip design for targeted problems using reinforcement learning [7].

Finding the absolute fastest code for a given task is in general NP-complete [8]. This problem is out of reach for tractable solution by such methods. However, phenomenal progress in SAT/SMT solver efficiency for NP-complete problems has been made over the last 20 years (someone has even quipped, “NP is the new P” [9]). And indeed, these methods have been applied to superoptimization problems [10, 11]. Perhaps methodologies based on efficient SAT and SMT solvers will afford further opportunities for advancement.

The need for speed and power efficiency has become so acute that radically different non-von Neumann processors are being built [12], in some cases orders of magnitude more power efficient but restrictive in the problems they can solve. In the coming years we can expect to see a huge amount activity and developments on this important problem.

[1] “Meta will have 350,000 of Nvidia’s fastest AI GPUs by end of year, buying AMD’s MI300, too,” Tom’s Hardware, Jan. 19, 2024, https://www.tomshardware.com/tech-industry/meta-will-have-350000-of-nvidias-fastest-ai-gpus-by-end-of-year-buying-amds-mi300-too

[2] “NVIDIA H100 Tensor Core GPU Datasheet,” https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet

[3] “Electricity Rates for Every State in The U.S.,” https://www.energybot.com/electricity-rates/

[4] Henry Warren, “Hacker’s Delight,” 2nd Edition, 2012, https://web.archive.org/web/20190915025154/http://www.hackersdelight.org/

[5] “Superoptimization”, https://en.wikipedia.org/wiki/Superoptimization

[6] Fawzi, A., Balog, M., Huang, A. et al. “Discovering faster matrix multiplication algorithms with reinforcement learning,” Nature 610, 47–53 (2022). https://doi.org/10.1038/s41586-022-05172-4

[7] Rajarshi Roy et al., “PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning,” https://arxiv.org/abs/2205.07000

[8] Hennessy, John L., and Thomas Gross. “Postpass code optimization of pipeline constraints.” ACM Transactions on Programming Languages and Systems (TOPLAS) 5, no. 3 (1983): 422-448.

[9] Arie Gurfinkel, “Algorithms for SAT,” https://ece.uwaterloo.ca/~agurfink/ece750t29/assets/pdf/02_SAT.pdf

[10] Abhinav Jangda and Greta Yorsh. 2017. “Unbounded superoptimization,” in Proceedings of the 2017 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! 2017). Association for Computing Machinery, New York, NY, USA, 78–88. https://doi.org/10.1145/3133850.3133856

[11] Phitchaya Mangpo Phothilimthana, Aditya Thakur, Rastislav Bodik, and Dinakar Dhurjati. 2016. “GreenThumb: superoptimizer construction framework,” in Proceedings of the 25th International Conference on Compiler Construction (CC 2016). Association for Computing Machinery, New York, NY, USA, 261–262. https://doi.org/10.1145/2892208.2892233

[12] “OpenAI Agreed to Buy $51 Million of AI Chips From a Startup Backed by CEO Sam Altman,” Wired, Dec. 3, 2023, https://www.wired.com/story/openai-buy-ai-chips-startup-sam-altman/

 

The IQ Test That AI Can’t Pass

Large language models have recently achieved remarkable test scores on well-known academic and professional exams (see, e.g., [1], p. 6). On such tests, these models are at times said to reach human-level performance. However, there is one test that humans can pass but every AI method known to have been tried has abysmally failed.

The Abstraction and Reasoning Corpus (ARC) benchmark [2] was developed by François Chollet to measure intelligence in performing tasks never or rarely seen before. We all do tasks something like this every day, like making a complicated phone call to correct a mailing address. The test is composed of image completion problems similar to Raven’s Progressive Matrices but more complex. Given images A, B and C, one must identify the image D such that the relationship “A is to B as C is to D” holds. Sometimes several examples of the A:B relationship are given.

The problem is hard because the relationship patterns between A and B that humans could easily identify (for example, image shrinking, rotating, folding, recoloring, etc.) might be many, many different things—more than can easily be trained for. By construction, every problem is qualitatively, unpredictably different, so the common approach of training on the training set doesn’t work. Instead, bonafide reasoning on a new kind of problem for each case is required.

Several competitions with prize money have encouraged progress on the ARC benchmark [3]. In these, each entrant’s algorithm must be tested against an unseen ARC holdout set. The leaderboard for the ARCathon 2023 challenge completed last month shows top score of 30 percent [4]; this is excellent progress on a very hard problem, but far from a perfect score or anything else resembling passing.

Ilya Sutskever has famously warned we shouldn’t bet against deep learning, and perhaps a future LLM will do much better on this benchmark. Others feel a new approach is needed, for example, from the burgeoning field of neurosymbolic methods. In any case, these results show at the present moment in this rapidly progressing field, we don’t seem to be anywhere close to strong forms of AGI, artificial general intelligence.

[1] OpenAI, “GPT-4 Technical Report,” https://cdn.openai.com/papers/gpt-4.pdf

[2] François Chollet, “On the Measure of Intelligence,” https://arxiv.org/abs/1911.01547

[3] “Abstraction and Reasoning Challenge,” https://www.kaggle.com/c/abstraction-and-reasoning-challenge

[4] “Winners – Lab42, “https://lab42.global/winners/

When High Performance Computing Is Not High Performance

Everybody cares about codes running fast on their computers. Hardware improvements over recent decades have made this possible. But how well are we taking advantage of hardware speedups?

Consider these two C++ code examples. Assume here n = 10000000.

void sub(int* a, int* b) {
    for (int i=0; i<n; ++i)
        a[i] = i + 1;
    for (int i=0; i<n; ++i)
        b[i] = a[i];
}
void sub(int* a, int* b) {
    for (int i=0; i<n; ++i) {
        const int j = i + 1;
        a[i] = j;
        b[i] = j;
    }
}

Which runs faster? Both are simple and give identical results (assuming no aliasing). However on modern architectures, depending on the compilation setup, one will generally run significantly faster than the other.

In particular, Snippet 2 would be expected to run faster than Snippet 1. In Snippet 1, elements of the array “a”, which is too large to be cached, must be retrieved from memory after being written, but this is not required for Snippet 2. The trend for over two decades has been for compute speed of newly delivered systems to grow much faster than memory speed, and the disparity is extreme today. The performance of these kernels is bound almost entirely by memory bandwidth speed. Thus Snippet 2, a fused loop version of Snippet 1, improves speed by reducing main memory access.

Libraries like C++ STL are unlikely to help, since this operation is too specialized to expect a library to support it (especially the fused loop version). Also, the compiler cannot safely fuse the loops automatically without specific instructions that the pointers are unaliased, and even then is not guaranteed to do so.

Thankfully, high level computer languages since the 1950s have raised the programming abstraction level for all of us. Naturally, many of us would like to just implement the required business logic in our codes and let the compiler and the hardware do the rest. But sadly, one can’t always just throw the code on a computer and expect it to run fast. Increasingly, as hardware becomes more complex, giving attention to the underlying architecture is critical to getting high performance.

The Million Dollar Matrix Multiply

The following post is by Wayne Joubert, the newest member of our consulting team. Wayne recently retired from his position as a Senior Computational Scientist at Oak Ridge National Laboratory. — John

Training large language models like GPT-4 costs many millions of dollars in server expenses. These costs are expected to trend to billions of dollars over the next few years [1]. One of the biggest computational expenses of LLM training is multiplying matrices. These are simple operations of the form C = AB. Matrix multiplies are common not only in AI model training but also many high performance computing applications from diverse science domains.

Eking out more speed from matrix multiplies could reduce AI model training costs by millions of dollars. More routinely, such improvements could reduce training runtime by hours on a single GPU-powered workstation or cut down cloud service provider expenses significantly.

What is less well-known is that matrix multiples run on graphics processing units (GPUs) that are typically used for model training have many exotic performance behaviors that can drastically reduce matrix multiply efficiency by a wide margin.

Two recent works [2], [3] examine these phenomena in considerable depth. Factors such as matrix size, alignment of data in memory, power throttling, math library versions, chip-level manufacturing variability, and even the values of the matrix entries can significantly affect performance. At the same time, much of this variability can be modeled by machine learning methods such as decision trees and random forests [2].

Use of these methods can be the first step toward implementing autotuning techniques to minimize costs. Using such methods or carefully applying rules of thumb for performance optimization can make a huge performance difference for matrix multiply-heavy GPU software.

Related posts

[1] What large models cost you—there is no free AI lunch

[2] Wayne Joubert, Eric Palmer and Verónica G. Melesse Vergara, “Matrix Multiply Performance of GPUs on Exascale-class HPE/Cray Systems,” Proceedings of the Cray User Group Meeting (CUG) 2022, https://www.osti.gov/biblio/2224210.

[3] P. Sinha, A. Guliani, R. Jain, B. Tran, M. D. Sinclair and S. Venkataraman, “Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems,” SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 2022, pp. 01-15, doi: 10.1109/SC41404.2022.00070.