Archiving data on paper

I finally got to experiment a bit with archiving data on a regular A4 or US Letter page using a regular printer and a phone camera to read it. It’s been bothering me for about 10 years. What is the maximum data size that we can store on the paper and reliably retrieve?

My setup

It seems it is limited by my camera, not my printer. This image stores 30KB of data. I printed it, took a photo with my iPhone, and then decoded and unpacked it using tar and bzip2. It correctly unpacks into 360KB of C++ files, 6305 lines.

I am using the Twibright Optar software.

It uses Golay codes (used by Voyager) that can fix 3 bad bits in each 24-bit code word (which contains 12-bit payload, and 12-bit parity bits). If there are 4 bad bits, the errors are detected, but cannot be corrected. In my experiment, there were 6 codewords which had 3 bad bits, and no codewords with more bad bits, so there was no data loss. It seems most of the bad bits were in the area where my phone cast a shadow on the paper, so possibly retaking the picture in broad daylight might help.

Bad bits

Here are the stats from the optar code:

7305 bits bad from 483840, bit error rate 1.5098%.

49.1855% black dirt, 50.8145% white dirt and 0 (0%) irreparable. Golay stats
0 bad bits      13164
1 bad bit       6693
2 bad bits      297
3 bad bits      6
4 bad bits      0
total codewords 20160

The original setup can store 200KB of data. I tried it, it seems to print fine. But it’s really tiny, and my iPhone camera can’t read it well enough, so nothing is recovered. So I used larger pixels. The 30KB is what I was able to store and retrieve and from the stats above, it seems this is the limit.

Other possibilities

Competing products, such as this one only store 3-4KB/page, so my experiment above is 10x that.

One idea is to use a different error correction scheme. The Golay above only uses 50% to store data. If we could use 75% to store data, that gives us 50% more capacity.

Another idea to improve is to use colors, with 8 colors we get 3x larger capacity, with 16 colors we get 4x more. That would give us around 100KB/page with colors. Can we do better?

Comparison with other storage techniques

I think the closest to compare is floppy disks. The 5¼-inch floppy disk that I used as a kid could store 180 KB single side, 360KB both sides. The 3½-inch floppy disk could store 720 KB single side or 1.44MB both sides. We can also print on both sides of the paper, so let’s just compare single side for both:

  • Optar original (requires a good scanner): 200 KB
  • My experiment above (iPhone): 30 KB
  • Estimate with 8 colors: 120 KB
  • 5¼-inch floppy disk: 180 KB
  • 3½-inch floppy disk: 720 KB

It seems that the 5¼-inch floppy disk is a good target.


One application is to store this at the end of a book, so that you don’t need to distribute CDs or floppy disks (as used to be the habit), but you just put 10-20 pages to the appendix, this should be possible to decode even 100 years from now quite reliably.

Another application is just archiving any projects that I care about. I still have some floppy disks in my parents’ attic, and I am quite sure they are completely unreadable by now. While I also have some printouts on paper from the same era 30 years ago, and they are perfectly readable.

I think the requirement to use a regular iPhone is good (mine is a few years old, perhaps the newest one can do better?). If we allow scanners, then of course we can do better, but not many people have a high quality scanner at home, and there is no limit to it: GitHub’s Arctic Vault uses microfilm and high quality scanner. I want something that can be put into a book, or article on paper, something that anyone can decode with any phone, and that doesn’t require any special treatment to store.

What’s the Best Code Editor?

Emacs, vi, TextEdit, nano, Sublime, Notepad, Wordpad, Visual Studio, Eclipse, etc., etc.—everyone’s got a favorite.

I used Visual Studio previously and liked the integrated debugger. Recently I started using VS again and found the code editing windows rather cluttered. Thankfully you can tone this down, if you can locate the right options.

Eclipse for Java has instantaneous checking for syntax errors. I have mixed feelings on this. Perhaps you could type a little more code before getting a glaring error message?

Concerning IDEs (integrated development environments) like this—I’ve met people who think that a full GUI-based IDE is the only way to go. Maybe so. However , there’s another view.

You’d think if anyone would know how to write code quickly, accurately and effectively, it would be world-class competitive programmers. They’re the best, right?

One of the very top people is Gennady Korotkevich. He’s won many international competitions.

What does he use? Far Manager, a text-based user interface tool with a mere two panels and command prompt. It’s based on 1980s pre-GUI file manager methodologies that were implemented under DOS.

It reminds me of a conversation I had with our admin when I was in grad school. I asked, “Why do you use vi instead of MS Word for editing documents?” Answer: “I like vi because it’s faster—your fingers never need to leave the keyboard.”

Admittedly, not all developer workflows would necessarily find this approach optimal. But still it makes you think. Sometimes the conventional answer is not the best one.

Do you have a favorite code editor? Please let us know in the comments.

Natural one-liners

I learned to use Unix in college—this was before Linux—but it felt a little mysterious. Clearly it was developed by really smart people, but what were the problems that motivated their design choices?

Some of these are widely repeated. For example, commands have terse names because you may have to transmit commands over a glacial 300 baud network connection.

OK, but why are there so many tools for munging text files, for example? That’s great if your job requires munging text files, but what about everything else. What I didn’t realize at the time was that nearly everything involves munging text files, or can be turned into a problem involving munging text files.

Working with data at the command line

There’s an old joke that Unix is user friendly, it’s just picky about who its friends are. I’d rephrase to say Unix makes more sense when you’re doing the kind of work the Unix developers were doing.

I was writing programs when I learned Unix, so some things about Unix made sense at the time. But I didn’t see the motivation for many of the standard command line tools until I started analyzing datasets years later. I thought awk was cool—it was the first scripting language I encountered—but it wasn’t until years later that I realized awk is essentially a command line spreadsheet. It was written for manipulating tabular data stored in text files.

Mythological scripts

Unix one-liners are impressive, but they can seem like a rabbit out of a hat. How would anyone think to do that?

When you develop your own one liners, one piece at a time, they seem much more natural. You get a feel for how the impressive one-liners you see on display were developed incrementally. They almost certainly did not pop into the world fully formed like Athena springing from the head of Zeus.

Example: Voter registration data

Here’s an example. I was looking at Washington state voter registration data. There’s a file 20240201_VRDB_Extract.txt. What’s in there?

The first line of a data file often contains column headers. Looking at just the first few lines of a file is a perennial task, so there’s a tool for that: head. By default it shows the first 10 lines of a file. We just want to see the first line, and there’s an option for that: -n 1.

> head -n 1 20240201_VRDB_Extract.txt


Inserting line breaks

OK, those look like column headers, but they’re hard to read. It would be nice if we could replace all the pipe characters used as field separators with line breaks. There’s a command for that too. The sed tool let’s you, among other things, replace one string with another. The tiny sed program


does just what we want. It may look cryptic, but it’s very straight forward. The “s” stands for substitute. The program s/foo/bar/ substitutes the first instance of foo with bar. If you want to replace all instances, you tack on a “g” on the end for “global.”

Eliminating temporary files

We could save our list of column headings to a file, and then run sed on the output, but that creates an unnecessary temporary file. If you do this very much, you get a lot of temporary files cluttering your working area, say with names like temp1 and temp2. Then after a while you start to forget what you named each intermediary file.

It would be nice if you could connect your processing steps together without having to create intermediary files. And that’s just what pipes do. So instead of saving our list of column headers to a file, we pipe it through to sed.

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g'


Scrolling and searching

This is much better. But it produces more output than you may be able to see in your terminal. You could see the list, one terminal window at a time, by piping the output to less.

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g' | less

This file only has 33 columns, but it’s not uncommon for a data file to have hundreds of columns. Suppose there were more columns than you wanted to scan through, and you wanted to know whether one of the columns contained a zip code. You could do that by piping the output through grep to look for “zip.”

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g' | grep -i zip

There are no column headings containing “zip”, but there are a couple containing “Zip.” Adding -i (for case insensitive) finds the zip code columns.


Our modest little one-liner now has three segments separated by pipes. It might look impressive to someone new to working this way, but it’s really just stringing common commands together in a common way.

A famous one-liner

When you see a more complicated one-liner like

tr -cs A-Za-z '
' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

you can imagine how it grew incrementally. Incidentally, the one-liner above is somewhat famous. You can find the story behind it here.

Avoiding Multiprocessing Errors in Bash Shell


Suppose you have two Linux processes trying to modify a file at the same time and you don’t want them stepping on each other’s work and making a mess.  A common solution is to use a “lock” mechanism (a.k.a. “mutex”). One process “locks the lock” and by this action has sole ownership of a resource in order to make updates, until it unlocks the lock to allow other processes access.

Writing a custom lock in Linux bash shell is tricky. Here’s an example that DOESN’T work right:

let is_locked=1 # helper variable to denote locked state
mylockvariable=$(cat mylockfile 2>/dev/null)  # read the lock
while [ "$mylockvariable" != $is_locked ]  # loop until unlocked
    sleep 5 # wait 5 seconds to try again 
    mylockvariable=$(cat mylockfile 2>/dev/null)  # read again
echo $is_locked > mylockfile  # lock the lock
# >>> do critical work safely here <<<
# >>> ERROR: NOT SAFE <<<
rm mylockfile  # unlock the lock

Here the lock value is stored in a shared resource, the file “mylockfile”. If the file exists and contains the character “1”, the lock is considered locked; otherwise, it is considered unlocked.  The code will loop until the lock is unlocked, then acquire the lock, do the required single-process work, and then release the lock.

However, this code can fail without warning: suppose two processes A and B execute this code concurrently. Initially the lock is in an unlocked state. Process A reads the lockfile. Then suppose immediately after this, Process A is temporarily interrupted, perhaps to give CPU cycles to run Process B. Then, suppose Process B begins, reads the lock, locks the lock and starts doing its critical work. Suppose now Process B is put into wait state and Process A is restarted. Process A, since it previously read the lockfile, wrongly believes the lock is unlocked, thus proceeds to also lock the lock and do the critical work—resulting in a mess.

This is an example of a classic race condition, in which the order of execution of threads or processes can affect the final outcome of execution.

A solution to this conundrum is found in the excellent book, Unix Power Tools [1,2]. This is a hefty tome but very accessibly written, for some people well worth a read-through to pick up a slew of time-saving tips.

The problem with the example code is the need to both read and set the lock in a single, indivisible (atomic) operation. Here’s a trick to do it:

until (umask 222; echo > mylockfile) 2>/dev/null  # check and lock
do  # keep trying if failed
    sleep 5 # wait 5 seconds to try again 
# >>> do critical work safely here <<<
rm -f mylockfile  # unlock the lock

Here, the existence of the lockfile itself is the indicator that the lock is set. Setting the umask makes this file creation fail if the file already exists, triggering the loop to activate to keep trying. This works because the existence of a file can either be true or false and nothing else; the existence of a file is guaranteed atomicity by the OS and the filesystem. Thus, assuming the system is working correctly, this code is guaranteed to produce the desired behavior.

Race conditions can be a nuisance to find since their occurrence is nondeterministic and can be rare but devastating. Writing correct code for multiple threads of execution can be confusing to those who haven’t done it before. But with experience it becomes easier to reason about correctness and spot such errors.


Do comments in a LaTeX file change the output?

When you add a comment to a LaTeX file, it makes no visible change to the output. The comment is ignored as far as the appearance of the file. But is that comment somehow included in the file anyway?

If you compile a LaTeX file to PDF, then edit it by throwing in a comment, and compile again, your two files will differ. As I wrote about earlier, the time that a file is created is embedded in a PDF. That time stamp is also included in two or three hashes, so the files will differ by more than just the bits in the time stamp.

But even if you compile two files at the same time (within the resolution of the time stamp, which is one second), the PDF files will still differ. Apparently some kind of hash of the source file is included in the PDF.

So suppose you have two files. The content of foo.tex is

    Hello world.

and the content of bar.tex is

    Hello world. % comment

then the output of running pdflatex on both files will look the same.

Suppose you compile the files at the same time so that the time stamps are the same.

    pdflatex foo.tex && pdflatex bar.tex

It’s possible that the two time stamps could be different, one file compiling a little before the tick of a new second and one compiling a little after. But if your computer is fast enough and you don’t get unlucky, the time stamps will be the same.

Then you can compare hex dumps of the two PDF files with

    diff  <(xxd foo.pdf) <(xxd bar.pdf)

This produces the following

    < ...  ./ID [<F12AF1442
    < ...  E03CC6B3AB64A5D9
    < ... 8DEE2FE> <F12AF1
    < ...  442E03CC6B3AB64A
    < ... 5D98DEE2FE>]./Le
    > ...  ./ID [<4FAA0E9F1
    > ...  CC6EFCC5068F481E
    > ...  0419AD6> <4FAA0E
    > ...  9F1CC6EFCC5068F4
    > ...  81E0419AD6>]./Le

You can’t recover the comment from the binary dump, but you can tell that the files differ.

I don’t know what hash is being used. My first guess was MD5, but that’s not it. It’s a 128-bit hash, so that rules out newer hashes like SHA256. I tried searching for it but didn’t find anything. If you know what hash pdflatex uses, please let me know.

LaTeX will also let you add text at the end of the file, after the \end{document} command. This also will change the hash code but will not change the appearance of the output.

If you save a file as PDF twice, you get two different files

If you save a file as a PDF twice, you won’t get exactly the same file both times. To illustrate this, I created an LibreOffice document containing “Hello world.” and saved it twice, first as humpty.pdf then as dumpty.pdf. Then I compared the two files.

    % diff humpty.pdf dumpty.pdf
    Binary files humpty.pdf and dumpty.pdf differ

That’s curious. Both files are the same size, 7260 bytes, but something is different inside.

Next I dumped both files to hexadecimal and compared the output.

.   % diff  <(xxd humpty.pdf) <(xxd dumpty.pdf)

This produced two ranges of differences. Here's the first:

    < 00001a60: ... 3232 ...  064322-06'00')>>
    > 00001a60: ... 3339 ...  064339-06'00')>>

The files differ in two consecutive bytes. The ASCII representation at the end of the lines shows what these bytes mean. Apparently these two bytes are part of a time stamp. The first was produced at 6:43:22 this morning and the second was produced 17 seconds later at 6:43:39.

There's another block of differences further down the file. I'll leave out the hex representation of the bytes to save space and just include the positions and the ASCII representation.

    < 00001bc0: ...  13 0 R./ID [ <CB
    < 00001bd0: ...  4185E1FB366E0C64
    < 00001be0: ...  D65ADF317ACB6A>.
    < 00001bf0: ...  <CB4185E1FB366E0
    < 00001c00: ...  C64D65ADF317ACB6
    < 00001c10: ...  A> ]./DocChecksu
    < 00001c20: ...  m /59EF0E5B9A2CC
    < 00001c30: ...  4AEC9FD90E7BBE23
    < 00001c40: ...  0CC.>>.startxref
    > 00001bc0: ...  13 0 R./ID [ <7D
    > 00001bd0: ...  1441609E44A5446A
    > 00001be0: ...  8A0F9A4E96FF49>.
    > 00001bf0: ...  <7D1441609E44A54
    > 00001c00: ...  46A8A0F9A4E96FF4
    > 00001c10: ...  9> ]./DocChecksu
    > 00001c20: ...  m /A7A3CD305537E
    > 00001c30: ...  B3DC35BA5EB4678F
    > 00001c40: ...  EDA.>>.startxref

The text DocChecksum jumps out. This looks like a 32-bit check sum. If I had to guess, I'd say it's probably CRC-32. And apparently there's some sort of 32-bit hash before the checksum: CB4 ...B6A in humpty.pdf and 7D1...F49 in dumpty.pdf. This must be some sort of hash. The hash is repeated twice in each file. Maybe this is some sort of versioning information, and the hash is repeated because the initial and final versions of the file are the same.

The fact that the files were saved 17 seconds apart changed two bytes in the timestamps. But changing these two bytes caused the two 32-byte hash codes to change.

Is Low Precision Arithmetic Safe?

The popularity of low precision arithmetic for computing has exploded since the 2017 release of the Nvidia Volta GPU. The half precision tensor cores of Volta offered a massive 16X performance gain over double precision for key operations. The “race to the bottom” for lower precision computations continues: some have even solved significant problems using 1-bit precision arithmetic hardware ([1], [2]). And hardware performance is getting even better: the Nvidia H100 tensor core-enabled FP16 is a full 58X faster than standard FP64, and 1-bit precision is yet another 16X faster than this, for total speedup of over 900X for algorithms that can use it [3].

This eye-popping speedup certainly draws attention. However, in scientific computing, low precision arithmetic has typically been seen as unsafe for modeling and simulation codes. Indeed, lower precision can sometimes be used to advantage [4], commonly in a “mixed precision” setting in which only parts of the calculation are done in low precision. However, in general anything less than double precision is considered inadequate to model complex physical phenomena with fidelity (see, e.g., [5]).

In response, developers have created tools to measure the safety of reduced precision arithmetic in application codes [6]. Some tools can even identify which variables or arrays can be safely demoted to lower precision without loss of accuracy in the final result. However, use of these tools in a blind fashion, not backed by some kind of reasoning process, can be hazardous.

An example will illustrate this. The conjugate gradient method for linear system solving and optimization [7] and the closely related Lanczos method for eigenvalue problem solving [8] showed great promise following their invention in the early 1950s. However, they were considered unsafe due to catastrophic roundoff errors under floating point arithmetic—even more pronounced as floating point precision is reduced. Nonetheless, Chris Paige showed in his pioneering work in the 1970s [9] that the roundoff error, though substantial, did not preclude the usefulness of the methods when properly used. The conjugate gradient method has gone on to become a mainstay in scientific computing.

Notice that no tool could possibly arrive at this finding, without a careful mathematical analysis of the methods. A tool would detect inaccuracy in the calculation but could not certify that these errors could cause no harm to the final result.

Some might propose instead a purely data-driven approach: just try low precision on some test cases, if it works then use low precision in production. This approach is fraught with peril, however: the test cases may not capture all situations that could be encountered in production.

For example, one might test an aerodynamics code only on smooth flow regimes, but production runs may encounter complex flows with steep gradients—that low precision arithmetic cannot correctly model. Academic papers that test low precision methods and tools must rigorously evaluate in challenging real-world scenarios like this.

Sadly, computational science teams frequently don’t have the time to evaluate their codes for potential use of lower precision arithmetic. Tools could certainly help. Also, libraries that encapsulate mixed precision methods can provide benefits to many users. A great success story here is mixed precision dense linear solvers, founded on the solid theoretical work of Nick Highnam and colleagues [10], which has found its way into libraries such as [11].

So the final answer is, “it depends.” Each new case must be looked at carefully, and a determination made based on some combination of analysis and testing.


Regex to match SWIFT-BIC codes

A SWIFT-BIC number identifies a bank, not a particular bank account. The BIC part stands for Bank Identifier Code.

I had to look up the structure of SWIFT-BIC codes recently, and here it is:

  • Four letters to identify the bank
  • Two letters to identify the country
  • Two letters or digits to identify the location
  • Optionally, three letters or digits to identify a branch

Further details are given in the ISO 9362 standard.

We can use this as an example to illustrate several regular expression features, and how regular expressions are used in practice.

Regular expressions

If your regular expression flavor supports listing a number of repetitions in braces, you could write the above format as


This would work, for example, with egrep but not with grep. YMMV.

That’s concise, but a little too permissive. It allows anywhere from 2 to 5 alphanumeric characters on the end. But the standard says 2 or 5 alphanumeric characters after the country code, not between 2 and 5. For example, 3 characters after the country code would no be valid. So we could reduce our false positive rate a little by changing the regex to


Without the dollar sign on the end, ABCDEF12X would still match because the part of the regex up to the optional ([A-Z0-9]{3})? at the end would match at the beginning of the string. The dollar sign marks the end of the string, so it says the code has to end either after 8 or 11 characters and stop.

If your regex flavor does not support counts in braces, you could spell everything out:


Convenience versus accuracy

If you want to match only valid SWIFT-BIC codes, you can get perfect accuracy by checking against an exhaustive list of SWIFT-BIC codes. You could even write a regular expression that matches codes on this list and only codes on the list, but what would the point be? Regular expressions usually tradeoff convenience for accuracy.

I don’t have a list of all valid SWIFT-BIC codes. If I did, it might be out of date by the time I download it. But if I’m trying to pull bank codes out of a text file, the regex


is likely to do a pretty good job. Regular expressions are usually used in a context where there’s some tolerance for error. Maybe you use a regular expression to do a first pass, then weed out the mismatches with a manual review.

Capturing parts

Maybe you want to do more than just find SWIFT codes. Maybe you want to look at their pieces.

For example, the fifth and sixth characters of a SWIFT code are the ISO 3166 two-letter abbreviation for the country the bank is in. (With one exception: XR represents Kosovo, which does not have an ISO 3166 code.)

You could replace


at the front of the regular expression with


which will not change which strings match, but it will store the fifth and sixth characters as the first captured group. How you access captured group varies between various regular expression implementations.


The first proposed regular expression


is easy to read, at least in my opinion. It has grown over the course of this post to


which is not as easy to read. This is typical: you start with a quick-and-dirty regular expression, the refine it until it meets your needs. Regular expressions tend to get uglier as they become more precise.

There are ways to make regular expressions more readable by using something like the /x modifier in Perl, which lets you insert white space and comments inside a regular expression.

That’s nice, but it’s also a little odd. If you’re going to use a complicated regular expression in production code, then you should format it nicely and add comments. But then you have to ask why you’re using a complicated regular expression in production code. I’m not saying this is never appropriate, but it’s not the most common use case.

I could imagine using a simple regular expression when you want quick and dirty, and using an exhaustive list of SWIFT codes in production. A complex, well-commented regular expression seems to fall into a sort of no man’s land in between.

New Ways To Make Code Run Faster


The news from Meta last week is a vivid reminder of the importance of making code run faster and more power-efficiently. Meta intends to purchase 350,000 Nvidia H100 GPUs this year [1]. Assuming 350W TDP [2] and $0.1621 per kW-h [3] average US energy cost, one expects a figure of $174 million per year in electricity expenses just to power the GPUs (note this is only a rough estimate of the actual). For this and many other datacenter-scale and real-time critical applications, every bit of increased performance can be impactful.

Many approaches can be taken to making code run faster. The excellent book Hacker’s Delight contains many tricks for speeding up low-level code kernels [4]. Also, superoptimization techniques find the very fastest performing implementations of short, loop-free code kernels by exhaustive search [5].

Since exhaustive search scales exponentially with code size, it’s no surprise that other methods have been tried. Recent work with reinforcement learning reduces the number of scalar multiples needed for matrix products, discovering new Strassen-like algorithms [6]. Other work focuses more on hardware design; for example, PrefixRL optimizes chip design for targeted problems using reinforcement learning [7].

Finding the absolute fastest code for a given task is in general NP-complete [8]. This problem is out of reach for tractable solution by such methods. However, phenomenal progress in SAT/SMT solver efficiency for NP-complete problems has been made over the last 20 years (someone has even quipped, “NP is the new P” [9]). And indeed, these methods have been applied to superoptimization problems [10, 11]. Perhaps methodologies based on efficient SAT and SMT solvers will afford further opportunities for advancement.

The need for speed and power efficiency has become so acute that radically different non-von Neumann processors are being built [12], in some cases orders of magnitude more power efficient but restrictive in the problems they can solve. In the coming years we can expect to see a huge amount activity and developments on this important problem.

Brute force cryptanalysis

A naive view of simple substitution ciphers is that they are secure because there are 26! ways to permute the English alphabet, and so an attacker would have to try 26! ≈ 4 × 1026 permutations. However, such brute force is not required. In practice, simple substitution ciphers are breakable by hand in a few minutes, and you can find software that automates the process.

However, for modern encryption, apparently brute force is required. If you encrypt a message using AES with a 128-bit key, for example, you can’t do much better than try 2128 keys. You might be able to do a little better, but as far as is openly known, you can’t do orders of magnitude better.

Even for obsolete encryption methods such as DES it still takes a lot more effort to break encryption than to apply encryption. The basic problem with DES is that it used 56-bit keys, and trying 256 keys is feasible [1]. You won’t be able to do it on your laptop, but it can be done using many processors in parallel [2]. Still, you’d need more than a passing curiosity about a DES encrypted message before you’d go to the time and expense of breaking it.

If breaking a simple substitution cipher really did require brute force, it would offer 88-bit security. That is, 26! roughly equals 288. So any cipher offering b-bit security for b > 88 is more secure in practice than breaking simple substitution ciphers would be in naive theory. This would include AES, as well as many of its competitors that weren’t chosen for the standard, such as Twofish.

For all the block ciphers mentioned here, the number of bits of security they offer is equal to the size of the key in bits. This isn’t always the case. For example, the security level of an RSA key is much less than the size of the key, and the relation between key size and security level is nonlinear.

A 1024-bit RSA modulus is believed to offer on the order of 87 bits security, which incidentally is comparable to 26! as mentioned above. NIST FIPS 184-5 recommends 2048 bits as the minimum RSA modulus size. This gives about 117 bits of security.

The security of RSA depends on the difficulty of factoring the product of large primes [3], and so you can compute the security level of a key based on the efficiency of the best known factoring algorithm, which is currently the General Number Field Sieve. More on this here.

