There are special compression algorithms for genetic sequence data, but I was curious how well simply zipping a text file would work.
I downloaded a 14 MB text file containing DNA sequence data from a fruit fly and compressed it as a zip file and as a 7z file. The result was about 3.5 MB, which is basically no compression beyond the obvious.
The file contains millions of A’s, C’s, G’s, and T’s and nothing more [0]. (I prepared the file by removing comments and line breaks.)
AAAATCAATATGTTGCCATT…
There are only four possible characters, so each carries two bits of information [1], but is encoded in an 8-bit ASCII character. The most obvious encoding would pack four letters into a byte. That would compress a 14 MB text file down to a 3.5 MB binary file.
Here are the exact numbers.
|----------+----------+-------| | file | size | ratio| |----------+----------+-------| | original | 14401122 | 1.000 | | zip | 3875361 | 3.716 | | 7z | 3419294 | 4.212 | |----------+----------+-------|
So the 7z format does a little better than simply packing groups of four letters into a byte, but the zip format does not.
There is repetition in genome sequences, but apparently generic compression software is unable to exploit this repetition. You can do much better with compression algorithms designed for genetic data.
Update
The plot thickens. Imran Haque kindly let me know that I overlooked the N’s in the data. These are placeholders for unresolved bases. You can’t simply encode each base using two bits because you need five states.
The number of N’s is small—at least in this example, though I imagine they would be more common in lower quality data—and so the calculation in footnote [1] is still approximately correct [2]. There are about two bits of information in each base, but this is on average. Shannon would say you can expect to compress your text file by at least 75%. But you can’t simply represent each base with two bits as I suggested because you need to make room for the possibility of a null base.
Note that the total information of a file is not the number of symbols times the information per symbol, unless all the symbols are independent. In the case of genetic data, the symbols are not independent, and so more compression is possible.
***
[0] Or so I thought. See the Update section.
[1] The letter frequencies are not exactly equal, but close: 26.75% A, 23.67% C, 23.94% G, and 25.58% T. The Shannon entropy is 1.998, so essentially two bits.
[2] N’s account for 0.04% of the data. Accounting for the N’s increases the Shannon entropy to 2.002.
Zstd is the current industry standard, being the fatest both in compression and decompression at equal compression levels.
I ould like to see it at the highest compression level (20)
If any readers need to compress genome sequences, or are curious about how different methods compare, there’s a nice benchmark here.
http://kirr.dyndns.org/sequence-compression-benchmark/
I tried zstd on the data, and it produced a file about 20% larger than zip.
If I understand the data from the sequence compression benchmark correctly, the compression algorithm with the highest compression is cmix, which is a compression algorithm for generic (not genetic) text!
Maybe the reason an algorithm for genetic data usually needs to work on large datasets, so being slow (as cmix is!) is not an option.