> There’s another distance limit at work here, and that is the speed of light. It takes milliseconds for the signal in your phone to reach the hotel above ground and be handed over to the mobile network.
It takes roughly 100us for light to travel 30km – Can you explain how the speed of light is relevant here?
.. and in 1mS it travels 300km. Maybe they just want to sound technical, somewhat to match the rest of the article. They certainly didn't use chat gpt, so maybe that's a good thing.
I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.
Another format that might be worth looking at in the bioinformatics world is hdf5. It's sort of a generic file format, often used for storing multiple related large tables. It has some built-in compression (gzip IIRC) but supports plugins. There may be an opportunity to integrate the self-describing nature of the hdf5 format with the self-describing decompression routines of openZL.
I've only tested this when writing my own parser where I could skip the record end checks, so idk if this improves perf on a existing parser. Excited to see what you find!
Yes, when doing anything intensive with lots of sequences it generally makes sense to liberate them from FASTA as early as possible and index them somehow. But as an interchange format FASTA seems quite sticky. I find the pervasiveness of fastq.gz particularly unfortunate with Gzip being as slow as it is.
> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I even confused myself about this while writing :-)
Note that BGZF solves gzip’s speed problem (libdeflate + parallel compression/decompression) without breaking compatibility, and usually the hit to compression ratio is tolerable.
BAM format is widely used but assemblies still tend to be generated and exchanged in FASTA text. BAM is quite a big spec and I think it's fair to say that none of the simpler binary equivalents to FASTA and FASTQ have caught on yet (XKCD competing standards etc.)
It might be worth (in some other context) introducing a pre-processing step which handles this at both ends. I'm thinking like PNG - the PNG compression is "just" zlib but for RGBA that wouldn't do a great job, however there's a (per row) filter step first, so e.g. we can store just the difference from the row above, now big areas of block colour or vertical stripes are mostly zeros and those compress well.
Guessing which PNG filters to use can make a huge difference to compression with only a tiny change to write speed. Or (like Adobe 20+ years ago) you can screw it up and get worse compression and slower speeds. These days brutal "try everything" modes exist which can squeeze out those last few bytes by trying even the unlikeliest combinations.
I can imagine a filter layer which says this textual data comes in 78 character blocks punctuated with \n so we're going to strip those out, then compress and in the opposite direction we decompress then put back the newlines.
For FASTA we can just unconditionally choose to remove the extra newlines but that may not be true for most inputs, so the filters would help there.
For one approach to compressing FASTQ (which is a series 3 distinct lines), we broke the distinct lines each into their own streams and then compressed each stream independently. That way the coder for the header line, sequence line, and error line could learn the model of that specific line type and get slightly better compression (not unlike a columnar format, although in this case, we simply combined "blocks" of streams together with a record separator.
I'm still pretty amazed that periodic newlines hurt compression ratios so much, given the compressor can use both a huffman coding and a lookback dictionary.
The best rule in sequence data storage is to store as little of it as possible.
Exactly. The line breaks break the runs of otherwise identical bits in identical sequences. Unless two identical subsequences are exactly in phase with respect to their line breaks, the hashes used for long range matching are different for otherwise identical subsequences.
It takes roughly 100us for light to travel 30km – Can you explain how the speed of light is relevant here?