More

bede · 2026-01-18T08:36:31 1768725391

> There’s another distance limit at work here, and that is the speed of light. It takes milliseconds for the signal in your phone to reach the hotel above ground and be handed over to the mobile network.

It takes roughly 100us for light to travel 30km – Can you explain how the speed of light is relevant here?

mianos · 2026-01-18T15:08:23 1768748903

.. and in 1mS it travels 300km. Maybe they just want to sound technical, somewhat to match the rest of the article. They certainly didn't use chat gpt, so maybe that's a good thing.

bede · 2025-10-06T22:55:36 1759791336

For BAM this could be a good place to start: https://www.htslib.org/benchmarks/CRAM.html

Happy to discuss further

terrelln · 2025-10-06T23:04:08 1759791848

Amazing, thank you!

I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.

fwip · 2025-10-07T03:45:55 1759808755

Another format that might be worth looking at in the bioinformatics world is hdf5. It's sort of a generic file format, often used for storing multiple related large tables. It has some built-in compression (gzip IIRC) but supports plugins. There may be an opportunity to integrate the self-describing nature of the hdf5 format with the self-describing decompression routines of openZL.

felixhandte · 2025-10-07T18:06:57 1759860417

Wanna hop over to https://github.com/facebook/openzl/issues/76?

bede · 2025-10-06T20:46:37 1759783597

Author of [0] here. Congratulations and well done for resisting. Eager to try it!

Edit: Have you any specific advice for training a fasta compressor beyond that given in e.g. "Using OpenZL" (https://openzl.org/getting-started/using-openzl/)

bede · 2025-09-16T08:22:04 1758010924

A Zstd maintainer clarified this: https://news.ycombinator.com/item?id=45251544

> Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses

bede · 2025-09-16T08:19:29 1758010769

Fascinating, thank you.

bede · 2025-09-15T15:01:01 1757948461

Thanks for reminding me to benchmark this!

totalperspectiv · 2025-09-15T15:19:31 1757949571

I've only tested this when writing my own parser where I could skip the record end checks, so idk if this improves perf on a existing parser. Excited to see what you find!

bede · 2025-09-15T13:05:44 1757941544

Yes, when doing anything intensive with lots of sequences it generally makes sense to liberate them from FASTA as early as possible and index them somehow. But as an interchange format FASTA seems quite sticky. I find the pervasiveness of fastq.gz particularly unfortunate with Gzip being as slow as it is.

> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)

I even confused myself about this while writing :-)

chrchang523 · 2025-09-15T19:14:25 1757963665

Note that BGZF solves gzip’s speed problem (libdeflate + parallel compression/decompression) without breaking compatibility, and usually the hit to compression ratio is tolerable.

bede · 2025-09-15T12:45:18 1757940318

BAM format is widely used but assemblies still tend to be generated and exchanged in FASTA text. BAM is quite a big spec and I think it's fair to say that none of the simpler binary equivalents to FASTA and FASTQ have caught on yet (XKCD competing standards etc.)

e.g. https://github.com/ArcInstitute/binseq

bede · 2025-09-15T12:22:47 1757938967

Thank you for clarifying this – yes the non-semantic nature of these particular line breaks is a key detail I omitted.

tialaramex · 2025-09-15T20:19:37 1757967577

It might be worth (in some other context) introducing a pre-processing step which handles this at both ends. I'm thinking like PNG - the PNG compression is "just" zlib but for RGBA that wouldn't do a great job, however there's a (per row) filter step first, so e.g. we can store just the difference from the row above, now big areas of block colour or vertical stripes are mostly zeros and those compress well.

Guessing which PNG filters to use can make a huge difference to compression with only a tiny change to write speed. Or (like Adobe 20+ years ago) you can screw it up and get worse compression and slower speeds. These days brutal "try everything" modes exist which can squeeze out those last few bytes by trying even the unlikeliest combinations.

I can imagine a filter layer which says this textual data comes in 78 character blocks punctuated with \n so we're going to strip those out, then compress and in the opposite direction we decompress then put back the newlines.

For FASTA we can just unconditionally choose to remove the extra newlines but that may not be true for most inputs, so the filters would help there.

dekhn · 2025-09-15T23:31:03 1757979063

For one approach to compressing FASTQ (which is a series 3 distinct lines), we broke the distinct lines each into their own streams and then compressed each stream independently. That way the coder for the header line, sequence line, and error line could learn the model of that specific line type and get slightly better compression (not unlike a columnar format, although in this case, we simply combined "blocks" of streams together with a record separator.

I'm still pretty amazed that periodic newlines hurt compression ratios so much, given the compressor can use both a huffman coding and a lookback dictionary.

The best rule in sequence data storage is to store as little of it as possible.

bede · 2025-09-15T11:03:44 1757934224

Exactly. The line breaks break the runs of otherwise identical bits in identical sequences. Unless two identical subsequences are exactly in phase with respect to their line breaks, the hashes used for long range matching are different for otherwise identical subsequences.