Hacker News

Damn surely you stop using ASCII formats before your dataset gets to 2 TB??

Ha. it gets worse. Search engines or blacklist processors often use gigantic url lists, which are stored as plain ASCII, which is then fed into a perfect hash generator, which accesses those url's unordered. I.e. they need to create a second ordering index to access the urllist. The perfect hashing guys are mathematicians and so they don't care because their definition of a mphf (minimal perfect hash function) is just a random ordering of unique indices, but they don't care to store the ordering also. So we have ASCII and no index.

a day ago [ - ]

[deleted]

bede 2 days ago [ - ]

BAM format is widely used but assemblies still tend to be generated and exchanged in FASTA text. BAM is quite a big spec and I think it's fair to say that none of the simpler binary equivalents to FASTA and FASTQ have caught on yet (XKCD competing standards etc.)

e.g. https://github.com/ArcInstitute/binseq

hhh 2 days ago [ - ]

no, I power thru indefinitely with no recourse

amelius a day ago [ - ]

People rely on compression for that ;)