FASTA is a candidate for the stupidest file format ever invented and a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.

Spend a few years handling data in arcane, one-off, and proprietary file formats conceived by "brilliant" programmers with strong CS backgrounds and you might reconsider the conclusion you've come to here.

This is a presentation problem, or possibly a lack of tooling problem.

A binary format with a tool that renders it to text works the same as a text format; if the rendering is lossless, you could even consume the text format rather than the binary.

A "text" format is built to be understandable, but that's not a requirement; you could write a text format that isn't descriptive, and you'd have just as much trouble understanding what 'A' means as you would understanding what 'C0' means for a binary format.

Undocumented formats are a pain, whether they're in text or binary.

> or possibly a lack of tooling problem.

It's a lack of tooling problem. Because if you're a bioinformatics researcher, you want to devote your time, money and energy towards bioinformatics. You don't want to spend weeks getting tooling written to handle an arcane file format, nor pay for that tooling, nor hire a "brilliant" programmer. That tooling needs to be written, packaged and maintained for perhaps dozens of programming languages.

Instead, you want to use the format that can be read and written by a rank novice with a single programming course under their belt, because that's what makes the field approachable by dewey-eyed undergrads eager to get their feet wet. Giving those folks an easy on-ramp is how you grow the field.

And then you want to compress that format with bog-standard compression algorithms, and you might get side-tracked investigating how to improve that process without exploding your bioinformatics-focused codebase. Which is an interesting show of curiosity, not a reason to insult a class of scientists.

There's also a distribution problem. When people break history, by introducing new file formats, or updating old file formats, that impedes archival and replication. And once you've got a handful of file formats, now you've got the classic n+1 problem where no single format is optimal in all ways so people are always inventing new formats to see what sticks. And now an archivist needs to maintain tooling with an ever-increasing overhead. Here we see a clash of wisdom versus intellect, and if you're trying to foster a healthy field of research, wisdom wins the long game.

other file formats that rival fasta in stupidity include fastq pdb bed sam cram vcf. further reading [1]

> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"

1. https://madhadron.com/science/farewell_to_bioinformatics.htm...

SAM is not a bad file format. What's bad about SAM?

I don't dislike the format, and it is much, much better than what it replaced, but SAM, and its binary sister-format BAM, does have some flaws:

- The original index format could not handle large chromosomes, so now there are two index formats: .bai and .csi

- For BAM, the CIGAR (alignment description) operation count is limited to 16 bits, which means that very long alignments cannot be represented. One workaround I've seen (but thankfully not used) is saving the CIGAR as a string in a tag

- SAM cannot unambiguously represent sequences with only a single base (e.g. after trimming), since a '*' in the quality column can be interpreted either as a single Phred score (9) or as a special value meaning "no qualities". BAM can represent such sequences unambiguously, but most tools output SAM

True. I'd consider these minor flaws. W.r.t. the CIGAR, the spec says you do need to store it as a tag.

> a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.

This is not really a fair statement. Literally all of software bears the weight of some early poor choice that then keeps moving forward via weight of momentum. FASTA and FASTQ formats are exceptionally dumb though.

I’ll do you the immense favor of taking the bait. What’s so bad about it?

It's a fine format for what it is.

A parser to stream FASTA can be written in like 30 lines [0], much easier than say CSV where the edge cases can get hairy.

If you need something like fast random reads, use the FAIDX format [1], or even better just store it in an LMDB or SQLite embedded db.

People forget FASTA was from 1985, and it sticks around because (1) it's easy to parse and write (2) we have mountains of sequences in that format going back 4 decades.

[O] https://gist.github.com/jszym/9860a2671dabb45424f2673a49e4b5...

[1] https://seqan.readthedocs.io/en/main/Tutorial/InputOutput/In...

I think the prevalence of the format vs something more widely used should be part of that metric.

On those grounds, the lack of pre-tokenization in html/css/js ranks at this point as a planet killing level of poor choices.

[deleted]

It might be the stupidest, but stupid in the sense of "the simplest thing that could possibly work."

When FASTA was invented, Sanger sequencing reads would be around a thousand bases in length. Even back then, disk space wasn't so precious that you couldn't spend several kilobytes on the results of your experiment. Plus, being able to view your results with `more` is a useful feature when you're working with data of that size.

And, despite its simplicity, it has worked for forty years.

When FASTA was invented in 1985, generally sequencing reads would be about half that.

The simplicity of FASTA seems like a dream compared to the GenBank flat file format used before then. And around the year 2000, less computationally-inclined scientists were storing sequence in Microsoft Word binary .doc files.

A lot of file formats (including bioinformatics formats!) have come and gone in that time period. I don't think many would design it this way today, but it has a lot of nice features like the ones you point out that led to its longevity.

FASTA was invented in late 1980s. At that time, unix tools often limited line length. Even in early 2000s, some unix tools (on AIX as I remember) still had this limit.

Yes, If someone want, they can do many analyses by grep!