In case "bases with optional newlines" wasn't obvious to anyone else, a specific example (from Wikipedia) is:
;LCBO - Prolactin precursor - Bovine
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*
where "SS...EM", HL..VT", or "ED..AR" may be common subsequences, but the plaintext file arbitrarily wraps at column 65 so it renders on a DEC VT100 terminal from the 70s nicely.Or, for an even simpler example:
; plaintext
GATTAC
AGATTA
CAGATT
ACCAGA
TTACAG
ATTACA
becomes, on disk, something like ; plaintext\r\nGATTAC\r\nAGATTA\r\nCAGATT\r\nACCAGA\r\nTTACAG\r\nATTACA\r\n
which is hard to compress, while ; plaintext\r\nGATTACAGATTACAGATTACCAGATTACAGATTACA
is just "; plaintext\r\n" + "GATTACA" * 7
and then, if you want, you can reflow the text when it's time to render to the screen.
Feels like it could be an extension to the compression lib (and would retain newlines as such) vs requiring external document tailoring. Also feels like a very specific use case but this optimization might have larger applications outside this narrow field/format.
Huh, so in other words: "If you don't arbitrarily interrupt continuous sequences of data with cosmetic noise, they compress better."
When working with data, I definitely prefer the UI to adapt to the data. I never save anything for the display back.