Hacker News

What is the point of using a generic compression algorithm in a file format? Does this actually get you much over turning on filesystem and transport compression, which can transparently swap the generic algorithm (e.g. my files are already all zstd compressed. HTTP can already negotiate brotli or zstd)? If it's not tuned to the application, it seems like it's better to leave it uncompressed and let the user decide what they want (e.g. people noting tradeoffs with bro vs zstd; let the person who has to live with the tradeoff decide it, not the original file author).

wongarsu 2 months ago [ - ]

Few people enable file system compression, and even if they do it's usually with fast algorithms like lz4 or zstd -1. When authoring a document you have very different tradeoffs and can afford the cost of high compression levels of zstd or brotli.

Someone 2 months ago [ - ]

- inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text

- when jumping from page to page, you won’t have to decompress the entire file

wizzwizz4 2 months ago [ - ]

> inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text

Okay, so we make a compressed container format that can perform such shenanigans, for the same amount of back-compat issues as extending PDF in this way.

> when jumping from page to page, you won’t have to decompress the entire file

This is already a thing with any compression format that supports quasi-random access, which is most of them. The answers to https://stackoverflow.com/q/429987/5223757 discuss a wide variety of tools for producing (and seeking into) such files, which can be read normally by tools not familiar with the conventions in use.

Someone 2 months ago [ - ]

> Okay, so we make a compressed container format that can perform such shenanigans, for the same amount of back-compat issues as extending PDF in this way.

Far from the same amount:

- existing tools that split PDFs into pages will remain working

- if defensively programmed, existing PDF readers will be able to render PDFs containing JPEG XL images, except for the images themselves.

eru 2 months ago [ - ]

Well, if sanity had prevailed, we would have likely stuck to .ps.gz (or you favourite compression format), instead of ending up with PDF.

Though we might still want to restrict the subset of PostScript that we allow. The full language might be a bit too general to take from untrusted third parties.

dunham 2 months ago [ - ]

Don't you end up with PDF if you start with PS and restrict it to a subset? And maybe normalize the structure of the file a little. The structure is nice when you want to take the content and draw a bit more on the page. Or when subsetting/combining files.

I suspect PDF was fairly sane in the initial incarnation, and it's the extra garbage that they've added since then that is a source of pain.

I'm not a big fan of this additional change (nor any of the javascript/etc), but I would be fine with people leaving content streams uncompressed and running the whole file through brotli or something.

eru 2 months ago [ - ]

> Don't you end up with PDF if you start with PS and restrict it to a subset?

PDF is also a binary format.

mikkupikku 2 months ago [ - ]

I thought PDFs can contain arbitrary PS.

lmz 2 months ago [ - ]

Compression filters are in PostScript.