Uploading ASP as an image and having it execute server side is one thing.
But in this case, it's subtly different.
This issue relies more on a quirk of how PDF and PostScript relate (PDF is built on a subset of postscript).
Imagine you had an image format which was just C which when compiled and ran produced the width, height, and then stream of RGB values to form an image. And you formalised this such that it had to have a specific structure so that if someone wanted to, they didn't have to write a C compiler, they could just pull out the key bits from this file which looks like ordinary C and produce the same result.
Now imagine that your website supports uploading such image files, and you need to render them to produce a thumbnail, but instead of using a minimal implementation of the standard which doesn't need to compile the code, you go ahead and just run gcc on it and run the output.
That's kind of more or less what happened here.
It's worth noting here that it's not really common knowledge that PDF is basically just a subset of postscript. So it's actually a bit less surprising that these guys fell for this, as it's as if C had become some weird language nobody talks about, and GCC became known as "that tool to wrangle that image format" rather than a general purpose C compiler.
The attackers in this case relied on some ghostscript exploits, that's true, but if you never ran the resulting C-image-format binaries, you could still get pwned through GCC exploits.
> it's not really common knowledge that PDF is basically just a subset of postscript.
Because that's not actually true? Check out the table in the PDF specification, Appendix A, p985, listing all the PDF operators and their totally different PostScript equivalents, when there are any: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
The PDF imaging model is mostly borrowed from PostScript, though PDF's imaging model also supports partial transparency. The actual files themselves are totally different.
In this case, no PDF files were involved at all, but a PostScript file renamed to .pdf, which was used to exploit an old insecure GhostScript's PostScript execution engine (PostScript is a programming language, unlike PDF) or maybe parser:
> According to S0I1337, it was done by exploiting a vulnerability on 4chan's outdated GhostScript version from 2012 by uploading a malformed PostScript file renamed to PDF to gain arbitrary code execution as 4chan didn't check if files with PDF extensions were actually PDF files -- https://wiki.soyjak.st/Great_Cuckset, see also the image in A_D_E_P_T's comment https://news.ycombinator.com/item?id=43699395
Key word: "basically"
Read section 2.4 of the PDF you linked for a bit of additional information on this "bsaically".
GhostScript is a postscript interpreter which can handle PDF files by applying the relatively simple transformations described in that section of the PDF. Whether they embedded the ghostscript exploit within the PDF, or didn't, it's not particularly important for making my point.
That seems like saying "Python is basically a subset of C; just run the simple transformations Cython implements". PDF can be transformed into something a PostScript interpreter can understand in the same way Python can be transformed into something GCC can understand. That is not what "subset" means.
... did you read the bit of the PDF I referenced?
Yes. The section itself says PDF differs significantly from PostScript. The required changes detailed there to transform a PDF to PostScript are substantial: add PostScript implementations of the PDF operators; extract and translate the page content, changing the operator names, decompressing and recompressing text, graphics, and image data, and deleting PDF-only content; translate and insert font data; reorder the content into page order. What you end up with is very different - PDF is not basically just a subset of PostScript.
The substantial differences are in terms of restrictions to postscript to reduce it to a declarative language rather than a full fledged programming language.
A PDF is a collection of isolated, restricted postscript programs (content streams) and the data required for rendering stuffed into one file. The overarching format is a subset of COS. But for all intents and purposes you can imagine this as a tarball containing postscript and other data.
The transformations required to go from PDF to postscript amount to:
1. Include some boilerplate
2. Pull out the content streams (postscript bits) ignoring the pdf-specific extensions
3. Search and replace the names of two procedures
4. Pull out the data required for rendering, optionally decompressing it if your postscript output doesn't support the particular compression in use
5. Concatenate all the data in the right order (on the basis of some metadata in the format)
6. It's now just normal postscript
Fun fact, to top it off: The COS format which is the structure behind a PDF, itself looks a lot like postscript, that's because apparently it's originally based on postscript [0] (although it has deviated).
[0]: https://archive.is/xBd9y (search for postscript)
You basically just described the XPM format.
Oh yeah... I completely forgot about this thing. But you're right!
There's also XBM.
I love these kinds of formats.
Your writing reminds me of a Tom Scott video.