> DoJ explicitly avoids JPEG images in the PDFs probably because they appreciate that JPEGs often contain identifiable information, such as EXIF, IPTC, or XMP metadata
Maybe I'm underestimating the issue at full, but isn't this a very lightweight problem to solve? Is converting the images to lower DPI formats/versions really any easier than just stripping the metadata? Surely the DOJ and similar justice agencies have been aware of and doing this for decades at this point, right?
This is speculation but generally rules like this follow some sort of incident. e.g. Someone responds to a FOI request and accidentally discloses more information than desired due to metadata. So a blanket rule is instituted not to use a particular format.
Maybe they know more than we do. It may be possible to tamper with files at a deeper level. I wonder if it is also possible to use some sort of tampered compression algorithm that could mark images much like printers do with paper.
Another guess is that perhaps the step is a part of a multi-step sanitation process, and the last step(s) perform the bitmap operation.
I'm not sure about computer image generation but you can (relatively) easily fingerprint images generated by digital cameras due to sensor defects. I'll bet there is a similar problem with PC image generation where even without the EXIF data there is probably still too much side channel data leakage.
Image metadata is the wild west of structured text. The developer of the foremost tool for dealing with it (exiftool) has made 'remove metadata' feature but still disclaims that it is not able to remove everything.
How could that be possible? Isn't JPEG a fairly straightforward container for JFIF+metadata?
"Fairly straightforward" is incorrect. Not an authority to describe in more detail, but the most tricky blocker I'm aware of are these proprietary "MakerNote" tags from camera manufacturers, which are (often undocumented) binary blobs. exiftool might not even know what's in there, let alone how to safely remove it without corrupting the file.
> exiftool might not even know what's in there, let alone how to safely remove it without corrupting the file.
But isn't it a contiguous sequence of data whose length is determined by the container format?
On the extreme end, simply decode the image and reencode it using an encoder that you have vetted to not include any metadata.
But I agree, presumably the image data part of the file is well and exhaustively defined. I would be very interested in counterexamples that have practical consequences.
Note that there will still be concerns about stenography and fingerprinting which would warrant such a disclaimer from the creator of a tool aimed at a nontechnical audience.
Yes, I figured that steganography, watermarks etc. are the kind of "metadata" that the tool author had in mind.