Without having read into this deeper, it sounds like someone could take an original video which has this code embedded as small fluctuations in luminance over time and edit it or produce a new video, simply applying the same luminance changes to the edited areas/generated video, no? It seems for a system like this every pixel would need to be digitally signed by the producer for it to be non-repudiable.
Exactly, that is my question too. If you can detect the lighting variations to read and verify the code, then you can also extract them, remove them, reapply to the edited version or the AI version... varying the level of global illumination in a video is like the easiest thing to manipulate.
Although there's a whole other problem with this, which is that it's not going to survive consumer compression codecs. Because the changes are too small to be easily perceptible, codecs will simply strip them out. The whole point of video compression is to remove perceptually insignificant differences.
As I understand it, the brilliant idea is that the small variantions in brightness of the pixels look just like standard noise. Distinguishing the actual noise from the algorithm is not possible, but it is still possible to verify that the 'noise' has the correct pattern.
Correct pattern for the correct time span matching random fluctuations in the electrical grid.
I think that will be handled by the AC to DC conversion in most systems.
Nope. Mains hum is picked up by microphones as well as light intensity:
https://en.wikipedia.org/wiki/Electrical_network_frequency_a...
The code embedded into the luminosity is sampled from a distribution resembling the noise already present in the video.
Plus, the code gives information about the frame it's embedded into, so you still have more work to do.
Doesn't this just fall apart if a video is reencoded? Something fairly common on all video platforms.
Take a computer screen with a full wash of R, G, or B. Sync the RGB display with your 2FA token, but run it at 15FPS instead of one code per minute.
Point the monitor at the wall, or desk, or whatever. Notice the radiosity and diffuse light scattering on the wall (and on the desk, and on the reflection on the pen cap, and on their pupils).
Now you can take a video that was purported to be taken at 1:23pm at $LOCATION and validate/reconstruct the expected "excess" RGB data and then compare to the observed excess RGB data.
What they say they've done as well is to not just embed a "trace" of expected RGB values at a time but also a data stream (eg: a 1FPS PNG) which kindof self-authenticates the previous second of video.
Obviously it's not RGB, but "noise" in the white channels, and not a PNG, but whatever other image compression they've figured works well for the purpose.
In the R, G, B case you can imagine that it's resistant (or durable through) most edits (eg: cuts, reordering), and it's interesting they're talking about detecting if someone has photoshopped in a vase full of flowers to the video (because they're also encoding a reference video/image in the "noise stream").
No, unsafe-yt
The code could be cryptographically derived from the content of the video. For simplicy, imagine there are subtitles baked into the video and the code is cryptographically derived from those.
Not if you encode a cryptographic signature in the watermark
what would that change
The general idea is for the signature to be random each time, but verifiable. There are a bajillion approaches to this, but a simple starting point is to generate a random nonce, encrypt it with your private key, then publish it along with the public key. Only you know the private key, so only you could have produced the resulting random string that decodes into the matching nonce with the public key. Also, critically, every signature is different. (that's what the nonce is for.) If two videos appear to have the same signature, even if that signature is valid, one of them must be a replay and is therefore almost certainly fake.
(Practical systems often include a generational index or a timestamp, which further helps to detect replay attacks.)
I think for the approach discussed in the paper, bandwidth is the key limiting factor, especially as video compression mangles the result, and ordinary news reporters edit the footage for pacing reasons. You want short clips to still be verifiable, so you can ask questions like "where is the rest of this footage" or "why is this played out of order" rather than just going, "there isn't enough signature left, I must assume this is entirely fake."
But the point is that you'd be extracting the nonce from someone else's existing video of the same event.
If a celebrity says something and person A films a true video, and person B films a video and then manipulates it, you'd be able to see that B's light code is different. But if B simply takes A's lighting data and applies it to their own video, now you can't tell which is real.
I am not defending the proposed method, but your criticism is not why:
Lets assume the pixels have an 8-bit luminance depth, and lets say the 7 most significant bits are kept, and the signature is coded in the last bit of the pixels in a frame. A hash of the full 7-bit image frame could be cryptographically signed, while you could copy the 8-th bit plane to a fake video, the same signature will not check out according to a verifying media player, since the fake video's leading 7-bit planes won't hash to the same hash that has been signed.
What does this change compared to status quo? nothing: you can already hash and sign a full 8-bit video, and Serious-Oath that it depicts Real imagery. Your signature would also not be transplantable to someone elses video, so others can't put fake video in your mouth.
The only difference: if the signature is generated by the image sensor, and end-users are unable to extract the private key, then it decreases the number of people / entities able to credibly fake a video, but provides great power to the manufacturers to sign fake videos while the masses are unable to (unless they play a fake video on a high quality screen being imaged by a manufacturer-privatekey-containing-image-sensor.
The bandwidth of the encoding is too low for playing cryptographic games. This doesn't preclude faking a video by introducing the code into your faked video--it's just that that is much, much more difficult than stringing pieces together in an incorrect fashion.
This is more akin to spread spectrum approaches--you can perfectly well know the signal is there and yet finding it without knowing the key is difficult. That's why old GPS receivers took a long time to lock on--all the satellites are transmitting on top of each other, just with different keys and the signal is way below the noise floor. You apply the key for each satellite and see if you can decode something. These days it's much faster because it's done in parallel.
So what if your adversary relays your encrypted message along another channel?
Academics presenting the opening move in a game of white hat / black hat thinking the game is over after one turn.