> Large language models generate text one word (token) at a time. Each word is assigned a probability score, based on how likely it is to be generated next. So for a sentence like “My favourite tropical fruits are mango and…”, the word “bananas” would have a higher probability score than the word “airplanes”.

> SynthID adjusts these probability scores to generate a watermark. It's not noticeable to the human eye, and doesn’t affect the quality of the output.

I think they need to be clearer about the constraints involved here. If I ask What is the capital of France? Just the answer, no extra information.” then there’s no room to vary the probability without harming the quality of the output. So clearly there is a lower bound beyond which this becomes ineffective. And presumably the longer the text, the more resilient it is to alterations. So what are the constraints?

I also think that this is self-interest dressed up as altruism. There’s always going to be generative AI that doesn’t include watermarks, so a watermarking scheme cannot tell you if something is genuine. It is, however, useful for determining that something came from a specific provider, which could be valuable to Google in all sorts of ways.

This might be enforced in some trivial way, e.g. by requiring AI models to answer with at least a sentence. The constraints may not be fully published and the obscurity might make it more efficient, if only temporarily.

Printer tracking dots[1] is one prior solution like this; annoying, largely unknown, workarounds exist, still - surprisingly efficient.

[1]: https://en.m.wikipedia.org/wiki/Printer_tracking_dots

I think they are what busted that ironically-named young lady that leaked NSA information.

Yes, the Wikipedia article mentions that and includes links to more sources:

> Both journalists and security experts have suggested that The Intercept's handling of the leaks by whistleblower Reality Winner, which included publishing secret NSA documents unredacted and including the printer tracking dots, was used to identify Winner as the leaker, leading to her arrest in 2017 and conviction.

[dead]

For answers like that, it probably wouldn't matter whether it was AI-generated or not. It becomes more relevant with long-form generated content

Security and surveillance products don’t have to be perfect to be useful enough to some.

Choosing the slightly less probable output is changing the quality of the output if it weren't LLMs wouldn't work by processing a large amount of data to get these probabilities as accurate as possible.

> It is, however, useful for determining that something came from a specific provider, which could be valuable to Google in all sorts of ways.

Oh crap, knowing Google it probably means they will put articles generated using their AI higher among the search results.