Hacker News

The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.

philipportner 3 hours ago [ - ]

This seems related, it may not be a codebase but they are able to extract "near" verbatim books out of Claude Sonnet.

https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

Aurornis 2 hours ago [ - ]

Their technique really stretched the definition of extracting text from the LLM.

They used a lot of different techniques to prompt with actual text from the book, then asked the LLM to continue the sentences. I only skimmed the paper but it looks like there was a lot of iteration and repetitive trials. If the LLM successfully guessed words that followed their seed, they counted that as "extraction". They had to put in a lot of the actual text to get any words back out, though. The LLM was following the style and clues in the text.

You can't literally get an LLM to give you books verbatim. These techniques always involve a lot of prompting and continuation games.

seba_dos1 2 hours ago [ - ]

> The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte.

The lesson here is that the Internet compresses pretty well.

mft_ 2 hours ago [ - ]

(I'm not needlessly nitpicking, as I think it matters for this discussion)

A frontier model (e.g. latest Gemini, Gpt) is likely several-to-many times larger than 500GB. Even Deepseek v3 was around 700GB.

But your overall point still stands, regardless.

uywykjdskn 38 minutes ago [ - ]

You got a source on frontier models being maybe half a terabyte. That's not passing the sniff test.