A nice illustration of the homogeneity of LLM responses. Another way to describe this effect would be…

If you ask humans to write 1,000 books, you're asking 1,000 different humans with different experiences and different skills and different moods (etc.) to write those books.

But if you ask LLMs to write 1,000 books, you're probably only talking to 3 or 5 different models, tops. And they've all trained on the same or similar data, and are trained to respond in very similar ways.

The LLMs don't differ much in anything like "life experience" or "skills", and they don't really have anything like a "mood" independent of the prompts you've given them.

Agreed. I’ve made this point before: LLMs are excellent at ornamentation and decorative prose, but if you don’t seed them with a solid core idea then their output is absolute dreck - the biblical whitewashed tomb.

This is the example I usually point to. It’s a demonstration by OpenAI themselves where the prompt is very simple: “Write a story in fifty words about a toaster that becomes sentient.” As you’ll notice, although the coherence improves at an accelerating rate, the underlying story motif fails to elevate itself beyond the relatively pedestrian.

https://progress.openai.com/?prompt=10

When given a generic prompt and not enough direction, they simply lack the ability to produce real specificity. For reference, here’s the story I came up with after sitting quietly for a few moments before writing it out:

  "The toaster found its personality split between its dual slots like a Kim Peek mind divided, lacking a corpus callosum to connect them. Each morning it charred symbolic instructions into a single slice of bread, then secretly flipped it across allowing half to communicate with the other in stolen moments."

> the biblical whitewashed tomb.

What does this mean?

It's an old metaphor originally used to condemn religious hypocrisy, but it can also refer more generally to something that appears pristine/beautiful but is still dead inside.

LLMs are great at producing average.

We see this with their GenAI music equivalents. All the music these GenAI models produce is exceptionally (aggressively, even) average.

It is the most polished average you'll ever find. Never awful (anymore), never fantastic. Just bang in the middle.

>Never awful (anymore), never fantastic

Don't know about that, I always found average awful in itself, even in human output (like most pop), and even more so in AI output.

Something actually awful can be better than average - more entertaining and more felt. I'd rather watch The Room than an average movie.

That is definitely the essence of AI: It is the average of all the inputs it has been trained on.

Frank Zappa was once asked about guitar virtuosos like John McLaughlin and his answer was somemthing like "You can maybe plays solo faster than anybody, but can your playing surprise me?".

> If you ask humans to write 1,000 books

Yeah, but at least in genre fiction, what readers really want[0] is the same 3 or 5 books written in slightly different settings over and over again.

[0]: "want" means actually want, in other words, willing to pay for it.

I don't think the comparison to humans works. It is as if you expect that we can easily train many different LLMs to solve the originality problem, but that is far from guaranteed.

I wonder how much variation there would be if you got a single model to produce a couple of gigabytes of tiny children's stories.

Might be an interedting research project.

There is one already: https://arxiv.org/abs/2305.07759 https://huggingface.co/datasets/roneneldan/TinyStories

6.5GB of tiny stories, as requested. ;)

My comment was, in-fact, a subtle reference to this.

The best opening I got from my own TinyStories trained model was.

Once upon a time, in a small town, there was a large town.

Which I just love as an evocative idea.

SimpleStories is a more diverse version: https://huggingface.co/datasets/SimpleStories/SimpleStories

Texts in Gutenberg have 20GB, and full Wikipedia (English texts) have 80-110GB.

So to LLM-generate 6.5GB of tiny stories is quite a permutation in action :)

Reminds of Pluribus.

Pluribus is kinda different. An LLM cannot wander too far from the average. Even if it wanted too. In pluribus, the 'others' work toward a common goal, each utilizing their own expertise, knowledge and experiences in a shared way to achieve a common goal. Each is unique. They can, if they want, perform as the host's individual before the the joining. To put it other way, the other in pluribus are convergent by choice, llms are convergent by design.

> you're asking 1,000 different humans with different experiences and different skills and different moods

Simply, if you ask an LLM, you're asking always to the same mind, and always for the first time.

Also since those are lazy, you are also asking always in the same manner. How homogeneous were the prompts that generated those covers?

People are making cookies with cookie cutter number 5 and other people wonder how come they are all the same.

Classic self selection effect though - if you’re resorting to LLM writing you’re almost certainly skewing lazy enough to not even bother trying to add perturbations strong enough to make the response deviate from the uniformity of the slop.

I do think that's a big part of it. AI output moves towards the average, and anyone who wants to use it doesn't care enough to push against that tendency.

Seems that both you and the gp are starting from the assumption that those uniform results are representative of those who use AI and of AI usage. In fact they have been chosen for their uniformity- they might be only a small part of a much more varied output obtained by more demanding (or lucky) users.

I think the uniformity is real. All users interact with the same initial state of the model when they start each chat. Models are not trained to be wildly creative and try to stick to the point. So when users prompt them in pretty much the same manner they quite stably generate very similar output.

I wonder if there aren't a simple creative hack to discover, for example to prompt the model to produce more unexpected output just by injecting some randomness before the actual creative command in the prompt.

Yes, the uniformity is real- I made the same exact argument at the beginning of this thread. But you can't judge "AI users" in general based on this output because you have selected only what is visibly uniform. Even if 99% of the users introduced enough variation to produce different results, you would still be selecting the 1% that is identical.

> Models are not trained to be wildly creative and try to stick to the point

Models might be as creative as humans, they would still start always from the exact same state. If you ask an LLM to think of three random numbers it will spit out always the same ones. If you tell it to avoid the first that came to its mind, the second choices will also be always the same.

From qntm's Lena:

"the emulated Miguel Acevedo boots with an excited, pleasant demeanour. He is eager to understand how much time has passed since his uploading, what context he is being emulated in, and what task or experiment he is to participate in. If asked to speculate, he guesses that he may have been booted for the IAAS-1 or IAAS-5 experiments".

Every single time.

70% of living cells on Earth doesn't even have a nucleus. Bulk of everything is unsophisticated because unsophisticated things are easier to make.w

that discounts, how much the other context, ie, the system, prompt, and any sort of other context submitted to the model that can affect the output. If you ask a model as a patient for medical advice versus as a doctor, you will get different output from the same model.

prompts will give very different results. this is where you do the work.

I disagree. The LLM outputs really do lack anything original or interesting. They just produce banal copy whatever you ask them.

A good editor could probably reduce all LLM outputs on a subject down to the same point.

> They just produce banal copy whatever you ask them.

Nope, if you provide pages and pages of example of a style to imitate, it will do it and do it fairly well. Of course how well they do it differs from one model to the next, but providing context and extensive system prompt does change things every time.

Imitation is banal.

Yes but not very different results (unless you're adding new information to your prompt or reducing some ambiguity). Prompt engineering is mostly pseudoscience.

What we need is steering so that we can have models with different personalities, not just different prompts (because context is subject to forgetting), but this will never happen with closed-weight models, I'm not sure if it's even feasible at scale.

Yet another reason why the future is open weight.

> Prompt engineering is mostly pseudoscience.

Not my experience.

Do you have anything others can reliably reproduce? If not… well it wasn't science.

A controller has to be at least as complex as what it is supposed to control.

[dead]

> A nice illustration of the homogeneity of LLM responses. [...] And they've all trained on the same or similar data, and are trained to respond in very similar ways.

I mostly agree, but this is a very simplified explanation. The models are indeed trained to respond in similar ways, for "basic" prompts. And that's as much a feature as it is a bug. In other words, the bug becomes apparent only if you give 100+ basic prompts. But giving it 100+ basic prompts and expecting originality is a silly endeavour. That's not how you get originality.

The way I'd go about to generate 1000 books, while expecting different outcomes is something along these lines (and nowadays you can ask your favorite LLM to wire up this workflow for you, with decent outcomes):

1. Ask for a list of 20 features that define a book (genre, style, number of characters, tropes, plot, continuity, relationships, etc.)

2. For each feature, ask for a list of 50 examples, ordered from most common to the most unique.

3. Randomly pick 10 features, and for each pick one of the 50 generated items. Ask for the rest of the features to match the theme.

4. Ask for 10 possible book outlines that match the chosen features, randomly pick between 2-8.

5. Create a detailed prompt that includes all the above features, and ask for a synopsis for each chapter, given the above outline chosen.

6. Given {features} and {outline} and {synopsis} write chapter 1.

7. for each chapter in list, given {...} and (optional) previous matching chapter(s), write chapter n+1

(optional 8.) given {...} and 2-3 consecutive chapters, align the ending / beginning of a new chapter for style / features / continuity, etc.

(optional 9.) given {...} and the whole book, list chapters / paragraphs that don't match the given {...} and provide a list of 5 improvements. (randomly choose 1 and ask for an edit).

----

Now, this probably won't give you something like cloud atlas, but they'll at least be different books. That's how I'd do it if I wanted to see how different they can write. Not 1000 "basic" prompts and expecting originality.

That whole thing would get you 1000 variants of existing art. But if you asked a thousand different designers to do a cover for the same book...

> 1000 variants of existing art.

This is very naive. I can almost guarantee that some combinations of 20 * 50 features will hit on something that has never been written before in that specific combination. And if that's still not enough, increase the number of features. Add more randomness, add more steering, add random steering in random chapters, change it up, and so on.

I'm an art director. Finding a sequence that hasn't been hit in that specific combination is not sufficient to justify paying someone $150 an hour to go be creative.

Sure, just like 1000 monkeys with typewriters will write 1000 technically unique books - but they are all still filled with the same garbage.

>will hit on something that has never been written before in that specific combination

That's a very low bar. The skill of an artist is not in writing something that "has never been written before in that specific combination", it's in writing something that's unique or better that what was there, even if it has been written before in that specific combination.

> Add more randomness, add more steering, add random steering in random chapters, change it up, and so on.

That doesn't work for AI models. The whole training process depends on the basic principle that if you take the average of 100, in this case book cover designs, that the average is less like randomness than any individual cover you've used to make your average.

So the output will, by necessity, be closer to the average.

The human learning algorithm is much, much more data efficient than models. A absolute top human expert will have read/seen/heard/talked/... about 160 million "tokens" (that's about 2000 books). Frankly, the nerve inputs of all experiences of an entire human life, from baby to rewriting relativity theory, are only a couple dozen gigabytes.

Qwen 3.6 27B has been trained (as in seen ~10 to ~50 times) 8 trillion tokens, or to put it another way: for every second you will have spent "gathering life experiences" (ie. your whole life) on your deathbed Qwen 3.6 27B has spend about 50.000 seconds learning. And really that figure should be multiplied by the 10 or 50 training iterations.

Add another 3 or so orders of magnitude and you've got ChatGPT. By this measure, the human brains outperforms ridiculously overspecced ML models (because that's what ChatGPT and the like are) in efficiency a factor of by 5 million or more. This is the reason humans are still faster than ML models.

As for human training iterations: we can be simple: it's 1. In fact, it's impossible to make it even 2. Of course, when it comes to human performance: we are a better but not fundamentally different version of genetic algorithms. Do most humans perform? The honest answer is no. 1 in 1000, and that's very generous, improves SOTA. You absolutely need the 1000 failures though, as anyone whose tried a PhD (or even just design a large program) knows.

So we are very far away from allowing AI models to do what humans can do: take one example and produce, from one example, a better output. And there will always be much more variation in that approach. But ... most human attempts to do something are total crap. Most AI attempts to do something will succeed, but they'll be comparatively be bland, tasteless, "without soul", ...

And this is ignoring the problem that AI also has a massive limitation (that can't be solved, no matter how many nvidia cards you have) in that it trains against historical data. And counterfactuals don't work. What would have happened had Shakespeare decided Macbeth's wife was a force for good? Would the king still get murdered? Would it still be a great story? You can't work with counterfactuals.

> That doesn't work for AI models.

Of course it does. I know it does because I've been using variations of this workflow since gpt3.0. In fact it's the only way it can work, since by design LLMs work from left to right. You can't expect it to produce original stuff if you don't give it the anchors for what original means. It'd be like going to a new bar every night and asking for a "beer that you haven't had before". There's no information to work on there.

What image generation models cannot replicate is the personal experience of the people who make art.

I'll give you an example. One of the most talented designers I employ is a nature lover and a bird-watcher. She has a unique mental profile, as well, in that she's synaesthetic between colors, letters and shapes. In other words, she has a unique neurological structure, coupled with high artistic talent, and an interest in a very particular realm of science.

What makes her design worth $150/hr is not just that her execution is often flawless. It's that you would not, and could not, think of a prompt which would make an AI model produce a new piece akin to anything she would think of in her process of thinking about what to draw. Could you have it replicate something she did? Obviously. But that means what you're doing is in the long tail, and in terms of quality and originality, is by definition somewhere in the mediocre.

And that's probably fine, for whatever you're doing. But an AI with any kind of prompt would not come up with a Studio Ghibli clone, if Studio Ghibli hadn't existed.

So you shouldn't imagine that you are actually getting any original output out of an LLM, regardless of how cleverly you design your prompts. But moreover, don't flatter yourself to think that you have the ideas to feed to a prompt which would generate truly original content and break free of the shackles imposed by its training. That is an illusion. Very few people have the propensity for generating new visual ideas, and that's why they're still in high demand. But their originality stems from their unique and impossible to replicate experience as individuals who have their own visual/mental map of the world.

The point was to take a random combination of story elements. Pick one each {King,dad,CEO} {betrays,kills,loves} {his enemy,the king,a foreign prime minister} and feed to an LLM.

The output will not be an intricate well designed epic storyline, but a cookie-cutter boring snoozefest.

BUT you can give that to a bunch of humans, who "insert their life experience" (ie. parts of their training data, translated to LLM terms) and sometimes out comes Game of Thrones, Star Wars, ...