"Groupthink" informed by extremely broad training sets is more conventionally called "consensus", and that's what we want the LLM to reflect.
"Groupthink", as the term is used by epistemologically isolated in-groups, actually means the opposite. The problem with the idea is that it looks symmetric, so if you yourself are stuck in groupthink, you fool yourself into think it's everyone else doing it instead. And, again, the solution for that is reasonable references grounded in informed consensus. (Whether that should be a curated encyclopedia or a LLM is a different argument.)
> "Groupthink" informed by extremely broad training sets is more conventionally called "consensus", and that's what we want the LLM to reflect.
Definitely not! I absolutely do not want an LLM that gives much or any truth-weight to the vast majority of writing on the vast majority of topics. Maybe, maybe if they’d existed before the Web and been trained only on published writing, but even then you have stuff like tabloids, cranks self-publishing or publishing through crank-friendly niche publishers, advertisements full of lies, very dumb letters to the editor, vanity autobiographies or narrative business books full of made-up stuff presented as true, et c.
No, that’s good for building a model of something like the probability space of human writing, but an LLM that has some kind of truth-grounding wholly based on that would be far from my ideal.
> And, again, the solution for that is reasonable references grounded in informed consensus. (Whether that should be a curated encyclopedia or a LLM is a different argument.)
“Informed” is a load bearing word in this post, and I don’t really see how the rest holds together if we start to pick at that.
> I absolutely do not want an LLM that gives much or any truth-weight to the vast majority of writing on the vast majority of topics.
I can think of no better definition of "groupthink" than what you just gave. If you've already decided on the need to self-censor your exposure to "the vast majority of writing on the vast majority of topics", you are lost, sorry.
A spectacular amount of extant writing accessible to LLM training datasets is uninformed noise from randos online. Not my fault the internet was invented.
I have to be misunderstanding you, though, because any time we want to build knowledge and skills for specialists their training doesn’t look anything like what you seem to be suggesting.
You're the second responder here that appears to think LLMs are "averaging" machines and that they need to be "protected" from wrong info. That's exactly the opposite of the way they work. You feed them the garbage precisely so they can explain to you why it's garbage. Otherwise we'd have just fed them wikipedia and stopped, but clearly that doesn't work as well.
I think this line is what did it:
> "Groupthink" informed by extremely broad training sets is more conventionally called "consensus", and that's what we want the LLM to reflect.
It's nothing to do with how LLMs work that I wrote what I did, but with this "ought" suggestion of how we should want them to work.
The issue is that on the open internet, the consensus is usually the one from 2000, 2010 at best. And since social science are moving fast recently (i mostly think about modern history and linguistics here), i wouldn't trust the consensus to be at the edge of the scientific knowledge (which is actually also _extremely_ true of wikipedia)
Gotta be honest, when I go to an encyclopedia the last thing I want is what the mathematically average chronically online person knows and thinks about a topic. Because common misconceptions and the "facts" you see parroted on online forums on all sorts of niche topics look just like consensus but ya know… wrong.
I would rather have an actual audio engineer's take than than the opinion of an amalgamation of hifi forums' talking pseudoscience and the latter is way more numerous in the training.
> what the mathematically average chronically online person knows and thinks about a topic
Yes you do, often. Understanding ideas and consensus is part of understanding "topics". To choose a Godwinized existence proof: an LLM that didn't understand public opinion in, say, 1920's Germany is one that can't answer the question of how the war started.
You're making two mistakes here: one is that you're assuming that "facts" exist as a separate idea from "discourse". And the second is that you appear to think LLMs merely "average" the stuff they read instead of absorbing controversies and discourse on their own terms. The first I can't really help you with, but the second you can disabuse yourself of on your own just by pulling up a GPT chat and talking to it.