> Adding in a mode that doesn't just dump an answer but works to take you through the material step-by-step is magical
Except these systems will still confidently lie to you.
The other day I noticed that DuckDuckGo has an Easter egg where it will change its logo based on what you've searched for. If you search for James Bond or Indiana Jones or Darth Vader or Shrek or Jack Sparrow, the logo will change to a version based on that character.
If I ask Copilot if DuckDuckGo changes its logo based on what you've searched for, Copilot tells me that no it doesn't. If I contradict Copilot and say that DuckDuckGo does indeed change its logo, Copilot tells me I'm absolutely right and that if I search for "cat" the DuckDuckGo logo will change to look like a cat. It doesn't.
Copilot clearly doesn't know the answer to this quite straightforward question. Instead of lying to me, it should simply say it doesn't know.
This is endlessly brought up as if the human operating the tool is an idiot.
I agree that if the user is incompetent, cannot learn, and cannot learn to use a tool, then they're going to make a lot of mistakes from using GPTs.
Yes, there are limitations to using GPTs. They are pre-trained, so of course they're not going to know about some easter egg in DDG. They are not an oracle. There is indeed skill to using them.
They are not magic, so if that is the bar we expect them to hit, we will be disappointed.
But neither are they useless, and it seems we constantly talk past one another because one side insists they're magic silicon gods, while the other says they're worthless because they are far short of that bar.
The ability to say "I don't know" is not a high bar. I would say it's a basic requirement of a system that is not magic.
Based on your example, basically any answer would be "I don't know 100%".
You could ask me as a human basically any question, and I'd have answers for most things I have experience with.
But if you held a gun to head and said "are you sure???" I'd obviously answer "well damn, no I'm not THAT sure".
It'd at least be an honest one that recognizes that we shouldn't be trusting the tech wholesale yet.
>But if you held a gun to head and said "are you sure???" I'd obviously answer "well damn, no I'm not THAT sure".
okay, who's holding a gun to Sam Altman's head?
Perhaps LLMs are magic?
I see your point
Some of the best exchanges that I participated in or witnessed involved people acknowledging their personal limits, including limits of conclusions formed a priori
To further the discussion, hearing the phrase you mentioned would help the listener to independently assess a level of confidence or belief of the exchange
But then again, honesty isn't on-brand for startups
It's something that established companies say about themselves to differentiate from competitors or even past behavior of their own
I mean, if someone prompted an llm weighted for honesty, who would pay for the following conversation?
Prompt: can the plan as explained work?
Response: I don't know about that. What I do know is on average, you're FUCKED.
> The ability to say "I don't know" is not a high bar.
For you and I, it's not. But for these LLMs, maybe it's not that easy? They get their inputs, crunch their numbers, and come out with a confidence score. If they come up with an answer they're 99% confident in, by some stochastic stumbling through their weights, what are they supposed to do?
I agree it's a problem that these systems are more likely to give poor, incorrect, or even obviously contradictory answers than say "I don't know". But for me, that's part of the risk of using these systems and that's why you need to be careful how you use them.
but they're not. Ofyen the confidence value is much lower. I should have an option to see how confident it is. (maybe set the opacity of each token to its confidence?)
Logits aren't confidence about facts. You can turn on a display like this in the openai playground and you will see it doesn't do what you want.
>If they come up with an answer they're 99% confident in, by some stochastic stumbling through their weights, what are they supposed to do?
As much as Fi, from The Legend of Zelda: Skyward Sword was mocked for this, this is the exact behavior a machine should do (not that Fi is a machine, but she operated as such).
Give a confidence score the way we do in statistics, make sure to offer sources, and be ready to push back on more objective answers. accomplish those and I'd be way more comfortable using them as a tool.
>hat's part of the risk of using these systems and that's why you need to be careful how you use them.
Adn we know in 2025 how careful the general user is of consuming bias and propaganda, right?
The confidence score is about the likelihood of this token appearing in this context.
LLMs don't operate in facts or knowledge.
It certainly should be able to tell you it doesn't know. Until it can though, a trick that I have learned is to try to frame the question in different ways that suggest contradictory answers. For example, I'd ask something like these, in a fresh context for each:
- Why does Duckduckgo change it's logo based on what you've searched?
- Why doesn't Duckduckgo change it's logo based on what you've searched?
- When did Duckduckgo add the current feature that will change the logo based on what you've searched?
- When did Duckduckgo remove the feature that changes the logo based on what you've searched?
This is similar to what you did, but it feels more natural when I genuinely don't know the answer myself. By asking loaded questions like this, you can get a sense of how strongly this information is encoded in the model. If the LLM comes up with an answer without contradicting any of the questions, it simply doesn't know. If it comes up with a reason for one of them, and contradicts the other matching loaded question, you know that information is encoded fairly strongly in the model (whether it is correct is a different matter).
I see these approaches a lot when I look over the shoulders of LLM users, and find it very funny :D you're spending the time, effort, bandwidth and energy for four carefully worded questions to try and get a sense of the likelihood of the LLM's output resembling facts, when just a single, basic query with simple terms in any traditional search engine would give you a much more reliable, more easily verifable/falsifiable answer. People seem so transfixed by the conversational interface smokeshow that they forgot we already have much better tools for all of these problems. (And yes, I understand that these were just toy examples.)
The nice thing about using a language model over using a traditional search engine is being able to provide specific context (ie disambiguate where keyword searches would be ambiguous) and to correlate unrelated information that would require multiple traditional searches using a single LLM query. I use Kagi, which provides interfaces for both traditional keyword searches, and for LLM chats. I use whichever is more appropriate for any given query.
It really depends on the query. I'm not a Google query expert, but I'm above average. I've noticed that phrasing a query in a certain way to get better results just no longer works. Especially in the last year, I have found it returns results that aren't even relevant at all.
The problem is people have learned to fill their articles/blogs with as many word combinations as possible so that it will show up in as many Google searches as possible, even if it's not relevant to the main question. The article has just 1 subheading that is somewhat relevant to the search query, even though the information under that subheading is completely irrelevant.
LLMs have ironically made this even worse because now it's so easy to generate slop and have it be recommended by Google's SEO. I used to be very good at phrasing a search query in the right way, or quoting the right words/phrases, or having it filter by sites. Those techniques no longer work.
So I have turned to ChatGPT for most of the queries I would have typically used Google for. Especially with the introduction of annotations. Now I can verify the source from where it determined the answer. It's a far better experience in most circumstances compared to Google.
I have also found ChatGPT to be much better than other LLMs at understanding nuance. There have been numerous occasions where I have pushed back against ChatGPT's answer and it has responded with something like "You would be correct if your input/criteria is X. But in this case, since your input/criteria is Y, this is the better solution for Z reasons".