I agree and put it this way: LLMs sound so convincing presenting you the work it does rose colored and promising to give you more if you keep going.
There is a 50/50 chance that it turns out to be right or letting you jump of the cliff.
Only the trip stays the same beautiful 5 star plus travel.
Also, spotting an error and telling LLM makes it in most cases worse, because the LLM wants to please you and goes on to apologize and change course.
The moment I find myself in such a situation I save or cancel the session and start from scratch in most cases or pivot with drastic measures.
Gemini to me is the most unpredictable LLM while GPT works best overall for me.
Gemini lately gave me two different answers to the same question. This was an intentional test because I was bored and wanted to see what happens if you simply open a new chat and paste the same prompt everything else being the same.
Reasoning doesn’t help much in the Coding domain for me because it is very high level and formally right what the LLM comes up with as an explanation.
I google more due to LLMs than before, because essentially what I witnessed is someone producing something that I gotta control first before I hit the button that it comes with. However, you only find out shortly afterwards whether the polished button started working or gave you a warm welcome to hell.
Reusing the same prompt several times is something I've started doing too. The contrast is often illuminating.
In one case, it made a thoroughly convincing argument that an approach was justified. The second time it made exactly the opposite argument, which was equally compelling.
I now see LLMs as persuasion machines.
One thing I've been doing lately -- and I'm in a business function, not a technical one, although I have an engineering background -- is pitting LLMs against each other. For example, if I'm structuring a proposal or a contract with the assistance of Claude, I'll begin my 360 feedback review first by asking Claude how it would react if it were the counter-party receiving the proposal. After some iterative changes, mostly manual, I will then run the same output document past Gemini and ask it to adopt personas from both sides and provide reactive feedback. The result of this is almost always a stronger proposal that I can also accompany with proactive objection handling and a solid FAQ, as well as clear points of negotiation that will likely be acceptable to both parties.
For this sort of thing, using multiple LLMs is extremely helpful.
Before AI happened I watched youtube. Occasionally I encountered there very convincing arguments. Same person often made very convincing arguments on many subjects.
But noticed that the closer the domain they were talking about was to my area of competence the less convincing their arguments were. There were more holes, errors and wrong conclusions.
I recalibrated my bs meter thanks to that.
Since AI came I successfully used this strategy of being extremely cautious towards convincing arguments to not become mislead by AI.
However this year I'm working with AI more in the domain of software development. Where I can see the competence. And I see the competence. This had opposite effect on me. I tend to trust AI outside my domain of expertise much more after I saw what can it do in software.
One caveat though is that there are a lot of areas of human culture where there's very little actual knowledge, but a lot of opinions, like politics, economy, diet, business, health. I still don't trust AI in those domains. But then again, I don't trust humans there either.
For me basically AI achieved the threshold of useful reliability for any domain that humans are reliable at.
I don't really care about sycophancy. I might have a slight advantage that I don't talk to AI in my native language. So its responses don't have a direct line to my emotions.
Ever since they started getting really sycophantic, I’ve been presenting my ideas as “my co-worker says this is a good approach but I disagree, can you help me convince him that it’s wrong?”
>LLM wants to please you
I was using Copilot and asked it a question about a PDF file (a concept search). It turned out the file was images of text. I was anticipating that and had the text ready to paste in.
Instead, it started writing an OCR program in python.
I stopped it after several minutes.
Often Copilot says it can't do something (sometimes it's even correct), that's preferential to the try-hard behaviour here.
> Gemini to me is the most unpredictable LLM while GPT works best overall for me.
This nails an important thing IMHO. I've absolutely noticed this, for better or worse. Gemini can produce surprisingly excellent things, but it's unpredictability make me go for GPT when I only want to ask it once.