I wish LLMs were good at search. I've tried to evaluate them many times for their quality at answering research questions for astrophysics (specifically numerical relativity). If they were good at answering questions, I'd use them in a heartbeat
Without exception, every technical question I've ever asked an LLM that I know the answer to, has been substantially wrong in some fashion. This makes it just.. absolutely useless for research. In some cases I've spotted it straight up plagiarising from the original sources, with random capitalisation giving it away
The issue is that once you get even slightly into a niche, they fall apart because the training data just doesn't exist. But they don't say "sorry there's insufficient training data to give you an answer", they just make shit up and state it as confidently incorrect
LLMs got good at search last year. You need to use the right ones though - ChatGPT Thinking mode and Google AI mode (that's https://www.google.com/ai - which is NOT the same as regular Google's "AI overviews" which are still mostly trash) are both excellent.
I've been tracking advances in AI assisted search here - https://simonwillison.net/tags/ai-assisted-search/ - in particular:
- https://simonwillison.net/2025/Apr/21/ai-assisted-search/ - April is when they started getting good, with o3 and the various deep research tools
- https://simonwillison.net/2025/Sep/6/research-goblin/ - GPT-5 got excellent. This post includes several detailed examples, including "Starbucks in the UK don’t sell cake pops! Do a deep investigative dive".
- https://simonwillison.net/2025/Sep/7/ai-mode/ - AI mode from Google
> LLMs got good at search last year. You need to use the right ones though - ChatGPT Thinking mode and Google AI mode (that's https://www.google.com/ai - which is NOT the same as regular Google's "AI overviews" which are still mostly trash) are both excellent.
I disagree. You might have seen some improvements in the results, but all LLMs still hallucinate quite hard on simple queries where you prompt them to cite their sources. You'll see ChatGPT insist quite hard that the source of their assertions is the 404 link that it asserts is working.
This is just completely the opposite to what i've experienced within Claude and Gemini. Sources are identified and if inaccessible are not included in the citations. I recently tried a quite specific search aimed towards finding information about specific memo's and essays cited within a 90s memo by bill gates, and it was succesful at finding a vast majority of them, something google search failed with.
I don't want to say that it's a skill issue, but you may just be using the wrong tools for the job.
Oh boy, someone's claiming that chatgpt is actually great now, time to ask it some questions
I asked chatgpt's thinking mode if the adm formalism is strictly equivalent to general relativity, and it made several strongly incorrect statements
This is my favourite:
>3. Boundary terms matter
>To be fully equivalent:
>One must add the correct Gibbons–Hawking–York boundary term
>And handle asymptotic conditions carefully (e.g. ADM energy)
>Otherwise, the variational principle is not well-defined.
Which is borderline gibberish
>The theory still has 2 propagating DOF per spacetime point
This is pretty good too
>(lapse and shift act as Lagrange multipliers, not dynamical fields).
This is also as far as I'm aware just wrong, as the gauge conditions are nonphysical. In general, lapse and shift are generally always treated as dynamical fields
Its full answer reads like someone with minimal understanding of physics trying to bullshit you. Then I asked it if the BSSN formalism is strictly equivalent to the ADM formalism (it isn't, because it isn't covariant)
This answer is actually more wrong, surprisingly
>Yes — classically, the BSSN formalism is equivalent to ADM, but only under specific conditions. In practice, it is a reparameterization plus gauge fixing and constraint handling, not a new theory. The equivalence is more delicate than ADM ↔ GR.
The ONE thing that doesn't change in the BSSN formalism is the gauge conditions
>Rewriting the evolution equations, adding terms proportional to constraints.
This is also pretty inadequate
>Precise equivalence statement
>BSSN is strictly equivalent to ADM at the classical level if:
...
>Gauge choices are compatible >(e.g. lapse and shift not over-constraining the system)
This is complete gibberish
It also states:
>No extra degrees of freedom are introduced
I don't think chatgpt knows what a degree of freedom is
>Why the equivalence is more subtle than ADM ↔ GR >1. BSSN is not a canonical transformation
>Unlike ADM ↔ GR:
>BSSN is not manifestly Hamiltonian
>The Poisson structure is not preserved automatically
>One must reconstruct ADM variables to see equivalence
This is all absolute bollocks. Manifestly hamiltonian is literally gibberish. Neither of these formalisms have a "poisson structure" whatever that means, and sure yes you can construct the adm variables from the bssn variables whoopee
>When equivalence can fail
>Discretized (numerical) system -> Equivalence only approximate
Nobody explain to chatgpt that the ADM formalism is also a discretiseable series of PDEs!
>BSSN and ADM describe the same classical solutions of Einstein’s equations, but BSSN reshapes the phase space and constraint handling to make the evolution well-behaved, sacrificing manifest Hamiltonian structure off-shell.
We're starting to hit timecube levels of nonsense
It also gets the original question completely wrong: The BSSN formalism isn't covariant or coordinate free - there's an alterative bssn-like formalism called cBSSN (covariant bssn), which is similar to ccz4 and z4cc (both covariant). Its an important property that the regular BSSN formalism lacks, which is one of the ways you can identify it as being not a strict equivalence to the ADM formalism on mathematical grounds. So in the ADM formalism you can express your equations in polar coordinates, but if you make that transformation in the BSSN formalism - its no longer the same
This has actually gotten significantly worse than last time I asked chatgpt about this kind of thing, its more confidently incorrect now
Perhaps try asking it a question that other people in HN could also answer, lol...
How did it do when you posed these arguments to it?
> Without exception, every technical question I've ever asked an LLM that I know the answer to, has been substantially wrong in some fashion.
The other problem that I tend to hit is a tradeoff between wrongness and slowness. The fastest variants of the SOTA models are so frequently and so severely wrong that I don't find them useful for search. But the bigger, slower ones that spend more time "thinking" take so long to yield their (admittedly better) results that it's often faster for me to just do some web searching myself.
They tend to be more useful the first time I'm approaching a subject, or before I've familiarized myself with the documentation of some API or language or whatever. After I've taken some time to orient myself (even by just following the links they've given me a few times), it becomes faster for me to just search by myself.
>> at answering research questions for astrophysics
I googled for "helium 3" yesterday. Google's AI answer said that helium 3 is "primarily sourced from the moon", as if we were actively mining it there already.
There are probably thousands of scifi books where the moon has some forms of helium 3 mining. Considering Google pirated and used them all for training it makes sense that it puts it in present tense.
On a similar note, Gemini told that I was born in 2025 when I did a cursory search for my real name. It's rather confident.
I wonder how much memory and computing time goes into making them, vs. a typical "proper" LLM prompt. It's like the freebies you get with a Christmas cracker.
If you nudge it towards tool use, A lot of time it can give you better answers.
Instead of "how cheese X is usually made" "search the web and give me a summary on the ways cheese X is made"
> I wish LLMs were good at search
The entire situation of web search for LLMs is a mess. None of the existing providers return good or usable results; and Google refuses to provide general access to theirs. As a result, all LLMs (except maybe Gemini) are severely gimped forever until someone solves this.
I seriously believe that the only real new breakthrough for LLM research can be achieved by a clean, trustworthy, comprehensive search index. Maybe someone will build that? Otherwise we’re stuck with subpar results indefinitely.
YaCy does a pretty good job, and is free, and you can run yourself, so the quality/experience is pretty much up to you. Paired together with a local GPT-OSS-120b with reasoning_effort set to high, I'm getting pretty good results. Validated with questions I do know the answer to, and seems alright although could be better of course, still getting better results out of GPT5.2 Pro which I guess is to be expected.
The point of my comment was that the AI/LLM is almost irrelevant in light of low quality search engine APIs/indexes. Is there a way to validate the actual quality and comprehensiveness of YaCY beyond anecdata?
> Is there a way to validate the actual quality and comprehensiveness of YaCY beyond anecdata?
No, because it's your own index essentially, hence the "the quality/experience is pretty much up to you" part.
Yeah, that’s not really reassuring nor indicative of its usefulness or value.
Yeah, if that's how you feel about your own abilities, then I guess that's the way it is. Not sure what that has to do with YaCy or my original comment.
Respectfully, you said:
> YaCy does a pretty good job
I assume that should be qualified with some basic amount of evidence beyond “I said so”? Anyways, thanks for pointing me in the direction of YaCy, will try it out.
How to build a search engine, apparently:
1. Install YaCy
2. Draw the rest of the owl
> state it as confidently incorrect
It's funny for me to read this. They don't exhibit "confidence". You are just getting the most accurate text that it can produce. Of course, the training data doesn't contain "I don't know" for questions, that would be really bad training data! If you are getting "attitudes", it would be because you are triggering some kind of dialogue-esque data with your prompts (or the system prompt might be doing that).
Expecting the LLM to say "sorry I don't know" would be like expecting google search to return "we found some pages but deemed them wrong, so we won't show you any".
Did you try https://elicit.org ?
I have been impressed by its results.
I think this fact stems more from its initial search phase than its pure LLM processing power, but to me it seems the approach works really well.