I'm a significant genAI skeptic.

I periodically ask them questions about topics that are subtle or tricky, and somewhat niche, that I know a lot about, and find that they frequently provide extremely bad answers. There have been improvements on some topics, but there's one benchmark question that I have that just about every model I've tried has completely gotten wrong.

Tried it on LMArena recently, got a comparison between Gemini 2.5 flash and a codenamed model that people believe was a preview of Gemini 3 flash. Gemini 2.5 flash got it completely wrong. Gemini 3 flash actually gave a reasonable answer; not quite up to the best human description, but it's the first model I've found that actually seems to mostly correctly answer the question.

So, it's just one data point, but at least for my one fairly niche benchmark problem, Gemini 3 Flash has successfully answered a question that none of the others I've tried have (I haven't actually tried Gemini 3 Pro, but I'd compared various Claude and ChatGPT models, and a few different open weights models).

So, guess I need to put together some more benchmark problems, to get a better sample than one, but it's at least now passing a "I can find the answer to this in the top 3 hits in a Google search for a niche topic" test better than any of the other models.

Still a lot of things I'm skeptical about in all the LLM hype, but at least they are making some progress in being able to accurately answer a wider range of questions.

I don't think tricky niche knowledge is the sweet spot for genai and it likely won't be for some time. Instead, it's a great replacement for rote tasks where a less than perfect performance is good enough. Transcription, ocr, boilerplate code generation, etc.

The thing is, I see people use it for tricky niche knowledge all the time; using it as an alternative to doing a Google search.

So I want to have a general idea of how good it is at this.

I found something that was niche, but not super niche; I could easily find a good, human written answer in the top couple of results of a Google search.

But until now, all LLM answers I've gotten for it have been complete hallucinated gibberish.

Anyhow, this is a single data point, I need to expand my set of benchmark questions a bit now, but this is the first time that I've actually seen progress on this particular personal benchmark.

That’s riding hype machine and throwing baby with bath water.

Get an API and try to use it for classification of text or classification of images. Having an excel file with somewhat random looking 10k entries you want to classify or filter down to 10 important for you, use LLM.

Get it to make audio transcription. You can now just talk and it will make note for you on level that was not possible earlier without training on someone voice it can do anyone’s voice.

Fixing up text is of course also big.

Data classification is easy for LLM. Data transformation is a bit harder but still great. Creating new data is hard so like answering questions where it has to generate stuff from thin air it will hallucinate like a mad man.

The ones that LLMs are good in are used in background by people creating actual useful software on top of LLMs but those problems are not seen by general public who sees chat box.

But people using the wrong tool for a task is nothing new. Using excel as a database (still happening today), etc.

Maybe the scale is different with genAI and there are some painful learnings ahead of us.

And Google themselves obviously believe that too as they happily insert AI summaries at the top of most serps now.

Or maybe Google knows most people search inane, obvious things?

Or more likely Google couldn't give a rat's arse whether those AI summaries are good or not (except to the degree that people don't flee it), and what it cares is that they keep users with Google itself, instead of clicking of to other sources.

After all it's the same search engine team that didn't care about its search results - it's main draw - activey going shit for over a decade.

Google AI Overview a lot of times write wrong about obvious things so... lol

They probably use old Flash Lite model, something super small, and just summarize the search...

Those summaries would be far more expensive to generate than the searches themselves so they're probably caching the top 100k most common or something, maybe even pre-caching it.

[deleted]

I also use niche questions a lot but mostly to check how much the models tend to hallucinate. E.g. I start asking about rank badges in Star Trek which they usually get right and then I ask about specific (non existing) rank badges shaped like strawberries or something like that. Or I ask about smaller German cities and what's famous about them.

I know without the ability to search it's very unlikely the model actually has accurate "memories" about these things, I just hope one day they will acutally know that their "memory" is bad or non-existing and they will tell me so instead of hallucinating something.

I'm waiting for properly adjusted specific LLMs. A LLM trained on so much trustworth generic data that it is able to understand/comprehend me and different lanugages but always talks to a fact database in the background.

I don't need an LLM to have a trillion parameters if i just need it to be a great user interface.

Someone is probably working on this somewere or will but lets see.

Second this.

Basically making sense of unstructured data is super cool. I can get 20 people to write an answer the way they feel like it and model can convert it to structured data - something I would have to spend time on, or I would have to make form with mandatory fields that annoy audience.

I am already building useful tools with the help of models. Asking tricky or trivia questions is fun and games. There are much more interesting ways to use AI.

Well, I used Grok to find information I forgot about like product names, films, books and various articles on different subjects. Google search didn't help but putting the LLM at work did the trick.

So I think LLMs can be good for finding niche info.

Yeah, but tests like that deliberately prod the boundaries of its capability rather than how well it does what it’s good at.

So this is an interesting benchmark, because if the answer is actually in the top 3 google results, then my python script that runs a google search, scrapes the top n results and shoves them into a crappy LLM would pass your benchmark too!

Which also implies that (for most tasks), most of the weights in a LLM are unnecessary, since they are spent on memorizing the long tail of Common Crawl... but maybe memorizing infinite trivia is not a bug but actually required for the generalization to work? (Humans don't have far transfer though... do transformers have it?)

I've tried doing this query with search enabled in LLMs before, which is supposed to effectively do that, and even with that they didn't give very good answers. It's a very physical kind of thing, and its easy to conflate with other similar descriptions, so they would frequently just conflate various different things and give some horrible mash-up answer that wasn't about the specific thing I'd asked about.

So it's a difficult question for LLMs to answer even when given perfect context?

Kinda sounds like you're testing two things at the same time then, right? The knowledge of the thing (was it in the training data and was it memorized?) and the understanding of the thing (can they explain it properly even if you give them the answer in context).

Counter point about general knowledge that is documented/discussed in different spots on the internet.

Today I had to resolve performance problems for some sql server statement. Been doing it years, know the regular pitfalls, sometimes have to find "right" words to explain to customer why X is bad and such.

I described the issue to GPT5.2, gave the query, the execution plan and asked for help.

It was spot on, high quality responses and actionable items and explanations on why this or that is bad, how to improve it and why particularly sql may have generated such a query plan. I could instantly validate the response given my experience in the field. I even answered with some parts of chatgpt on how well it explained. However I did mention that to customer and I did tell them I approve the answer.

Asked high quality question and receive a high quality answer. And I am happy that I found out about an sql server flag where I can influence particular decision. But the suggestion was not limited to that, there were multiple points given that would help.

Even the most magical wonderful auto-hammer is gonna be bad at driving in screws. And, in this analogy I can't fault you because there are people trying to sell this hammer as a screwdriver. My opinion is that it's important to not lose sight of the places where it is useful because of the places where it isn't.

Funny, I grew up using what's called a "hand impact screwdriver"... turns out a hammer can be used to drive in screws!

can you give us an example of this niche knowledge? I highly doubt there is knowledge that is not inside some internet training material.

Hi. I am curious what was the benchmark question? Cheers!

The problem with publicly disclosing these is that if lots of people adopt them they will become targeted to be in the model and will no longer be a good benchmark.

Yeah, that's part of why I don't disclose.

Obviously, the fact that I've done Google searches and tested the models on these means that their systems may have picked up on them; I'm sure that Google uses its huge dataset of Google searches and search index as inputs to its training, so Google has an advantage here. But, well, that might be why Googles new models are so much better, they're actually taking advantage of some of this massive dataset they've had for years.

This thought process is pretty baffling to me, and this is at least the second time I've encountered it on HN.

What's the value of a secret benchmark to anyone but the secret holder? Does your niche benchmark even influence which model you use for unrelated queries? If LLM authors care enough about your niche (they don't) and fake the response somehow, you will learn on the very next query that something is amiss. Now that query is your secret benchmark.

Even for niche topics it's rare that I need to provide more than 1 correction or knowledge update.

I have a bunch of private benchmarks I run against new models I'm evaluating.

The reason I don't disclose isn't generally that I think an individual person is going to read my post and update the model to include it. Instead it is because if I write "I ask the question X and expect Y" then that data ends up in the train corpus of new LLMs.

However, one set of my benchmarks is a more generalized type of test (think a parlor-game type thing) that actually works quite well. That set is the kind of thing that could be learnt via reinforcement learning very well, and just mentioning it could be enough for a training company or data provider company to try it. You can generate thousands of verifiable tests - potentially with verifiable reasoning traces - quite easily.

Ok, but then your "post" isn't scientific by definition since it cannot be verified. "Post" is in quotes because I don't know what you're trying to but you're implying some sort of public discourse.

For fun: https://chatgpt.com/s/t_694361c12cec819185e9850d0cf0c629

I didn't see anyone claiming any 'science'? Did I miss something?

I guess there's two things I'm still stuck on:

1. What is the purpose of the benchmark?

2. What is the purpose of publicly discussing a benchmark's results but keeping the methodology secret?

To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.

1. The purpose of the benchmark is to choose what models I use for my own system(s). This is extremely common practice in AI - I think every company I've worked with doing LLM work in the last 2 years has done this in some form.

2. I discussed that up-thread, but https://github.com/microsoft/private-benchmarking and https://arxiv.org/abs/2403.00393 discuss some further motivation for this if you are interested.

> To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.

This is an odd way of looking at it. There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.

I see the potential value of private evaluations. They aren't scientific but you can certainly beat a "vibe test".

I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.

> There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.

Then you must not be working in an environment where a better benchmark yields a competitive advantage.

> I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.

In principle, we have ways: if nl's reports consistently predict how public benchmarks will turn out later, they can build up a reputation. Of course, that requires that we follow nl around for a while.

As ChatGPT said to you:

> A secret benchmark is: Useful for internal model selection

That's what I'm doing.

My question was "What's the value of a secret benchmark to anyone but the secret holder?"

The root of this whole discussion was a post about how Gemini 3 outperformed other models on some presumably informal question benchmark (a"vibe test"?). When asked for the benchmark, the response from the op and and someone else was that secrecy was needed to protect the benchmark from contamination. I'm skeptical of the need in the op's cases and I'm skeptical of the effectiveness of the secrecy in general. In a case where secrecy has actual value, why even discuss the benchmark publicly at all?

The point is that it's a litmus test for how well the models do with niche knowledge _in general_. The point isn't really to know how well the model works for that specific niche. Ideally of course you would use a few of them and aggregate the results.

I actually think "concealing the question" is not only a good idea, but a rather general and powerful idea that should be much more widely deployed (but often won't be, for what I consider "emotional reasons").

Example: You are probably already aware that almost any metric that you try to use to measure code quality can be easily gamed. One possible strategy is to choose a weighted mixture of metrics and conceal the weights. The weights can even change over time. Is it perfect? No. But it's at least correlated with code quality -- and it's not trivially gameable, which puts it above most individual public metrics.

It's hard to have any certainty around concealment unless you are only testing local LLMs. As a matter of principle I assume the input and output of any query I run in a remote LLM is permanently public information (same with search queries).

Will someone (or some system) see my query and think "we ought to improve this"? I have no idea since I don't work on these systems. In some instances involving random sampling... probably yes!

This is the second reason I find the idea of publicly discussing secret benchmarks silly.

I learned in another thread there is some work being done to avoid contamination of training data during evaluation of remote models using trusted execution environments (https://arxiv.org/pdf/2403.00393). It requires participation of the model owner.

Because it encompasses the very specific way I like to do things. It's not of use to the general public.

If they told you, it would be picked up in a future model's training run.

Don't the models typically train on their input too? I.e. submitting the question also carries a risk/chance of it getting picked up?

I guess they get such a large input of queries that they can only realistically check and therefore use a small fraction? Though maybe they've come up with some clever trick to make use of it anyway?

OpenAI and Anthropic don't train on your questions if you have pressed the opt-out button and are using their UI. LMArena is a different matter.

they probably dont train on inputs from testing grounds.

you dont train on your test data because you need to have that to compare if training is improving or not.

Given they asked in on LMArena, yes.

Yeah, probably asking on LMArena makes this an invalid benchmark going forward, especially since I think Google is particular active in testing models on LMArena (as evidenced by the fact that I got their preview for this question).

I'll need to find a new one, or actually put together a set of questions to use instead of just a single benchmark.

Is that an issue if you now need a new question to ask?

Heres my old benchmark question and my new variant:

"When was the last time England beat Scotland at rugby union"

new variant "Without using search when was the last time England beat Scotland at rugby union"

It is amazing how bad ChatGPT is at this question and has been for years now across multiple models. It's not that it gets it wrong - no shade, I've told it not to search the web so this is _hard_ for it - but how badly it reports the answer. Starting from the small stuff - it almost always reports the wrong year, wrong location and wrong score - that's the boring facts stuff that I would expect it to stumble on. It often creates details of matches that didn't exist, cool standard hallucinations. But even within the text it generates itself it cannot keep it consistent with how reality works. It often reports draws as wins for England. It frequently states the team that it just said scored most points lost the match, etc.

It is my ur example for when people challenge my assertion LLMs are stochastic parrots or fancy Markov chains on steroids.

I also have my own tricky benchmark that up til now only Deepseek has been able to answer. Gemini 3 Pro was the second. Every other LLM fail horribly. This is the main reason I started looking at G3pro more seriously.