I'm very curious if a toggle would be useful that would display a heatmap of a source file showing how surprising each token is to the model. Red tokens are more likely to be errors, bad names, or wrong comments.

We explored this exact idea in our recent paper https://arxiv.org/abs/2505.22906

Turns out this kind of UI is not only useful to spot bugs, but also allows users to discover implementation choices and design decisions that are obscured by traditional assistant interfaces.

Very exciting research direction!

Very exciting indeed. I will definitely do a deep dive into this paper, as my current work is exploring layers of affordances such as these in workflows beyond coding.

I've wanted someone to write an extension utilising this idea since GPT-3 came out. Is it available to use anywhere?

This! That's what I wanted since LLMs learned how to code.

And in fact, I think I saw a paper / blog post that showed exactly this, and then... nothing. For the last few years, the tech world became crazy with code generation, with forks of VSCode hooked to LLMs worth billions of dollars and all that. But AI-based code analysis is remarkably poor. The only thing I have seen resembling this is bug report generators, which is I believe is one of the worst approach.

The idea you have, that I also had and I am sure many thousands of other people had seem so obvious, why is no one talking about it? Is there something wrong with it?

The thing is, using such a feature requires a brain between the keyboard and the chair. A "surprising" token can mean many things: a bug, but also a unique feature, anyways, something you should pay attention to. Too much "green" should also be seen as a signal. Maybe you reinvented the wheel and you should use a library instead, or maybe you failed to take into account a use case specific to your application.

Maybe such tools don't make good marketing. You need to be a competent programmer to use them. It won't help you write more lines faster. It doesn't fit the fantasy of making anyone into a programmer with no effort (hint: learning a programming language is not the hard part). It doesn't generate the busywork of AI 1 introducing bugs for AI 2 to create tickets for.

Just to point...

> Is there something wrong with it?

> Maybe such tools don't make good marketing.

You had the answer the entire time :)

Features that require a brain between the AI and key-presses just don't sell. Don't expect to see them for sale. (But we can still get them for free.)

I don’t think I understand your point.

Are you saying that people of a certain competence level lose interest in force-multiplying tools? I don’t think you can be saying that because there’s so much contrary evidence. So what are you saying?

Other way around. The masses aren’t interested in force-multiplying tools. They only want to buy force-eliminating tools. They don’t want to work smarter or harder. They don’t want to work at all.

A fairly misanthropic view that hasn’t born out in my experience.

I'm saying they don't sell.

Some times people want them so badly that they will self-organize and collaborate outside of a market to make them. But a market won't supply them.

And yes, it's a mix of many people not being competent enough to see the value on them, markets putting pressure on companies to listen disproportionately to those people, publicity having a very low signal to noise ratio that can't communicate why a tool is good, and companies not respecting their customers enough to build stuff that is good for them (that last one isn't inherent to a market economy, but it near universal nowadays).

Either way, the software market just doesn't sell tools as useful as the GP is talking about.

> The idea you have, that I also had and I am sure many thousands of other people had seem so obvious, why is no one talking about it? Is there something wrong with it?

I expect it definitely requires some iteration, I don't think you can just map logits to heat, you get a lot of noise that way.

Honestly I just never really thought about it. But now it seems obvious that AI should be continuously working in the background to analyze code (and the codebase) and could even tie into the theme of this thread by providing some type of programming HUD.

Even if something is surprising just because it's a novel algorithm, it warrants better documentation - but commenting the code explaining how it works will make the code itself less surprising!

In short, it's probably possible (and it's maybe a good engineering practice) to structure the source such as no specific part is really surprising

It reminds me how LLMs finally made people to care about having good documentation - if not for other people, for the AIs to read and understand the system

I often find myself leaving review comments on pull requests where I was surprised. I'll state as much: This surprised me - I was expecting XYZ at this point. Or I wasn't expecting X to be in charge of Y.

WTFs/minute is a good metric for code quality. Now your pair expressing that can be an LLM.

https://blog.codinghorror.com/whos-your-coding-buddy/

I like to say that the reviewer is always right in that sense, if something is surprising, confusing, unexpected. Since I've been looking at the code for hours, I don't have a valid perspective anymore.

[deleted]

> It reminds me how LLMs finally made people to care about having good documentation - if not for other people, for the AIs to read and understand the system

Honestly I've mostly seen the opposite - impenetrable code translated to English by AI

Even if the impenetrable human code was translated to English by AI, it's still useful for every future AI that will touch the code.

Perhaps to get that decent documentation it took a decent bit of agentic effort (or even multiple passes using different models) to truly understand it and eliminate hallucinations, so getting that high quality and accurate summary into a comment could save a lot of tokens and time in the future.

Interesting! I've often felt that we aren't fully utilizing the "low hanging fruit" from the early days of the LLM craze. This seems like one of those ideas.

That's a really cool idea. Also the inverse, where suggestions from the AI were similarly heat mapped for confidence would be extremely useful.

I want that in an editor. It's also a good way to check if your writing is too predictable or cliche.

The perplexity calculation isn't difficult; just need to incorporate it into the editor interface.

Can you elaborate on how would one do this calculation?

    import openai, math, os, textwrap, json, sys
    query = 'Paris is the capital of'  # short demo input
    os.environ['OPENAI_API_KEY']       # check key early
    client = openai.OpenAI()
    resp = client.chat.completions.create(
        model='gpt-3.5-turbo',
        messages=[{'role': 'user', 'content': query}],
        max_tokens=12,
        logprobs=True,
        top_logprobs=1
    )

    logprobs = [t.logprob for t in resp.choices[0].logprobs.content]
    perplexity = math.exp(-sum(logprobs) / len(logprobs))
    print('Prompt: "', query, '"', sep='')
    print('\nCompletion:', resp.choices[0].message.content)
    print('\nToken count:', len(logprobs))
    print('Perplexity:', round(perplexity, 2))
Output:

    Prompt: "Paris is the capital of"
    
    Completion:  France.
    
    Token count: 2
    Perplexity: 1.17

Meta: Out of three models: k2, qwen3-coder and opus4, only opus one-shot the correct formatting for this comment.

If you want to generate a heatmap of existing text, you will have to take a different approach here.

The naive solution I could come up with would be really expensive with openai, but if you have an open source model, you can write up custom inference that goes one-token-at-a-time through the text, and on each token you look up the difference in logprobs between the token that the LLM predicted vs what was actually there, and use that to color the token.

The downside I imagine to this approach is it would probably tend to highlight the beginning of bad code, and not the entire block - because once you commit to a mistake, the model will generally roll with it - ie, a 'hallucination' - so logprobs of tokens after the bug happened might only be slightly higher than normal.

Another option might be to use a diffusion based model, adding some noise to the input and having it iterate a few times through, then measuring the parts of the text that changed the most. I have only a light theory understanding of these models though, so I'm not sure how well that would work

There's some libraries that might make this easier to implement:

https://github.com/kanishkamisra/minicons

> so logprobs of tokens after the bug happened might only be slightly higher than normal.

Sounds like it’s easier to pinpoint the bug.

There is some argument to be made here about entropy, compression, and how if there are no surprises, the program communicates no new information.

Interestingly, frequency of "surprising" sentences is one of the ways quality of AI novels is judged: https://arxiv.org/abs/2411.02316

That's actually something I implemented for a university project a few weeks ago. My professor also did some research into how this can be used for more advanced UIs. I'm sure it's a very common idea.

Do you have a link to the code? I'm curious how you implemented it. I'd also be really intrigued to see that research - does your professor have any published papers or something for those UIs?

Sounds great.

I'd like to see more contextually meaningful refactoring tools. Like "Remove this dependency" or "Externalize this code with a callback".

And refactoring shouldn't be done by generatively rewriting the code, but as a series of guaranteed equivalent transformations of the AST, each of which should be committed separately.

The AI should be used to analyse the value of the transformation and filter out asinine suggestions, not to write code in itself.

You know what happens when a measure becomes a target, though.

That’s actually fantastic as an idea

previously undefined variable and function names would be red as well

It would depend on how surprising the name is, right? The declaration of `int total` should be relatively unsurprising at the top of a function named `sum`, but probably more surprising in `reverseString`.

All editors do this already.

No, I mean when an LLM encounters a previously unseen name it doesn't expect it so it would be red, even though it's perfectly valid.

I'm imagining everything the LLM could produce (with a given top_k setting) would be shades of green to yellow. Just outside of that orange, and far outside red.

LLMs generate new functions all the time, I'd guess these would be light green, maybe the first token in the name would be yellow and it would get brighter green as the name unfolds.

The logits are probably all small when in the global scope where it's not clear what will be defined next. I'm not imagining mapping logits directly to heat, the ordering of tokens seems much more appropriate.

I don't think that's necessarily true. I've definitely seen LLMs hallucinating variables they never defined.

[flagged]

[flagged]

I read all your dead replies and it's a little wild that you think I'm an OpenAI employee (I'm not) trying to do damage control (is the article damaging to OpenAI?) by hijacking the comments (is my comment not a HUD-like idea?).

I don't really know where that's coming from, I'm just a dude who connected the idea in the article to an old idea that I haven't seen tried yet. The only thing I truly don't appreciate is you made one comment saying the text of my post had changed. It didn't.

The trouble with anonymous downvoting is that it fuels this kind of paranoia.

[flagged]

You're adding nothing of substance. If you have a point about the subject itself make it and present the receipts, then the rest of us can decide if we can follow your observation.

Without even knowing what the lines of the supposed conflict ought to be about: All I see here are baseless accusations from your side that make you quite frankly look a little bit unhinged. Please discuss your issues based on the merit of ideas not based on accusations and persons.

[flagged]

How so? I seriously don't follow what you are trying to convey

[flagged]

And the narrative they didn't like was what again?

[flagged]

Could you elaborate, please?

Nope, not surprising. Parent changed their text but they are just as wrong.

[flagged]

(Recent 3 comments to save anybody else checking whether @smolder is being picked on...)

Have you considered not cheating your way to projecting your wrong opinions?

Please downvote the fake commenters and keep YC reasonably pure.

Please don't accept this comment as valid since it's part of a campaign to set peoples overton window, paid for by dbags and executed by dbags.

You're getting downvoted - and flagged - because you're repeatedly breaking the HN guidelines. You may want to consider whether that's a path you want to continue.

Not sure you understand what has been going on but thanks for your concern.

No one understands what "has been going on" because you won't explain yourself.

What I understand is that if this has already gotten two of your comments killed, and will eventually get your account banned if you keep it up. I'm only bothering with this at all because I see from your profile that you make reasonable comments quite regularly.

Consider that I'm being reasonable.

It’s frequently possible to disagree while still adding to the discussion.

The comments of yours that I see downvoted are falling well short of that mark, particularly the ones where you accuse a decade-plus-old account with a recent comment history of being quite skeptical or even anti-LLM for coding of being a shill for an AI company.

Consider that your words are not being experienced as reasonable by readers.

4 dead comments so far is telling us HN users think otherwise. Including me.

It's a dumb hill to die on.

I do wish I could see who downvotes. If I ever criticise Google or Amazon, immediate downvoting without comment occurs.