Hacker News

in the era of LLMs, syntax might matter more than you think.

The c form of `type name;` is ambiguous because it could actually be more than one thing depending on context. Even worse if you include macro sheananigans. The alternate (~rust/zig) is `var/const/mut name type` is unambiguous.

For humans, with rather long memory of what is going on in the codebase, this is ~"not a problem" for experts. But for an LLM, its knowledge is limited to the content that currently exists in your context, and conventions baked in with the training corpus, this matters. Of course it is ALSO a problem for humans if they are first looking at a codebase, and if the types are unusual.

wavemode 5 hours ago [ - ]

I hope that someday LLMs will interact with code mostly via language servers, rather than reading the code itself (which both frequently confuses the LLM, as you've noted, but is also simply a waste of tokens).

dnautics 5 hours ago [ - ]

why? I suspect that writing code itself is extremely token efficient (unless like your keywords happen to be silly, super-long alien text).

Like which do you think is more token-efficient?

     <tool-call write_code "my_function(my_variable)"/>

    <tool-call available_functions/>

    resp: 
         <option> my_function </option>
         <option> your_function </option>
         <option> some_other_function </option>
         <option> kernel_function1 </option>
         <option> kernel_function2 </option>
         <option> imported_function1 </option>
         <option> imported_function2 </option>
         <option> ... </option>
     <tool-call write_function_call "my_function"/>
     resp:
         <option> my_variable </option>
         <option> other_variable_of_same_type </option>
     <tool-call write_variable "my_variable"/>

wavemode 4 hours ago [ - ]

Not sure I follow. You seem to have omitted the part of 1) explaining how the LLM knew that my_function even existed - presumably, it read the entire file to discover that, which is way more input tokens than your hypothetical available_functions response.

dnautics 3 hours ago [ - ]

Reading files is not that input tokien heavy, I suspect. But anyways I omitted it because presumably it would have done so to gain local context in general.

0x457 5 hours ago [ - ]

LSP is meant for IDEs and very deterministic calls. Its APIs are like this: give me a definition of <file> <row> <column> <lenght>. This makes sense for IDEs because all of those can be deterministically captures based of your cursor position.

LLMs are notoriously bad at counting.

wavemode 4 hours ago [ - ]

I think one could easily build an MCP tool wrapping LSP which smooths over those difficulties. What the LLM needs is just a structured way to say "perform this code change" and a structured way to ask things like "what's the definition of this function?" or "what functions are defined in this module?"

Not much different from what agents already do today inside of their harnesses, just without the part where they have to read entire files to find the definition of one thing.

0x457 3 hours ago [ - ]

So not using LSP, but rather using something in a middle that uses LSP as implementation detail.

So far enabling LSP in Claude only added messages like "this is old diagnostic before my edit".

zahlman 4 hours ago [ - ]

I suspect that the context-dependence in C is more an issue with the implementation than the overall syntactic philosophy.

dnautics 3 hours ago [ - ]

It's a thing in c.

foo * bar;

Is that multiplication? Or a declaration of type foo*?

zahlman 2 hours ago [ - ]

Right; and my argument is that this isn't because the type expression `foo *` precedes the name `bar`; it's because the type "pointer to foo" is expressed in a way that could also be a prefix of a multiplication expression.

cranberryturkey 5 hours ago [ - ]

This is an underappreciated point. I work across a lot of codebases and the difference in how well AI coding tools handle Rust vs JavaScript vs Python is striking — and syntax ambiguity is a big part of it.

The `type name` vs `let name: type` distinction matters more than it seems. When the grammar is unambiguous, the LLM can parse intent from a partial file without needing the full compilation context that a human expert carries in their head. Rust and Go are notably easier for LLMs to work with than C or C++ partly because the syntax encodes more structure.

The flip side: syntax that is too terse becomes opaque to LLMs for the same reason it becomes opaque to humans. Point-free Haskell, APL-family languages, heavy operator overloading — these rely on the reader holding a lot of context that does not exist in the immediate token window.

I wonder if we will see new languages designed with LLM-parseability as an explicit goal, the way some languages were designed for easy compilation.

vidarh 5 hours ago [ - ]

Fine tuning is likely a bigger part of it.

I've worked on fine tuning projects. There's a massive bias towards fone tuning for Python at several model providers for example, followed by JS.

dheera 5 hours ago [ - ]

Humans also have limited context. For LLMs it's mostly a question of pipeline engineering to pack the context and system prompt with the most relevant information, and allow tool use to properly understand the rest of the codebase. If done well I think they shouldn't have this particular issue. Current AI coding tools are mostly huge amounts of this pipeline innovation.

dnautics 5 hours ago [ - ]

I think we need a LLM equivalent of this part's of fitt's law: The fastest place to click under a cursor is the location of the cursor. For an LLM the least context-expensive feedback is no feedback at all, the LLM should be able to intuit the correct code in-place, at token generation.