Strings are a universal interface with no dependencies. You can do anything in any language across any number of files. Any other abstraction heavily restricts what you can accomplish.
Also, LLMs aren't trained on ASTs, they're trained on strings -- just like programmers.
No, it’s not really “any string.” Most strings sent to an interpreter will result in a syntax error. Many Unix commands will report an error if you pass in an unknown flag.
In theory, there is a type that describes what will parse, but it’s implicit.
Exactly. LLMs are trained on huge amounts of bash scripts. They “know” how to use grep/awk/whatever. ASTs are, I assume, not really part of that training data. How would they know how to work well with on? LLMs are trained on what humans do to code. Yes, I assume down the road someone will train more efficient versions that can work more closely with the machine. But LLMs work as well as they do because they have a large body of “sed” statements in their statistical models
They also know how to use modern options like fd and rg, which allow more complex operations with a single call.
treesitter is more or less a universal AST parser you can run queries against. Writing queries against an AST that you incrementally rebuild is massively more powerful and precise in generating the correct context than manually writing infinitely many shell pipeline oneliners and correctly handling all of the edge cases.
I agree with you, but the question is more whether existing LLMs have enough training with AST queries to be more effective with that approach. It’s not like LLMs were designed to be precise in the first place
generating code that doesn't run is just a waste of electricity.