Yeah! I think an AST is sort of what I'm envisioning here, but with much broader metadata, including requirements and implicit assumptions and stuff.
As a concrete example, a random bit of code from the minih264 encoder:
/**
* Quantized/dequantized representation for 4x4 block
*/
typedef struct
{
int16_t qv[16]; // quantized coefficient
int16_t dq[16]; // dequantized
} quant_t;
Someone who's built an encoder or studied h264 probably knows what this is for (I have a very fuzzy idea). But even with the comment there's lots of questions. Are these arrays restricted to certain values? Can they span the full int16, or are there limits, or are the bits packed in an interesting way? Can they be negative? Why would you want to store these 2 numbers together in a struct, why not separately? Do they get populated at the same time, or at different phases of the pipeline, or are they built up over multiple passes? Are all of these questions ridiculous because I don't really understand enough about how h264 works (probably)?LLMs already have a lot of this knowledge, and could probably answer if prompted, but my point is more that the code doesn't explicitly lay out all of these things unless you carefully trace the execution, and even then, some of the requirements might not be evident. Maybe negative numbers aren't valid here (I don't actually know) but the reason that invariant gets upheld is an abs() call 6 levels up the call stack, or the data read from the file is always positive so we just don't have to worry about it. I dunno.
Anyway I imagine LLMs could be even more useful if they knew more about all this implicit context somehow, and I think this is the kind of stuff that just piles up as a codebase gets larger.