Very interesting and cool project.
Creating an accurate call graph is difficult, especially for dynamic languages such as JavaScript or TypeScript. The academia has spent decades of effort on this. I am wondering why your custom parser could do this much better. And, I am interested in how to store dynamic typing information into Protobuf's strong typing system.
Due to the limited context window, it is definitely unaffordable to provide the entire application's source code to the model. I am wondering what kind of "context" information is generally helpful for bug detection, like the call chain?
Thanks, we use a similar approach to GitHub's stack graphs (https://github.blog/open-source/introducing-stack-graphs/) to build a graph structure with definition/reference nodes. For dynamic typing in protobuf, we use the language compiler as an intermediary to resolve dynamic types into static relationships, then encode the relationships into protobuf.
Yes, we don't feed entire codebases to the LLM. The LLM queries our indexer for symbols names and code sections (exposed functions, data flow boundaries, sanitization functions) to build up the call chain and reason about the vulnerability.