CodeQL seems to raise too many false-positives in my experience. And it seems there is no easy way to run it locally, so it's a vendor lock-in situation.
CodeQL seems to raise too many false-positives in my experience. And it seems there is no easy way to run it locally, so it's a vendor lock-in situation.
Heyo, I'm the Product Director for detection & remediation engines, including CodeQL.
I would love to hear what kind of local experience you're looking for and where CodeQL isn't working well today.
As a general overview:
The CodeQL CLI is developed as an open-source project and can run CodeQL basically anywhere. The engine is free to use for all open-source projects, and free for all security researchers.
The CLI is available as release downloads, in homebrew, and as part of many deployment frameworks: https://github.com/advanced-security/awesome-codeql?tab=read...
Results are stored in standard formats and can be viewed and processed by any SARIF-compatible tool. We provide tools to run CodeQL against thousands of open-source repos for security research.
The repo linked above points to dozens of other useful projects (both from GitHub and the community around CodeQL).
The vagaries of the dual licensing discourages a lot of teams working on commercial projects from kicking the tires on CodeQL and generally hinders adoption for private projects as well: are there any plans to change the licensing in the future?
Nice, I for one didn't know about this. Thanks a bunch for chiming in!
CodeQL seems to raise too many false-positives in my experience.
I’d be interested in what kinds of false positives you’ve seen it produce. The functionality in CodeQL that I have found useful tends to accompany each reported vulnerability with a specific code path that demonstrates how the vulnerability arises. While we might still decide there is no risk in practice for other reasons, I don’t recall ever seeing it make a claim like this that was incorrect from a technical perspective. Maybe some of the other types of checks it performs are more susceptible to false positives and I just happen not to have run into those so much in the projects I’ve worked on.
The previous company I was working at (6 months ago) had a bunch of microservices, most in python using fastapi and pydantic. At one point the security team tuned on CodeQL for a bunch of them, and we just got a bunch of false positives for not validating a UUID url path param to a request handler. In fact the parameter was typed in the handler function signature, and fastapi does validate that type. But in this strange case, CodeQL knew that these were external inputs, but didn't know that fastapi would validate that path param type, so it suggested adding redundant type check and bail-out code, in 100s of places.
The patterns we had established were as simple, basic, and "safe" as practical, and we advised and code-reviewed the mechanics of services/apps for the other teams, like using database connections/pools correctly, using async correctly, validating input correctly, etc (while the other teams were more focused on features and business logic). Low-level performance was not really a concern, mostly just high-level db-queries or sub-requests that were too expensive or numerous. The point is, there really wasn't much of anything for CodeQL to find, all the basic blunders were mostly prevented. So, it was pretty much all false-positives.
Of course, the experience would be far different if we were more careless or working with more tricky components/patterns. Compare to the base-rate fallacy from medicine ... if there's a 99% accurate test across a population with nothing for it to find, the "1%" false positive case will dominate.
I also want to mention a tendency for some security teams to decide that their role is to set these things up, turn them on, cover their eyes, and point the hose at the devs. Using these tools makes sense, but these security teams think it's not practical for them to look at the output and judge the quality with their own brains, first. And it's all about the numbers: 80 criticals, 2000 highs! (except they're all the same CVE and they're all not valid for the same reason)
Interesting, thanks. In the UUID example you mentioned, it seems the CodeQL model is missing some information about how FastAPI’s runtime validation works and so not drawing correct inferences about the types. It doesn’t seem to have a general problem with tracking request parameters coming into Python web frameworks — in fact, the first thing that really impressed me about CodeQL was how accurate its reports were with some quite old Django code — but there is a lot more emphasis on type annotations and validating input against those types at runtime in FastAPI.
I completely agree about the problem of someone deciding to turn these kinds of scanning tools on and then expecting they’ll Just Work. I do think the better tools can provide a lot of value, but they still involve trade-offs and no tool will get everything 100% right, so there will always be a need to review their output and make intelligent decisions about how to use it. Scanning tools that don’t provide a way to persistently mark a certain result as incorrect or to collect multiple instances of the same issue together tend to be particularly painful to work with.