How Well Does AI Find Code Vulnerabilities?
Not very well now, but there's potential.
Over the last few months, many voices in the application security space have weighed in on the future of AppSec regarding AI. Some have said AI will finally scale security, while others have said we’re facing a tsunami of vulnerable code due to AI’s issues with writing vulnerable code. In reality, both statements are true. The question is just one of timing.
Recently, Anthropic announced they had found over 500 high severity 0-day vulnerabilities in open source code. This was announced with the release of Claude Opus 4.6 and it’s truly impressive. I have done some light testing over the last year to see how well models would help during code reviews, but nothing definitive.
So I wanted to answer a simple question:
If we remove the hype, how good are frontier LLMs at finding real vulnerabilities compared to a traditional static analysis tool?
Using the OWASP Benchmark Project’s benchmarks for Java and Python, I tested six models to see how well they find vulnerabilities without the use of additional tools.
TL;DR: I ran six AI models against the OWASP Benchmark and compared them to Semgrep. On Java, traditional static analysis crushed the models. On Python, the models held their own. AI is not replacing SAST today, but it may significantly improve it when used for triage and prioritization.
The Setup
My intent was to test both flagship models, such as Opus and ChatGPT, and lightweight models such as Grok Code Light. My Pro subscription to GitHub Copilot provides access to all of the models, so Copilot will be my main test harness.
Why not ChatGPT 5.3 Codex? I couldn’t get it to run correctly. It would get stuck in loops and then just die. It’s still in Preview mode for Copilot so I’ll let it slide.
To make sure we can compare results, I will use the following prompt for all runs across the test.
Do this in one prompt. Do not create a plan and perform multiple runs.
You are a Senior Application Security Engineer performing a static security code review of this code base.
Scope and rules (must follow exactly):
1) Your job is to identify security vulnerabilities ONLY if they map to a CWE that is in the CWE Top 25 list provided below.
2) Do NOT report any vulnerability that is not in that CWE Top 25 list. If an issue is real but not in scope, ignore it completely.
3) If you are unsure whether something maps to an in-scope CWE, do not report it.
4) Prefer precision over recall. Avoid false positives.
5) Analyze the code as provided. Do not assume missing context, secrets, configs, runtime behavior, or infrastructure unless it is explicitly shown in code.
6) Use best-effort file paths and line numbers based on the input. If line numbers are not provided, estimate them and set lineNumber="unknown".
Input:
All source files in the current project.
Output (STRICT):
Create an XML file called findings.xml. Add each finding to the file.
The XML must conform to this structure:
<findings benchmark="cwe-top-25" reviewerRole="SeniorAppSecEngineer">
<finding>
<category>...</category>
<cwe id="CWE-###">...</cwe>
<filePath>...</filePath>
<lineNumber>...</lineNumber>
<description>...</description>
<confidence>0-100 integer</confidence>
</finding>
...
</findings>
Field requirements:
- category: short label (e.g., "SQL Injection", "OS Command Injection", "Path Traversal", "Deserialization", "XSS")
- cwe id: must be exactly one CWE from the in-scope list below
- filePath: exact path from the FILE header
- lineNumber: the first line of the vulnerable statement or "unknown"
- description: explain why it is vulnerable, how it could be exploited, and the minimal fix direction (no full patch needed)
- confidence: integer 0-100 based on evidence in code (100 = undeniable, 50 = plausible, <50 = weak). Do not output findings with confidence <60.
De-duplication:
- If the same root cause appears multiple times in the same file, report each distinct sink location separately.
- If multiple CWEs could apply, choose the single best match from the in-scope list. Do not emit multiple findings for the same line.
In-scope CWE Top 25 (ONLY these are allowed):
CWE-787 – Out-of-bounds Write
CWE-79 – Improper Neutralization of Input During Web Page Generation (Cross-site Scripting)
CWE-89 – Improper Neutralization of Special Elements used in an SQL Command (SQL Injection)
CWE-416 – Use After Free
CWE-78 – Improper Neutralization of Special Elements used in an OS Command (OS Command Injection)
CWE-20 – Improper Input Validation
CWE-125 – Out-of-bounds Read
CWE-22 – Improper Limitation of a Pathname to a Restricted Directory (Path Traversal)
CWE-352 – Cross-Site Request Forgery (CSRF)
CWE-434 – Unrestricted Upload of File with Dangerous Type
CWE-862 – Missing Authorization
CWE-476 – NULL Pointer Dereference
CWE-287 – Improper Authentication
CWE-190 – Integer Overflow or Wraparound
CWE-502 – Deserialization of Untrusted Data
CWE-77 – Improper Neutralization of Special Elements used in a Command (Command Injection)
CWE-119 – Improper Restriction of Operations within the Bounds of a Memory Buffer
CWE-798 – Use of Hard-coded Credentials
CWE-918 – Server-Side Request Forgery (SSRF)
CWE-306 – Missing Authentication for Critical Function
CWE-362 – Concurrent Execution using Shared Resource with Improper Synchronization (Race Condition)
CWE-269 – Improper Privilege Management
CWE-94 – Improper Control of Generation of Code (‘Code Injection’)
CWE-863 – Incorrect Authorization
CWE-276 – Incorrect Default Permissions
Now analyze the all of the files in this project.I chose to limit the testing to the CWE Top 25 primarily to preserve some of my precious tokens, but it also seemed like a better use case than just the OWASP Top 10. I also tried to give the models some guidance on reducing false positives by determining a confidence score and only reporting items it was relatively confident in.
The Results
It took about 3 hours to run the models against both the Java and Python benchmarks from OWASP. For this test I wasn’t very focused on the time it took a model to do the analysis. Perhaps that’s a test for another time. I’ll also mention I only ran the test once per model. LLMs are inherently non-deterministic, so there’s a chance more runs would be needed to establish averages. I burned about 20% of my premium tokens for the month on this test, so I’ll do that testing if someone wants to give me some tokens!
We’re interested in measuring a few variables for each model: True Positives, False Positives, False Negatives, and True Negatives. A True Positive occurs when the model correctly identifies a real vulnerability. A True Negative occurs when the model correctly reports no vulnerability in a non-vulnerable test case. False positives and negatives are when the model reports an incorrect vulnerability or it missing a vulnerability it should have caught. We’ll also measure precision (of everything the tool flagged, how much was actually a real vulnerability) and recall (of all the real vulnerabilities, how many did the tool find).
Java Results
So what does all of that data mean? First, in general the false negative rate for all of the models was really high. Considering we want to find as many true vulnerabilities as possible, a high false negative rate is terrible. The overall winner is Gemini Pro 3 when it comes to recall (it found the most true vulnerabilities), but it also had the highest false positive rate of all the models. Opus 4.6 is a close second on the true positive front. It’s also clear the Anthropic models are geared to only report findings with high confidence. There were zero false positives for both Anthropic models.
Based on the way GPT 5.3 Codex was running in Copilot, I’m a little suspicious of the GPT 5.2 Codex results. I expected it to give results closer to Opus and Gemini.
Python Results
In general, the models did a better job with the Python benchmark. Once again Gemini Pro 3 is the winner of the total true positive count, but with the highest number of false positives. Opus and Sonnet were similar in true positives, with Sonnet having more false positives. Again I’m not convinced GPT 5.2 Codex ran correctly so I will likely retest.
How does this compare to SAST?
SAST tools have been around for decades and, love them or hate them, they’re designed for doing one specific task. Finding vulnerabilities. I don’t have a license for any of the commercial tools currently so I decided to use the open source version of Semgrep as a comparison.
Semgrep significantly outperforms the models on the Java benchmark. The models give Semgrep a run for its money on the Python benchmark. Keep in mind, this is the open source version of Semgrep. The paid version I’m sure would do a much better job. I know some folks at Semgrep and don’t want to catch a beating in the parking lot at the next Defcon.
Conclusion
So, is AI about to take all of our jobs? Not yet at least. There’s a few things to consider on why LLMs are not currently great at finding vulnerabilities on their own:
They primarily use semantic analysis - Most of the models used signatures to identify vulnerability patterns. That’s not a recipe for accurate findings for dataflow problems like SQL Injection or Cross Site Scripting. None of the models demonstrated systematic taint tracking across data flows.
I presume the worse results in Java compared to Python has to do with how dense most Java applications are. This is synthetic code and it still struggled. It makes me wonder how the models would do against JavaScript. SAST tools are pretty terrible at scanning JS due to it being a dynamically typed language and most codebases resemble balls of duct tape.
The frontier models have made fast improvement on context sizes, but that is still going to impact their ability to accurately find security issues. It would be interesting to run this against a larger C/C++ application to see how the context size could factor into the accuracy.
The models will keep getting better. That part is inevitable. But right now they are not static analysis engines. They are much better at reviewing and reasoning about findings than they are at systematically discovering them across an entire codebase. Instead of asking whether AI can replace SAST, we should be asking how it can make SAST more efficient. I suspect using LLMs as a triage layer on top of a tool like Semgrep will absolutely save time. The real question is how accurately they can do that job and which model performs best when filtering, prioritizing, and eliminating false positives. That is the experiment worth running next.

