More CPUs Won't Find More Bugs: Insights from Combining LLM Agents and Jazzer

More CPUs Won't Find More Bugs: Insights from Combining LLM Agents and Jazzer

When we were designing our CRS for the DARPA AI Cyber Challenge, we quickly realized that scaling Jazzer alone wouldn’t be enough for Java vulnerability discovery. The hard vulnerabilities required structured, semantically meaningful inputs that random mutation couldn’t produce. So we built Gondar, a system that combines LLM agents with coverage-guided fuzzing, and it helped us win.

After AIxCC, we wanted to put this to the test: how well does the approach hold up under rigorous, controlled evaluation? The resulting paper will be published at IEEE S&P ‘26. This post is about our journey and what we found along the way.

We Started by Throwing Compute at the Problem

The conventional assumption in fuzzing is straightforward: more compute, more bugs. So we tested it. We ran Jazzer, the state-of-the-art coverage-guided Java fuzzer, in two configurations: a standard baseline with 3 cores and 12 hours per target, and a large-scale run with 50 cores and 24 hours per target, over 7.2 CPU-years of total computation.

No existing Java vulnerability benchmark suited our needs, so we built one: 54 vulnerabilities across 22 open-source Java projects, spanning 12 CWE types. It draws from real CVEs (19), AIxCC challenges (20), and vulnerabilities we manually injected based on real-world patterns (15). Controlled access is being prepared to support reproducibility while mitigating model contamination.

Both configurations exploited exactly 8 vulnerabilities. Scaling compute by 17x and doubling the time budget found zero additional vulnerabilities. The large-scale run reached 3 more sinks (29 vs. 26), but couldn’t convert any of them into a working exploit.

Why? The failures fell into two categories: 46% of vulnerabilities were never even reached, and another 39% were reached but not exploited, what we call the “last-mile” problem. These vulnerabilities require inputs with specific structure and semantics that random mutation is unlikely to produce. A deserialization exploit needs a valid archive with correct internal structure. A path traversal needs paths that bypass sanitization logic. An XXE attack needs well-formed XML with specific entity definitions. A command injection might depend on satisfying a cryptographic check. Random mutation won’t solve these; they require understanding what the code expects and why.

Half the targets reached 95% of their final coverage within 15 minutes (Figure 1). The remaining hours and cores contributed almost nothing. This isn’t a resource gap. It’s a semantic gap.

Jazzer exploits only 8 of 54 vulnerabilities regardless of whether it runs on 3 cores for 12 hours or 50 cores for 24 hours. Coverage saturates in minutes: this is a semantic gap, not a resource gap.

Figure 1: Jazzer’s code coverage over time for three example projects. Each line is a fuzzing harness. Coverage flatlines early; the remaining hours contribute little new coverage.

So We Added Semantic Reasoning

Since more compute wasn’t the answer, we needed something that could reason about code structure. We set out to combine LLM-based agents with Jazzer, targeting sinks: security-sensitive API calls like Runtime.exec(), ProcessBuilder, or SQL query methods where vulnerabilities actually manifest. We’ve written about this sink-centered approach before in the context of Atlantis-Java.

Key to the sink-centered design are three agents assisting in vulnerability detection (see Figure 2), one during static analysis and two during dynamic testing.

The first agent sits inside Gondar’s static sink detection module, which applies CodeQL to identify potential sinks. However, CodeQL’s built-in queries filter sinks based on predefined source definitions, which are too restrictive for our use case: attacker-controlled input comes from fuzzing harnesses, not the sources CodeQL expects. So we strip CodeQL’s filters and use only its sink definitions, extracting all call sites directly. This gives us thousands of candidates, most of which are false positives. We reduce these with our own filtering pipeline including both traditional static analysis (constant value checks, reachability checks) as well as our exploitability assessment agent, which filters out sinks that it determines unexploitable based on concrete evidence in the source code. This allows Gondar to bring the number of sinkpoints down to a few hundred actionable sinks while retaining over 96% of the truly exploitable ones.

Second, a sink exploration agent analyzes call paths from the program entry point to each sink, reads the source code along the path, and generates inputs designed to satisfy the constraints needed to reach it. Third, a sink exploitation agent receives inputs that successfully reach sinks (we call these “beep seeds”) and iteratively develops proof-of-concept exploits by reasoning about the vulnerability-specific conditions needed to trigger Jazzer’s sanitizers.

Critically, these agents don’t run in isolation. They operate concurrently with Jazzer and exchange artifacts in both directions: exploration seeds flow into the fuzzer’s corpus for further mutation, while discovered beep seeds flow to the exploitation agent as concrete starting points. All exploitation outputs, even failed attempts, are fed back to the fuzzer, which may mutate a near-miss into a working exploit.

Figure 2: Gondar’s architecture. LLM agents and the fuzzer run concurrently, exchanging artifacts bidirectionally. Exploration seeds enrich the fuzzer’s corpus; discovered beep seeds ground the exploitation agent’s reasoning.

Putting Gondar to the Test

We ran Gondar on the same 54 vulnerabilities. Figure 3 shows vulnerabilities reached versus exploited across all 15 configurations: 7 Gondar model variants, 3 ablations, and 5 baselines. Upper-right is better.

Figure 3: Vulnerabilities reached vs. exploited for all configurations. Gondar configurations (blue) cluster in the upper-right; baselines (gray) sit in the lower-left. Ablations (orange) show the impact of removing individual components.

Gondar exploits 41 of 54 vulnerabilities with its best configuration (Gemini-2.5-Pro), compared to Jazzer’s 8. That’s over 5x as many on the same benchmark, at comparable or lower cost. Even the cheapest Gondar configuration (GPT-5-nano at $182 total) exploits 27 vulnerabilities, still over 3x the baseline.

The ablations confirm that each component matters: removing the exploration agent (XO) drops reached vulnerabilities from 42 to 29; removing the exploitation agent (RO) drops exploited from 37 to 18. Gondar also exploits 35 of the 46 vulnerabilities that Jazzer misses, by leveraging LLM reasoning to satisfy constraints that mutation alone cannot.

The paper digs deeper into each stage: how sink filtering balances precision and recall, how iterative refinement drives exploitation success, and how Gondar compares against static analysis tools like CodeQL and SpotBugs. But two things surprised us most.

Takeaway 1: LLMs and Fuzzers Are Complementary, Not Interchangeable

It’s tempting to think of LLMs as a replacement for fuzzers, or vice versa. Our results show the opposite: they have fundamentally different strengths, and the combination is greater than the sum of its parts.

LLMs excel at structure and intent. They can generate a well-formed ZIP archive containing a crafted payload, reason about path constraints from source code, or construct an XML document that satisfies a parser’s type system. Fuzzers excel at fast mutation, exploring millions of input variations per second. Neither capability substitutes for the other, and our data confirms it: 7 vulnerabilities in our benchmark are found only through the agent-fuzzer collaboration. Neither the agents alone nor the fuzzer alone discovers them.

Take a zip-slip vulnerability in Apache ZooKeeper: the exploitation agent understands the vulnerability pattern and generates archive inputs that are structurally close to a working exploit, but not quite right. Jazzer picks up these seeds and mutates them, eventually producing an input that triggers the sanitizer. The agent provides the semantic scaffolding; the fuzzer provides the final refinement.

7 vulnerabilities are discovered only through agent-fuzzer collaboration. Neither component finds them independently; the combination outperforms the sum of its parts.

Takeaway 2: Open-Source Models Deliver Near-Flagship Performance

After seeing these results with flagship models (GPT-5, Gemini-2.5-Pro, and Claude Sonnet 4.5, which exploit 37-41 vulnerabilities at ~$2,400-$3,100 total), we asked the natural follow-up: how much does the model actually matter?

We swapped in GLM-5, an open-source model, and the result surprised us. GLM-5 exploits 35 vulnerabilities at a total cost of just $392, or $11.21 per bug, compared to $66-$74 for flagship models. It achieves 85% of flagship performance at roughly 13-16% of the cost.

Put differently: GLM-5 at $392 finds more vulnerabilities than large-scale fuzzing at $3,264, while costing less than one-eighth as much (Figure 4). The architecture amplifies the model. A well-designed system with a modest open-source model beats brute-force compute with no intelligence.

GLM-5 (open-source) exploits 35 vulnerabilities at $392 total, more than large-scale fuzzing at $3,264, and 85% of flagship performance at 13-16% of the cost.

Figure 4: Cost versus vulnerabilities exploited across all configurations. GLM-5 sits in the sweet spot: near-flagship effectiveness at a fraction of the cost. Large-scale fuzzing (Baseline-LS) costs the most while finding the fewest vulnerabilities. The GLM experiment was supported by FriendliAI.

Closing Thoughts

These results come from our controlled benchmark for a reproducible, scientific evaluation, but Gondar has found real bugs too. During AIxCC, it discovered 3 zero-day vulnerabilities in real-world projects (Hertzbeat, Healthcare-Data-Harmonization, and PDFBox), all responsibly disclosed. It’s now part of OSS-CRS, an OpenSSF Sandbox Project for continuous open-source security protection, where it has already found another zero-day path traversal in a widely used Java database.

When we started building Gondar for AIxCC, we knew fuzzing alone wasn’t enough. Now we have the numbers to back that up: adding sink-focused agents finds at least 3x more bugs than spending multiples on raw fuzzing alone, even with a cheap open-source model. Check the paper for full details.

References

Related Posts

Atlantis Infrastructure

Atlantis Infrastructure

The AIxCC competition is not just about creating automated bug-finding and patching techniques – it is about building a cyber reasoning system (CRS) that can do both without any human assistance. To succeed, a CRS must excel in four critical infrastructure areas: Reliability: Run continuously for weeks without intervention. Scalability: Handle many challenge projects concurrently. Budget Utilization: Maximize Azure cloud and LLM credit usage. Submission Management: Consistently deliver valid proof-of-vulnerability blobs (POVs), Patches, SARIF assessments, and Bundles. In this post, we will share how we designed the infrastructure of our CRS, Atlantis, to meet these keys and make it as robust as possible. We could not have won AIxCC without the exceptional work of our infrastructure team.

Every Patch Agent has its Own Story (1) - Martian: Exploring the Unknown with Sophisticated Tools

Every Patch Agent has its Own Story (1) - Martian: Exploring the Unknown with Sophisticated Tools

As we mentioned in our previous blog post, we enhanced the patching capabilities of Atlantis by ensembling multiple patch agents. In this series of blog posts, we will introduce each of our patch agents in detail and explain the rationale behind their designs. Diversity for Good To maximize the effectiveness of ensembling, it is crucial to have diverse agents. If all agents are similar, the ensemble will not perform significantly better than any individual agent. Therefore, we intentionally designed our patch agents to be diverse in their approaches, methodologies, and also models used. We newly developed six patch agents, each with its own unique architecture and motivation, as summarized in the table below.

Hacking Redefined: How LLM Agents Took on University Hacking Competition

Hacking Redefined: How LLM Agents Took on University Hacking Competition

For the first time, we deployed our hybrid system, powered by LLM agents—Atlantis—to compete in Georgia Tech’s flagship CTF event, TKCTF 2024. During the competition, Atlantis concentrated on two pivotal areas: vulnerability analysis and automatic vulnerability remediation. Remarkably, the system uncovered 10 vulnerabilities and produced 7 robust patches1, showcasing the practicality and promise of our approach in a real-world hacking competition. In this blog, I’ll delve into some fascinating insights and essential lessons from the CTF experience. As we prepare to open-source the full details of our system following AIxCC competition rules, this milestone reflects more than just a technical achievement—it embodies our commitment to advancing LLM-driven security research.