When we were designing our CRS for the DARPA AI Cyber Challenge, we quickly realized that scaling Jazzer alone wouldn’t be enough for Java vulnerability discovery. The hard vulnerabilities required structured, semantically meaningful inputs that random mutation couldn’t produce. So we built Gondar, a system that combines LLM agents with coverage-guided fuzzing, and it helped us win.
After AIxCC, we wanted to put this to the test: how well does the approach hold up under rigorous, controlled evaluation? The resulting paper will be published at IEEE S&P ‘26. This post is about our journey and what we found along the way.