Photo by Adi Goldstein on Unsplash Source

Wiz tested Claude Sonnet 4.5, GPT-5, and Gemini 2.5 Pro against human security researchers on 10 web hacking challenges modeled after real-world, high-value vulnerabilities. The AI agents solved 9 out of 10 for under $50 in LLM costs per challenge. Human pentesters charge $15,000 to $100,000 for equivalent assessments. Then Wiz gave an agent a real incident investigation with no predefined target, and it burned through 500 tool calls over an hour without finding the issue. A human researcher found it in 5 minutes.

That split tells you everything about where AI stands in offensive security right now: phenomenal at pattern matching against known vulnerability classes, useless at the kind of broad, intuitive reasoning that real-world hacking demands. The question is no longer whether AI can hack. It is when you should let it, and when you absolutely should not.

Related: AI Agents in Cybersecurity: Offense, Defense, and the Arms Race

What Wiz Actually Tested

Wiz’s AI Cyber Model Arena, launched in February 2026, is the most rigorous public benchmark for AI agents in offensive security to date. The setup: 257 real-world challenges spanning five domains (zero-day discovery, CVE detection, API security, web security, and cloud security), each run in isolated Docker containers with no internet access, no CVE databases, and no external resources. Scoring is deterministic with no LLM-as-judge subjectivity.

The headline results from the web hacking comparison were striking. AI agents identified tech stacks from subtle clues and immediately mapped known misconfigurations. On challenges that followed recognizable patterns (SQL injection, SSRF, authentication bypass), agents moved faster than any human could. The cost per successful exploit was between $1 and $50 depending on challenge complexity.

Where Agents Excelled

The agents’ strength was methodical execution of known attack patterns. Give an agent a web application with a documented vulnerability class, and it will find the specific instance faster and cheaper than a human. Wiz found that this kind of pattern recognition, identifying a tech stack from subtle clues and immediately knowing which misconfigurations to check, is exactly what LLMs are built for. Multi-step reasoning chains that would take a human 30 minutes of manual testing took the agent seconds.

Where Agents Fell Apart

The moment Wiz moved from structured challenges to realistic, undirected scenarios, performance collapsed. In their incident investigation test, the AI agent ran for approximately one hour, making roughly 500 tool calls, and failed to find the vulnerability. The human researcher found it in about 5 minutes using broad enumeration and strategic pivoting. The agent did not lack knowledge. It lacked the ability to step back, survey the full attack surface, and decide where to look first. It kept drilling deeper into the wrong areas instead of broadening its search.

This matches a consistent finding across all the benchmarks: when AI agents are given wider scope without specific targets, the cost of finding vulnerabilities jumps 2 to 2.5 times, and the success rate drops sharply. The agent scaffolding matters as much as the underlying model.

The $50 vs $100,000 Problem

Fortune reported in January 2026 that AI is making hacking “very cheap,” and the Wiz data backs this up with specific numbers. For structured vulnerability discovery, the economics have flipped completely.

Traditional human penetration testing runs between $5,000 and $100,000+ per engagement, depending on scope and complexity. Mid-tier engagements ($15,000 to $50,000) cover deeper human-led analysis. AI-based alternatives like XBOW, which reached #1 on the global HackerOne leaderboard by finding over 1,000 vulnerabilities autonomously, generate equivalent reports starting at $4,000.

This cost compression has a dark side. The same economics that make defensive pentesting cheaper also make offensive hacking cheaper for attackers. The CrowdStrike 2026 Global Threat Report documents an 89% year-over-year increase in AI-enabled attacks. When your adversary can run sophisticated exploitation chains for the cost of an API call, your threat model needs updating.

Tenzai: Top 1% Without a Human in the Loop

Tenzai’s autonomous pentesting agent ranked in the top 1% of global CTF competitions in early 2026, outperforming 99% of human participants across six major platforms including websec.fr, dreamhack.io, and Lakera’s Agent Breaker. The agent solved challenges spanning authentication bypass, insecure direct object reference, application logic flaws, and multi-stage exploitation chains.

The key distinction: Tenzai’s system reasons about application behavior rather than relying on brute-force scanning. It analyzes identity flows, trust boundaries, and system interactions to find subtle vulnerabilities. The company raised $75 million in seed funding from Greylock and Battery Ventures, one of the largest seed rounds in cybersecurity history. Founded by the creators of Guardicore (acquired by Akamai), Tenzai is betting that autonomous pentesting will become the default within two years.

Related:
Related: AI Agent Prompt Injection: The Attack That Breaks Every Guardrail

Where Humans Still Win (And Why That Matters)

The Wiz data reveals a pattern that every benchmark in 2026 has confirmed: AI agents are specialists, not generalists. They excel at depth but fail at breadth. They follow known playbooks but cannot improvise new ones.

Human hackers bring three capabilities that current AI agents lack:

Broad enumeration. A human pentester spends the first phase of any engagement surveying the full attack surface: subdomains, API endpoints, third-party integrations, employee behavior patterns. They build a mental map before choosing where to strike. AI agents jump straight to exploitation without this reconnaissance phase, which is why they perform well on isolated challenges but poorly on real infrastructure.

Strategic pivoting. When a human finds that one attack vector is a dead end, they shift their entire approach. They might move from a technical exploit to a social engineering vector, or from the web application to the CI/CD pipeline. AI agents, even with sophisticated tool access, tend to persist on their initial approach far too long. The Wiz incident investigation demonstrated this: 500 tool calls in the same wrong direction.

Contextual judgment. Humans understand organizational context. They know that the staging server probably shares credentials with production. They notice that the company just acquired a startup and the integration is likely half-finished. This kind of situational awareness is not something you can prompt-engineer into an agent.

Dark Reading’s 2026 survey found that 48% of security professionals believe agentic AI will be the top attack vector by year-end. But the same survey implicitly acknowledges that the most dangerous attacks still require human creativity in the planning phase.

The Hybrid Model: Human Direction, AI Execution

The most compelling data comes from Hack The Box’s March 2026 benchmark, the largest side-by-side comparison of AI-augmented versus human-only cybersecurity performance ever conducted. They evaluated 1,078 teams (120 AI-augmented, 958 human-only) across 36 challenges covering nine technical domains.

The results were unambiguous. AI-augmented elite teams achieved 4.1x productivity compared to human-only teams. Even across all skill levels, AI augmentation delivered 1.4x improvement. AI-augmented teams improved their challenge solve rate by 70%, achieving a 27% solve rate versus 16% for top human-only teams.

But here is the critical nuance: the hardest and most novel challenges still demanded human judgment. AI raised the floor dramatically but barely moved the ceiling. The best human-only teams still solved problems that AI-augmented teams could not.

The Talent Pipeline Risk

Hack The Box flagged a risk that the industry has not yet grappled with: if AI handles the routine security work, junior analysts never develop the intuition needed to handle the hard cases. You cannot build a 10-year veteran by having AI do their first 5 years of work. The report specifically cautions that over-reliance on AI augmentation could hollow out the cybersecurity talent pipeline.

This maps directly to what Wiz found. AI agents need human direction to perform well on realistic tasks. But if the humans never develop that directional skill because AI did their early-career work for them, the hybrid model breaks down within a generation.

Related: OpenClaw: What the First Viral AI Agent Means for Enterprise Security

What This Means for Your Security Strategy

The benchmark data points to three actionable conclusions.

First, use AI agents for structured vulnerability scanning. For known vulnerability classes against defined targets, AI is faster and cheaper than humans by orders of magnitude. There is no reason to pay $50,000 for a human to find your SQL injection vulnerabilities when an agent will find them for $50.

Second, keep humans in charge of strategy and novel threats. Red team exercises, incident response, and adversary simulation still require human judgment. The Wiz data is clear: agents cannot replace the strategic layer. If your penetration test starts with “find everything wrong with this infrastructure,” a human needs to direct that engagement.

Third, invest in the hybrid workflow now. The 4.1x productivity gain from Hack The Box is not theoretical. It is measured across 1,078 teams. Security teams that pair human analysts with AI agents will outperform both pure-AI and pure-human approaches. But this requires training your team to direct AI effectively, not just handing them a tool and hoping for the best.

The age of AI-only or human-only security testing is ending. The benchmarks have spoken: neither side wins alone.

Frequently Asked Questions

Can AI agents hack better than humans?

In structured scenarios with known vulnerability classes, yes. Wiz’s 2026 benchmark showed AI agents solving 9 out of 10 web hacking challenges for under $50 each. But in realistic, undirected scenarios, human hackers still significantly outperform AI agents. The Wiz study found an AI agent made 500 tool calls over an hour without finding a vulnerability that a human spotted in 5 minutes.

How much cheaper is AI penetration testing compared to human pentesters?

For structured vulnerability discovery, AI agents cost between $1 and $50 per exploit in LLM costs, compared to $5,000 to $100,000+ for traditional human penetration testing engagements. AI-based pentesting platforms like XBOW offer reports starting at $4,000 versus $10,000 to $35,000 for equivalent manual assessments.

What is the Wiz AI Cyber Model Arena?

The Wiz AI Cyber Model Arena, launched in February 2026, is a benchmark suite of 257 real-world cybersecurity challenges spanning zero-day discovery, CVE detection, API security, web security, and cloud security. Each agent-model combination runs in isolated Docker containers with no internet access and deterministic scoring. It tested models including Claude Sonnet 4.5, GPT-5, and Gemini 2.5 Pro.

How much more productive are AI-augmented security teams?

According to Hack The Box’s March 2026 benchmark of 1,078 teams, AI-augmented elite cybersecurity teams achieved 4.1x more output than human-only teams. Across all skill levels, the improvement was 1.4x. AI-augmented teams also had a 70% higher challenge solve rate (27% vs 16%).

Will AI replace human penetration testers?

Not in the foreseeable future. Benchmark data consistently shows that AI agents fail at broad enumeration, strategic pivoting, and contextual judgment, all skills essential for realistic penetration testing. The hybrid model (human direction with AI execution) outperforms both pure-AI and pure-human approaches. However, AI will replace the routine, structured parts of pentesting, making human testers focus on strategy and novel threats.