Long-Horizon AI Agents: What Sequoia's AGI Thesis Gets Right (and Wrong)

Sequoia Capital published a bold claim in January 2026: AGI is here. Not “nearly here” or “on the horizon.” Here, now, in production. Their argument rests on a specific concept: long-horizon agents, AI systems that can work autonomously for tens of minutes or hours, making and correcting mistakes without human intervention. Coding agents like Claude Code and OpenAI Codex are the first proof. More domains will follow.

The thesis is provocative, well-argued, and funded by billions of dollars in portfolio companies that benefit from the claim being true. That does not make it wrong. But it demands scrutiny.

What “Long-Horizon” Actually Means

The term “long-horizon” describes how long an AI agent can sustain coherent, goal-directed work without a human stepping in. A chatbot has a horizon of one message. A simple automation script has a horizon of seconds. A long-horizon agent operates for minutes or hours: planning its approach, using tools, hitting walls, backtracking, and iterating toward a goal.

Sequoia partners Pat Grady and Sonya Huang frame this as the third pillar of AGI. The first was pre-training (the “ChatGPT moment” of 2022). The second was inference-time compute, models that reason through problems step by step (OpenAI’s o1, late 2024). The third, arriving in late 2025 and early 2026, is the agent loop: systems that can “figure things out” over extended time horizons.

Their most concrete example: a recruiting agent from Juicebox that completed a full candidate search in 31 minutes. It searched LinkedIn for developer relations candidates at competitor companies, cross-referenced YouTube for conference talks, checked Twitter activity patterns, ruled out disqualified prospects, and drafted personalized outreach. No human touched it between the initial prompt and the final output.

That is genuinely different from a chatbot that answers one question at a time. But whether it constitutes AGI depends entirely on what you mean by the word.

The Benchmark: METR’s 50% Time Horizon

The strongest quantitative evidence for the long-horizon thesis comes from METR, an AI safety research organization. Their “50% time horizon” metric measures the length of tasks (calibrated by how long a human expert takes) that an AI agent can complete autonomously with a 50% success rate.

The numbers are striking. Frontier models like Claude 3.7 Sonnet hit a 50% time horizon of roughly 50 to 60 minutes as of early 2025. That number has been doubling approximately every seven months over the past six years. In some recent stretches, it doubled every three to four months.

If you extrapolate (always a dangerous word), agents would handle full-day expert tasks by around 2028 and full-year tasks by 2034. METR themselves flag the obvious caveat: these projections are based on six years of data, primarily in software engineering tasks. Whether the trend holds across other domains, or hits physical limits, remains genuinely unknown.

What is not speculative: agents that could barely handle five-minute tasks in 2023 now routinely handle hour-long tasks. That is real, measurable progress.

Coding Agents as the Canary

If long-horizon agents have a breakout domain, it is software engineering. Coding agents are the first place where multi-hour autonomous work has become routine, and the production data backs it up.

Claude Code crossed $1.1 billion in annualized revenue by end of 2025, with daily installs rising from 17.7 million to 29 million since January 2026. Companies like Uber, Netflix, Spotify, and Salesforce use it in production. On SWE-Bench Verified, a standardized benchmark for real-world bug fixing, Claude Opus 4.5 hit 80.9%, the first model to break 80%.

OpenAI’s GPT-5.3-Codex, launched in February 2026, takes a slightly different approach. Tasks typically take one to 30 minutes. The Codex desktop app functions as a command center for parallel task execution, with automations that handle routine work like issue triage without being prompted.

Goldman Sachs provides a telling enterprise case study. The firm deployed Cognition’s Devin across its 12,000-person developer workforce, starting with hundreds of instances and planning to scale to thousands. CIO Marco Argenti called Claude “surprisingly capable” at tasks beyond coding, particularly where large data parsing meets rule application and judgment. Goldman expects these agents to triple or quadruple the impact of their previous AI solutions.

Two Technical Paths to Longer Horizons

How do you actually build an agent that stays on track for hours? Two approaches are converging.

Reinforcement Learning

Frontier labs are training models intrinsically to maintain coherent behavior over longer contexts. Google’s approach uses reinforcement learning at the goal level rather than the token level, dramatically reducing the search space. Instead of optimizing what word comes next, the model learns to optimize what strategy to pursue next. This produces agents that naturally stay focused on multi-step problems without external scaffolding.

Agent Harnesses

The application layer approach wraps models in scaffolding that compensates for their limitations. Anthropic’s Claude Code uses initializer agents that set up the working environment, progress files that maintain state across context windows, and git-based checkpointing. When the context window fills up, the agent can read its own progress notes and pick up where it left off.

Harrison Chase of LangChain argues on the Sequoia podcast that “context engineering, not just better models” is the key. LangChain’s Deep Agents use the file system as agent state, spawn sub-agents for subtasks, and treat each session as a checkpoint in a longer workflow. This works especially well in coding, SRE, research, and advanced customer support.

The practical takeaway: longer horizons come from both smarter models and smarter infrastructure around them. Betting on one approach alone misses half the picture.

The AGI Question: Definition Games or Real Breakthrough?

Sequoia’s AGI definition is deliberately pragmatic. They define it as “the ability to figure things out,” broken into three components: baseline knowledge (pre-training), reasoning (inference-time compute), and iteration (agent loops). By this definition, yes, AGI has arrived.

The tech community is not convinced. Hacker News discussions repeatedly flag that Sequoia invests in both OpenAI and Anthropic. Maintaining the AGI narrative until IPOs occur is financially motivated. Tim Dettmers published a technical argument that scaling laws are approaching physical limits, that GPU improvements have plateaued, and that current AI addresses only knowledge work while ignoring the largest sectors of the economy.

There is also the cancellation problem. Gartner predicts that 40% of enterprise apps will embed AI agents by end of 2026, but simultaneously forecasts that over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls.

The honest assessment: long-horizon agents represent a genuine capability breakthrough. An agent that works autonomously for an hour on a complex task is qualitatively different from a chatbot. But calling it AGI stretches the term past its useful meaning. AGI has always implied general intelligence across all domains. What we have is narrow autonomy that is getting less narrow at an impressive rate.

What This Means for Enterprise Strategy

Forget the AGI label. Focus on what long-horizon agents can actually do for your organization today.

Where Long-Horizon Agents Work Now

Coding and software engineering are the proven domain. Beyond that, Sequoia’s own mapping shows production agents in medicine (OpenEvidence), law (Harvey), cybersecurity (XBOW), DevOps (Traversal), and sales (Day AI). Goldman Sachs is deploying Claude for trade accounting and client onboarding compliance.

The common thread: structured domains with clear success criteria, access to digital tools, and tolerance for iteration. If the task involves reading documents, applying rules, cross-referencing data, and producing an output that a human can review, a long-horizon agent can probably handle it.

Where They Fail

Tasks requiring physical interaction, real-time social judgment, or genuinely novel reasoning outside the training distribution. An agent can draft a legal brief by analyzing precedent; it cannot negotiate a settlement. An agent can triage support tickets; it cannot read the room on a sales call.

The Practical Playbook

Start with coding agents if you have a development team. The ROI is the most proven. Then look at internal workflows with high document volumes and rule-based decision-making: compliance checks, financial reconciliation, candidate screening. Goldman’s experience suggests that “tasks beyond coding where large data parsing meets rule application” are the next frontier.

Budget for a 40% failure rate on initial agent projects, in line with Gartner’s predictions. The projects that fail typically lack clear success metrics, insufficient human oversight, or try to automate tasks that require tacit knowledge the agent does not have.

Frequently Asked Questions

What are long-horizon AI agents?

Long-horizon AI agents are AI systems that can work autonomously for extended periods, typically tens of minutes to hours, without human intervention. They plan their approach, use tools, hit obstacles, backtrack, and iterate toward a goal. Coding agents like Claude Code and OpenAI Codex are the most prominent examples in production today.

Did Sequoia Capital really say AGI is here?

Yes. In January 2026, Sequoia partners Pat Grady and Sonya Huang published “This Is AGI,” arguing that long-horizon agents represent functional AGI defined as “the ability to figure things out.” This pragmatic definition differs from traditional AGI concepts that require human-level general intelligence across all domains.

How long can AI agents work autonomously in 2026?

According to METR’s benchmarks, frontier AI models can complete tasks that take human experts 50 to 60 minutes with 50% reliability as of early 2025. This capability has been doubling approximately every seven months over the past six years. Coding agents routinely handle multi-hour tasks in production.

Are coding agents the best example of long-horizon AI?

Yes. Coding agents are the breakout domain for long-horizon autonomy. Claude Code crossed $1.1 billion ARR by end of 2025. On SWE-Bench Verified, Claude Opus 4.5 solved 80.9% of real-world bug-fixing tasks. Goldman Sachs deployed hundreds of Devin instances across 12,000 developers. Other domains like law, medicine, and cybersecurity are following.

Should enterprises adopt long-horizon AI agents now?

Start with coding agents where ROI is most proven. Then target internal workflows with high document volumes and rule-based decisions: compliance, financial reconciliation, candidate screening. Budget for a 40% failure rate on initial projects, per Gartner’s forecast. The best candidates are structured domains with clear success criteria and digital tool access.

Cover image: Francesco Ungaro via Pexels Source

What “Long-Horizon” Actually Means#

The Benchmark: METR’s 50% Time Horizon#

Coding Agents as the Canary#

Two Technical Paths to Longer Horizons#

Reinforcement Learning#

Agent Harnesses#

The AGI Question: Definition Games or Real Breakthrough?#

What This Means for Enterprise Strategy#

Where Long-Horizon Agents Work Now#

Where They Fail#

The Practical Playbook#

Frequently Asked Questions#

What are long-horizon AI agents?#

Did Sequoia Capital really say AGI is here?#

How long can AI agents work autonomously in 2026?#

Are coding agents the best example of long-horizon AI?#

Should enterprises adopt long-horizon AI agents now?#