Stripe’s internal coding agents, called Minions, now produce over 1,300 merged pull requests per week. Zero of those lines are human-written. All of them are human-reviewed. That makes Minions the most concrete, metrics-backed example of fully autonomous AI coding running inside a major enterprise, and the architecture behind them is more instructive than the headline number.
Unlike interactive tools like GitHub Copilot or Cursor that pair-program with a developer, Minions work unattended. An engineer describes a task in Slack, walks away, and returns to a finished pull request. The system is built on a heavily modified fork of Block’s open-source Goose agent, stripped of everything meant for interactive use and rebuilt for one-shot, end-to-end task completion.
The Four-Layer Architecture That Makes It Work
Most coverage of Stripe Minions focuses on the 1,300 PR number. The engineering decisions underneath it are more interesting. Stripe’s two-part blog series describes a four-layer architecture that separates execution, orchestration, context, and feedback into distinct concerns.
Layer 1: Isolated Devboxes
Each Minion runs in its own AWS EC2 instance, identical to the environments human engineers use. These devboxes come pre-loaded with Stripe’s full source tree, warmed Bazel and type-checking caches, and running code generation services. Spin-up time from a warm pool: roughly 10 seconds.
Stripe treats these as cattle, not pets. Standardized, disposable, easily replaceable. No production access, no real user data, no arbitrary network egress. Multiple devboxes can run simultaneously, which means Minions can parallelize across tasks without stepping on each other.
The critical insight: Stripe didn’t build special AI infrastructure. They reused the developer environments they’d already invested years in building for humans. As the Stripe engineering team puts it: “If it’s good for humans, it’s good for LLMs, too.”
Layer 2: Blueprint Orchestration
This is the architectural innovation that distinguishes Minions from simpler agent loops. Blueprints are sequences that alternate between two types of nodes:
Deterministic nodes run configured linters, push changes, apply PR templates. No LLM invocation, guaranteed consistency every time.
Agentic nodes implement the actual task and fix CI failures. Full LLM flexibility, but constrained to specific subtasks.
Why does this matter? Pure agentic loops (where the LLM controls everything) compound errors. Each LLM decision introduces variance, and over a long sequence of decisions, that variance accumulates. By alternating between deterministic and agentic phases, Stripe reduces token consumption and keeps error propagation in check. The LLM does the creative work; fixed code handles the mechanical parts.
Layer 3: Curated Context
Stripe’s codebase spans hundreds of millions of lines, primarily Ruby with Sorbet typing, an uncommon combination that means standard LLMs have limited training data for it. Getting the right context into a limited context window is one of the hardest engineering problems in the system.
Stripe solves this with two mechanisms:
Rule files, similar in format to Cursor rules or Claude Code’s CLAUDE.md files, but scoped to specific directories and file patterns rather than applied globally. Global rules are used “very judiciously” to preserve context window space.
Toolshed, Stripe’s internal MCP server hosting roughly 500 tools for documentation retrieval, ticket lookups, code search via Sourcegraph, and build status checks. Tools are curated per agent type rather than dumping everything into every agent’s toolkit.
Layer 4: Hard Feedback Limits
Minions get local linting feedback in under 5 seconds (precomputed, cached rules) and selective CI test execution from Stripe’s 3+ million test battery. If tests fail, the agent gets a maximum of two CI retry rounds to fix the problem. After that, the branch transfers to a human engineer.
This is a deliberate design choice, not a limitation. As the Stripe blog explains: “There are diminishing marginal returns if an LLM is running against indefinitely many rounds of a full CI loop.” Two retries captures most fixable issues. Beyond that, the agent is likely stuck on something that requires human judgment.
What Tasks Minions Actually Handle
Stripe deliberately constrains Minions to narrow, well-defined tasks with clear inputs and explicit success criteria. This is not a system that takes “build me a payment integration” and runs with it. The task types include:
- Fixing flaky tests (often auto-triggered by CI detecting the flake)
- Configuration adjustments and dependency upgrades
- Straightforward migrations to new internal API versions
- Writing or updating unit tests for changed code
- Removing deprecated dependencies
- Minor refactoring and linter warning fixes
Every task specification includes precise file references, explicit scope boundaries (which files to modify and which to leave alone), verification criteria, and injected context snippets.
The productivity gains are real. Stripe reports that pan-EU local payment method integration, previously a roughly two-month effort, was reduced to about two weeks and trending toward completion in days.
The pattern is clear: Minions trade ambition-per-task for reliability and throughput. Instead of attempting complex, open-ended engineering challenges, they handle high volumes of well-specified, bounded work. This is the opposite strategy from tools like Devin or SWE-agent that try to solve larger problems autonomously.
How Stripe’s Numbers Stack Up Against Google, GitHub, and Amazon
Stripe isn’t the only company running AI coding at enterprise scale. But the approaches differ significantly.
Google reported that 25% of its code was AI-generated in Q3 2024, climbing to roughly 30% by early 2025. Sundar Pichai has indicated this trend is accelerating. Google’s approach is primarily AI-assisted (interactive tools) with a growing agentic component.
GitHub Copilot handles roughly 1.2 million PRs per month across its entire user base, with 46% average code generation and 55% faster task completion reported. Deployed at 90% of Fortune 100 companies. But Copilot is largely interactive: a human drives, the AI assists.
Amazon Q Developer (formerly CodeWhisperer) focuses on the highest multiline code acceptance rate at $19/user/month. It’s adding agentic capabilities, but the core product remains an interactive assistant.
Stripe’s differentiation: Minions are fully unattended. No human touches the keyboard between task description and completed PR. The 1,300 PRs/week figure represents a different category of AI-generated code, one where the agent handles the entire workflow end-to-end rather than autocompleting inside an editor.
The other headline number: Stripe reports 8,500 employees using LLM tools daily, with 65-70% of engineers using AI coding assistants. Emily Glassberg Sands, Stripe’s Head of Data & AI, has noted that LLM costs for coding assistants are “pretty non-trivial,” suggesting the economics of running autonomous agents at scale remain an active concern.
What Other Companies Can (and Can’t) Learn from Stripe
The uncomfortable truth about Stripe’s Minions success: it rests on a decade of infrastructure investment that most companies haven’t made. Standardized developer environments, a 3+ million test suite, comprehensive linting, internal MCP tooling, and a culture of code review discipline. These aren’t things you bolt on to make AI agents work; they’re engineering fundamentals that happen to make AI agents effective.
Three principles that do transfer, regardless of company size:
Start with your developer environment, not the AI. If your developers struggle with flaky environments, slow builds, or inconsistent tooling, agents will struggle even more. Fix the foundation first.
Constrain tasks ruthlessly. Minions succeed because they solve small, well-specified problems. Companies that try to jump straight to “let the AI build features end-to-end” skip the step where you learn which task shapes work and which don’t.
Set hard iteration limits. Two CI retries, then hand off. This prevents cost spirals, avoids the trap of letting an LLM spend $50 of compute trying to fix a $5 problem, and keeps human engineers in the loop where they’re actually needed.
The broader signal from Stripe’s experience is that the gap between AI-assisted coding (human drives, AI helps) and autonomous coding (AI drives, human reviews) is narrowing faster than most engineering organizations expected. Stripe’s Minions already operate in the autonomous category for a specific class of tasks. The question for every other engineering team isn’t whether this becomes standard practice, but how quickly they can build the infrastructure to support it.
Frequently Asked Questions
What are Stripe Minions?
Stripe Minions are fully autonomous, unattended coding agents built internally at Stripe. They receive task descriptions (typically via Slack), work through them end-to-end without human intervention, and deliver finished pull requests for human review. They generate over 1,300 merged PRs per week with zero human-written code.
How do Stripe Minions generate pull requests?
Minions use a four-layer architecture: isolated devbox environments (AWS EC2 instances), blueprint orchestration that alternates between deterministic and AI-driven steps, curated context from rule files and an internal MCP server with 500 tools, and a feedback loop with local linting and up to two CI retry rounds before handing off to a human.
What technology are Stripe Minions built on?
Stripe Minions are built on a heavily modified fork of Block’s open-source Goose coding agent. The fork was started in late 2024, and everything designed for interactive human use was stripped out and replaced with optimization for fully unattended, one-shot task completion.
What types of coding tasks do Stripe Minions handle?
Minions handle narrow, well-defined tasks: fixing flaky tests, configuration adjustments, dependency upgrades, API migrations, writing unit tests, removing deprecated dependencies, and minor refactoring. All tasks have clear inputs, explicit scope boundaries, and defined success criteria. They do not handle large architectural decisions or ambiguous feature requests.
Can other companies replicate Stripe’s autonomous coding agent approach?
Partially. Stripe’s success depends on a decade of infrastructure investment: standardized developer environments, a 3+ million test suite, comprehensive linting, and internal tooling. The transferable principles are: fix your developer environment first, constrain agent tasks to small well-specified problems, and set hard iteration limits to prevent cost spirals and keep humans in the loop.
