Software Factories: When AI Agents Build Software Without Human Review

Photo by Simon Kadula on Unsplash Source

Sixteen Claude agents, each running inside a Docker container, built a 100,000-line C compiler from scratch in two weeks. The compiler passes 99% of the GCC torture test suite. It compiled the Linux 6.9 kernel for x86, ARM, and RISC-V. Total cost: $20,000 in API tokens. No human reviewed the code.

That project, published by Anthropic’s Nicholas Carlini in February 2026, is the most visible example of what StrongDM calls a “software factory”: a system where AI agents write, test, and iterate on code through structured loops until the output meets convergence criteria. Humans define the goal and the test harness. Agents handle everything else.

This is not a coding assistant suggesting the next line. It is a production pipeline where code flows from specification to working software without a human ever reading the source.

How a Software Factory Works

The core idea borrows from dynamical systems theory. A rock tumbler polishes raw stones through repeated chaotic motion. Individual tumbles are random, but the aggregate process is convergent: rough stones become smooth. Software factories apply the same principle to code.

StrongDM’s AI team (Justin McCarthy, Jay Taylor, Navan Chauhan) formalized this in their software factory writeup and open-sourced Agate, their orchestration tool. The process follows five phases:

1. Interview. The system reads a goal (a markdown file describing what to build) and generates clarifying questions. A human answers them. This is the last point of direct human input.

2. Design. Agents produce architecture documents and technical decisions. These persist as plain markdown files, not database entries, so humans can inspect and edit the design if needed.

3. Sprint planning. The system breaks work into tasks and assigns them to specialized agent roles: planners, coders, reviewers, and recovery agents. Each role has its own skill definition.

4. Implementation loop. Coders write code. Reviewers gate each task. If a reviewer rejects the output, the task loops back. If a coder fails repeatedly, a recovery agent diagnoses the failure and a replanner rewrites the task. This inner loop runs without human oversight.

5. Assessment. After each sprint, the system evaluates whether the original goal is met. If not, it plans another sprint automatically.

Two principles define the StrongDM approach. First: “Code must not be written by humans.” Second: “Code must not be reviewed by humans.” The factory treats human code review as the bottleneck, not the safety mechanism.

The Convergence Problem

The hard part is not getting agents to write code. It is getting them to converge.

Anthropic’s compiler project hit this directly. When all 16 agents tried to compile the Linux kernel, they all encountered the same bugs, produced the same fixes, and overwrote each other’s changes through Git. Sixteen agents running in parallel were effectively one agent running sixteen times.

The fix was an oracle: use GCC as a known-good compiler to compile random subsets of kernel files, then test Claude’s compiler only on the remainder. This let each agent debug a different subsystem simultaneously. The agents stopped colliding and started converging.

Agate solves this differently. Its sprint planner decomposes work into independent tasks, and its reviewer gates prevent agents from moving on until a task actually passes review. The exit code system (0 for done, 1 for more work, 2 for error, 255 for awaiting human input) creates clear convergence criteria at the orchestration level.

Both approaches share a key insight: convergence requires structure. Autonomous agents without orchestration just produce chaos.

What Factory.ai and Agate Look Like in Practice

StrongDM and Factory.ai represent two poles of the software factory spectrum.

Agate: Open-Source Orchestration

Agate is a Go-based CLI that runs locally. You create a GOAL.md file describing what you want to build, run agate auto, and the system takes over. All state persists in plain markdown files under an .ai/ directory: interview transcripts, architecture documents, sprint plans with checkbox-tracked tasks, skill definitions, and full invocation logs.

The system supports Claude Opus 4.5 (default), Claude 3.5 Haiku for fast iteration, and GPT-5.2 through the Codex CLI. Built-in roles include _planner, _reviewer, _recover, _replanner, _interviewer, and _retro (for sprint retrospectives). Language-specific skills auto-generate: go-coder, python-reviewer, and so on.

What makes Agate notable is its transparency. Because everything is markdown, you can pause a factory run, read exactly what happened, edit a sprint plan by hand, and resume. The system’s state is fully human-readable even though humans are not expected to read the code it produces.

Factory.ai: Enterprise Droids

Factory.ai takes the opposite approach: a managed platform with specialized agents called “Droids” that plug into existing CI/CD pipelines. Developers delegate tasks through their IDE (VS Code, JetBrains, Vim) or terminal, and Droids handle implementation, review, and pull request creation.

Factory raised a $50M Series B from NEA, Sequoia, J.P. Morgan, and Nvidia. Customers include MongoDB, Ernst & Young, Zapier, Bayer, and Clari, with 200% quarter-over-quarter growth through 2025.

The company emphasizes “harness engineering” over raw model capability. According to Factory’s Eno Reyes in a Stack Overflow interview, the real work is “the sum of hundreds of little optimizations”: context management, environment injection, tool integration, and quality signal validation. Factory identifies hundreds of validation signals for generated code, from compilation checks to pattern-based quality analysis.

A Stanford research finding that Factory cites: code quality is the only predictor of whether AI accelerates or decelerates organizational productivity. Not adoption volume. Not agent penetration rates. Just how clean the codebase is before agents touch it.

The $1,000-Per-Engineer-Per-Day Question

StrongDM’s recommendation for their factory approach: spend $1,000 per day in API tokens per human engineer. That works out to roughly $20,000 per developer per month, more than most junior developer salaries in DACH markets and on par with senior compensation in many regions.

Anthropic’s compiler project backs up that order of magnitude. Sixteen agents running for two weeks consumed 2 billion input tokens and 140 million output tokens for a total of $20,000. That bought 100,000 lines of working Rust code for a non-trivial systems project.

But “working” needs qualification. The compiler passes 99% of the GCC torture suite, but it cannot generate 16-bit x86 code needed for Linux boot sequences. Its output is less optimized than GCC running with all optimizations disabled. For a research demonstration, these gaps are acceptable. For production infrastructure, they are not.

The economics get more interesting when you compare to human timelines. GCC took thousands of engineers over 37 years. Claude’s compiler took one researcher (who set up the orchestration and “mostly walked away”) and 16 agents for two weeks. Even accounting for the compiler’s limitations, the productivity ratio is staggering.

Digital Twin Universes

StrongDM introduced another concept that changes the economics: Digital Twin Universes (DTUs). These are behavioral clones of third-party services like Okta, Jira, Slack, Google Docs, and Google Sheets. DTUs replicate APIs, edge cases, and behaviors while eliminating rate limits and costs.

With DTUs, agents can run “thousands of scenarios per hour” against simulated environments instead of hitting real APIs. This shifts the success metric from binary (tests pass or fail) to probabilistic: “of all observed trajectories through all scenarios, what fraction likely satisfies the user?”

This is where software factories diverge from regular agent coding. A coding assistant helps you write a function. A software factory runs your entire application against a simulated universe to find out if it works.

Where Software Factories Break Down

Three problems keep software factories from replacing conventional development today.

Quality ceilings. Agent-generated code works, but it does not optimize. Anthropic’s compiler outputs significantly less efficient code than GCC with no optimizations. Factory.ai’s Stanford-cited research confirms: agents accelerate good codebases and decelerate bad ones. If the factory starts from messy foundations, it builds messy buildings.

Coordination overhead. Anthropic’s team discovered that 16 agents solving the same problem is worse than one agent solving it unless you deliberately partition the work. Agate’s sprint planner and Factory’s Droids each address this, but neither has published benchmarks showing linear scaling with agent count. The coordination tax is real and poorly understood.

Accountability gaps. If no human reads the code, who is responsible when it fails in production? The EU AI Act classifies AI systems by risk level, and autonomous code generation for safety-critical applications almost certainly triggers high-risk obligations. StrongDM’s “no human review” principle collides directly with Article 14’s human oversight requirements for high-risk systems.

What This Means for Development Teams

Software factories are not replacing developers. They are replacing a specific workflow: the spec-to-PR pipeline where a developer receives a well-defined ticket, implements it, self-reviews, and opens a pull request.

For that workflow, the factory model works today. Define the specification. Define the acceptance criteria. Let agents iterate until convergence. Review the output at the integration level (does it do what we asked?) rather than the code level (is this function well-written?).

Anthropic’s eight trends for 2026 emphasize exactly this shift: engineers move from writing code to coordinating agents, focusing their expertise on architecture, system design, and strategic decisions. Rakuten engineers used Claude Code on a task involving a 12.5-million-line codebase, completing it in seven hours with 99.9% numerical accuracy.

The practical takeaway for teams evaluating software factories:

Start with internal tools. Low-stakes, well-specified applications where code quality matters less than delivery speed. Agate is free and open-source. Try it on a weekend project before betting production workloads on it.

Invest in test harnesses, not code review. The factory model lives or dies on the quality of its convergence criteria. Carlini’s biggest lesson from the compiler project: design the test harness for the agent, not for yourself. Explicit progress tracking, clean output formats, and extremely high-quality tests matter more than reviewing what agents write.

Watch the cost curve. $20,000 per month per developer is prohibitive for most teams today. But API token costs have dropped roughly 10x per year since 2023. The StrongDM model that costs $20K/month in February 2026 may cost $2K/month by February 2027.

Frequently Asked Questions

What is an AI software factory?

An AI software factory is a system where AI agents write, test, and iterate on code through structured plan-implement-review loops until the output meets convergence criteria. Unlike coding assistants, software factories operate without human code review. Examples include StrongDM’s Agate orchestrator and Factory.ai’s Droids platform.

How much does it cost to run AI agents as a software factory?

StrongDM recommends spending around $1,000 per day in API tokens per human engineer, roughly $20,000 per month per developer. Anthropic’s C compiler project used 16 parallel agents for two weeks at a total cost of $20,000, consuming 2 billion input tokens and 140 million output tokens.

What is Agate by StrongDM?

Agate is an open-source AI orchestration tool that automates software development. Users define a project goal in a markdown file, and Agate runs multiple AI agents through iterative plan-implement-review cycles until the goal is met. It supports Claude Opus, Claude Haiku, and GPT-5.2, with all state stored as human-readable markdown files.

Can AI agents really build a working compiler without human oversight?

Yes. Anthropic demonstrated this in February 2026 when 16 parallel Claude agents built a 100,000-line C compiler in Rust that passes 99% of the GCC torture test suite and can compile the Linux 6.9 kernel. However, the compiler has limitations: it cannot generate 16-bit x86 code and produces less optimized output than GCC with no optimizations enabled.

How do software factories handle code quality without human review?

Software factories replace human code review with automated convergence criteria: test suites, scenario-based validation, compilation checks, and probabilistic satisfaction metrics. StrongDM uses Digital Twin Universes to simulate third-party service behavior, running thousands of scenarios per hour. Factory.ai tracks hundreds of validation signals from compilation to pattern-based quality analysis.

How a Software Factory Works#

The Convergence Problem#

What Factory.ai and Agate Look Like in Practice#

Agate: Open-Source Orchestration#

Factory.ai: Enterprise Droids#

The $1,000-Per-Engineer-Per-Day Question#

Digital Twin Universes#

Where Software Factories Break Down#

What This Means for Development Teams#

Frequently Asked Questions#

What is an AI software factory?#

How much does it cost to run AI agents as a software factory?#

What is Agate by StrongDM?#

Can AI agents really build a working compiler without human oversight?#

How do software factories handle code quality without human review?#