Photo by Tima Miroshnichenko on Pexels Source

AutoResearchClaw takes a single research idea typed into a terminal and, without any further human input, produces a conference-formatted LaTeX paper with real citations, generated experiments, statistical analysis, and multi-agent peer review. The open-source project from UNC Chapel Hill’s AIMING Lab hit 8,000 GitHub stars within nine days of its March 15, 2026 release. It is the most ambitious attempt yet to automate not just parts of research, but the entire pipeline from hypothesis to submission-ready manuscript.

Where Karpathy’s AutoResearch is a 630-line loop that optimizes a single metric on a single GPU, AutoResearchClaw is a full production system: 23 stages, 8 phases, specialized sub-agents for code generation, benchmarking, figure creation, and a self-learning memory system called MetaClaw that gets better with every run. The tagline is “Chat an Idea. Get a Paper.” Whether that paper is actually any good is the more interesting question.

Related: Karpathy's AutoResearch: 630 Lines of Python That Run 100 Experiments While You Sleep

The 23-Stage Pipeline: What Actually Happens After You Press Enter

The pipeline is organized into 8 phases. Three of them include “gate” stages where AutoResearchClaw can pause for human approval, though most users bypass these with the --auto-approve flag.

Phase A: Research Scoping parses your idea and decomposes it into sub-problems. You type “file-based vs. vector-based memory for LLM agents” and the system frames that into specific research questions with measurable outcomes.

Phase B: Literature Discovery generates search queries, pulls real papers from OpenAlex, Semantic Scholar, and arXiv, screens them for relevance, and extracts key findings. This is not keyword matching. The agent reads abstracts and conclusions, scores relevance, and builds a structured knowledge base from the top hits.

Phase C: Knowledge Synthesis is where it gets interesting. Three agents debate each other to generate testable hypotheses. One proposes, one critiques, one synthesizes. In a real run by Menon Lab, this debate produced a “quantum-inspired memory compression” hypothesis and a neuroplasticity-based dynamic switching architecture. Both were novel framings not present in the input literature.

Phase D: Experiment Design writes multi-file Python code (main.py, setup.py, requirements.txt), detects available hardware (NVIDIA CUDA, Apple MPS, or CPU-only), and allocates resources. The default experiment budget is 300 seconds, which is often too short for complex topics.

Phase E: Experiment Execution runs the code in a sandboxed environment (Docker or local) with a self-healing loop: up to 10 repair rounds with AST validation, NaN/Inf detection, and automatic error correction. If the code crashes, AutoResearchClaw diagnoses the failure and rewrites the broken section.

Phase F: Analysis & Decision runs multi-agent statistical analysis and makes an autonomous decision: PROCEED to writing, REFINE the experiment parameters, or PIVOT to an entirely new direction. Each path maintains full artifact versioning so nothing is lost.

Phase G: Paper Writing drafts the full paper (5,000-6,500 words target), runs a 7-dimension peer review scoring system, and revises based on its own critique. The review checks evidence consistency, methodology rigor, novelty, and whether the conclusions actually follow from the results.

Phase H: Finalization runs a quality audit that includes AI-slop detection, archives lessons learned for MetaClaw, generates LaTeX in NeurIPS/ICML/ICLR templates with proper BibTeX, and runs a 4-layer citation verification check against arXiv, CrossRef, DataCite, and Semantic Scholar.

The output folder contains the draft, the LaTeX file, verified references, experiment code, charts with colorblind-safe palettes, peer review notes, and a citation integrity report.

CodeAgent, BenchmarkAgent, FigureAgent: The Specialized Sub-Agents

Version 0.2.0, released just one day after launch, introduced three specialized agent subsystems that handle the technical heavy lifting.

CodeAgent operates in four phases: generation, validation, review, and repair. It writes experiment code, runs static analysis and AST-based verification on its own output, does a deep validation pass, and enters an iterative fix loop (up to 3 rounds) when things break. The important detail: it validates that classes and methods actually exist in the generated code before trying to run them. This catches the single most common failure mode in LLM-generated code, calling functions that were never defined.

BenchmarkAgent uses four sub-agents to select appropriate datasets and baselines from a 13-domain knowledge base. It is domain-aware: it picks different benchmarks for computer vision vs. NLP vs. reinforcement learning topics. It handles import validation and pretrained model resizing, which matters because half the “my experiment crashed” errors in early versions came from mismatched tensor dimensions.

FigureAgent produces academic-quality visualizations through five sub-agents: comparison plots, heatmaps, ablation studies, and more. It enforces LLM output type safety (no “the chart data is approximately…” when it needs actual numbers) and uses Paul Tol’s colorblind-safe palette, which is increasingly required for conference submissions.

Related: Multi-Agent Orchestration Platforms Compared: What Actually Works in 2026

Does It Actually Work? Real-World Results and Honest Limitations

The AIMING Lab ran 6 end-to-end test runs during development. All completed successfully (124 out of 124 pipeline steps), with 94.3% citation integrity and a mean quality score of 6.2 out of 10 on a simulated conference review scale. That 6.2 is below typical NeurIPS/ICML acceptance thresholds, which generally start around 6.5-7.0.

Menon Lab’s independent test on “file-based vs. vector-based memory for LLM agents” produced more granular insights:

  • Literature collection: 5,153 lines of real BibTeX from arXiv and Semantic Scholar, all verified
  • Code quality: Properly structured Python with baseline conditions and ablation studies, but “simplified simulations rather than production-grade implementations”
  • Self-awareness: The pipeline correctly identified methodological weaknesses in its own experiments and triggered REFINE loops
  • Cost: $5-15 in API calls per run with GPT-4o, though complex topics with multiple refinement cycles cost more
  • Runtime: 20 minutes to 2+ hours depending on complexity
  • Verdict: “Effective research assistant for initial exploration rather than final publication-ready output”

Where It Breaks

The experiments are simulated, not real. AutoResearchClaw generates and runs Python code, but that code runs simplified simulations, not actual GPU training jobs at scale. It works with pre-cached datasets (CIFAR-10/100, MNIST) and does not access external compute resources or novel datasets.

The 300-second default experiment budget is too short for anything complex. Users have to manually increase it, and the documentation does not make this obvious enough.

Configuration friction is real. The first run commonly fails because of incorrect Python paths, missing Docker access, or API key issues. Menon Lab’s first attempt crashed due to a wrong Python path in sandbox settings.

And math-heavy or purely theoretical topics are a known weakness. Version 0.3.2 explicitly listed “math/theoretical topic handling” as a bug fix, confirming that earlier versions failed on non-empirical research.

MetaClaw: The Part That Matters Most Long-Term

Version 0.3.0 introduced MetaClaw, a cross-run learning system that captures lessons from failures and warnings, converts them into reusable skills, and injects those skills into all 23 pipeline stages on subsequent runs. It uses a 30-day time-decay memory model where recent lessons carry more weight.

The controlled experiment results: 24.8% reduction in stage retry rate, 40% fewer refinement cycles, and 18.3% overall robustness improvement. This is the architecture pattern that separates AutoResearchClaw from a one-shot prompt chain. Each run makes the next one better. After 50 runs, the system has accumulated enough domain-specific heuristics to avoid the failure modes it encountered in runs 1 through 49.

Related: OpenClaw: What the First Viral AI Agent Means for Enterprise Security

The Research Integrity Problem

AutoResearchClaw exists in a context where AI-generated academic content is already a crisis. Pangram Labs’ analysis of ICLR 2026 submissions found that 21% of peer reviews (15,899 reviews) were fully AI-generated, and over 50% showed some AI involvement. GPTZero’s separate analysis found 50+ hallucinated citations in a 300-paper sample, papers that had already been reviewed by 3-5 human experts who missed every single fabricated reference.

The Bulletin of the Atomic Scientists frames the deeper problem: “When hiring committees prioritize publication counts over research quality, we create powerful incentives for gaming the system.” AutoResearchClaw makes that gaming dramatically cheaper. A $10 API call replaces months of work.

To its credit, AutoResearchClaw’s 4-layer citation verification (arXiv IDs, CrossRef/DataCite DOIs, Semantic Scholar title matching, LLM relevance scoring) is a genuine differentiator. Most LLMs hallucinate citations at alarming rates. AutoResearchClaw’s 94.3% citation integrity rate means roughly 1 in 18 citations might still be fabricated, but that is orders of magnitude better than raw GPT-4 output, where hallucinated references are the norm rather than the exception.

The tool’s v0.3.2 VerifiedRegistry system categorizes 13 types of citation deficiency and auto-removes references that fail verification. It is transparent about what it is and recommends human review before any submission. The question is whether users will exercise that restraint when a conference deadline is 48 hours away and the tool just handed them a formatted paper.

Frequently Asked Questions

Can AutoResearchClaw generate papers good enough to submit to NeurIPS or ICML?

AutoResearchClaw generates LaTeX papers in NeurIPS, ICML, and ICLR templates, but initial test runs scored 6.2 out of 10 on a conference review scale, below the typical acceptance threshold of 6.5-7.0. The tool itself recommends human expert review before any real submission. It is best used as a research assistant for initial exploration and drafting, not as a final publication pipeline.

How much does a full AutoResearchClaw run cost?

A typical run costs $5-15 in API calls using GPT-4o. Complex topics that trigger multiple REFINE or PIVOT cycles can cost significantly more. The system supports OpenAI, OpenRouter, DeepSeek, and MiniMax APIs directly, plus Claude Code, Codex CLI, Copilot CLI, and Gemini CLI through the Agent Client Protocol (ACP), which requires no separate API key.

Does AutoResearchClaw use real citations or hallucinate them?

AutoResearchClaw uses a 4-layer citation verification system that checks references against arXiv, CrossRef, DataCite, and Semantic Scholar databases. In testing, it achieved 94.3% citation integrity. The v0.3.2 VerifiedRegistry system categorizes 13 types of citation deficiency and auto-removes references that fail verification. While significantly better than raw LLM output, roughly 1 in 18 citations may still be unverified.

What is MetaClaw and how does it improve AutoResearchClaw over time?

MetaClaw is AutoResearchClaw’s cross-run learning system introduced in v0.3.0. It captures lessons from failures and warnings during each run, converts them into reusable skills, and injects those skills into all 23 pipeline stages on subsequent runs. It uses a 30-day time-decay memory model. In controlled experiments, MetaClaw reduced stage retry rates by 24.8%, refinement cycles by 40%, and improved overall pipeline robustness by 18.3%.

Is it ethical to use AutoResearchClaw for academic research?

This is actively debated. Using AutoResearchClaw as an ideation and exploration tool while disclosing AI involvement is generally considered defensible. Submitting its unreviewed output to academic conferences without disclosure is not, and many venues now have explicit policies against undisclosed AI-generated content. The tool’s creators recommend human expert review before any submission.