The production-grade open-source AI agent stack has five layers: local inference, agent orchestration, vector storage for RAG, workflow automation, and observability. Every layer now has at least one tool mature enough for production traffic, and a hot thread on r/LocalLLaMA from February 2026 shows that hundreds of teams are running this exact combination: Ollama or vLLM serving local models, LangGraph or CrewAI for agent logic, Qdrant for retrieval, n8n for integration workflows, and Langfuse for tracing.
This post walks through each layer, names the tools that are actually shipping in production (not just demo-ready), and explains when to pick one over the other. If you have read framework comparison posts and tool roundups but still do not know what to install first, this is the guide you need.
Layer 1: Local LLM Inference with Ollama and vLLM
Everything starts with inference. If your agents call a cloud API for every request, you are paying per token and sending data off-premises. The open-source stack replaces that with local model serving.
Ollama is the starting point for 90% of teams. One command (ollama run llama3.2) pulls a quantized model and starts a local API server compatible with the OpenAI chat completions format. It handles GPU detection automatically, runs on Mac, Linux, and Windows, and wraps model management into something that feels like a package manager. Docker support makes it deployable anywhere. With over 130,000 GitHub stars in early 2026, it is the most popular local LLM tool by a wide margin.
The catch: Ollama processes a maximum of four parallel requests by default and uses GGUF-quantized models (typically 4-bit or 8-bit). For a single developer or a small team prototyping agents, that is fine. For a production service handling 50+ concurrent users, it is not.
vLLM solves the throughput problem. Its PagedAttention memory management reduces GPU memory fragmentation by 50% or more, and Red Hat benchmarks show vLLM handling up to 3.23x more throughput than Ollama at 128 concurrent requests. vLLM uses BF16 safetensors models rather than quantized GGUFs, which means higher memory requirements but better output quality. It needs modern GPUs (A100, H100, or RTX 4090 class) with high VRAM.
Which to Pick
Use Ollama for development, prototyping, and small-scale deployments. Use vLLM when you need to serve more than a handful of concurrent agent sessions or when output quality matters more than hardware cost. Many teams run both: Ollama on developer laptops, vLLM on shared GPU servers.
A third option gaining traction is SGLang, which matches vLLM on throughput while offering a structured generation API that works well for tool-calling agents. The inference serving landscape analysis from January 2026 ranks it alongside vLLM for production use.
Layer 2: Agent Orchestration Frameworks
The orchestration layer defines how your agents think, act, and recover from failures. Three open-source frameworks dominate production deployments right now, each with a fundamentally different architecture.
LangGraph models agents as state graphs where nodes are actions and edges are transitions. This gives you explicit control over every decision path, built-in checkpointing for state persistence, and the ability to replay any agent run from any point. Langfuse benchmarks show it achieving the lowest latency and token usage across standardized tasks. With 24,000+ GitHub stars and 4.2 million monthly PyPI downloads, it is the framework most production teams choose when control and auditability matter.
CrewAI takes a role-based approach. You define agents with specific roles, backstories, and goals, then assign them tasks in a sequential or hierarchical process. It ships with layered memory (ChromaDB for short-term, SQLite for task results and long-term) and YAML-based task configuration. Teams that think in terms of “I need a researcher agent, a writer agent, and a reviewer agent” will find CrewAI gets from concept to working prototype faster than anything else. The trade-off is less granular control over execution flow.
smolagents from Hugging Face is the minimalist choice. Agents write and execute Python code directly rather than calling predefined tools through a framework abstraction. There is no graph, no YAML, no role system. Just a loop where the LLM generates code, the runtime executes it, and the result feeds back into the next step. It is ideal for self-hosted setups running smaller Hugging Face models, and it requires the least boilerplate of any framework.
Production Readiness Comparison
| Feature | LangGraph | CrewAI | smolagents |
|---|---|---|---|
| State persistence | Built-in checkpoints | ChromaDB + SQLite | In-memory only |
| Human-in-the-loop | Native breakpoints | Supported | Manual |
| Multi-agent patterns | Graph composition | Role hierarchies | Code delegation |
| Observability | LangSmith / Langfuse | Built-in tracing | Basic logging |
| Learning curve | Steep | Moderate | Low |
For most production teams, the answer is LangGraph for complex, mission-critical workflows and CrewAI for everything else. smolagents fills a niche for teams already deep in the Hugging Face ecosystem running local models.
Layer 3: RAG, Vector Storage, and Agent Memory
Agents without memory are stateless chatbots with extra steps. The retrieval layer gives agents access to your documents, knowledge bases, and conversation history.
Qdrant is the production choice for vector storage. Written in Rust, it supports HNSW indexing, payload filtering, and vector quantization out of the box. Horizontal scaling works via built-in sharding across multiple nodes. Self-hosting is straightforward with Docker, and the Qdrant documentation covers production deployment patterns including replication and backup strategies. Performance at scale is where Qdrant separates from alternatives: it handles millions of vectors with sub-millisecond query times.
ChromaDB is simpler to start with. Its Python-native API means you can embed it directly into a LangGraph or CrewAI agent with three lines of code. For prototyping RAG pipelines on a laptop, nothing is faster to set up. But it lacks the distributed deployment features that Qdrant offers, and performance degrades with large collections. Think of ChromaDB as the SQLite of vector databases: perfect for development, insufficient for heavy production loads.
Supabase deserves mention as the “Swiss army knife” option. It combines PostgreSQL (with pgvector for vector search), authentication, real-time subscriptions, and a REST API in a single self-hosted Docker stack. Teams that need a vector database AND a relational database AND user auth often choose Supabase to avoid running three separate services. The n8n self-hosted AI starter kit bundles it by default.
Memory Architecture for Agents
Production agent memory typically splits into three tiers:
- Working memory: Current conversation context, held in the orchestration framework’s state (LangGraph checkpoints or CrewAI’s short-term ChromaDB store).
- Episodic memory: Past conversation summaries and task results, stored in a relational database (PostgreSQL via Supabase or SQLite).
- Semantic memory: Documents, knowledge bases, and embeddings, stored in a vector database (Qdrant or ChromaDB) for retrieval.
Layer 4: Workflow Automation and Integration
Agent orchestration frameworks handle the thinking. Workflow automation handles the plumbing: triggering agents from external events, connecting to SaaS tools, scheduling batch jobs, and routing results to the right destination.
n8n dominates this layer for self-hosted teams. With 70,000+ GitHub stars, a $2.5 billion valuation after its Series C, and nearly 70 AI-specific nodes built on LangChain, n8n bridges the gap between “agent that can reason” and “agent that can actually do things in the real world.”
What makes n8n essential in the open-source stack:
- AI Agent node connects to OpenAI, Anthropic, Google, or Ollama-served local models. Your LangGraph agent handles the complex reasoning; n8n handles the triggers and integrations.
- 400+ pre-built integrations cover CRMs, databases, email, Slack, Google Workspace, and practically every SaaS API. Building these from scratch for each agent would take months.
- Sub-workflow orchestration enables multi-agent patterns where a router workflow delegates to specialized agent workflows.
- Self-hosted by default. Docker Compose, no external dependencies, your data stays on your infrastructure.
The n8n self-hosted AI starter kit bundles n8n + Ollama + Qdrant + PostgreSQL in a single docker-compose.yml that starts with one command. It is the fastest way to get a complete local AI workflow environment running.
For teams that prefer a code-first approach over visual workflows, Temporal (open source, durable workflow execution) is the alternative. It handles long-running agent tasks with built-in retry logic and state persistence, but it requires significantly more engineering effort to set up.
Layer 5: Observability and Evaluation
Agents in production without observability are black boxes. You cannot debug what you cannot trace, and you cannot improve what you cannot measure.
Langfuse is the open-source standard for LLM observability. Self-hostable via Docker Compose (or Helm for Kubernetes), it gives you:
- Trace visualization: See every LLM call, tool invocation, and retrieval step in a single timeline view. When an agent goes off the rails, you can pinpoint exactly which step produced the wrong output.
- Cost tracking: Monitor token usage and compute costs per agent, per workflow, per user. Essential for teams running on GPU budgets.
- Prompt management: Version and A/B test prompts without redeploying your agent.
- Evaluation frameworks: Score agent outputs against ground truth, compare model versions, and track quality metrics over time.
- OpenTelemetry integration: Native OTEL support means you can pipe Langfuse data into your existing monitoring infrastructure (Grafana, Datadog, etc.).
Langfuse integrates natively with LangGraph, CrewAI, and most LLM SDKs. The Langfuse GitHub repository has over 10,000 stars, and the project is backed by Y Combinator.
For teams already using LangGraph, LangSmith offers tighter integration, but it is a hosted service, not self-hostable. If staying fully on-premises matters, Langfuse is the answer.
The Reference Stack: Putting It All Together
Here is what a complete self-hosted agentic AI stack looks like in practice, from bottom to top:
┌─────────────────────────────────────────────┐
│ Layer 5: Observability │
│ Langfuse (tracing, evals, cost tracking) │
├─────────────────────────────────────────────┤
│ Layer 4: Workflow & Integration │
│ n8n (triggers, SaaS connectors, routing) │
├─────────────────────────────────────────────┤
│ Layer 3: RAG & Memory │
│ Qdrant (vectors) + PostgreSQL (relational) │
├─────────────────────────────────────────────┤
│ Layer 2: Agent Orchestration │
│ LangGraph (complex) or CrewAI (rapid) │
├─────────────────────────────────────────────┤
│ Layer 1: Inference │
│ Ollama (dev) or vLLM (production) │
└─────────────────────────────────────────────┘
Minimum Hardware for Getting Started
A single machine with 16GB RAM and an NVIDIA GPU with 8GB+ VRAM (like an RTX 3070 or 4060) can run the entire stack for development. Ollama serving a 7B parameter model, n8n, Qdrant, PostgreSQL, and Langfuse all fit comfortably in Docker Compose on that hardware.
For production, plan on:
- Inference server: 1-2 GPUs with 24GB+ VRAM each (RTX 4090 or A100) running vLLM
- Application server: 32GB RAM, 8+ CPU cores for n8n, Qdrant, PostgreSQL, Langfuse
- Storage: SSD with 500GB+ for model weights, vector indices, and logs
The Docker Compose Starting Point
The n8n self-hosted AI starter kit gets you layers 1, 3, and 4 running with a single command. Add Langfuse for layer 5 and wire in LangGraph or CrewAI for layer 2. The local-ai-packaged project by Cole Medin takes this further, bundling Ollama, Supabase, n8n, Open WebUI, and Flowise into one package.
Starting from scratch to a working multi-agent system with RAG and observability takes about a weekend for an experienced developer. That is the real promise of the 2026 open-source stack: not just that the tools are free, but that they compose cleanly.
Frequently Asked Questions
What is the best open-source LLM for running AI agents locally in 2026?
For local AI agent deployment in 2026, Llama 3.2 (8B and 70B variants) and Qwen 3 are the most popular choices. Ollama makes running them trivial with a single command. For agent-specific tasks requiring tool calling, Llama 3.2 with its native function calling support performs closest to commercial models like GPT-4o and Claude.
Can I run a production AI agent stack without a GPU?
Technically yes, but practically no. Ollama supports CPU-only inference, but a 7B model will generate about 2-5 tokens per second on CPU compared to 50-100+ tokens per second on a modern GPU. For production agent workflows where response latency matters, GPU inference is essential. An entry-level NVIDIA RTX 4060 with 8GB VRAM can serve a 7B model at acceptable speeds for 1-3 concurrent users.
How much does a self-hosted AI agent stack cost compared to using cloud APIs?
Hardware costs for a production self-hosted stack start around $3,000-5,000 for a single GPU server (RTX 4090). Ongoing costs are electricity (roughly $50-100/month for a single GPU running 24/7). Compare that to cloud API costs: a busy agent making 10,000 GPT-4o calls per day costs approximately $300-600/month in API fees alone. Most teams break even within 6-12 months of self-hosting, with the additional benefits of data privacy and no rate limits.
What is the difference between n8n and LangGraph for AI agents?
LangGraph handles agent reasoning: deciding which tools to call, processing results, managing conversation state, and implementing multi-step logic. n8n handles workflow automation: triggering agents from external events (webhooks, schedules, emails), connecting to SaaS tools (CRMs, databases, Slack), and routing agent outputs to downstream systems. Most production stacks use both. LangGraph for the brain, n8n for the nervous system.
Is the open-source AI agent stack production-ready in 2026?
Yes, with caveats. Individual components like Ollama, LangGraph, Qdrant, n8n, and Langfuse are each used in production by thousands of teams. The challenge is integration: making all five layers work together reliably requires DevOps expertise and ongoing maintenance. Starter kits like n8n’s self-hosted AI kit and local-ai-packaged reduce setup time to hours, but production hardening (monitoring, backups, security, scaling) is still your responsibility.
