A chatbot answers questions. An agent does things. If your “AI agent” cannot call a tool, track progress across steps, or decide on its own what to do next, you have built a chatbot with a fancy system prompt. That distinction matters because it determines whether your creation can actually automate a workflow or just simulate the experience of automating one.

This tutorial builds a real AI agent in Python. Not a toy demo that calls one API and prints the result, but an agent that reasons about what tool to use, executes it, evaluates the output, and decides what to do next. You will have working code by the end, plus a clear understanding of why your first agent will still break in production and what to do about it.

Related: What Are AI Agents? A Practical Guide for Business Leaders

What Makes Something an Agent, Not a Chatbot

Three properties separate agents from chatbots. First, tool use: the agent can call external functions like web searches, database queries, or API calls. Second, reasoning loops: the agent decides which tool to call based on the current state, not a hardcoded sequence. Third, state management: the agent tracks what it has done and what remains, across multiple steps.

IBM’s architecture overview describes this as the “perceive, reason, act” loop. The LLM is the reasoning engine. Tools are the hands. State is the memory. Strip any one of these away and you are back to a chatbot.

The minimum viable agent looks like this: a user gives it a task, the LLM decides which tool to call, the tool returns a result, the LLM evaluates whether the task is complete, and if not, it picks the next tool. That loop, simple as it sounds, is what separates an agent from a chain of predefined steps.

Pick Your Stack: Three Frameworks, Three Philosophies

You have three serious options for building agents in Python in 2026. Each reflects a different philosophy about how much control you want.

Related: AI Agent Frameworks Compared: LangGraph, CrewAI, AutoGen

LangGraph: Maximum Control, Maximum Complexity

LangGraph models your agent as a state machine. You define nodes (functions), edges (transitions), and a state object that flows through the graph. Every decision point is explicit. Every tool call is a node. Every routing decision is an edge.

This sounds like overhead until you need to debug why your agent called a tool three times in a row. With LangGraph, you can see exactly which node fired, what state it received, and why the conditional edge routed to that node instead of another. That visibility is why LangGraph dominates production deployments: the 2026 State of Agent Engineering survey found it is the most widely adopted framework for agents in production.

Best for: developers who want full control over the agent’s decision flow and need debuggable, testable agent logic.

OpenAI Agents SDK: Minimal Boilerplate, Fast Start

The OpenAI Agents SDK takes the opposite approach. Define an agent with a name, instructions, and tools. Call Runner.run(). The SDK handles the reasoning loop, tool dispatch, and conversation management internally.

It is fast to prototype with. You can have a working agent in 15 lines of code. But you trade control for convenience: the reasoning loop is a black box, routing logic between agents uses handoffs that are harder to unit test, and you are loosely coupled to OpenAI’s API design even though the SDK technically supports other providers.

Best for: prototypes, hackathons, and teams that want something working before lunch.

CrewAI: Role-Based Multi-Agent

CrewAI structures agents as team members with roles, goals, and backstories. You define a “crew” of agents, assign tasks, and let them collaborate. The framework handles delegation and communication between agents.

The role metaphor makes it intuitive for non-technical stakeholders to understand what the system does. But the abstraction can fight you when you need fine-grained control over which agent handles which subtask.

Best for: multi-agent workflows where different agents specialize in different tasks, especially when you need to explain the system to a business audience.

Build a Research Agent Step by Step

Let us build something concrete: a research agent that takes a topic, searches the web, extracts key facts, and produces a structured summary. We will use LangGraph because it gives you the most visibility into what is happening at each step.

Step 1: Install Dependencies

pip install langgraph langchain-openai langchain-community tavily-python

You need an OpenAI API key and a Tavily API key for web search. Set them as environment variables:

export OPENAI_API_KEY="your-key-here"
export TAVILY_API_KEY="your-key-here"

Step 2: Define the State

The state is a TypedDict that carries information between nodes. Every node reads from and writes to this shared state.

from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages

class ResearchState(TypedDict):
    messages: Annotated[list, add_messages]
    topic: str
    search_results: list[str]
    summary: str

messages holds the conversation history with the LLM. topic is the input. search_results accumulates findings. summary is the output.

Step 3: Define the Tools

from langchain_community.tools.tavily_search import TavilySearchResults

search_tool = TavilySearchResults(max_results=3)

Tavily returns structured search results with titles, URLs, and content snippets. Three results is enough for a first agent; more results means more tokens, higher cost, and diminishing returns.

Step 4: Build the Agent Node

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)
llm_with_tools = llm.bind_tools([search_tool])

def research_node(state: ResearchState):
    """The agent reasons about what to do next."""
    system_msg = (
        f"You are a research agent. Your topic is: {state['topic']}. "
        "Search for key facts, statistics, and expert opinions. "
        "When you have enough information, summarize your findings."
    )
    messages = [{"role": "system", "content": system_msg}] + state["messages"]
    response = llm_with_tools.invoke(messages)
    return {"messages": [response]}

Step 5: Build the Tool Execution Node

from langgraph.prebuilt import ToolNode

tool_node = ToolNode([search_tool])

LangGraph’s ToolNode handles tool execution automatically. When the LLM’s response includes a tool call, this node executes it and adds the result back to state.

Step 6: Wire the Graph

from langgraph.graph import StateGraph, END

def should_continue(state: ResearchState):
    """Route to tools if the LLM wants to call one, otherwise end."""
    last_message = state["messages"][-1]
    if last_message.tool_calls:
        return "tools"
    return END

graph = StateGraph(ResearchState)
graph.add_node("agent", research_node)
graph.add_node("tools", tool_node)
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END})
graph.add_edge("tools", "agent")

app = graph.compile()

The flow: the agent node reasons and optionally calls a tool. If it calls a tool, the graph routes to the tool node, which executes the call and returns to the agent. If the agent has no more tool calls, the graph ends.

Step 7: Run It

result = app.invoke({
    "messages": [{"role": "user", "content": "Research the current state of AI agents in enterprise"}],
    "topic": "AI agents in enterprise",
    "search_results": [],
    "summary": ""
})

print(result["messages"][-1].content)

That is roughly 60 lines of code for a functional research agent. It searches the web, synthesizes what it finds, and produces a structured answer. It is not production-ready, but it is a real agent: it decides when to search, evaluates results, and determines when it has enough information to stop.

The Five Mistakes Every First Agent Has

Building the agent is the easy part. Making it reliable is where teams spend months. Here are the five problems you will hit first, drawn from real-world failure patterns and production experience.

Related: Context Engineering: The Architecture Pattern Replacing Prompt Engineering

1. The Infinite Loop

Your agent calls the same tool repeatedly because the LLM cannot determine that it already has the answer. Fix this by adding a step counter to your state and a conditional edge that forces termination after N iterations:

class ResearchState(TypedDict):
    messages: Annotated[list, add_messages]
    topic: str
    step_count: int  # Add this

def research_node(state: ResearchState):
    step = state.get("step_count", 0) + 1
    # ... rest of logic
    return {"messages": [response], "step_count": step}

def should_continue(state: ResearchState):
    if state.get("step_count", 0) >= 5:
        return END  # Force stop after 5 steps
    last_message = state["messages"][-1]
    if last_message.tool_calls:
        return "tools"
    return END

2. Context Window Overflow

Every tool call adds tokens to the message history. After ten searches, your context window is full of search results the agent no longer needs. Production agents use context compression and memory management to prune stale information. The simplest approach: summarize tool results before adding them to state, and cap message history length.

3. The God Agent Anti-Pattern

You build one agent that handles everything: searching, analyzing, writing, formatting. It works for your demo. Then you add a fourth task and the agent starts confusing its instructions. Split responsibilities into specialized agents. A search agent finds information. An analysis agent evaluates it. A writer agent produces the output. CrewAI’s role-based architecture makes this pattern explicit, but you can implement it in any framework.

4. No Error Handling for Tool Failures

APIs time out. Search engines return empty results. Your database connection drops. If your agent has no fallback for tool failures, it either crashes or hallucinates an answer based on what it “expected” the tool to return. Wrap tool calls in try/except blocks and return structured error messages the LLM can reason about:

def safe_search(query: str) -> str:
    try:
        results = search_tool.invoke(query)
        return str(results)
    except Exception as e:
        return f"Search failed: {str(e)}. Try a different query."

5. Skipping Human-in-the-Loop

Your agent works perfectly on ten test queries. On the eleventh, it sends an email to a customer that says something unhinged. Every agent that touches external systems needs a checkpoint where a human can review high-stakes actions before they execute.

Related: Human-in-the-Loop AI Agents: When to Let Agents Act and When to Hit Pause

From Tutorial to Production: What Changes

The gap between a tutorial agent and a production agent is infrastructure. Here is what you add before deploying anything real.

Observability: You need to see every step the agent takes, every tool call, every LLM response, every routing decision. LangSmith traces LangGraph agents natively. For other frameworks, OpenTelemetry with the GenAI semantic conventions gives you framework-agnostic tracing.

Persistence: If your agent crashes mid-task, it should resume where it left off, not start over. LangGraph’s checkpointing handles this. For other frameworks, you need to build your own state persistence layer.

Cost controls: A runaway agent can burn through hundreds of dollars in API calls in minutes. Set per-run token budgets, enforce maximum iteration counts, and log cost per execution. The State of Agent Engineering 2026 survey found that cost management is the second most common challenge after evaluation.

Evaluation: “I chatted with it and it seemed fine” is not evaluation. Build automated test suites that run your agent against a set of tasks and score the outputs. Braintrust and LangSmith both offer evaluation frameworks designed for non-deterministic agent outputs.

Security: If your agent calls tools with real credentials, those credentials need proper scoping. The agent should have the minimum permissions required and nothing more. Read the OWASP Top 10 for Agentic Applications before deploying anything that touches production data.

The tutorial agent you built today is a starting point. Production agents add layers of safety, observability, and resilience around the same core loop. The reasoning is identical. The infrastructure around it is what makes the difference between a demo and a system you can trust.

Frequently Asked Questions

What programming language is best for building AI agents?

Python dominates AI agent development in 2026. All three major frameworks (LangGraph, OpenAI Agents SDK, CrewAI) are Python-native, and the ecosystem of LLM libraries, tool integrations, and evaluation frameworks is deepest in Python. TypeScript is a viable alternative if you are building browser-based or Node.js agents.

How much does it cost to run an AI agent?

Cost depends on the model and how many steps the agent takes. A single GPT-4o call costs roughly $0.005 per 1,000 input tokens. A research agent that makes five search queries and processes the results might cost $0.05-0.15 per run. Runaway agents with no iteration limits can cost hundreds of dollars in a single session, which is why cost controls are essential.

Which AI agent framework should I start with?

Start with LangGraph if you want to understand how agents actually work under the hood. Its explicit state machine approach forces you to think about every decision point. Use OpenAI Agents SDK if you want something running fast with minimal code. Use CrewAI if you are building multi-agent workflows where different agents have specialized roles.

Can I build an AI agent without coding?

Yes. Platforms like n8n, Make, and Zapier now support AI agent workflows with visual drag-and-drop builders. Botpress and Voiceflow offer no-code agent builders with tool integration. These work well for straightforward workflows but become limiting when you need custom tool logic, complex state management, or multi-agent orchestration.

How do I test an AI agent before deploying it?

Build a set of test cases with known expected outcomes and run your agent against them automatically. Track success rate, cost per run, and time to completion. Tools like LangSmith and Braintrust provide evaluation frameworks designed for non-deterministic agent outputs. Never rely solely on manual testing by chatting with the agent.

Source