TL;DR: The open-source AI agent ecosystem has exploded. There are now dozens of projects that look promising on a GitHub README and fail in production within two hours. This guide cuts through the noise. We tested, deployed, and rated the best open-source agent projects across five categories — browser agents, coding agents, research agents, orchestration frameworks, and infrastructure. For each, we give you: what it actually does, production readiness (1-5), who it is for, and honest caveats. If you are building with agents in 2026 and do not want to waste two weeks on a repo that was abandoned in January, start here.
What you will learn
- The current state of the open-source agent ecosystem
- Agent project landscape map
- Browser agents: what actually works
- Coding agents: the SWE-bench reality check
- Research agents: from prototype to pipeline
- Orchestration frameworks: how to wire it all together
- Infrastructure layer: sandboxing, observability, tool integrations
- How to pick the right stack for your use case
- Frequently asked questions
The current state of the open-source agent ecosystem
Twelve months ago, building with open-source AI agents meant choosing between three half-finished frameworks and hoping the maintainer would respond to your GitHub issue. Today, that calculus has completely reversed.
The open-source agent ecosystem in 2026 is genuinely mature in some areas and still dangerously over-hyped in others. The distinction matters enormously if you are shipping production software.
Here is what changed:
Model quality crossed the production threshold. In 2024, even the best open-source models were unreliable for multi-step agentic tasks — they hallucinated tool calls, forgot context mid-chain, and required constant human correction. By early 2026, GPT-5, Claude Opus 4, and Gemini Ultra all cross the bar where a well-designed agent can complete multi-step workflows reliably enough to trust with production data. DeepSeek V3.2 and Llama 4 Scout get you 70-80% of that capability for local/private deployments.
Tooling infrastructure matured. The Model Context Protocol (MCP) created a standardized interface for tool connections. E2B and Modal made sandboxed execution accessible. Composio solved the 200-integration problem that was blocking enterprise deployments. The plumbing is no longer the hard part.
The GitHub star inflation problem. Every week, a new agent framework appears, gets featured on Hacker News, accumulates 3,000 stars in 48 hours, and then goes quiet. We are going to be explicit about distinguishing "viral repository" from "production-ready tool." Our production readiness ratings reflect actual sustained use, maintenance cadence, breaking change frequency, and real-world reliability — not star counts.
If you want to understand why the AI agent startup opportunity is real and why open-source is central to it, the short version is: open-source projects are where the best practitioners are sharing their architectures. Using them means standing on the shoulders of the people who are three years ahead of the enterprise vendors.
Agent project landscape map
Before diving into individual projects, here is how the full ecosystem fits together:
graph TB
subgraph MODELS["Foundation Models"]
GPT5["GPT-5 / GPT-5.4"]
CLAUDE["Claude Opus 4 / Sonnet 4.5"]
DEEPSEEK["DeepSeek V3.2"]
LLAMA["Llama 4 Scout"]
end
subgraph ORCHESTRATION["Orchestration Frameworks"]
CREWAI["CrewAI"]
LANGGRAPH["LangGraph"]
AUTOGEN["AutoGen / AG2"]
OPENAI_SDK["OpenAI Agents SDK"]
CLAUDE_SDK["Claude Agent SDK"]
end
subgraph BROWSER["Browser Agents"]
BROWSER_USE["Browser Use"]
STAGEHAND["Stagehand v3"]
PLAYWRIGHT_MCP["Playwright MCP"]
COMPUTER_USE["Computer Use APIs"]
end
subgraph CODING["Coding Agents"]
CLAUDE_CODE["Claude Code"]
CURSOR["Cursor + Automations"]
AIDER["Aider"]
OPENHANDS["OpenHands / OpenDevin"]
CLINE["Cline"]
CONTINUE["Continue"]
end
subgraph RESEARCH["Research Agents"]
STORM["STORM"]
GPT_RESEARCHER["GPT-Researcher"]
EXA["Exa AI"]
PERPLEXITY_API["Perplexity API"]
end
subgraph INFRA["Infrastructure"]
E2B["E2B Sandboxing"]
MODAL["Modal Serverless"]
COMPOSIO["Composio Tools"]
AGENTOPS["AgentOps / LangSmith"]
end
MODELS --> ORCHESTRATION
MODELS --> BROWSER
MODELS --> CODING
MODELS --> RESEARCH
ORCHESTRATION --> BROWSER
ORCHESTRATION --> CODING
ORCHESTRATION --> RESEARCH
INFRA --> ORCHESTRATION
INFRA --> CODING
INFRA --> BROWSER
Think of the stack in four layers:
- Foundation models — the reasoning engine that powers everything
- Orchestration frameworks — the control plane that sequences steps, manages state, and routes between agents
- Domain agents — specialized tools optimized for browser, code, or research tasks
- Infrastructure — execution sandboxes, observability, and tool integrations
Most production systems use at least two or three layers from this diagram in combination. A coding agent (OpenHands) might use LangGraph for orchestration, E2B for sandboxed execution, and LangSmith for observability. The layers compose.
Browser agents: what actually works
Browser agents are the most visible category right now — partly because the demos are compelling and partly because the use case is obvious: "AI that can use the web like a human." Reality is more nuanced.
The architecture of a browser agent
flowchart TD
USER_INPUT["User Instruction\n(natural language)"] --> PLANNER["Planning Module\n(LLM - task decomposition)"]
PLANNER --> ACTION_LOOP["Action Loop"]
ACTION_LOOP --> OBSERVE["Observe Browser State\n(screenshot + DOM)"]
OBSERVE --> PERCEIVE["Perceive\n(extract relevant elements)"]
PERCEIVE --> DECIDE["Decide\n(LLM - next action)"]
DECIDE --> CLICK["Click / Type / Scroll"]
DECIDE --> NAVIGATE["Navigate URL"]
DECIDE --> EXTRACT["Extract Data"]
DECIDE --> DONE{"Task\nComplete?"}
CLICK --> OBSERVE
NAVIGATE --> OBSERVE
EXTRACT --> OUTPUT["Structured Output"]
DONE -- No --> OBSERVE
DONE -- Yes --> OUTPUT
OUTPUT --> USER_OUTPUT["Return to User"]
style USER_INPUT fill:#f0f9ff,stroke:#0ea5e9
style USER_OUTPUT fill:#f0fdf4,stroke:#22c55e
style DONE fill:#fef9c3,stroke:#eab308
Every production browser agent implements some version of this loop: observe the current state, extract meaning from what is visible, decide the next action, execute it, and repeat until the task is complete or a maximum step count is hit.
Where they differ is in the observe-perceive layer. Some agents work purely from screenshots (visual agents, computationally expensive). Others parse the DOM directly (fast but breaks on canvas-heavy apps). The best production implementations do both and let the model decide which signal to trust.
Browser Use
GitHub: browser-use/browser-use — 50K+ stars
Browser Use is the runaway leader in this category. It hit 50,000 GitHub stars faster than almost any developer tool in 2025, and unlike most viral repositories, it actually deserves the attention.
The core design is clean: Browser Use wraps Playwright with an LLM layer that can interpret any web page and take actions on it. It uses a dual-mode perception system — DOM extraction for structured pages and visual capture for everything else. The result is a browser automation tool that works on pages that would defeat traditional scraping, including dynamically loaded content, login walls, and multi-step forms.
What it is actually good at:
- Web scraping at scale (structured extraction from arbitrary sites)
- Form automation (filling and submitting multi-step forms, including authenticated workflows)
- Research pipelines (open-ended web research, not just a list of URLs)
- E-commerce automation (price monitoring, checkout flows, catalog extraction)
Production readiness: 4/5
The caveats: Browser Use is fast in demos and slower in production. Each step requires a model call, and on complex pages, the DOM extraction step adds meaningful latency. At scale (thousands of tasks per day), cost management becomes a real engineering challenge. The team has been shipping improvements to caching and batching, but this is still an area to benchmark before committing.
Anti-captcha handling is a known gap. Browser Use can navigate human-designed pages well, but aggressive bot detection (Cloudflare, hCaptcha) stops it. You will need a separate layer for sites with heavy protection.
from browser_use import Agent
from langchain_anthropic import ChatAnthropic
agent = Agent(
task="Go to hacker news, find the top 3 posts about AI agents, extract title and URL",
llm=ChatAnthropic(model="claude-sonnet-4-5"),
)
result = await agent.run()
Install: pip install browser-use
Docs: docs.browser-use.com
Stagehand v3
GitHub: browserbase/stagehand — 12K+ stars
Stagehand takes a different philosophy from Browser Use. Where Browser Use is a general-purpose agent loop, Stagehand is a typed SDK — you write TypeScript and call methods like page.act(), page.extract(), and page.observe(). The LLM handles the messy web interpretation, but your code drives the high-level logic.
This turns out to be a better fit for production software engineering than the pure-agent approach. Your code is deterministic at the orchestration level. The non-determinism is isolated to the web interaction layer where it is actually needed. You get proper TypeScript types, error handling, and unit-testable business logic around an AI-powered browsing core.
What it is actually good at:
- Production web automation where you need code-level control
- Scraping pipelines inside TypeScript/Node.js backends
- SaaS integrations where the target site lacks an API
- Test automation for visually complex UIs
Production readiness: 4/5
Stagehand runs on Browserbase, which provides headless browser infrastructure at scale. The tight integration means you are not managing your own browser fleet, but it also means vendor dependency. Self-hosted deployment is possible but less polished.
Docs: docs.stagehand.dev
Playwright MCP
GitHub: microsoft/playwright-mcp — 5K+ stars
Playwright MCP is Microsoft's bridge between the Model Context Protocol and Playwright's browser automation library. It exposes Playwright's full API as MCP tools, which means any MCP-compatible agent (Claude, GPT-5 via function calling, etc.) can control a browser natively without a separate browser agent framework.
This is the lowest-level option but also the most composable. If you are already using MCP for other tool connections, adding Playwright MCP means your agent can browse the web with the same pattern as any other tool call.
Production readiness: 3/5
Playwright MCP is newer and the tooling around error recovery and session management is less mature than Browser Use or Stagehand. Use it when you are already deep in the MCP ecosystem and want browser access without adding another framework.
Computer Use APIs (Claude, GPT-5.4)
Both Anthropic and OpenAI now offer computer use APIs — agents that can take control of a full desktop environment, not just a browser tab. Claude's computer use API lets you spin up a virtual desktop and have the model interact with any application via screenshot observation and keyboard/mouse control.
This is qualitatively different from browser agents. A browser agent can only interact with web pages. Computer use can interact with any installed application: spreadsheets, legacy desktop software, developer tools, anything with a GUI.
Production readiness: 3/5
Computer use is genuinely powerful but latency is high and cost is significant. Each step requires a vision model call on a full screenshot. For tasks that can be accomplished with a browser-only agent, the overhead is not worth it. Computer use earns its keep for workflows involving desktop applications that have no browser interface and no API. For the security model behind safe execution, Alibaba's OpenSandbox gives a good picture of what production-safe agent execution looks like.
Perplexity Computer launched in early 2026 as a turnkey computer use agent with an opinionated interface. Less flexible than the raw APIs but faster to get started for common tasks like web research + document creation workflows.
Coding agents: the SWE-bench reality check
Coding agents are the category that has attracted the most investment, the most hype, and the most honest reckoning with what LLMs can and cannot do reliably. They are also what has made vibe coding — shipping entire products through AI — a credible workflow for founders rather than a party trick.
The standard benchmark is SWE-bench Verified — a dataset of real GitHub issues from popular open-source repositories. The task: given an issue description and the codebase, produce a patch that makes the failing tests pass, without breaking existing tests. It is hard. Real-world hard.
Here is where the leading models sit as of March 2026:
These numbers are impressive and should be interpreted carefully. SWE-bench uses curated issues from well-maintained codebases with clear test suites. Real enterprise codebases have legacy code, sparse tests, ambiguous requirements, and implicit conventions. Benchmark performance is a ceiling, not a floor, for what you will see in production.
The coding agent workflow architecture
flowchart LR
ISSUE["Issue / Feature Request"] --> UNDERSTAND["Understand Codebase\n(index + search)"]
UNDERSTAND --> LOCATE["Locate Relevant Files\n(grep, semantic search)"]
LOCATE --> READ["Read & Analyze Code\n(context loading)"]
READ --> PLAN["Plan Changes\n(edit strategy)"]
PLAN --> EDIT["Apply Edits\n(file modifications)"]
EDIT --> VERIFY["Run Tests\n(sandbox execution)"]
VERIFY -- Tests Pass --> PR["Open Pull Request"]
VERIFY -- Tests Fail --> DIAGNOSE["Diagnose Failure"]
DIAGNOSE --> EDIT
PR --> REVIEW["Human Review\n(optional)"]
REVIEW -- Approved --> MERGE["Merge"]
REVIEW -- Changes Requested --> EDIT
style ISSUE fill:#f0f9ff,stroke:#0ea5e9
style MERGE fill:#f0fdf4,stroke:#22c55e
style VERIFY fill:#fef9c3,stroke:#eab308
Every serious coding agent implements this loop. The differences are in: how well they index and search your codebase, how they manage large context windows across multiple files, how they handle test execution feedback, and how they know when to stop trying.
Claude Code
GitHub: anthropics/claude-code — CLI + VS Code extension
Claude Code is Anthropic's official coding agent and the one we use most heavily in our own development. It is not purely open-source in the model layer — it calls the Claude API — but the CLI and tooling are open-source and the architecture is transparent.
What sets Claude Code apart is how it handles codebase understanding. Rather than relying solely on semantic search, it uses a tree-sitter based syntax analysis layer to understand file structure before diving into content. Combined with Claude's 200K context window, it can hold the relevant parts of a large codebase in context across a multi-step editing session.
The integration with Cursor Automations is notable — you can use Claude Code as the underlying model in Cursor's agent pipeline, giving you a continuous deployment-aware coding agent that triggers on commits, not just manual prompts.
What it is actually good at:
- Multi-file refactoring (best-in-class for understanding cross-file dependencies)
- Bug diagnosis with stack trace context
- Writing tests for existing code
- Documentation generation from code
- Explaining unfamiliar codebases quickly
Production readiness: 5/5
The polish, the maintenance cadence, and the community support make this the most reliable coding agent available. It is the benchmark everything else is measured against.
Install: npm install -g @anthropic-ai/claude-code
Docs: docs.anthropic.com/claude-code
Cursor + Automations
Website: cursor.com — not open-source, but ecosystem anchor
Cursor earns a place on this list because it has become the development environment for the agent-native engineering team. The product itself is a VS Code fork with deeply integrated AI capabilities. The recent Automations launch changed the category — Cursor can now run coding agents continuously in response to triggers (git commits, Slack messages, timers, PagerDuty alerts) rather than only when a developer prompts it.
We cover the architecture of Cursor Automations in depth in our dedicated article. The short version: Cursor treats coding agents as infrastructure that runs alongside your CI/CD pipeline, not as a developer tool you invoke manually.
Production readiness: 5/5 (for teams)
The caveat is pricing. Cursor is not free at scale. For solo developers or small teams, the economics are fine. For large engineering organizations, the cost-per-seat math requires a build-vs-buy analysis that depends heavily on how much agent usage your team generates.
Aider
GitHub: paul-gauthier/aider — 25K+ stars
Aider is the battle-tested open-source coding agent. It has been in active development since before the current agent boom, which means it has worked through problems that newer projects have not yet encountered.
The core interface is a terminal REPL that works with any Git repository. You invoke Aider, it maps your codebase, and you have a conversation with it about your code. The key capability is its unified diff approach — it generates well-structured diffs that apply cleanly rather than rewriting entire files, which dramatically reduces the risk of breaking changes in areas you did not intend to touch.
What it is actually good at:
- Solo developer productivity (best terminal-native experience)
- Incremental changes to existing codebases (its diff approach is excellent)
- Working with any model (supports Claude, GPT-5, Gemini, local models via Ollama)
- Fully offline-capable with local model backend
Production readiness: 4/5
Aider is the right choice if you want a terminal-native, model-agnostic coding agent without a subscription. Its SWE-bench numbers are strong relative to any project without a closed-source API as the default backend. The primary gap vs. Claude Code is context management on very large codebases — Aider's repo map heuristic is excellent but occasionally misses distant dependencies.
Install: pip install aider-install && aider-install
Docs: aider.chat
GitHub: All-Hands-AI/OpenHands — 45K+ stars
OpenHands is the fully autonomous coding agent — the open-source answer to Devin. Where Aider and Claude Code are tools that assist a developer, OpenHands is designed to complete software engineering tasks end-to-end with minimal human intervention.
The architecture is more ambitious: OpenHands runs a full development environment (terminal, browser, file system) in a Docker container, and the agent operates inside that environment. It can write code, run tests, browse the web for documentation, install packages, and iterate until the task is complete.
This is both its strength and its risk profile. Because OpenHands has a full environment, it can handle tasks that require multi-step, multi-tool workflows. Because it has a full environment, a hallucinating agent can also install unwanted packages, modify files outside its intended scope, or exhaust resources.
What it is actually good at:
- End-to-end feature implementation from a clear spec
- Greenfield project scaffolding
- Automated bug fixes for well-specified issues
- Research and implementation tasks where browsing docs is part of the workflow
Production readiness: 3/5
OpenHands is impressive in demos and requires care in production. The Docker-sandboxed execution model is the right approach for safety — and it connects naturally with projects like OpenSandbox for stronger isolation guarantees. But the agent can get stuck in loops, consume excessive tokens on ambiguous tasks, and produce over-engineered solutions when the problem was simple. Use it for bounded, well-specified tasks with a human review step before any output goes to production.
Docs: docs.all-hands.dev
Cline
GitHub: clinebot/cline — 20K+ stars
Cline (formerly Claude Dev) is a VS Code extension that brings agentic coding into the editor without the subscription model of Cursor. It uses the VS Code API directly and your own model API keys, making it the best choice for developers who want editor-integrated AI without per-seat pricing.
The user experience is closer to Claude Code than to Aider — you get a chat interface inside VS Code that can read, write, and execute code in your workspace. Cline's standout feature is its approval workflow: by default, it asks permission before writing files or running commands, which makes it safe to use in codebases where you cannot afford surprises.
Production readiness: 4/5
Cline is the best free, editor-integrated option. The approval workflow adds friction but is the right default for production codebases. Advanced users can configure auto-approval for lower-risk actions (reads, test runs) and keep manual approval for writes.
Continue
GitHub: continuedev/continue — 22K+ stars
Continue is less of a coding agent and more of an AI coding copilot that you can configure to behave like one. It supports VS Code and JetBrains, connects to any model provider, and lets you define custom commands, context providers, and slash commands.
Think of Continue as the open-source Cursor alternative that maximizes configurability at the cost of out-of-the-box polish. If you work in a JetBrains IDE, Continue is the only serious option. If you work in VS Code and want full control over your AI tooling without a subscription, Continue is the alternative to evaluate before Cursor.
Production readiness: 4/5
Docs: docs.continue.dev
Windsurf and Devin
Windsurf (Codeium) and Devin (Cognition AI) deserve mention even though neither is open-source.
Windsurf is a VS Code fork like Cursor but with a flow-based collaboration model — the "Flow" interface tries to show you what the agent is thinking while it works. For teams already sold on AI-native IDEs but looking for an alternative to Cursor, Windsurf is the credible alternative.
Devin positions itself as the first "AI software engineer" — a fully autonomous agent with its own development environment, browser, and persistent memory. In our testing, Devin is better at autonomous task completion than OpenHands for the kinds of tasks Cognition has optimized for, but significantly more expensive and not self-hostable. It earns its keep for large engineering teams that want to delegate well-specified issues entirely.
Research agents: from prototype to pipeline
Research agents automate the process of finding, reading, synthesizing, and presenting information from the web and from documents. The category ranges from "smarter Google" to "automated analyst."
STORM
GitHub: stanford-oval/storm — 18K+ stars
STORM (Synthesis of Topic Outlines through Retrieval and Multi-perspective questioning) is the most academically rigorous research agent available. Developed at Stanford, it mimics the Wikipedia editing process: it generates multiple perspectives on a topic, researches each perspective separately using web search, then synthesizes a comprehensive article with citations.
The output quality is genuinely impressive for structured knowledge synthesis. STORM does not just retrieve and paste — it reasons about what different perspectives would say about a topic, finds sources that represent each perspective, and writes a coherent synthesis with proper attribution.
What it is actually good at:
- Deep dives on specific topics where you need multiple perspectives
- Competitive analysis and market research
- Technical topic synthesis for teams entering a new domain
- Automated first draft of documentation or briefings
Production readiness: 3/5
STORM is research-grade software that works well but requires care in deployment. The multi-perspective research loop is expensive in API calls. For one-off deep research it is hard to beat. For continuous pipelines at scale, cost management requires engineering work.
Docs: storm.genie.stanford.edu
GPT-Researcher
GitHub: assafelovic/gpt-researcher — 18K+ stars
GPT-Researcher is the production-friendly research agent. It runs a structured research loop: generate search queries → search the web → scrape relevant pages → extract information → synthesize a report with citations. The whole loop takes 2-5 minutes and produces a structured markdown report.
The architecture is designed for reliability over academic rigor. It runs multiple parallel searches, aggregates results, and handles failures gracefully. The output is not as nuanced as STORM's multi-perspective synthesis, but it is far more consistent and significantly cheaper to run.
What it is actually good at:
- Automated competitive intelligence pipelines
- Market research reports (given a company or topic, produce a briefing)
- News monitoring and synthesis
- Due diligence support
Production readiness: 4/5
GPT-Researcher supports multiple search backend options (Tavily, Bing, Google, SerpAPI, DuckDuckGo) and multiple model providers. For teams building research pipelines, the flexibility and reliability make it the default choice.
Install: pip install gpt-researcher
Docs: docs.gptr.dev
Exa AI
Website: exa.ai — API service with open SDK
Exa AI deserves a mention here because it solves the search problem that plagues most research agents. Standard web search APIs (Bing, Google) return SEO-optimized pages that are often not the most informative sources. Exa's embeddings-based search is specifically designed to find the most semantically relevant pages for a given query — including academic papers, forum discussions, and technical documentation that standard search buries.
For research agents that need high-quality sources (not just high-ranking pages), Exa is worth the additional API cost. The Python and TypeScript SDKs integrate cleanly with LangChain, LangGraph, and direct OpenAI/Anthropic function calling.
Production readiness: 5/5 (as an API component)
Perplexity API
Perplexity's API exposes their search-augmented generation directly. For research tasks where you want a single, well-sourced answer rather than a full research report, the Perplexity API is the lowest-friction option. It handles the retrieve-and-synthesize loop internally, returning answers with citations in one API call.
The limitation is control — you cannot customize the retrieval strategy, filter sources, or inspect intermediate results. For production research pipelines with quality requirements, GPT-Researcher or a custom LangGraph chain over Exa gives you more control. For quick research lookups inside an agent workflow, the Perplexity API is often the fastest path.
Orchestration frameworks: how to wire it all together
Orchestration frameworks handle the control plane — how you sequence steps, manage state, coordinate between multiple agents, and handle failures. Choosing the wrong orchestration layer is the most expensive mistake in building an agent system.
LangGraph
GitHub: langchain-ai/langgraph — 10K+ stars
LangGraph is the orchestration choice we recommend for most production deployments. The core abstraction is a directed graph where nodes are functions (including LLM calls, tool calls, and regular Python) and edges are transitions with optional conditional logic. State flows through the graph and is persisted at each step.
What LangGraph gets right that most alternatives get wrong is state management with interruption recovery. Long-running agent workflows fail. Models hallucinate. APIs return errors. A production-grade agent system needs to be able to pause, inspect state, recover, and resume. LangGraph's checkpoint system gives you this out of the box.
from langgraph.graph import StateGraph, END
from typing import TypedDict
class AgentState(TypedDict):
messages: list
research_results: list
draft: str
def research_node(state: AgentState):
# call search API, return results
...
def write_node(state: AgentState):
# use LLM to write draft from research
...
graph = StateGraph(AgentState)
graph.add_node("research", research_node)
graph.add_node("write", write_node)
graph.add_edge("research", "write")
graph.add_edge("write", END)
Production readiness: 5/5
The learning curve is real — LangGraph requires you to think explicitly about state shape and graph topology, which feels verbose for simple use cases. The payoff is that complex multi-agent systems become maintainable, debuggable, and testable. For anything beyond a single-loop agent, LangGraph is our default recommendation.
Docs: langchain-ai.github.io/langgraph
CrewAI
GitHub: crewAIInc/crewAI — 26K+ stars
CrewAI takes a role-based abstraction rather than a graph abstraction. You define a crew of agents, each with a role, goal, and backstory, and assign them tasks. CrewAI handles the coordination — agents can collaborate, hand off to each other, and use tools. The interface feels more like defining a team than wiring a graph.
The appeal is accessibility. CrewAI code reads like a description of what you want to happen, not like a state machine specification. For teams new to multi-agent systems, the mental model is lower friction.
The limitation is that the accessibility comes at the cost of control. When something goes wrong in a CrewAI pipeline, understanding exactly why is harder than in LangGraph where every node and edge is explicit. For complex, production-critical systems, the debugging overhead is a real cost.
Production readiness: 4/5
CrewAI is the right choice for rapid prototyping, business logic agents where the tasks are relatively simple, and teams without deep LangGraph experience. For production systems with complex state, LangGraph is worth the learning investment.
Docs: docs.crewai.com
AutoGen / AG2
GitHub: microsoft/autogen — 38K+ stars (AG2 fork: ag2ai/ag2)
AutoGen pioneered the conversational multi-agent pattern — agents that coordinate by sending messages to each other, rather than through a centralized orchestrator. For problems that map naturally to a team conversation (one agent writes code, another reviews it, a third tests it), AutoGen's communication model is intuitive.
The AG2 fork emerged from a community disagreement about the direction of the project and has taken a more opinionated stance on production features: better async support, improved observability, and cleaner agent lifecycle management.
Production readiness: 3/5 (AutoGen) / 4/5 (AG2)
The conversational model is more expensive in tokens than a tight graph (agents repeat context to each other in messages) and harder to reason about in production. Use AutoGen/AG2 when the conversational collaboration pattern genuinely fits your domain — code review loops, debate-and-synthesis workflows, adversarial evaluation pipelines.
OpenAI Agents SDK
GitHub: openai/openai-agents-python — 7K+ stars
OpenAI's Agents SDK is the newest entrant in this category and the most opinionated. It provides a clean Python API for defining agents with handoffs, guardrails, and tracing built in. The design philosophy is "batteries included" — you get observability, token tracking, and agent communication protocols without additional setup.
The limitation is obvious: it is optimized for OpenAI's own models. While it technically supports other providers, the tight integration with GPT-5's function calling and the Responses API means you get the best experience when you stay in the OpenAI ecosystem.
Production readiness: 4/5
For teams committed to OpenAI models and building production multi-agent systems, the SDK's built-in tracing and guardrails are genuinely valuable. For teams that need model flexibility, LangGraph is the better choice.
Docs: openai.github.io/openai-agents-python
Claude Agent SDK
Anthropic's Claude Agent SDK (the framework powering this very platform) provides similar primitives for Claude-based agent systems — tool definitions, agent handoffs, and conversation management — with first-class support for the MCP ecosystem. For systems where you want Claude's extended context and reasoning capabilities as the orchestration layer, it is the natural foundation.
Infrastructure layer: sandboxing, observability, tool integrations
Building agents without production infrastructure is like running a web application without monitoring. The infrastructure layer is what separates a demo from a deployed product.
E2B — Code Execution Sandboxing
GitHub: e2b-dev/e2b — 7K+ stars
Website: e2b.dev
E2B is the production standard for sandboxed code execution in agent systems. When your coding agent generates and runs Python, you cannot let it execute on your host machine. E2B spins up isolated microVMs in milliseconds, executes the code, returns the output, and disposes of the environment — with full isolation guarantees.
The Python and TypeScript SDKs are clean, the latency is low (sub-500ms cold start), and the pricing is reasonable for typical coding agent workloads. E2B has become a de facto dependency for any agent that generates and executes code.
Production readiness: 5/5
from e2b_code_interpreter import Sandbox
sbx = Sandbox()
execution = sbx.run_code("import pandas as pd; print(pd.__version__)")
print(execution.text)
sbx.kill()
Install: pip install e2b-code-interpreter
Docs: e2b.dev/docs
Modal — Serverless GPU/CPU for Agents
Website: modal.com
Modal solves a different problem: when your agent needs to run computationally expensive operations — inference on open-weight models, vector index construction, heavy data processing — you want serverless execution that scales to zero and back up in seconds.
Modal's Python decorator model is particularly clean for agent tasks:
import modal
app = modal.App("research-agent")
@app.function(gpu="T4", timeout=300)
def run_embedding_pipeline(docs: list[str]) -> list[list[float]]:
# heavy embedding work runs on GPU, isolated from your agent host
...
For production agent systems where you want to run open-weight models as tool calls inside your pipeline, Modal is the most developer-friendly option. For teams exploring the build vs. buy decision on the model layer, Modal makes running your own models economically viable.
Production readiness: 5/5
GitHub: ComposioHQ/composio — 12K+ stars
Website: composio.dev
Composio addresses the integration problem: your agent needs to call 30 different external services (GitHub, Slack, Linear, Salesforce, Google Workspace, etc.) and each one has different authentication, rate limits, and API shape.
Composio provides a unified tool interface where each external service becomes a standardized action with automatic auth handling, rate limit management, and schema generation for LLM function calling. The difference between "I need to build 30 integrations" and "I need to add Composio to my requirements.txt" is weeks of engineering time.
Production readiness: 4/5
The primary caveat is vendor dependency. Your agent's tool capabilities are gated on Composio's service being available. For most teams building agent products (as opposed to infrastructure), this is an acceptable tradeoff. For teams building agent infrastructure for others, you may want to build a subset of integrations natively.
Docs: docs.composio.dev
Observability: AgentOps, LangSmith, Helicone, Langfuse
Agent observability is a separate discipline from traditional application monitoring. You need to see: token usage per step, latency breakdown by agent node, which tool calls are failing, which prompts are producing unexpected outputs, and how cost scales with usage.
AgentOps (agentops.ai) is framework-agnostic and installs with two lines of code. It captures agent session replays — you can watch, step by step, exactly what your agent did on any given run. This is invaluable for debugging.
LangSmith (smith.langchain.com) is the natural choice if you are already using LangGraph or LangChain. The tight integration with LangGraph's tracing layer means you get detailed run trees with minimal configuration.
Helicone (helicone.ai) is the right choice for cost visibility. It routes all your LLM API calls through a proxy that tracks cost by model, by prompt template, and by user — essential for teams building multi-tenant agent products where you need to understand cost per customer.
Langfuse (langfuse.com) is the open-source alternative that you can self-host. If your data cannot leave your infrastructure, Langfuse gives you production-grade observability without sending traces to a third-party service.
Production readiness: All four are production-ready at 4/5 or 5/5. The choice depends on your stack and data residency requirements.
How to pick the right stack for your use case
The taxonomy above covers 20+ projects. Here is a decision framework to narrow it down quickly.
If you are building a browser automation product:
- Core automation: Browser Use (Python) or Stagehand (TypeScript)
- Orchestration: LangGraph or direct API calls for simple workflows
- Infrastructure: E2B for any code generated, Helicone for cost tracking
If you are building a coding agent product:
- Core agent: OpenHands for autonomous tasks, Cline/Continue for developer-in-the-loop
- Orchestration: Build on Claude Code's SDK or integrate with Cursor Automations
- Infrastructure: E2B is non-negotiable for code execution, LangSmith for debugging
If you are building a research pipeline:
- Core research: GPT-Researcher for structured reports, STORM for deep synthesis
- Search layer: Exa AI for quality, Perplexity API for convenience
- Orchestration: LangGraph for complex pipelines, direct API calls for simple ones
If you are building a multi-agent product:
- Orchestration: LangGraph for complex state, CrewAI for simpler role-based systems
- Communication: OpenAI Agents SDK if OpenAI-only, Claude Agent SDK for Anthropic-first
- Observability: AgentOps from day one, not as an afterthought
If you are building agent infrastructure:
- Sandboxing: E2B (managed) or OpenSandbox (self-hosted)
- Tool integrations: Composio for quick time-to-market, custom for critical integrations
- Compute: Modal for GPU workloads, standard cloud for CPU-bound tasks
The pattern you should notice: most production agent systems combine two or three layers. A good framework choice is worthless without good infrastructure. Great infrastructure does not fix a poorly designed orchestration graph. The stack compounds.
For teams who want to understand why this tooling ecosystem is commercially significant, the SaaS replacement angle explains why companies are racing to build on these frameworks rather than just using existing software. And the multi-agent orchestration product architecture post covers how to design systems that actually hold up at scale.
Summary table: production readiness ratings
Frequently asked questions
Q: Which open-source AI agent project has the most GitHub stars?
As of March 2026: Browser Use leads the browser category at 50K+ stars, OpenHands leads coding at 45K+, and AutoGen/AG2 leads orchestration at 38K+. Star counts are a popularity signal, not a quality signal. OpenHands at 45K stars requires more production engineering than Aider at 25K stars.
Q: Can I run these agents with local models instead of API models?
Yes, with caveats. Aider, Continue, and Cline all support local models via Ollama or direct GGUF loading. LangGraph and CrewAI are model-agnostic and work with any OpenAI-compatible endpoint. The practical limit is model quality: most local models (Llama 4 Scout, Mistral, Phi-3 Medium) produce noticeably worse results on complex multi-step agent tasks than frontier API models. For tasks where quality matters, plan your local-model strategy carefully. For tasks where privacy or cost is the primary constraint, local models are viable.
Q: How do I handle the cost of running browser agents at scale?
Browser agents that call a vision model at each step are expensive. A 20-step browser task with a screenshot at each step can cost $0.50-2.00 per run depending on the model and image resolution. At scale, this adds up fast.
Cost optimization strategies in order of impact:
- Use DOM extraction (cheap) as the first-pass perception, only fall back to screenshot (expensive) when DOM fails
- Cache observation results for pages that have not changed between steps
- Use a smaller model (Claude Haiku, GPT-4o mini) for simple action steps, reserve large models for planning
- Set hard token budgets per task and fail explicitly rather than letting runaway agents generate unbounded costs
Q: What is the difference between Browser Use, Stagehand, and a direct Playwright MCP integration?
Browser Use gives you a Python-first, agent-centric interface where you describe a goal and the agent figures out how to navigate to it. Stagehand gives you a TypeScript SDK with explicit methods (page.act(), page.extract()) that use AI underneath but are called from your code like a library. Playwright MCP gives you Playwright's full API exposed as MCP tools — maximum control, minimum abstraction.
For rapid prototyping: Browser Use. For production TypeScript code: Stagehand. For MCP-native architectures: Playwright MCP.
Q: Is OpenHands actually production-ready?
With strong caveats. OpenHands can complete software engineering tasks that no other open-source tool can match in scope. But it has a higher failure rate, higher token cost, and more unpredictable behavior than the 3/5 rating implies if you deploy it without guardrails.
The safe deployment model: use OpenHands for well-specified tasks in a sandboxed environment (E2B or Docker), always require human review before merging any output, set hard token budgets, and log every action for audit. With those guardrails, OpenHands is a genuinely valuable tool. Without them, it is a source of expensive surprises.
Q: What is the SWE-bench benchmark and why does it matter for choosing a coding agent?
SWE-bench Verified is a benchmark of 500 real GitHub issues from 12 popular open-source Python repositories. Each issue comes with a failing test. The task: produce a code patch that makes the failing test pass without breaking existing tests. It measures the core coding agent capability: understanding a codebase and making targeted changes to fix a real problem.
It matters because it is the closest publicly available measure to "can this agent actually fix bugs in real code." GPT-5's 88% and Claude Opus 4's 72% represent meaningful capability differences in production. DeepSeek V3.2 at 70.2% is the best open-weight alternative.
The benchmark's limitation: it uses well-maintained codebases with clear test suites. Production benchmarking on your own codebase will reveal gaps. Use SWE-bench scores as a starting point, not a final answer.
Q: How do I choose between LangGraph and CrewAI for my use case?
Use LangGraph if: your workflow has complex state that must be explicitly managed, you need checkpoint-and-resume for long-running tasks, you want fine-grained control over every transition, or you are building something that will run in production at scale where debugging matters.
Use CrewAI if: your workflow maps naturally to a team of roles with distinct responsibilities, you want to ship a prototype quickly and see if the concept works, your team is new to agent orchestration and prefers readable code over explicit state machines.
In practice, many teams prototype in CrewAI and migrate to LangGraph when they hit the first serious production debugging session.
Q: What AI agent projects are worth watching that are not covered here?
A few that we considered but left out for scope:
- Magentic-One (Microsoft Research) — a multi-agent system for complex web and file tasks, technically impressive, not yet production-ready
- SWE-agent (Princeton NLP) — the research project that established many of the techniques used in production coding agents
- Smolagents (HuggingFace) — a minimal framework from HuggingFace that prioritizes simplicity; worth watching for the open-model ecosystem
- AgentBench — a benchmark suite, not a framework, but useful for evaluating your own agent systems against standardized tasks
The common thread: the projects not on our main list are either genuinely research-grade (impressive but not production-ready) or too new to have established reliability in production. Watch them. Do not build your product on them yet.
What comes next
The open-source agent ecosystem in 2026 is the best it has ever been. It is also the noisiest it has ever been.
The signal through the noise: a handful of projects in each category have earned production credibility through sustained maintenance, large-scale deployment, and honest public discussion of their limitations. Those are the projects worth your time. Everything else is a bet, and the odds are worse than the star counts suggest.
For the commercial implications of this tooling landscape — how it creates new startup opportunities and threatens existing SaaS — we have written about agents replacing SaaS categories and why the agent startup window is real. The technical foundation described in this guide is what makes those business opportunities possible.
Build on proven infrastructure. Ship early. Measure real production performance against your actual workloads. The agents that create value are the ones that work reliably, not the ones that work impressively on a demo.
All star counts are approximate as of March 2026. Production readiness ratings reflect our assessment based on maintenance cadence, community size, breaking change history, and real-world deployment reports. Your mileage will vary based on use case and operational requirements.