TL;DR: AI agents are black boxes by default. You deploy a system that makes API calls, uses tools, loops through reasoning steps, and coordinates across sub-agents — and without observability tooling, you have no idea why it succeeded, why it failed, or how much it cost. Traditional APM tools don't cut it: they weren't built for non-deterministic LLM calls, multi-hop tool chains, or agents that spawn other agents. This guide covers the observability stack you actually need — tracing, cost tracking, debugging multi-agent failures, alerting on quality degradation, and the seven metrics every production agent system must track. We compare six platforms (LangSmith, Helicone, Langfuse, AgentOps, Braintrust, Phoenix/Arize), show you how to instrument with OpenTelemetry, and explain why nobody has fully solved the "Datadog for agents" problem yet.
Table of Contents
- Why agent observability is different
- What you actually need to observe
- The agent observability stack
- Platform comparison: the six main players
- OpenTelemetry for agents
- Code examples: instrumentation patterns
- Tracing multi-agent workflows
- Cost tracking per agent run
- Debugging coordination failures
- Alerting on quality degradation
- Audit logging for compliance
- Replay and time-travel debugging
- The 7 production metrics you must track
- The "Datadog for agents" problem
- Frequently asked questions
Why agent observability is different
If you have run production web services, you know what APM looks like. A request comes in, it hits your API, maybe calls a database, returns a response. Latency, error rate, throughput — three numbers tell the whole story. Tools like Datadog, New Relic, and Honeycomb were built for this model, and they do it well.
AI agents break every assumption that model is built on.
Here is what an agent actually does when you call it:
- Receives a natural-language instruction
- Sends a prompt to an LLM (nondeterministic, variable latency, token-based cost)
- Receives a tool call request from the model
- Executes the tool (external API, database, file system)
- Returns tool output to the model
- Model reasons again — potentially calling more tools
- Model decides to spawn a sub-agent for part of the task
- Sub-agent runs its own loop (go back to step 2)
- Sub-agent returns result to orchestrator
- Orchestrator synthesizes, loops again, or terminates
A single user request might produce 50 LLM calls, 200 tool invocations, 8 sub-agent spawns, and 15 minutes of wall-clock execution time. The "request" is not a single thing — it is a tree.
Traditional APM instruments request/response cycles. It has no concept of "agent reasoning step" or "tool call context" or "why did the agent loop 12 times before failing." You cannot instrument an agent with a Datadog APM agent the same way you instrument a Rails app, because the unit of work is completely different.
The specific problems you face:
Non-determinism. The same input might produce different outputs. You can't test this in the traditional sense — you need to evaluate whether outputs are "good enough" across a distribution. That requires LLM-as-judge or human eval pipelines, which are separate from your APM.
Variable execution paths. A linear service has predictable call trees. An agent might follow a 3-step path or a 30-step path depending on what it encounters. Your visualization needs to handle tree structures with arbitrary depth, not flat request spans.
LLM-specific costs. Costs are input tokens + output tokens + embeddings + tool calls, aggregated across potentially dozens of model invocations per user request. No traditional APM tracks this.
Quality vs uptime. Traditional APM tells you if your service is up. For agents, being "up" is not enough — you need to know if the output quality is degrading. A hallucinating agent running at 99.9% uptime is worse than a failed request that returns a clear error.
Handoff context. When Agent A hands off to Agent B, you need the trace context to propagate. If it doesn't, you get disconnected traces that look like separate requests — you lose the causal chain.
All of this means you need purpose-built observability tooling for agents. The good news is that the ecosystem has moved fast. By early 2026, there are at least six serious platforms for this problem. The bad news is that none of them has fully solved it yet — we'll get to that.
If you're building on the agent startup opportunity (covered in detail at /blog/ai-agent-startup-opportunity), observability is not optional. It's the difference between a product you can iterate on and a product that randomly fails in ways you can't diagnose.
What you actually need to observe
Before picking a platform, be precise about what you need to observe. There are four layers:
Layer 1: LLM calls. Every request to an LLM — the prompt, the completion, the model used, token counts, latency, cost. This is the foundation. Every platform covers this.
Layer 2: Tool calls. When the agent calls a function (search, database query, API call), you need to record what was called, the arguments, the output, and the latency. The correlation between LLM call → tool call → LLM call is critical for understanding agent behavior.
Layer 3: Agent steps. The higher-level abstraction of "reasoning step." An agent might make 3 LLM calls and 5 tool calls in a single reasoning step. You want to group these logically so you can see the agent's decision process, not just a flat list of calls.
Layer 4: Session/run level. The full trace from user input to final output, spanning all steps, all sub-agents, and all tool invocations. This is what you correlate to business outcomes (did the task succeed? did the user accept the result? what was the total cost?).
Most teams start with Layer 1, think they're done, and then discover 6 months later that they can't debug Layer 3 and 4 failures. Build the full stack from day one — retrofitting observability into agents is painful.
The agent observability stack
Here is the reference architecture we use for production agent systems:
graph TB
subgraph "Agent System"
U[User Request] --> O[Orchestrator Agent]
O --> SA[Sub-Agent A]
O --> SB[Sub-Agent B]
SA --> T1[Tool: Search]
SA --> T2[Tool: Database]
SB --> T3[Tool: Code Exec]
SB --> T4[Tool: File Write]
end
subgraph "Instrumentation Layer"
OT[OpenTelemetry SDK]
OT --> SPAN1[Spans: LLM Calls]
OT --> SPAN2[Spans: Tool Calls]
OT --> SPAN3[Spans: Agent Steps]
OT --> SPAN4[Spans: Sessions]
end
subgraph "Observability Platform"
COL[OTel Collector]
COL --> TRACE[Trace Storage]
COL --> METRICS[Metrics Store]
COL --> LOGS[Log Store]
end
subgraph "Dashboards & Alerts"
DASH[Production Dashboard]
ALERT[Alert Manager]
EVAL[Eval Pipeline]
REPLAY[Replay Debugger]
end
O -.->|instrument| OT
SA -.->|instrument| OT
SB -.->|instrument| OT
OT --> COL
TRACE --> DASH
METRICS --> ALERT
LOGS --> EVAL
TRACE --> REPLAY
The key architectural decision is where to put the instrumentation layer. You have three choices:
SDK instrumentation (invasive). You add observability calls directly into your agent code. Maximum control, maximum effort. This is what LangSmith and Langfuse require in the base case.
Proxy instrumentation (minimal). Traffic routes through an observability proxy that intercepts LLM calls transparently. This is Helicone's model — swap your base URL, get instant logging. Low effort, lower visibility (you see LLM calls but not the agent logic between them).
Framework instrumentation (integrated). If you use LangChain, LangGraph, CrewAI, or OpenAI's Agents SDK, there are native integrations that instrument at the framework level. You get agent-level traces without writing instrumentation code yourself.
In practice, you combine these. Use framework-level integration to get the agent structure, add manual spans for the business logic that matters most, and route through a proxy for cost tracking.
Here is an honest comparison of the platforms available in 2026:
LangSmith
What it is: Observability and evaluation platform from the LangChain team. Native integration with LangChain, LangGraph, and OpenAI's Agents SDK.
Best for: Teams already using LangChain/LangGraph. If your agent is built on LangGraph, LangSmith integration is one environment variable.
Strengths:
- Zero overhead for LangChain users (the claim of 0% overhead is about integration friction, not literal CPU overhead)
- Rich trace visualization with the agent reasoning tree laid out visually
- Eval framework built in — you can run evaluators directly from the traces you capture
- Dataset management — capture production traces as ground truth for evals
- Human feedback annotation workflows
- Prompt management and versioning
Weaknesses:
- Vendor lock-in if you use LangChain-specific features
- Cost scales with volume — can get expensive at high throughput
- The eval UX is good but opinionated about evaluation approaches
Integration: Add LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY to your env. That's it for LangChain. For other frameworks, use the Python/TypeScript SDK.
Docs: smith.langchain.com
Helicone
What it is: Proxy-based LLM observability. Route your OpenAI/Anthropic/Gemini calls through Helicone's proxy, get instant logging, cost tracking, and rate limiting.
Best for: Teams that want minimal integration friction. Particularly good if you're not using a framework and just making raw API calls.
Strengths:
- Truly zero-code integration (change one base URL)
- Cross-provider support — OpenAI, Anthropic, Gemini, Mistral, Cohere from one dashboard
- Cost tracking is best-in-class — accurate, real-time, per-request
- Rate limiting and caching built in
- Custom properties — tag requests with user IDs, session IDs, environment
- Self-hostable (open source version on GitHub)
Weaknesses:
- Proxy adds ~20-50ms latency (small but real)
- Agent-level tracing requires additional SDK work — proxy alone gives you LLM call level, not agent step level
- Less sophisticated eval tooling than LangSmith or Braintrust
Integration: Set base URL to https://oai.helicone.ai/v1 and add Helicone-Auth header. Done.
Docs: docs.helicone.ai
Langfuse
What it is: Open-source LLM observability platform. Self-hostable, with a generous cloud tier.
Best for: Teams with compliance requirements (SOC2, GDPR) that need data sovereignty. Also great for cost-conscious teams — self-hosted is free.
Strengths:
- Fully open source (GitHub: github.com/langfuse/langfuse)
- Self-hostable on your own infra — data never leaves your environment
- Strong tracing model with nested spans that maps well to agent execution trees
- Prompt management, dataset tracking, eval pipelines
- Integration with most agent frameworks
- Active community, fast release cadence
Weaknesses:
- Self-hosting requires operational overhead
- Cloud version is generous but enterprise tiers can get pricey
- UI is functional but less polished than commercial competitors
Integration: pip install langfuse, then use the Langfuse SDK decorator or callback handler.
Docs: langfuse.com/docs
AgentOps
What it is: Observability platform built specifically for AI agents, not adapted from LLM observability. Purpose-built for the multi-agent use case.
Best for: Teams running complex multi-agent systems who want agent-native tooling rather than LLM observability adapted for agents.
Strengths:
- Session replay — watch exactly what your agent did, step by step
- Agent-native concepts: sessions, events, actions, errors
- LLM cost tracking with per-session breakdown
- Works with most agent frameworks (CrewAI, AutoGen, LangChain, OpenAI SDK)
- Compliance tooling built in — audit trails, PII detection
Weaknesses:
- Smaller community and ecosystem than LangSmith/Langfuse
- Eval framework less mature than Braintrust
- Documentation lags behind feature development
Integration: pip install agentops, then agentops.init(api_key="...") at startup.
Docs: docs.agentops.ai
Braintrust
What it is: Eval + observability combined into one platform. Strong emphasis on running evals at scale and connecting them to production traces.
Best for: Teams where evaluation quality is the primary concern — you want to know not just what happened, but whether it was good. Product teams doing A/B testing of prompt changes, new models, or agent architectures.
Strengths:
- Best-in-class eval infrastructure — scored datasets, eval runs, regression detection
- Production tracing + eval in the same platform (traces feed directly into eval datasets)
- Human annotation UI is excellent
- Experiment management — compare different agent configurations side by side
- Good TypeScript/Python SDK parity
Weaknesses:
- More expensive than Langfuse at scale
- Infrastructure/DevOps focused teams might find the eval-first UX unfamiliar
- Less native multi-agent tracing support than AgentOps
Integration: Use Braintrust SDK or the OTEL exporter.
Docs: braintrustdata.com/docs
Phoenix (Arize)
What it is: ML observability platform (Arize) extended to cover LLMs and agents. Phoenix is the open-source version; Arize AI is the commercial cloud.
Best for: Teams that already use Arize for ML model monitoring and want to extend observability to their AI agent layer. Strong for teams with ML engineers on staff.
Strengths:
- Deep ML observability roots — strong on drift detection, data quality monitoring
- Phoenix is fully open source (good for local development)
- OTEL-native — the best OTEL support of any platform in this list
- Embedding visualization (useful for RAG debugging)
- Extends naturally to RAG pipeline observability
Weaknesses:
- UI is more complex than alternatives — ML observability background shows
- Less agent-specific tooling than AgentOps
- The split between open-source Phoenix and commercial Arize creates confusion
Integration: pip install arize-phoenix openinference-instrumentation, then use auto-instrumentation for LangChain/OpenAI/LlamaIndex.
Docs: docs.arize.com/phoenix
OpenTelemetry for agents
The problem with six different platforms is fragmentation. Every platform has its own SDK, its own data model, its own concept of what a "trace" means for an agent. If you instrument for LangSmith today and want to switch to Langfuse tomorrow, you're rewriting instrumentation code.
OpenTelemetry (OTEL) solves this with a vendor-neutral standard for traces, metrics, and logs. The OpenTelemetry AI SIG is working on semantic conventions specifically for LLM calls and agent workflows.
The current draft semantic conventions define spans like:
gen_ai.system — the LLM provider (openai, anthropic, etc.)
gen_ai.request.model — the model used
gen_ai.request.max_tokens
gen_ai.response.finish_reasons
gen_ai.usage.input_tokens
gen_ai.usage.output_tokens
For agents specifically, the conventions extend to:
gen_ai.agent.id — unique agent identifier
gen_ai.agent.name — human-readable agent name
gen_ai.tool.name — tool being called
gen_ai.tool.call.id — tool call correlation ID
Phoenix/Arize is currently the most OTEL-native of the platforms above. LangSmith has an OTEL exporter. Langfuse has OTEL support via their backend. The ecosystem is converging on OTEL as the instrumentation layer, with platforms differentiating on storage, visualization, and eval capabilities.
Our recommendation: instrument with OTEL semantic conventions from day one, then route to whichever platform fits your team. This keeps your instrumentation code stable even as the platform ecosystem consolidates.
Code examples: instrumentation patterns
Basic OpenAI call with OTEL tracing
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
import openai
import time
# Initialize OTEL provider
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-agent", "1.0.0")
client = openai.OpenAI()
def call_llm_with_tracing(messages: list, model: str = "gpt-4o") -> str:
with tracer.start_as_current_span("llm.call") as span:
span.set_attribute("gen_ai.system", "openai")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.request.messages_count", len(messages))
start = time.time()
response = client.chat.completions.create(
model=model,
messages=messages
)
latency_ms = (time.time() - start) * 1000
span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
span.set_attribute("gen_ai.response.finish_reasons",
[response.choices[0].finish_reason])
span.set_attribute("llm.latency_ms", latency_ms)
return response.choices[0].message.content
from opentelemetry import trace
from typing import Callable, Any
import json
tracer = trace.get_tracer("agent")
def traced_agent_step(
step_name: str,
agent_id: str,
input_data: dict
) -> Callable:
"""Decorator for tracing individual agent reasoning steps."""
def decorator(func):
def wrapper(*args, **kwargs):
with tracer.start_as_current_span(f"agent.step.{step_name}") as span:
span.set_attribute("gen_ai.agent.id", agent_id)
span.set_attribute("agent.step.name", step_name)
span.set_attribute("agent.step.input", json.dumps(input_data)[:1000])
try:
result = func(*args, **kwargs)
span.set_attribute("agent.step.status", "success")
span.set_attribute("agent.step.output_type", type(result).__name__)
return result
except Exception as e:
span.set_attribute("agent.step.status", "error")
span.set_attribute("agent.step.error", str(e))
span.record_exception(e)
raise
return wrapper
return decorator
def traced_tool_call(tool_name: str, tool_fn: Callable, **kwargs) -> Any:
"""Execute a tool call with full tracing."""
with tracer.start_as_current_span("agent.tool_call") as span:
span.set_attribute("gen_ai.tool.name", tool_name)
span.set_attribute("agent.tool.input", json.dumps(kwargs)[:500])
start = time.time()
try:
result = tool_fn(**kwargs)
latency_ms = (time.time() - start) * 1000
span.set_attribute("agent.tool.latency_ms", latency_ms)
span.set_attribute("agent.tool.status", "success")
span.set_attribute("agent.tool.output_size", len(str(result)))
return result
except Exception as e:
span.set_attribute("agent.tool.status", "error")
span.set_attribute("agent.tool.error", str(e))
span.record_exception(e)
raise
Langfuse integration for multi-agent tracing
from langfuse import Langfuse
from langfuse.decorators import langfuse_context, observe
import uuid
langfuse = Langfuse(
public_key="pk-...",
secret_key="sk-...",
host="https://cloud.langfuse.com"
)
@observe(name="orchestrator-agent")
def orchestrator(task: str, session_id: str) -> str:
"""Top-level agent that coordinates sub-agents."""
langfuse_context.update_current_trace(
name="agent-session",
session_id=session_id,
user_id="user-123",
tags=["production", "orchestrator"],
metadata={"task_type": "research", "env": "prod"}
)
# Route sub-tasks to specialized agents
research_result = researcher_agent(query=task)
synthesis = synthesizer_agent(
research=research_result,
original_task=task
)
return synthesis
@observe(name="researcher-agent")
def researcher_agent(query: str) -> dict:
"""Sub-agent that handles web research."""
# LLM call to plan searches
search_plan = call_llm_with_tracing([
{"role": "user", "content": f"Plan web searches for: {query}"}
])
# Tool calls tracked as child spans
results = []
for search_query in parse_search_plan(search_plan):
result = traced_tool_call(
"web_search",
web_search_fn,
query=search_query
)
results.append(result)
return {"query": query, "results": results}
@observe(name="synthesizer-agent")
def synthesizer_agent(research: dict, original_task: str) -> str:
"""Sub-agent that synthesizes research into final output."""
return call_llm_with_tracing([
{"role": "system", "content": "You synthesize research into clear answers."},
{"role": "user", "content": f"Task: {original_task}\n\nResearch: {research}"}
])
TypeScript: AgentOps integration
import AgentOps from 'agentops';
import OpenAI from 'openai';
// Initialize AgentOps at startup
AgentOps.init({
apiKey: process.env.AGENTOPS_API_KEY!,
tags: ['production', 'v2.1.0'],
});
const client = new OpenAI();
interface AgentSession {
sessionId: string;
startTime: Date;
taskType: string;
}
async function runAgentWithTracking(
task: string,
userId: string
): Promise<{ result: string; session: AgentSession }> {
const sessionId = AgentOps.startSession({
tags: ['task-execution'],
inherited_session_id: userId,
});
const session: AgentSession = {
sessionId,
startTime: new Date(),
taskType: classifyTask(task),
};
try {
// AgentOps auto-instruments OpenAI calls when initialized
const completion = await client.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: task },
],
});
const result = completion.choices[0].message.content ?? '';
AgentOps.endSession('Success', { output: result.slice(0, 200) });
return { result, session };
} catch (error) {
AgentOps.endSession('Fail', { error: String(error) });
throw error;
}
}
function classifyTask(task: string): string {
// Simple classification for session tagging
if (task.includes('search') || task.includes('find')) return 'research';
if (task.includes('write') || task.includes('draft')) return 'generation';
return 'general';
}
Tracing multi-agent workflows
The hardest tracing problem in multi-agent systems is context propagation. When your orchestrator spawns a sub-agent, the sub-agent needs to know it's part of the same parent trace. Without this, you get disconnected traces that look like separate independent requests.
sequenceDiagram
participant U as User
participant O as Orchestrator
participant SA as Sub-Agent A
participant SB as Sub-Agent B
participant T1 as Tool: Search
participant T2 as Tool: Code
note over U,T2: Trace ID: abc-123 propagates through all spans
U->>O: "Research and summarize topic X"
activate O
note right of O: Span: orchestrator.run<br/>trace_id=abc-123<br/>span_id=span-001
O->>SA: delegate(query="research X")
activate SA
note right of SA: Span: sub-agent.research<br/>trace_id=abc-123<br/>parent_span=span-001<br/>span_id=span-002
SA->>T1: search(query="X overview")
activate T1
note right of T1: Span: tool.web_search<br/>trace_id=abc-123<br/>parent_span=span-002
T1-->>SA: search results
deactivate T1
SA->>T1: search(query="X recent developments")
activate T1
T1-->>SA: more results
deactivate T1
SA-->>O: research_data
deactivate SA
O->>SB: delegate(task="summarize", data=research_data)
activate SB
note right of SB: Span: sub-agent.synthesizer<br/>trace_id=abc-123<br/>parent_span=span-001<br/>span_id=span-003
SB->>T2: run_code(script="summarize.py")
activate T2
note right of T2: Span: tool.code_exec<br/>trace_id=abc-123<br/>parent_span=span-003
T2-->>SB: output
deactivate T2
SB-->>O: summary
deactivate SB
O-->>U: Final response
deactivate O
note over U,T2: Full trace: 1 session → 2 sub-agents → 4 tool calls
In OpenTelemetry, this is handled via context propagation headers. When you spawn a sub-agent via HTTP, pass the traceparent header. For in-process sub-agents, the OTEL context manager handles propagation automatically.
For async or queue-based agent coordination, propagate the trace context explicitly:
from opentelemetry.propagate import inject, extract
from opentelemetry import trace
import json
def dispatch_sub_agent(agent_name: str, task: dict) -> str:
"""Dispatch sub-agent task with trace context propagation."""
carrier = {}
inject(carrier) # Injects current trace context into carrier dict
# Include trace context in the task payload
task_with_context = {
**task,
"_trace_context": carrier
}
return queue.push(agent_name, json.dumps(task_with_context))
def receive_sub_agent_task(raw_task: str) -> None:
"""Sub-agent receives task with parent trace context restored."""
task = json.loads(raw_task)
# Restore parent trace context
ctx = extract(task.get("_trace_context", {}))
with trace.get_tracer("sub-agent").start_as_current_span(
"sub-agent.execute",
context=ctx # Parent context attached here
) as span:
span.set_attribute("agent.task_type", task.get("type"))
# Now all child spans will be nested under the parent trace
execute_task(task)
For multi-agent systems that communicate via the Google A2A protocol, trace context propagation is built into the protocol spec. If you're using A2A, instrument the x-trace-context extension header.
Cost tracking per agent run
Cost tracking is where most teams have massive blind spots. An individual agent call might cost $0.02 in LLM tokens. Running that agent 10,000 times per day at $0.02 is $200/day, $6,000/month — manageable. But agents that loop unexpectedly, call expensive models when cheap ones would suffice, or spawn sub-agents unnecessarily can blow up costs 10x overnight.
Track costs at three levels:
Per LLM call: Input tokens × input price + output tokens × output price. Simple, but you need to get the pricing right per model version and account for batching discounts.
Per agent session: Sum all LLM calls + tool calls (if tools have cost, e.g., Serper for web search) in a single agent run. This is the unit that maps to your product's cost per task.
Per user/tenant: Aggregate session costs by user for billing, quota management, and anomaly detection.
import dataclasses
from typing import Optional
# Current pricing (March 2026) — update when models change
MODEL_COSTS = {
"gpt-4o": {"input": 2.50, "output": 10.00}, # per 1M tokens
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
"claude-3-5-haiku": {"input": 0.80, "output": 4.00},
"gemini-2.0-flash": {"input": 0.10, "output": 0.40},
}
@dataclasses.dataclass
class TokenUsage:
model: str
input_tokens: int
output_tokens: int
@property
def cost_usd(self) -> float:
pricing = MODEL_COSTS.get(self.model, {"input": 0.0, "output": 0.0})
return (
(self.input_tokens / 1_000_000) * pricing["input"] +
(self.output_tokens / 1_000_000) * pricing["output"]
)
@dataclasses.dataclass
class AgentSessionCost:
session_id: str
user_id: str
llm_calls: list[TokenUsage]
tool_costs: dict[str, float] # tool_name -> cost
@property
def total_llm_cost(self) -> float:
return sum(call.cost_usd for call in self.llm_calls)
@property
def total_tool_cost(self) -> float:
return sum(self.tool_costs.values())
@property
def total_cost(self) -> float:
return self.total_llm_cost + self.total_tool_cost
def to_span_attributes(self) -> dict:
return {
"cost.total_usd": self.total_cost,
"cost.llm_usd": self.total_llm_cost,
"cost.tool_usd": self.total_tool_cost,
"cost.llm_calls_count": len(self.llm_calls),
}
Set cost alerts at three thresholds:
- Per-session warning: >$0.50/run for what should be a cheap task
- Per-session critical: >$2.00/run (investigate before it scales)
- Daily spend anomaly: >2x rolling average (something is looping unexpectedly)
Helicone has the best out-of-box cost tracking — it handles model pricing automatically and provides cost dashboards without custom instrumentation. For teams that don't want to build this, start with Helicone for cost visibility and layer in more detailed agent tracing with another tool.
Debugging multi-agent coordination failures
Multi-agent systems fail in ways that are difficult to diagnose without traces. The most common failure modes we see in production:
Circular task delegation. Agent A delegates to Agent B. Agent B, because of ambiguous task definition, delegates back to Agent A. Without tracing showing the delegation chain, this looks like high latency followed by a timeout.
Context window overflow in handoffs. Agent A passes a 50,000-token context to Agent B. Agent B's context window is 32,000 tokens. The handoff fails silently — Agent B processes a truncated context and returns a nonsensical result. The orchestrator accepts it because it's syntactically valid.
Tool output hallucination. The LLM "remembers" a tool call from earlier in the session and fabricates a result for a second call to the same tool. The tool was never called. Without tracing tool calls explicitly, this is invisible.
Race conditions in parallel agents. Two agents write to the same resource (database row, file) simultaneously. One overwrites the other's work. The final result is missing half the expected output.
For each of these, the debugging pattern is the same: find the span where expected behavior diverges from actual behavior.
flowchart TD
START([Agent Run Failed]) --> Q1{Is there a complete trace?}
Q1 -->|No| FIX1[Add trace propagation to agent handoffs]
Q1 -->|Yes| Q2{Where does the trace end?}
Q2 --> OPTS[Check last recorded span]
OPTS --> OPT1{LLM call span}
OPTS --> OPT2{Tool call span}
OPTS --> OPT3{Agent handoff span}
OPTS --> OPT4{No span - silent failure}
OPT1 --> Q3{Finish reason?}
Q3 -->|length truncated| FIX2[Context too long — reduce input or switch model]
Q3 -->|content_filter| FIX3[Guardrail blocked output — review prompt]
Q3 -->|stop normal| Q4{Check tool calls requested}
Q4 -->|Tool called but no tool span| FIX4[Tool call not being executed — check dispatch logic]
OPT2 --> Q5{Tool error?}
Q5 -->|Auth/permission error| FIX5[Tool credential issue]
Q5 -->|Timeout| FIX6[Tool latency exceeds agent timeout]
Q5 -->|Malformed output| FIX7[Parse tool output before passing to LLM]
OPT3 --> Q6{Handoff target reachable?}
Q6 -->|No| FIX8[Sub-agent registration/routing issue]
Q6 -->|Yes but no response| FIX9[Context not propagated — check trace context headers]
OPT4 --> FIX10[Add spans around agent initialization and task dispatch]
FIX1 & FIX2 & FIX3 & FIX4 & FIX5 & FIX6 & FIX7 & FIX8 & FIX9 & FIX10 --> RERUN[Reproduce with detailed logging enabled]
The practical debugging workflow is:
- Filter traces by error status in your observability platform
- Find the last successful span before failure
- Check inputs and outputs of that span for anomalies
- Look at the parent span context — what was the agent's state when it made that decision?
- Check for unexpectedly high loop counts (agent looped 20 times when it should have looped 3)
- Compare against a successful trace for the same task type — find the divergence point
AgentOps has the best UX for this workflow because it surfaces agent-specific events (loops, handoffs, tool failures) as first-class objects, not just generic spans.
Alerting on quality degradation
Uptime alerts are easy — your agent either errors or it doesn't. Quality alerts are harder because your agent might return an output that is syntactically valid but semantically wrong, and you won't know without evaluation.
The approaches, in order of sophistication:
Latency proxy. If your agent's P95 latency suddenly increases from 5 seconds to 45 seconds, something changed — probably the agent is looping more, which often correlates with lower quality output. Latency is a leading indicator of quality problems.
Loop count monitoring. Track how many times your agent loops per session. Baseline your typical loop count (e.g., mean 4.2 loops, P95 8 loops). Alert if P95 exceeds 3x baseline — the agent is struggling.
Tool error rate. Track tool call failures as a percentage of total tool calls. Rising tool error rate often means the agent is attempting tool calls with malformed arguments — a sign the reasoning is degrading.
LLM-as-judge sampling. On 5-10% of production sessions, run an automated evaluator that scores the output quality. Alert if quality score drops below your threshold. This is expensive but gives you a real quality signal.
User feedback signals. If your product has any user feedback mechanism (thumbs up/down, regenerate button), instrument it. A rising regeneration rate is a direct quality signal.
from dataclasses import dataclass
from typing import Callable
import statistics
@dataclass
class AgentQualityMetrics:
session_id: str
loop_count: int
tool_error_rate: float
total_latency_ms: float
output_length: int
class QualityAlerter:
def __init__(self, baseline_window: int = 1000):
self.baseline_window = baseline_window
self.loop_count_history: list[int] = []
self.latency_history: list[float] = []
self.tool_error_history: list[float] = []
# Alert thresholds
self.loop_count_multiplier = 3.0
self.latency_multiplier = 2.0
self.tool_error_abs_threshold = 0.15 # 15% error rate
def record_session(self, metrics: AgentQualityMetrics) -> list[str]:
"""Record session metrics and return list of triggered alerts."""
alerts = []
if len(self.loop_count_history) >= 50:
baseline_loops = statistics.mean(self.loop_count_history[-50:])
if metrics.loop_count > baseline_loops * self.loop_count_multiplier:
alerts.append(
f"HIGH_LOOP_COUNT: {metrics.loop_count} loops "
f"(baseline {baseline_loops:.1f})"
)
if len(self.latency_history) >= 50:
baseline_latency = statistics.median(self.latency_history[-50:])
if metrics.total_latency_ms > baseline_latency * self.latency_multiplier:
alerts.append(
f"HIGH_LATENCY: {metrics.total_latency_ms:.0f}ms "
f"(baseline median {baseline_latency:.0f}ms)"
)
if metrics.tool_error_rate > self.tool_error_abs_threshold:
alerts.append(
f"HIGH_TOOL_ERROR_RATE: {metrics.tool_error_rate:.1%}"
)
# Update history
self.loop_count_history.append(metrics.loop_count)
self.latency_history.append(metrics.total_latency_ms)
self.tool_error_history.append(metrics.tool_error_rate)
return alerts
Connect your alerter to PagerDuty, Slack, or whatever your on-call tooling is. Quality alerts should be lower severity than uptime alerts by default — you want them in a Slack channel, not waking someone up at 3am, unless quality degradation is complete (agent is producing 100% garbage).
Audit logging for compliance
If you're building agents that act on user data or take external actions, you need audit logs for compliance. This matters for SOC 2, GDPR, HIPAA, and increasingly for AI-specific regulatory requirements.
An audit log for agents is different from a trace:
- Traces are for debugging — they capture everything including internal reasoning
- Audit logs are for compliance — they capture who did what, to what data, when, and with what authorization
The audit log events that matter:
For security considerations in agentic systems, audit logs also need to capture when agents were granted elevated permissions, when guardrails fired, and when the agent's action was blocked.
import hashlib
import json
from datetime import datetime, timezone
from typing import Any
class AgentAuditLogger:
"""Compliance-grade audit logging for agent actions."""
PII_FIELDS = {"email", "phone", "ssn", "credit_card", "password", "token"}
def __init__(self, audit_store):
self.store = audit_store
def log_tool_execution(
self,
session_id: str,
user_id: str,
tool_name: str,
tool_input: dict,
outcome: str
) -> None:
event = {
"event_type": "agent.tool.execute",
"timestamp": datetime.now(timezone.utc).isoformat(),
"session_id": session_id,
"user_id": user_id,
"tool_name": tool_name,
"tool_input_hash": self._hash_params(tool_input),
"tool_input_sanitized": self._scrub_pii(tool_input),
"outcome": outcome,
}
self.store.write(event)
def log_data_access(
self,
session_id: str,
user_id: str,
resource_type: str,
resource_ids: list[str],
operation: str # "read" | "write" | "delete"
) -> None:
event = {
"event_type": f"agent.data.{operation}",
"timestamp": datetime.now(timezone.utc).isoformat(),
"session_id": session_id,
"user_id": user_id,
"resource_type": resource_type,
"resource_ids": resource_ids,
"resource_count": len(resource_ids),
}
self.store.write(event)
def _scrub_pii(self, params: dict) -> dict:
return {
k: "[REDACTED]" if k.lower() in self.PII_FIELDS else v
for k, v in params.items()
}
def _hash_params(self, params: dict) -> str:
canonical = json.dumps(params, sort_keys=True)
return hashlib.sha256(canonical.encode()).hexdigest()[:16]
Audit logs must be immutable, append-only, and retained per your compliance requirements (typically 1 year for SOC 2, 7 years for some financial regulations). Store them separately from your trace data — traces can be deleted after their debugging value expires, but audit logs are legal records.
Replay and time-travel debugging
The most powerful debugging capability for agents is session replay — the ability to replay a failed or anomalous session step by step, seeing exactly what the agent saw at each decision point.
LangSmith has session replay built in for LangChain/LangGraph runs. AgentOps has a session replay UI that shows events on a timeline. Both let you step through the agent's execution and inspect the state at each step.
For custom agent frameworks, you can implement basic replay by logging enough state:
import json
from pathlib import Path
from typing import Any
from datetime import datetime, timezone
@dataclass
class AgentStateSnapshot:
step_index: int
timestamp: str
agent_id: str
step_type: str # "llm_call", "tool_call", "handoff", "decision"
input_state: dict
output_state: dict
llm_messages: list | None # Full message history for LLM calls
tool_name: str | None
tool_args: dict | None
tool_result: Any | None
class AgentReplayRecorder:
"""Records agent execution for replay debugging."""
def __init__(self, session_id: str, replay_dir: Path):
self.session_id = session_id
self.replay_dir = replay_dir
self.replay_dir.mkdir(parents=True, exist_ok=True)
self.snapshots: list[AgentStateSnapshot] = []
self.step_counter = 0
def record_llm_call(
self,
agent_id: str,
messages: list,
response: dict,
agent_state: dict
) -> None:
snapshot = AgentStateSnapshot(
step_index=self.step_counter,
timestamp=datetime.now(timezone.utc).isoformat(),
agent_id=agent_id,
step_type="llm_call",
input_state=agent_state.copy(),
output_state={**agent_state, "last_llm_response": response},
llm_messages=messages,
tool_name=None,
tool_args=None,
tool_result=None,
)
self.snapshots.append(snapshot)
self.step_counter += 1
def record_tool_call(
self,
agent_id: str,
tool_name: str,
tool_args: dict,
tool_result: Any,
agent_state: dict
) -> None:
snapshot = AgentStateSnapshot(
step_index=self.step_counter,
timestamp=datetime.now(timezone.utc).isoformat(),
agent_id=agent_id,
step_type="tool_call",
input_state=agent_state.copy(),
output_state={**agent_state, "last_tool_result": str(tool_result)[:500]},
llm_messages=None,
tool_name=tool_name,
tool_args=tool_args,
tool_result=tool_result,
)
self.snapshots.append(snapshot)
self.step_counter += 1
def flush(self) -> Path:
"""Write all snapshots to disk for replay."""
output_path = self.replay_dir / f"{self.session_id}.replay.json"
with open(output_path, "w") as f:
json.dump(
[dataclasses.asdict(s) for s in self.snapshots],
f,
indent=2,
default=str
)
return output_path
Time-travel debugging extends replay by letting you re-inject different inputs at any step — "what would have happened if the web search returned different results at step 3?" This is valuable for reproducing rare failure modes that are hard to trigger in staging.
Implementing true time-travel debugging requires your agent to be stateless and deterministic given the same input state — which most LLM-based agents are not (temperature > 0). You can approximate it by fixing temperature to 0 in replay mode and injecting known tool outputs.
The 7 production metrics you must track
These are the minimum metrics for any production agent system. Build your dashboard around these first:
graph LR
subgraph "Session Metrics"
M1["1. Session Success Rate<br/>(task completion %)"]
M2["2. Session P50/P95/P99 Latency<br/>(wall-clock duration)"]
M3["3. Cost per Session<br/>(total tokens + tools)"]
end
subgraph "Step Metrics"
M4["4. Loop Count Distribution<br/>(mean + P95 per task type)"]
M5["5. Tool Error Rate<br/>(% of tool calls that fail)"]
M6["6. LLM Error Rate<br/>(% of LLM calls that fail or truncate)"]
end
subgraph "Quality Metrics"
M7["7. Quality Score<br/>(LLM-as-judge on sampled sessions)"]
end
subgraph "Derived Alerts"
A1["Cost anomaly > 2x baseline"]
A2["Success rate < threshold"]
A3["P95 latency > SLA"]
A4["Tool error rate > 15%"]
A5["Quality score drop > 10pts"]
end
M1 --> A2
M2 --> A3
M3 --> A1
M5 --> A4
M7 --> A5
Metric 1: Session success rate. What percentage of agent sessions complete the intended task? You need to define "success" explicitly — it might be task_completed=true in your agent's final output, user accepted the result, or no error exception raised. Start with programmatic success (no exception) and layer in semantic success over time.
Metric 2: Session latency (P50/P95/P99). Wall-clock time from user input to final output. P50 tells you typical performance. P95 tells you what slow sessions look like. P99 tells you your worst cases. For most agent use cases, P95 < 30 seconds is a reasonable starting target.
Metric 3: Cost per session. Total LLM tokens + tool costs per completed session. Track this by task type (a "summarize document" task should cost less than a "research and write report" task). This is your unit economics for your product.
Metric 4: Loop count distribution. How many reasoning cycles does your agent take per session? Track mean and P95 per task type. Rising P95 loop counts are a leading indicator of quality problems or prompt regressions.
Metric 5: Tool error rate. Percentage of tool calls that return an error. Baseline this metric — some tool errors are normal (network timeouts, rate limits). Alert if it exceeds 15-20%.
Metric 6: LLM error rate. Percentage of LLM calls that fail, hit rate limits, or return truncated responses (finish_reason=length). This should be close to 0% in a well-configured system. If it's rising, check your token budgets and model availability.
Metric 7: Quality score. Periodic automated evaluation of output quality. Run an LLM-as-judge prompt on a sample of completed sessions (5-10%), score 1-10, track the rolling average. This is the only metric that catches "technically succeeded but output was garbage" failures.
The "Datadog for agents" problem
Here is the uncomfortable truth about agent observability in 2026: no single tool has solved the full problem. Every platform has meaningful gaps.
The fragmentation problem. Six platforms, none of which is clearly dominant. LangSmith has the deepest LangChain integration. Helicone has the best cost visibility. AgentOps has the best multi-agent replay. Braintrust has the best eval. Phoenix has the best OTEL support. Teams end up combining two or three platforms, which means managing multiple vendor relationships, multiple data pipelines, and multiple UIs.
The framework lock-in problem. The best observability experience you can have is with native integrations (LangSmith for LangChain, AgentOps for CrewAI). But those integrations only work well with specific frameworks. If you build a custom agent without a framework, you're writing instrumentation code from scratch.
The non-determinism problem. Every observability tool is built on the assumption that two requests with the same input produce comparable outputs. For agents, this is not true — temperature, reasoning paths, and tool results introduce variance. You can't do simple request diffing to detect regressions.
The semantic gap problem. A Datadog trace tells you a service was slow or errored. An agent trace tells you... an LLM returned a response. Whether that response was good or useful requires evaluation, which is a fundamentally different problem from telemetry. None of the platforms has fully bridged this gap between "what happened" and "was it good."
The cost problem. Comprehensive agent observability — logging every prompt, every completion, every tool call — is itself expensive. High-volume systems can spend $2,000-5,000/month on observability costs alone. Some teams end up sampling heavily, which creates blind spots.
The market will consolidate. The likely winner is either:
- A company that solves the OTEL standardization problem and becomes the "infrastructure layer" that all platforms sit on top of
- A platform that nails the eval + observability combination (Braintrust is closest to this today)
- One of the existing APM giants (Datadog, Honeycomb, Grafana) that adds comprehensive LLM/agent support
For now, our recommended stack is:
- Instrument with OTEL from day one (don't couple to a vendor)
- Use Langfuse (self-hosted) for trace storage and visualization if you need data sovereignty
- Add Helicone for cost tracking (minimal integration, great visibility)
- Use Braintrust for evals and quality measurement
- Roll your own audit logging for compliance — don't trust this to a third party
As you scale your AI agents platform beyond initial deployment, the observability investment pays back in faster debugging cycles and the ability to make confident changes to your agent prompts and architectures.
If you're tracking product data observability alongside agent observability, note that they have different requirements — product analytics is about user behavior patterns, agent observability is about individual execution correctness. Keep them in separate systems.
The agents category is moving toward compound workflow platforms where observability is increasingly built into the orchestration layer. Google's A2A protocol, LangGraph's Cloud platform, and Anthropic's Claude Agent SDK are all building observability as a first-class feature. In 18-24 months, the current state of "pick 3 vendors and glue them together" will likely give way to integrated observability within the major orchestration frameworks.
Until then, build your instrumentation layer on OTEL, own your data, and invest in the eval infrastructure that actually tells you if your agents are working.
Frequently asked questions
Q: Do I need observability for a simple single-agent system, or is it only for multi-agent workflows?
You need basic observability for any production agent, even a single-agent system. At minimum: cost tracking, latency monitoring, and error rate. Multi-agent systems need the full stack because coordination failures are harder to debug without traces.
Q: How much will agent observability tooling cost?
Rough estimates: Helicone cloud free tier handles ~1,000 requests/month; paid starts at $50/month. LangSmith charges by trace volume — roughly $0.005-0.01 per trace at volume. Langfuse self-hosted is free. Braintrust charges by eval runs + trace volume. For a product doing 10,000 agent sessions/month, budget $200-500/month for observability tooling.
Q: Should I build my own observability or use a platform?
Build your own audit logging (compliance is too important to outsource). Buy observability tooling for traces, cost tracking, and evals — the value-to-build ratio is not there. The platforms are improving fast enough that building custom tooling today is likely to be replaced by mature SaaS solutions within 12 months.
Q: How do I trace agents that run asynchronously or in background jobs?
Use OTEL context propagation headers. When you dispatch an async task, serialize the current trace context into the job payload. When the job executes, deserialize and restore the trace context before creating any spans. Both Python's opentelemetry-sdk and the Node.js @opentelemetry/api SDK support this pattern.
Q: What's the difference between a trace and a session in agent observability?
A trace is a technical unit: a tree of spans tracking a specific execution path. A session is a business unit: a complete agent run representing one user task, which may span multiple traces (e.g., if the agent retries). Some platforms (AgentOps) surface sessions as the primary unit. Others (LangSmith) surface traces. Map your business events (task started, task completed, task failed) to sessions, and let traces be the implementation detail underneath.
Q: How do I handle PII in agent traces?
Before logging, scrub fields that contain personal data (names, emails, phone numbers, IDs). For LLM prompts and completions, use a PII detection model (AWS Comprehend, Microsoft Presidio, or similar) to redact before logging. Never log raw user content unless you have explicit consent and appropriate data handling agreements. For GDPR compliance, ensure trace data is stored in your user's region and can be deleted on request.
Q: How do I know if my agent observability tooling is actually working?
Deliberately break your agent in staging and verify that the failure shows up in your observability platform within 60 seconds. Check that cost tracking captures the failure scenario correctly. Verify that the trace shows the step where the failure occurred. Run this "observability smoke test" after every major configuration change.
Q: What model costs should I account for in cost tracking?
Input tokens, output tokens, and (if used) embedding tokens. Some tools also have image/vision token costs. Additionally, account for tool costs: web search APIs (Serper ~$0.001/query), code execution environments (E2B ~$0.10/hour), and any external APIs your agent calls. The LLM token cost is usually 60-80% of total agent cost, but tool costs matter at scale.
If you're evaluating the full agent development stack — including multi-agent orchestration patterns and security for agentic systems — observability is the layer that makes everything else debuggable. Build it early.