# AI Agent Observability: Monitoring, Tracing, and Debugging Multi-Agent Systems in Production

**TL;DR:** AI agents are black boxes by default. You deploy a system that makes API calls, uses tools, loops through reasoning steps, and coordinates across sub-agents — and without observability tooling, you have no idea why it succeeded, why it failed, or how much it cost. Traditional APM tools don't cut it: they weren't built for non-deterministic LLM calls, multi-hop tool chains, or agents that spawn other agents. This guide covers the observability stack you actually need — tracing, cost tracking, debugging multi-agent failures, alerting on quality degradation, and the seven metrics every production agent system must track. We compare six platforms (LangSmith, Helicone, Langfuse, AgentOps, Braintrust, Phoenix/Arize), show you how to instrument with OpenTelemetry, and explain why nobody has fully solved the "Datadog for agents" problem yet.

---

## Table of Contents

1. [Why agent observability is different](#why-different)
2. [What you actually need to observe](#what-to-observe)
3. [The agent observability stack](#observability-stack)
4. [Platform comparison: the six main players](#platform-comparison)
5. [OpenTelemetry for agents](#opentelemetry-agents)
6. [Code examples: instrumentation patterns](#code-examples)
7. [Tracing multi-agent workflows](#tracing-multi-agent)
8. [Cost tracking per agent run](#cost-tracking)
9. [Debugging coordination failures](#debugging-failures)
10. [Alerting on quality degradation](#alerting-quality)
11. [Audit logging for compliance](#audit-logging)
12. [Replay and time-travel debugging](#replay-debugging)
13. [The 7 production metrics you must track](#7-metrics)
14. [The "Datadog for agents" problem](#datadog-problem)
15. [Frequently asked questions](#faq)

---

## Why agent observability is different {#why-different}

If you have run production web services, you know what APM looks like. A request comes in, it hits your API, maybe calls a database, returns a response. Latency, error rate, throughput — three numbers tell the whole story. Tools like Datadog, New Relic, and Honeycomb were built for this model, and they do it well.

AI agents break every assumption that model is built on.

Here is what an agent actually does when you call it:

1. Receives a natural-language instruction
2. Sends a prompt to an LLM (nondeterministic, variable latency, token-based cost)
3. Receives a tool call request from the model
4. Executes the tool (external API, database, file system)
5. Returns tool output to the model
6. Model reasons again — potentially calling more tools
7. Model decides to spawn a sub-agent for part of the task
8. Sub-agent runs its own loop (go back to step 2)
9. Sub-agent returns result to orchestrator
10. Orchestrator synthesizes, loops again, or terminates

A single user request might produce 50 LLM calls, 200 tool invocations, 8 sub-agent spawns, and 15 minutes of wall-clock execution time. The "request" is not a single thing — it is a tree.

Traditional APM instruments request/response cycles. It has no concept of "agent reasoning step" or "tool call context" or "why did the agent loop 12 times before failing." You cannot instrument an agent with a Datadog APM agent the same way you instrument a Rails app, because the unit of work is completely different.

The specific problems you face:

**Non-determinism.** The same input might produce different outputs. You can't test this in the traditional sense — you need to evaluate whether outputs are "good enough" across a distribution. That requires LLM-as-judge or human eval pipelines, which are separate from your APM.

**Variable execution paths.** A linear service has predictable call trees. An agent might follow a 3-step path or a 30-step path depending on what it encounters. Your visualization needs to handle tree structures with arbitrary depth, not flat request spans.

**LLM-specific costs.** Costs are input tokens + output tokens + embeddings + tool calls, aggregated across potentially dozens of model invocations per user request. No traditional APM tracks this.

**Quality vs uptime.** Traditional APM tells you if your service is up. For agents, being "up" is not enough — you need to know if the output quality is degrading. A hallucinating agent running at 99.9% uptime is worse than a failed request that returns a clear error.

**Handoff context.** When Agent A hands off to Agent B, you need the trace context to propagate. If it doesn't, you get disconnected traces that look like separate requests — you lose the causal chain.

All of this means you need purpose-built observability tooling for agents. The good news is that the ecosystem has moved fast. By early 2026, there are at least six serious platforms for this problem. The bad news is that none of them has fully solved it yet — we'll get to that.

If you're building on the agent startup opportunity (covered in detail at [/blog/ai-agent-startup-opportunity](/blog/ai-agent-startup-opportunity)), observability is not optional. It's the difference between a product you can iterate on and a product that randomly fails in ways you can't diagnose.

---

## What you actually need to observe {#what-to-observe}

Before picking a platform, be precise about what you need to observe. There are four layers:

**Layer 1: LLM calls.** Every request to an LLM — the prompt, the completion, the model used, token counts, latency, cost. This is the foundation. Every platform covers this.

**Layer 2: Tool calls.** When the agent calls a function (search, database query, API call), you need to record what was called, the arguments, the output, and the latency. The correlation between LLM call → tool call → LLM call is critical for understanding agent behavior.

**Layer 3: Agent steps.** The higher-level abstraction of "reasoning step." An agent might make 3 LLM calls and 5 tool calls in a single reasoning step. You want to group these logically so you can see the agent's decision process, not just a flat list of calls.

**Layer 4: Session/run level.** The full trace from user input to final output, spanning all steps, all sub-agents, and all tool invocations. This is what you correlate to business outcomes (did the task succeed? did the user accept the result? what was the total cost?).

Most teams start with Layer 1, think they're done, and then discover 6 months later that they can't debug Layer 3 and 4 failures. Build the full stack from day one — retrofitting observability into agents is painful.

---

## The agent observability stack {#observability-stack}

Here is the reference architecture we use for production agent systems:

```mermaid
graph TB
    subgraph "Agent System"
        U[User Request] --> O[Orchestrator Agent]
        O --> SA[Sub-Agent A]
        O --> SB[Sub-Agent B]
        SA --> T1[Tool: Search]
        SA --> T2[Tool: Database]
        SB --> T3[Tool: Code Exec]
        SB --> T4[Tool: File Write]
    end

    subgraph "Instrumentation Layer"
        OT[OpenTelemetry SDK]
        OT --> SPAN1[Spans: LLM Calls]
        OT --> SPAN2[Spans: Tool Calls]
        OT --> SPAN3[Spans: Agent Steps]
        OT --> SPAN4[Spans: Sessions]
    end

    subgraph "Observability Platform"
        COL[OTel Collector]
        COL --> TRACE[Trace Storage]
        COL --> METRICS[Metrics Store]
        COL --> LOGS[Log Store]
    end

    subgraph "Dashboards & Alerts"
        DASH[Production Dashboard]
        ALERT[Alert Manager]
        EVAL[Eval Pipeline]
        REPLAY[Replay Debugger]
    end

    O -.->|instrument| OT
    SA -.->|instrument| OT
    SB -.->|instrument| OT
    OT --> COL
    TRACE --> DASH
    METRICS --> ALERT
    LOGS --> EVAL
    TRACE --> REPLAY
```

The key architectural decision is where to put the instrumentation layer. You have three choices:

**SDK instrumentation (invasive).** You add observability calls directly into your agent code. Maximum control, maximum effort. This is what LangSmith and Langfuse require in the base case.

**Proxy instrumentation (minimal).** Traffic routes through an observability proxy that intercepts LLM calls transparently. This is Helicone's model — swap your base URL, get instant logging. Low effort, lower visibility (you see LLM calls but not the agent logic between them).

**Framework instrumentation (integrated).** If you use LangChain, LangGraph, CrewAI, or OpenAI's Agents SDK, there are native integrations that instrument at the framework level. You get agent-level traces without writing instrumentation code yourself.

In practice, you combine these. Use framework-level integration to get the agent structure, add manual spans for the business logic that matters most, and route through a proxy for cost tracking.

---

## Platform comparison: the six main players {#platform-comparison}

Here is an honest comparison of the platforms available in 2026:

### LangSmith

**What it is:** Observability and evaluation platform from the LangChain team. Native integration with LangChain, LangGraph, and OpenAI's Agents SDK.

**Best for:** Teams already using LangChain/LangGraph. If your agent is built on LangGraph, LangSmith integration is one environment variable.

**Strengths:**
- Zero overhead for LangChain users (the claim of 0% overhead is about integration friction, not literal CPU overhead)
- Rich trace visualization with the agent reasoning tree laid out visually
- Eval framework built in — you can run evaluators directly from the traces you capture
- Dataset management — capture production traces as ground truth for evals
- Human feedback annotation workflows
- Prompt management and versioning

**Weaknesses:**
- Vendor lock-in if you use LangChain-specific features
- Cost scales with volume — can get expensive at high throughput
- The eval UX is good but opinionated about evaluation approaches

**Integration:** Add `LANGCHAIN_TRACING_V2=true` and `LANGCHAIN_API_KEY` to your env. That's it for LangChain. For other frameworks, use the Python/TypeScript SDK.

**Docs:** [smith.langchain.com](https://smith.langchain.com/docs)

---

### Helicone

**What it is:** Proxy-based LLM observability. Route your OpenAI/Anthropic/Gemini calls through Helicone's proxy, get instant logging, cost tracking, and rate limiting.

**Best for:** Teams that want minimal integration friction. Particularly good if you're not using a framework and just making raw API calls.

**Strengths:**
- Truly zero-code integration (change one base URL)
- Cross-provider support — OpenAI, Anthropic, Gemini, Mistral, Cohere from one dashboard
- Cost tracking is best-in-class — accurate, real-time, per-request
- Rate limiting and caching built in
- Custom properties — tag requests with user IDs, session IDs, environment
- Self-hostable (open source version on GitHub)

**Weaknesses:**
- Proxy adds ~20-50ms latency (small but real)
- Agent-level tracing requires additional SDK work — proxy alone gives you LLM call level, not agent step level
- Less sophisticated eval tooling than LangSmith or Braintrust

**Integration:** Set base URL to `https://oai.helicone.ai/v1` and add `Helicone-Auth` header. Done.

**Docs:** [docs.helicone.ai](https://docs.helicone.ai)

---

### Langfuse

**What it is:** Open-source LLM observability platform. Self-hostable, with a generous cloud tier.

**Best for:** Teams with compliance requirements (SOC2, GDPR) that need data sovereignty. Also great for cost-conscious teams — self-hosted is free.

**Strengths:**
- Fully open source (GitHub: [github.com/langfuse/langfuse](https://github.com/langfuse/langfuse))
- Self-hostable on your own infra — data never leaves your environment
- Strong tracing model with nested spans that maps well to agent execution trees
- Prompt management, dataset tracking, eval pipelines
- Integration with most agent frameworks
- Active community, fast release cadence

**Weaknesses:**
- Self-hosting requires operational overhead
- Cloud version is generous but enterprise tiers can get pricey
- UI is functional but less polished than commercial competitors

**Integration:** `pip install langfuse`, then use the Langfuse SDK decorator or callback handler.

**Docs:** [langfuse.com/docs](https://langfuse.com/docs)

---

### AgentOps

**What it is:** Observability platform built specifically for AI agents, not adapted from LLM observability. Purpose-built for the multi-agent use case.

**Best for:** Teams running complex multi-agent systems who want agent-native tooling rather than LLM observability adapted for agents.

**Strengths:**
- Session replay — watch exactly what your agent did, step by step
- Agent-native concepts: sessions, events, actions, errors
- LLM cost tracking with per-session breakdown
- Works with most agent frameworks (CrewAI, AutoGen, LangChain, OpenAI SDK)
- Compliance tooling built in — audit trails, PII detection

**Weaknesses:**
- Smaller community and ecosystem than LangSmith/Langfuse
- Eval framework less mature than Braintrust
- Documentation lags behind feature development

**Integration:** `pip install agentops`, then `agentops.init(api_key="...")` at startup.

**Docs:** [docs.agentops.ai](https://docs.agentops.ai)

---

### Braintrust

**What it is:** Eval + observability combined into one platform. Strong emphasis on running evals at scale and connecting them to production traces.

**Best for:** Teams where evaluation quality is the primary concern — you want to know not just what happened, but whether it was good. Product teams doing A/B testing of prompt changes, new models, or agent architectures.

**Strengths:**
- Best-in-class eval infrastructure — scored datasets, eval runs, regression detection
- Production tracing + eval in the same platform (traces feed directly into eval datasets)
- Human annotation UI is excellent
- Experiment management — compare different agent configurations side by side
- Good TypeScript/Python SDK parity

**Weaknesses:**
- More expensive than Langfuse at scale
- Infrastructure/DevOps focused teams might find the eval-first UX unfamiliar
- Less native multi-agent tracing support than AgentOps

**Integration:** Use Braintrust SDK or the OTEL exporter.

**Docs:** [braintrustdata.com/docs](https://www.braintrustdata.com/docs)

---

### Phoenix (Arize)

**What it is:** ML observability platform (Arize) extended to cover LLMs and agents. Phoenix is the open-source version; Arize AI is the commercial cloud.

**Best for:** Teams that already use Arize for ML model monitoring and want to extend observability to their AI agent layer. Strong for teams with ML engineers on staff.

**Strengths:**
- Deep ML observability roots — strong on drift detection, data quality monitoring
- Phoenix is fully open source (good for local development)
- OTEL-native — the best OTEL support of any platform in this list
- Embedding visualization (useful for RAG debugging)
- Extends naturally to RAG pipeline observability

**Weaknesses:**
- UI is more complex than alternatives — ML observability background shows
- Less agent-specific tooling than AgentOps
- The split between open-source Phoenix and commercial Arize creates confusion

**Integration:** `pip install arize-phoenix openinference-instrumentation`, then use auto-instrumentation for LangChain/OpenAI/LlamaIndex.

**Docs:** [docs.arize.com/phoenix](https://docs.arize.com/phoenix)

---

## OpenTelemetry for agents {#opentelemetry-agents}

The problem with six different platforms is fragmentation. Every platform has its own SDK, its own data model, its own concept of what a "trace" means for an agent. If you instrument for LangSmith today and want to switch to Langfuse tomorrow, you're rewriting instrumentation code.

OpenTelemetry (OTEL) solves this with a vendor-neutral standard for traces, metrics, and logs. The [OpenTelemetry AI SIG](https://github.com/open-telemetry/community/blob/main/projects/llm-semconv.md) is working on semantic conventions specifically for LLM calls and agent workflows.

The current draft semantic conventions define spans like:

- `gen_ai.system` — the LLM provider (openai, anthropic, etc.)
- `gen_ai.request.model` — the model used
- `gen_ai.request.max_tokens`
- `gen_ai.response.finish_reasons`
- `gen_ai.usage.input_tokens`
- `gen_ai.usage.output_tokens`

For agents specifically, the conventions extend to:

- `gen_ai.agent.id` — unique agent identifier
- `gen_ai.agent.name` — human-readable agent name
- `gen_ai.tool.name` — tool being called
- `gen_ai.tool.call.id` — tool call correlation ID

Phoenix/Arize is currently the most OTEL-native of the platforms above. LangSmith has an OTEL exporter. Langfuse has OTEL support via their backend. The ecosystem is converging on OTEL as the instrumentation layer, with platforms differentiating on storage, visualization, and eval capabilities.

Our recommendation: instrument with OTEL semantic conventions from day one, then route to whichever platform fits your team. This keeps your instrumentation code stable even as the platform ecosystem consolidates.

---

## Code examples: instrumentation patterns {#code-examples}

### Basic OpenAI call with OTEL tracing

```python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
import openai
import time

# Initialize OTEL provider
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-agent", "1.0.0")

client = openai.OpenAI()

def call_llm_with_tracing(messages: list, model: str = "gpt-4o") -> str:
    with tracer.start_as_current_span("llm.call") as span:
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.messages_count", len(messages))

        start = time.time()
        response = client.chat.completions.create(
            model=model,
            messages=messages
        )
        latency_ms = (time.time() - start) * 1000

        span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
        span.set_attribute("gen_ai.response.finish_reasons",
                          [response.choices[0].finish_reason])
        span.set_attribute("llm.latency_ms", latency_ms)

        return response.choices[0].message.content
```

### Agent step tracing with tool calls

```python
from opentelemetry import trace
from typing import Callable, Any
import json

tracer = trace.get_tracer("agent")

def traced_agent_step(
    step_name: str,
    agent_id: str,
    input_data: dict
) -> Callable:
    """Decorator for tracing individual agent reasoning steps."""
    def decorator(func):
        def wrapper(*args, **kwargs):
            with tracer.start_as_current_span(f"agent.step.{step_name}") as span:
                span.set_attribute("gen_ai.agent.id", agent_id)
                span.set_attribute("agent.step.name", step_name)
                span.set_attribute("agent.step.input", json.dumps(input_data)[:1000])

                try:
                    result = func(*args, **kwargs)
                    span.set_attribute("agent.step.status", "success")
                    span.set_attribute("agent.step.output_type", type(result).__name__)
                    return result
                except Exception as e:
                    span.set_attribute("agent.step.status", "error")
                    span.set_attribute("agent.step.error", str(e))
                    span.record_exception(e)
                    raise
        return wrapper
    return decorator


def traced_tool_call(tool_name: str, tool_fn: Callable, **kwargs) -> Any:
    """Execute a tool call with full tracing."""
    with tracer.start_as_current_span("agent.tool_call") as span:
        span.set_attribute("gen_ai.tool.name", tool_name)
        span.set_attribute("agent.tool.input", json.dumps(kwargs)[:500])

        start = time.time()
        try:
            result = tool_fn(**kwargs)
            latency_ms = (time.time() - start) * 1000

            span.set_attribute("agent.tool.latency_ms", latency_ms)
            span.set_attribute("agent.tool.status", "success")
            span.set_attribute("agent.tool.output_size", len(str(result)))
            return result
        except Exception as e:
            span.set_attribute("agent.tool.status", "error")
            span.set_attribute("agent.tool.error", str(e))
            span.record_exception(e)
            raise
```

### Langfuse integration for multi-agent tracing

```python
from langfuse import Langfuse
from langfuse.decorators import langfuse_context, observe
import uuid

langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"
)

@observe(name="orchestrator-agent")
def orchestrator(task: str, session_id: str) -> str:
    """Top-level agent that coordinates sub-agents."""
    langfuse_context.update_current_trace(
        name="agent-session",
        session_id=session_id,
        user_id="user-123",
        tags=["production", "orchestrator"],
        metadata={"task_type": "research", "env": "prod"}
    )

    # Route sub-tasks to specialized agents
    research_result = researcher_agent(query=task)
    synthesis = synthesizer_agent(
        research=research_result,
        original_task=task
    )
    return synthesis


@observe(name="researcher-agent")
def researcher_agent(query: str) -> dict:
    """Sub-agent that handles web research."""
    # LLM call to plan searches
    search_plan = call_llm_with_tracing([
        {"role": "user", "content": f"Plan web searches for: {query}"}
    ])

    # Tool calls tracked as child spans
    results = []
    for search_query in parse_search_plan(search_plan):
        result = traced_tool_call(
            "web_search",
            web_search_fn,
            query=search_query
        )
        results.append(result)

    return {"query": query, "results": results}


@observe(name="synthesizer-agent")
def synthesizer_agent(research: dict, original_task: str) -> str:
    """Sub-agent that synthesizes research into final output."""
    return call_llm_with_tracing([
        {"role": "system", "content": "You synthesize research into clear answers."},
        {"role": "user", "content": f"Task: {original_task}\n\nResearch: {research}"}
    ])
```

### TypeScript: AgentOps integration

```typescript
import AgentOps from 'agentops';
import OpenAI from 'openai';

// Initialize AgentOps at startup
AgentOps.init({
  apiKey: process.env.AGENTOPS_API_KEY!,
  tags: ['production', 'v2.1.0'],
});

const client = new OpenAI();

interface AgentSession {
  sessionId: string;
  startTime: Date;
  taskType: string;
}

async function runAgentWithTracking(
  task: string,
  userId: string
): Promise<{ result: string; session: AgentSession }> {
  const sessionId = AgentOps.startSession({
    tags: ['task-execution'],
    inherited_session_id: userId,
  });

  const session: AgentSession = {
    sessionId,
    startTime: new Date(),
    taskType: classifyTask(task),
  };

  try {
    // AgentOps auto-instruments OpenAI calls when initialized
    const completion = await client.chat.completions.create({
      model: 'gpt-4o',
      messages: [
        { role: 'system', content: 'You are a helpful assistant.' },
        { role: 'user', content: task },
      ],
    });

    const result = completion.choices[0].message.content ?? '';

    AgentOps.endSession('Success', { output: result.slice(0, 200) });
    return { result, session };
  } catch (error) {
    AgentOps.endSession('Fail', { error: String(error) });
    throw error;
  }
}

function classifyTask(task: string): string {
  // Simple classification for session tagging
  if (task.includes('search') || task.includes('find')) return 'research';
  if (task.includes('write') || task.includes('draft')) return 'generation';
  return 'general';
}
```

---

## Tracing multi-agent workflows {#tracing-multi-agent}

The hardest tracing problem in multi-agent systems is context propagation. When your orchestrator spawns a sub-agent, the sub-agent needs to know it's part of the same parent trace. Without this, you get disconnected traces that look like separate independent requests.

```mermaid
sequenceDiagram
    participant U as User
    participant O as Orchestrator
    participant SA as Sub-Agent A
    participant SB as Sub-Agent B
    participant T1 as Tool: Search
    participant T2 as Tool: Code

    note over U,T2: Trace ID: abc-123 propagates through all spans

    U->>O: "Research and summarize topic X"
    activate O
    note right of O: Span: orchestrator.run<br/>trace_id=abc-123<br/>span_id=span-001

    O->>SA: delegate(query="research X")
    activate SA
    note right of SA: Span: sub-agent.research<br/>trace_id=abc-123<br/>parent_span=span-001<br/>span_id=span-002

    SA->>T1: search(query="X overview")
    activate T1
    note right of T1: Span: tool.web_search<br/>trace_id=abc-123<br/>parent_span=span-002
    T1-->>SA: search results
    deactivate T1

    SA->>T1: search(query="X recent developments")
    activate T1
    T1-->>SA: more results
    deactivate T1

    SA-->>O: research_data
    deactivate SA

    O->>SB: delegate(task="summarize", data=research_data)
    activate SB
    note right of SB: Span: sub-agent.synthesizer<br/>trace_id=abc-123<br/>parent_span=span-001<br/>span_id=span-003

    SB->>T2: run_code(script="summarize.py")
    activate T2
    note right of T2: Span: tool.code_exec<br/>trace_id=abc-123<br/>parent_span=span-003
    T2-->>SB: output
    deactivate T2

    SB-->>O: summary
    deactivate SB

    O-->>U: Final response
    deactivate O
    note over U,T2: Full trace: 1 session → 2 sub-agents → 4 tool calls
```

In OpenTelemetry, this is handled via context propagation headers. When you spawn a sub-agent via HTTP, pass the `traceparent` header. For in-process sub-agents, the OTEL context manager handles propagation automatically.

For async or queue-based agent coordination, propagate the trace context explicitly:

```python
from opentelemetry.propagate import inject, extract
from opentelemetry import trace
import json

def dispatch_sub_agent(agent_name: str, task: dict) -> str:
    """Dispatch sub-agent task with trace context propagation."""
    carrier = {}
    inject(carrier)  # Injects current trace context into carrier dict

    # Include trace context in the task payload
    task_with_context = {
        **task,
        "_trace_context": carrier
    }

    return queue.push(agent_name, json.dumps(task_with_context))


def receive_sub_agent_task(raw_task: str) -> None:
    """Sub-agent receives task with parent trace context restored."""
    task = json.loads(raw_task)

    # Restore parent trace context
    ctx = extract(task.get("_trace_context", {}))

    with trace.get_tracer("sub-agent").start_as_current_span(
        "sub-agent.execute",
        context=ctx  # Parent context attached here
    ) as span:
        span.set_attribute("agent.task_type", task.get("type"))
        # Now all child spans will be nested under the parent trace
        execute_task(task)
```

For multi-agent systems that communicate via the [Google A2A protocol](/blog/multi-agent-orchestration-product-architecture), trace context propagation is built into the protocol spec. If you're using A2A, instrument the `x-trace-context` extension header.

---

## Cost tracking per agent run {#cost-tracking}

Cost tracking is where most teams have massive blind spots. An individual agent call might cost $0.02 in LLM tokens. Running that agent 10,000 times per day at $0.02 is $200/day, $6,000/month — manageable. But agents that loop unexpectedly, call expensive models when cheap ones would suffice, or spawn sub-agents unnecessarily can blow up costs 10x overnight.

Track costs at three levels:

**Per LLM call:** Input tokens × input price + output tokens × output price. Simple, but you need to get the pricing right per model version and account for batching discounts.

**Per agent session:** Sum all LLM calls + tool calls (if tools have cost, e.g., Serper for web search) in a single agent run. This is the unit that maps to your product's cost per task.

**Per user/tenant:** Aggregate session costs by user for billing, quota management, and anomaly detection.

```python
import dataclasses
from typing import Optional

# Current pricing (March 2026) — update when models change
MODEL_COSTS = {
    "gpt-4o": {"input": 2.50, "output": 10.00},       # per 1M tokens
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
    "claude-3-5-haiku": {"input": 0.80, "output": 4.00},
    "gemini-2.0-flash": {"input": 0.10, "output": 0.40},
}

@dataclasses.dataclass
class TokenUsage:
    model: str
    input_tokens: int
    output_tokens: int

    @property
    def cost_usd(self) -> float:
        pricing = MODEL_COSTS.get(self.model, {"input": 0.0, "output": 0.0})
        return (
            (self.input_tokens / 1_000_000) * pricing["input"] +
            (self.output_tokens / 1_000_000) * pricing["output"]
        )


@dataclasses.dataclass
class AgentSessionCost:
    session_id: str
    user_id: str
    llm_calls: list[TokenUsage]
    tool_costs: dict[str, float]  # tool_name -> cost

    @property
    def total_llm_cost(self) -> float:
        return sum(call.cost_usd for call in self.llm_calls)

    @property
    def total_tool_cost(self) -> float:
        return sum(self.tool_costs.values())

    @property
    def total_cost(self) -> float:
        return self.total_llm_cost + self.total_tool_cost

    def to_span_attributes(self) -> dict:
        return {
            "cost.total_usd": self.total_cost,
            "cost.llm_usd": self.total_llm_cost,
            "cost.tool_usd": self.total_tool_cost,
            "cost.llm_calls_count": len(self.llm_calls),
        }
```

Set cost alerts at three thresholds:
- **Per-session warning:** >$0.50/run for what should be a cheap task
- **Per-session critical:** >$2.00/run (investigate before it scales)
- **Daily spend anomaly:** >2x rolling average (something is looping unexpectedly)

Helicone has the best out-of-box cost tracking — it handles model pricing automatically and provides cost dashboards without custom instrumentation. For teams that don't want to build this, start with Helicone for cost visibility and layer in more detailed agent tracing with another tool.

---

## Debugging multi-agent coordination failures {#debugging-failures}

Multi-agent systems fail in ways that are difficult to diagnose without traces. The most common failure modes we see in production:

**Circular task delegation.** Agent A delegates to Agent B. Agent B, because of ambiguous task definition, delegates back to Agent A. Without tracing showing the delegation chain, this looks like high latency followed by a timeout.

**Context window overflow in handoffs.** Agent A passes a 50,000-token context to Agent B. Agent B's context window is 32,000 tokens. The handoff fails silently — Agent B processes a truncated context and returns a nonsensical result. The orchestrator accepts it because it's syntactically valid.

**Tool output hallucination.** The LLM "remembers" a tool call from earlier in the session and fabricates a result for a second call to the same tool. The tool was never called. Without tracing tool calls explicitly, this is invisible.

**Race conditions in parallel agents.** Two agents write to the same resource (database row, file) simultaneously. One overwrites the other's work. The final result is missing half the expected output.

For each of these, the debugging pattern is the same: find the span where expected behavior diverges from actual behavior.

```mermaid
flowchart TD
    START([Agent Run Failed]) --> Q1{Is there a complete trace?}

    Q1 -->|No| FIX1[Add trace propagation to agent handoffs]
    Q1 -->|Yes| Q2{Where does the trace end?}

    Q2 --> OPTS[Check last recorded span]
    OPTS --> OPT1{LLM call span}
    OPTS --> OPT2{Tool call span}
    OPTS --> OPT3{Agent handoff span}
    OPTS --> OPT4{No span - silent failure}

    OPT1 --> Q3{Finish reason?}
    Q3 -->|length truncated| FIX2[Context too long — reduce input or switch model]
    Q3 -->|content_filter| FIX3[Guardrail blocked output — review prompt]
    Q3 -->|stop normal| Q4{Check tool calls requested}
    Q4 -->|Tool called but no tool span| FIX4[Tool call not being executed — check dispatch logic]

    OPT2 --> Q5{Tool error?}
    Q5 -->|Auth/permission error| FIX5[Tool credential issue]
    Q5 -->|Timeout| FIX6[Tool latency exceeds agent timeout]
    Q5 -->|Malformed output| FIX7[Parse tool output before passing to LLM]

    OPT3 --> Q6{Handoff target reachable?}
    Q6 -->|No| FIX8[Sub-agent registration/routing issue]
    Q6 -->|Yes but no response| FIX9[Context not propagated — check trace context headers]

    OPT4 --> FIX10[Add spans around agent initialization and task dispatch]

    FIX1 & FIX2 & FIX3 & FIX4 & FIX5 & FIX6 & FIX7 & FIX8 & FIX9 & FIX10 --> RERUN[Reproduce with detailed logging enabled]
```

The practical debugging workflow is:

1. Filter traces by error status in your observability platform
2. Find the last successful span before failure
3. Check inputs and outputs of that span for anomalies
4. Look at the parent span context — what was the agent's state when it made that decision?
5. Check for unexpectedly high loop counts (agent looped 20 times when it should have looped 3)
6. Compare against a successful trace for the same task type — find the divergence point

AgentOps has the best UX for this workflow because it surfaces agent-specific events (loops, handoffs, tool failures) as first-class objects, not just generic spans.

---

## Alerting on quality degradation {#alerting-quality}

Uptime alerts are easy — your agent either errors or it doesn't. Quality alerts are harder because your agent might return an output that is syntactically valid but semantically wrong, and you won't know without evaluation.

The approaches, in order of sophistication:

**Latency proxy.** If your agent's P95 latency suddenly increases from 5 seconds to 45 seconds, something changed — probably the agent is looping more, which often correlates with lower quality output. Latency is a leading indicator of quality problems.

**Loop count monitoring.** Track how many times your agent loops per session. Baseline your typical loop count (e.g., mean 4.2 loops, P95 8 loops). Alert if P95 exceeds 3x baseline — the agent is struggling.

**Tool error rate.** Track tool call failures as a percentage of total tool calls. Rising tool error rate often means the agent is attempting tool calls with malformed arguments — a sign the reasoning is degrading.

**LLM-as-judge sampling.** On 5-10% of production sessions, run an automated evaluator that scores the output quality. Alert if quality score drops below your threshold. This is expensive but gives you a real quality signal.

**User feedback signals.** If your product has any user feedback mechanism (thumbs up/down, regenerate button), instrument it. A rising regeneration rate is a direct quality signal.

```python
from dataclasses import dataclass
from typing import Callable
import statistics

@dataclass
class AgentQualityMetrics:
    session_id: str
    loop_count: int
    tool_error_rate: float
    total_latency_ms: float
    output_length: int

class QualityAlerter:
    def __init__(self, baseline_window: int = 1000):
        self.baseline_window = baseline_window
        self.loop_count_history: list[int] = []
        self.latency_history: list[float] = []
        self.tool_error_history: list[float] = []

        # Alert thresholds
        self.loop_count_multiplier = 3.0
        self.latency_multiplier = 2.0
        self.tool_error_abs_threshold = 0.15  # 15% error rate

    def record_session(self, metrics: AgentQualityMetrics) -> list[str]:
        """Record session metrics and return list of triggered alerts."""
        alerts = []

        if len(self.loop_count_history) >= 50:
            baseline_loops = statistics.mean(self.loop_count_history[-50:])
            if metrics.loop_count > baseline_loops * self.loop_count_multiplier:
                alerts.append(
                    f"HIGH_LOOP_COUNT: {metrics.loop_count} loops "
                    f"(baseline {baseline_loops:.1f})"
                )

        if len(self.latency_history) >= 50:
            baseline_latency = statistics.median(self.latency_history[-50:])
            if metrics.total_latency_ms > baseline_latency * self.latency_multiplier:
                alerts.append(
                    f"HIGH_LATENCY: {metrics.total_latency_ms:.0f}ms "
                    f"(baseline median {baseline_latency:.0f}ms)"
                )

        if metrics.tool_error_rate > self.tool_error_abs_threshold:
            alerts.append(
                f"HIGH_TOOL_ERROR_RATE: {metrics.tool_error_rate:.1%}"
            )

        # Update history
        self.loop_count_history.append(metrics.loop_count)
        self.latency_history.append(metrics.total_latency_ms)
        self.tool_error_history.append(metrics.tool_error_rate)

        return alerts
```

Connect your alerter to PagerDuty, Slack, or whatever your on-call tooling is. Quality alerts should be lower severity than uptime alerts by default — you want them in a Slack channel, not waking someone up at 3am, unless quality degradation is complete (agent is producing 100% garbage).

---

## Audit logging for compliance {#audit-logging}

If you're building agents that act on user data or take external actions, you need audit logs for compliance. This matters for SOC 2, GDPR, HIPAA, and increasingly for AI-specific regulatory requirements.

An audit log for agents is different from a trace:
- **Traces** are for debugging — they capture everything including internal reasoning
- **Audit logs** are for compliance — they capture who did what, to what data, when, and with what authorization

The audit log events that matter:

| Event Type | What to Record |
|-----------|---------------|
| `agent.session.start` | User ID, session ID, task description hash, timestamp |
| `agent.tool.execute` | Tool name, parameters (PII-scrubbed), user context, auth token used |
| `agent.data.read` | Data source, record IDs accessed, sensitivity classification |
| `agent.data.write` | Data source, record IDs modified, before/after hash |
| `agent.external.call` | External service, endpoint, request hash |
| `agent.session.end` | Session ID, outcome, total duration, data accessed summary |

For [security considerations in agentic systems](/blog/saas-security-agentic-threats), audit logs also need to capture when agents were granted elevated permissions, when guardrails fired, and when the agent's action was blocked.

```python
import hashlib
import json
from datetime import datetime, timezone
from typing import Any

class AgentAuditLogger:
    """Compliance-grade audit logging for agent actions."""

    PII_FIELDS = {"email", "phone", "ssn", "credit_card", "password", "token"}

    def __init__(self, audit_store):
        self.store = audit_store

    def log_tool_execution(
        self,
        session_id: str,
        user_id: str,
        tool_name: str,
        tool_input: dict,
        outcome: str
    ) -> None:
        event = {
            "event_type": "agent.tool.execute",
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "session_id": session_id,
            "user_id": user_id,
            "tool_name": tool_name,
            "tool_input_hash": self._hash_params(tool_input),
            "tool_input_sanitized": self._scrub_pii(tool_input),
            "outcome": outcome,
        }
        self.store.write(event)

    def log_data_access(
        self,
        session_id: str,
        user_id: str,
        resource_type: str,
        resource_ids: list[str],
        operation: str  # "read" | "write" | "delete"
    ) -> None:
        event = {
            "event_type": f"agent.data.{operation}",
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "session_id": session_id,
            "user_id": user_id,
            "resource_type": resource_type,
            "resource_ids": resource_ids,
            "resource_count": len(resource_ids),
        }
        self.store.write(event)

    def _scrub_pii(self, params: dict) -> dict:
        return {
            k: "[REDACTED]" if k.lower() in self.PII_FIELDS else v
            for k, v in params.items()
        }

    def _hash_params(self, params: dict) -> str:
        canonical = json.dumps(params, sort_keys=True)
        return hashlib.sha256(canonical.encode()).hexdigest()[:16]
```

Audit logs must be immutable, append-only, and retained per your compliance requirements (typically 1 year for SOC 2, 7 years for some financial regulations). Store them separately from your trace data — traces can be deleted after their debugging value expires, but audit logs are legal records.

---

## Replay and time-travel debugging {#replay-debugging}

The most powerful debugging capability for agents is session replay — the ability to replay a failed or anomalous session step by step, seeing exactly what the agent saw at each decision point.

LangSmith has session replay built in for LangChain/LangGraph runs. AgentOps has a session replay UI that shows events on a timeline. Both let you step through the agent's execution and inspect the state at each step.

For custom agent frameworks, you can implement basic replay by logging enough state:

```python
import json
from pathlib import Path
from typing import Any
from datetime import datetime, timezone

@dataclass
class AgentStateSnapshot:
    step_index: int
    timestamp: str
    agent_id: str
    step_type: str  # "llm_call", "tool_call", "handoff", "decision"
    input_state: dict
    output_state: dict
    llm_messages: list | None  # Full message history for LLM calls
    tool_name: str | None
    tool_args: dict | None
    tool_result: Any | None

class AgentReplayRecorder:
    """Records agent execution for replay debugging."""

    def __init__(self, session_id: str, replay_dir: Path):
        self.session_id = session_id
        self.replay_dir = replay_dir
        self.replay_dir.mkdir(parents=True, exist_ok=True)
        self.snapshots: list[AgentStateSnapshot] = []
        self.step_counter = 0

    def record_llm_call(
        self,
        agent_id: str,
        messages: list,
        response: dict,
        agent_state: dict
    ) -> None:
        snapshot = AgentStateSnapshot(
            step_index=self.step_counter,
            timestamp=datetime.now(timezone.utc).isoformat(),
            agent_id=agent_id,
            step_type="llm_call",
            input_state=agent_state.copy(),
            output_state={**agent_state, "last_llm_response": response},
            llm_messages=messages,
            tool_name=None,
            tool_args=None,
            tool_result=None,
        )
        self.snapshots.append(snapshot)
        self.step_counter += 1

    def record_tool_call(
        self,
        agent_id: str,
        tool_name: str,
        tool_args: dict,
        tool_result: Any,
        agent_state: dict
    ) -> None:
        snapshot = AgentStateSnapshot(
            step_index=self.step_counter,
            timestamp=datetime.now(timezone.utc).isoformat(),
            agent_id=agent_id,
            step_type="tool_call",
            input_state=agent_state.copy(),
            output_state={**agent_state, "last_tool_result": str(tool_result)[:500]},
            llm_messages=None,
            tool_name=tool_name,
            tool_args=tool_args,
            tool_result=tool_result,
        )
        self.snapshots.append(snapshot)
        self.step_counter += 1

    def flush(self) -> Path:
        """Write all snapshots to disk for replay."""
        output_path = self.replay_dir / f"{self.session_id}.replay.json"
        with open(output_path, "w") as f:
            json.dump(
                [dataclasses.asdict(s) for s in self.snapshots],
                f,
                indent=2,
                default=str
            )
        return output_path
```

Time-travel debugging extends replay by letting you re-inject different inputs at any step — "what would have happened if the web search returned different results at step 3?" This is valuable for reproducing rare failure modes that are hard to trigger in staging.

Implementing true time-travel debugging requires your agent to be stateless and deterministic given the same input state — which most LLM-based agents are not (temperature > 0). You can approximate it by fixing temperature to 0 in replay mode and injecting known tool outputs.

---

## The 7 production metrics you must track {#7-metrics}

These are the minimum metrics for any production agent system. Build your dashboard around these first:

```mermaid
graph LR
    subgraph "Session Metrics"
        M1["1. Session Success Rate<br/>(task completion %)"]
        M2["2. Session P50/P95/P99 Latency<br/>(wall-clock duration)"]
        M3["3. Cost per Session<br/>(total tokens + tools)"]
    end

    subgraph "Step Metrics"
        M4["4. Loop Count Distribution<br/>(mean + P95 per task type)"]
        M5["5. Tool Error Rate<br/>(% of tool calls that fail)"]
        M6["6. LLM Error Rate<br/>(% of LLM calls that fail or truncate)"]
    end

    subgraph "Quality Metrics"
        M7["7. Quality Score<br/>(LLM-as-judge on sampled sessions)"]
    end

    subgraph "Derived Alerts"
        A1["Cost anomaly > 2x baseline"]
        A2["Success rate < threshold"]
        A3["P95 latency > SLA"]
        A4["Tool error rate > 15%"]
        A5["Quality score drop > 10pts"]
    end

    M1 --> A2
    M2 --> A3
    M3 --> A1
    M5 --> A4
    M7 --> A5
```

**Metric 1: Session success rate.** What percentage of agent sessions complete the intended task? You need to define "success" explicitly — it might be task_completed=true in your agent's final output, user accepted the result, or no error exception raised. Start with programmatic success (no exception) and layer in semantic success over time.

**Metric 2: Session latency (P50/P95/P99).** Wall-clock time from user input to final output. P50 tells you typical performance. P95 tells you what slow sessions look like. P99 tells you your worst cases. For most agent use cases, P95 < 30 seconds is a reasonable starting target.

**Metric 3: Cost per session.** Total LLM tokens + tool costs per completed session. Track this by task type (a "summarize document" task should cost less than a "research and write report" task). This is your unit economics for your product.

**Metric 4: Loop count distribution.** How many reasoning cycles does your agent take per session? Track mean and P95 per task type. Rising P95 loop counts are a leading indicator of quality problems or prompt regressions.

**Metric 5: Tool error rate.** Percentage of tool calls that return an error. Baseline this metric — some tool errors are normal (network timeouts, rate limits). Alert if it exceeds 15-20%.

**Metric 6: LLM error rate.** Percentage of LLM calls that fail, hit rate limits, or return truncated responses (finish_reason=length). This should be close to 0% in a well-configured system. If it's rising, check your token budgets and model availability.

**Metric 7: Quality score.** Periodic automated evaluation of output quality. Run an LLM-as-judge prompt on a sample of completed sessions (5-10%), score 1-10, track the rolling average. This is the only metric that catches "technically succeeded but output was garbage" failures.

---

## The "Datadog for agents" problem {#datadog-problem}

Here is the uncomfortable truth about agent observability in 2026: no single tool has solved the full problem. Every platform has meaningful gaps.

**The fragmentation problem.** Six platforms, none of which is clearly dominant. LangSmith has the deepest LangChain integration. Helicone has the best cost visibility. AgentOps has the best multi-agent replay. Braintrust has the best eval. Phoenix has the best OTEL support. Teams end up combining two or three platforms, which means managing multiple vendor relationships, multiple data pipelines, and multiple UIs.

**The framework lock-in problem.** The best observability experience you can have is with native integrations (LangSmith for LangChain, AgentOps for CrewAI). But those integrations only work well with specific frameworks. If you build a custom agent without a framework, you're writing instrumentation code from scratch.

**The non-determinism problem.** Every observability tool is built on the assumption that two requests with the same input produce comparable outputs. For agents, this is not true — temperature, reasoning paths, and tool results introduce variance. You can't do simple request diffing to detect regressions.

**The semantic gap problem.** A Datadog trace tells you a service was slow or errored. An agent trace tells you... an LLM returned a response. Whether that response was good or useful requires evaluation, which is a fundamentally different problem from telemetry. None of the platforms has fully bridged this gap between "what happened" and "was it good."

**The cost problem.** Comprehensive agent observability — logging every prompt, every completion, every tool call — is itself expensive. High-volume systems can spend $2,000-5,000/month on observability costs alone. Some teams end up sampling heavily, which creates blind spots.

The market will consolidate. The likely winner is either:
- A company that solves the OTEL standardization problem and becomes the "infrastructure layer" that all platforms sit on top of
- A platform that nails the eval + observability combination (Braintrust is closest to this today)
- One of the existing APM giants (Datadog, Honeycomb, Grafana) that adds comprehensive LLM/agent support

For now, our recommended stack is:
1. **Instrument with OTEL** from day one (don't couple to a vendor)
2. **Use Langfuse** (self-hosted) for trace storage and visualization if you need data sovereignty
3. **Add Helicone** for cost tracking (minimal integration, great visibility)
4. **Use Braintrust** for evals and quality measurement
5. **Roll your own audit logging** for compliance — don't trust this to a third party

As you scale your [AI agents platform](/blog/ai-agents-replacing-saas) beyond initial deployment, the observability investment pays back in faster debugging cycles and the ability to make confident changes to your agent prompts and architectures.

If you're tracking [product data observability](/blog/product-data-observability) alongside agent observability, note that they have different requirements — product analytics is about user behavior patterns, agent observability is about individual execution correctness. Keep them in separate systems.

The agents category is moving toward [compound workflow platforms](/blog/multi-agent-orchestration-product-architecture) where observability is increasingly built into the orchestration layer. Google's A2A protocol, LangGraph's Cloud platform, and Anthropic's Claude Agent SDK are all building observability as a first-class feature. In 18-24 months, the current state of "pick 3 vendors and glue them together" will likely give way to integrated observability within the major orchestration frameworks.

Until then, build your instrumentation layer on OTEL, own your data, and invest in the eval infrastructure that actually tells you if your agents are working.

---

## Frequently asked questions {#faq}

**Q: Do I need observability for a simple single-agent system, or is it only for multi-agent workflows?**

You need basic observability for any production agent, even a single-agent system. At minimum: cost tracking, latency monitoring, and error rate. Multi-agent systems need the full stack because coordination failures are harder to debug without traces.

**Q: How much will agent observability tooling cost?**

Rough estimates: Helicone cloud free tier handles ~1,000 requests/month; paid starts at $50/month. LangSmith charges by trace volume — roughly $0.005-0.01 per trace at volume. Langfuse self-hosted is free. Braintrust charges by eval runs + trace volume. For a product doing 10,000 agent sessions/month, budget $200-500/month for observability tooling.

**Q: Should I build my own observability or use a platform?**

Build your own audit logging (compliance is too important to outsource). Buy observability tooling for traces, cost tracking, and evals — the value-to-build ratio is not there. The platforms are improving fast enough that building custom tooling today is likely to be replaced by mature SaaS solutions within 12 months.

**Q: How do I trace agents that run asynchronously or in background jobs?**

Use OTEL context propagation headers. When you dispatch an async task, serialize the current trace context into the job payload. When the job executes, deserialize and restore the trace context before creating any spans. Both Python's `opentelemetry-sdk` and the Node.js `@opentelemetry/api` SDK support this pattern.

**Q: What's the difference between a trace and a session in agent observability?**

A trace is a technical unit: a tree of spans tracking a specific execution path. A session is a business unit: a complete agent run representing one user task, which may span multiple traces (e.g., if the agent retries). Some platforms (AgentOps) surface sessions as the primary unit. Others (LangSmith) surface traces. Map your business events (task started, task completed, task failed) to sessions, and let traces be the implementation detail underneath.

**Q: How do I handle PII in agent traces?**

Before logging, scrub fields that contain personal data (names, emails, phone numbers, IDs). For LLM prompts and completions, use a PII detection model (AWS Comprehend, Microsoft Presidio, or similar) to redact before logging. Never log raw user content unless you have explicit consent and appropriate data handling agreements. For GDPR compliance, ensure trace data is stored in your user's region and can be deleted on request.

**Q: How do I know if my agent observability tooling is actually working?**

Deliberately break your agent in staging and verify that the failure shows up in your observability platform within 60 seconds. Check that cost tracking captures the failure scenario correctly. Verify that the trace shows the step where the failure occurred. Run this "observability smoke test" after every major configuration change.

**Q: What model costs should I account for in cost tracking?**

Input tokens, output tokens, and (if used) embedding tokens. Some tools also have image/vision token costs. Additionally, account for tool costs: web search APIs (Serper ~$0.001/query), code execution environments (E2B ~$0.10/hour), and any external APIs your agent calls. The LLM token cost is usually 60-80% of total agent cost, but tool costs matter at scale.

---

*If you're evaluating the full agent development stack — including [multi-agent orchestration patterns](/blog/multi-agent-orchestration-product-architecture) and [security for agentic systems](/blog/saas-security-agentic-threats) — observability is the layer that makes everything else debuggable. Build it early.*