Observability for AI Agent Systems

Shane

2/13/2026

28 min read

Build observability for AI agents with structured logging, OpenTelemetry tracing, metrics collection, and dashboards in Node.js.

nodejs logging monitoring observability tracing agents

Observability for AI Agent Systems

AI agents are notoriously difficult to debug. They make autonomous decisions, call tools in unpredictable sequences, and fail in ways that traditional application monitoring never anticipated. This article covers how to build real observability into your agent systems using structured logging, distributed tracing with OpenTelemetry, metrics collection, and dashboards — all in Node.js with practical, production-tested patterns.

Prerequisites

Node.js v18 or later
Familiarity with Express.js
Basic understanding of AI agent patterns (tool calling, multi-step reasoning)
An OpenTelemetry-compatible backend (Jaeger, Grafana Tempo, or similar) for tracing
npm packages: @opentelemetry/sdk-node, @opentelemetry/api, winston, prom-client, express

Install the dependencies:

npm install @opentelemetry/sdk-node @opentelemetry/api @opentelemetry/sdk-trace-node @opentelemetry/exporter-trace-otlp-http @opentelemetry/resources @opentelemetry/semantic-conventions winston prom-client express uuid

The Three Pillars of Observability Applied to Agents

Traditional observability revolves around three pillars — logs, metrics, and traces. Agent systems need all three, but each one takes on a different character than what you are used to in a typical web application.

Logs in agent systems are not just request/response pairs. Every reasoning step, every tool call decision, every prompt sent to the LLM, and every response parsed back — these all need structured log entries. An agent might take 15 steps to answer a single user question. Without structured logs correlating those steps, you are flying blind.

Metrics go beyond latency and throughput. You need to track completion rates, average step counts per task, token consumption per step, cost accumulation, tool call frequency distributions, and error rates broken down by step type. An agent that suddenly starts averaging 30 steps instead of 8 is a problem even if it is not throwing errors.

Traces are where agent observability gets genuinely interesting. A single agent run is a tree of spans: the root span is the overall task, child spans are individual steps, and each step might have sub-spans for LLM calls, tool executions, and result parsing. Distributed tracing lets you see the entire execution flow, spot bottlenecks, and understand why an agent took the path it did.

Structured Logging for Agent Steps

The first thing to get right is structured logging. Every log entry from your agent should be a JSON object with a correlation ID that ties it back to the agent run. Here is a logging module built on Winston that does this properly:

var winston = require("winston");
var crypto = require("crypto");

function createAgentLogger(options) {
  var serviceName = options.serviceName || "agent-service";

  var logger = winston.createLogger({
    level: options.level || "info",
    format: winston.format.combine(
      winston.format.timestamp({ format: "ISO" }),
      winston.format.json()
    ),
    defaultMeta: { service: serviceName },
    transports: [
      new winston.transports.Console(),
      new winston.transports.File({
        filename: "logs/agent-errors.log",
        level: "error",
        maxsize: 50 * 1024 * 1024,
        maxFiles: 10
      }),
      new winston.transports.File({
        filename: "logs/agent-combined.log",
        maxsize: 100 * 1024 * 1024,
        maxFiles: 20
      })
    ]
  });

  return logger;
}

function generateRunId() {
  return "run_" + crypto.randomBytes(8).toString("hex");
}

function generateStepId() {
  return "step_" + crypto.randomBytes(6).toString("hex");
}

function logAgentStep(logger, context, stepType, data) {
  var entry = {
    runId: context.runId,
    stepId: generateStepId(),
    stepNumber: context.stepCount,
    stepType: stepType,
    agentName: context.agentName,
    taskId: context.taskId,
    elapsedMs: Date.now() - context.startTime,
    timestamp: new Date().toISOString()
  };

  Object.keys(data).forEach(function (key) {
    entry[key] = data[key];
  });

  logger.info("agent_step", entry);
  return entry;
}

module.exports = {
  createAgentLogger: createAgentLogger,
  generateRunId: generateRunId,
  generateStepId: generateStepId,
  logAgentStep: logAgentStep
};

This gives you log entries like:

{
  "level": "info",
  "message": "agent_step",
  "service": "agent-service",
  "timestamp": "2026-02-11T14:32:01.445Z",
  "runId": "run_a3f8c91d2e4b7106",
  "stepId": "step_c8e1f3a92b01",
  "stepNumber": 3,
  "stepType": "tool_call",
  "agentName": "research-agent",
  "taskId": "task_001",
  "elapsedMs": 2341,
  "toolName": "web_search",
  "toolInput": { "query": "Node.js memory leak detection" },
  "durationMs": 812
}

The runId is your correlation ID. Every log entry from a single agent run shares it. When something goes wrong, you filter by runId and see the entire sequence of decisions the agent made.

Distributed Tracing for Multi-Step Agent Workflows

Structured logs tell you what happened. Traces tell you how it all fits together in time. OpenTelemetry is the standard here, and it maps beautifully onto agent execution. Here is how to set up tracing for an agent system:

var opentelemetry = require("@opentelemetry/api");
var { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
var { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");
var { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-node");
var { Resource } = require("@opentelemetry/resources");
var { ATTR_SERVICE_NAME } = require("@opentelemetry/semantic-conventions");

function initTracing(serviceName, exporterUrl) {
  var resource = new Resource({});
  resource.attributes[ATTR_SERVICE_NAME] = serviceName;

  var provider = new NodeTracerProvider({ resource: resource });

  var exporter = new OTLPTraceExporter({
    url: exporterUrl || "http://localhost:4318/v1/traces"
  });

  provider.addSpanProcessor(new BatchSpanProcessor(exporter));
  provider.register();

  return opentelemetry.trace.getTracer(serviceName);
}

function startAgentRunSpan(tracer, agentName, taskDescription) {
  var span = tracer.startSpan("agent.run", {
    attributes: {
      "agent.name": agentName,
      "agent.task": taskDescription,
      "agent.start_time": Date.now()
    }
  });

  var ctx = opentelemetry.trace.setSpan(
    opentelemetry.context.active(),
    span
  );

  return { span: span, context: ctx };
}

function startStepSpan(tracer, parentContext, stepType, stepNumber, attributes) {
  var spanName = "agent.step." + stepType;
  var span = tracer.startSpan(
    spanName,
    {
      attributes: Object.assign(
        {
          "agent.step.type": stepType,
          "agent.step.number": stepNumber
        },
        attributes || {}
      )
    },
    parentContext
  );

  var ctx = opentelemetry.trace.setSpan(parentContext, span);
  return { span: span, context: ctx };
}

module.exports = {
  initTracing: initTracing,
  startAgentRunSpan: startAgentRunSpan,
  startStepSpan: startStepSpan
};

When you visualize these traces in Jaeger or Grafana Tempo, you get a waterfall view of your agent's execution. The root span covers the entire run, child spans show each step, and you can instantly see which step consumed the most time, where errors occurred, and how the agent branched between reasoning and tool use.

Key Metrics for Agent Systems

Standard web application metrics are not sufficient for agents. Here are the metrics that actually matter, implemented with prom-client:

var promClient = require("prom-client");

function createAgentMetrics(prefix) {
  var p = prefix || "agent";

  var metrics = {
    runDuration: new promClient.Histogram({
      name: p + "_run_duration_seconds",
      help: "Total duration of agent runs in seconds",
      labelNames: ["agent_name", "status"],
      buckets: [1, 5, 10, 30, 60, 120, 300, 600]
    }),

    stepCount: new promClient.Histogram({
      name: p + "_step_count",
      help: "Number of steps per agent run",
      labelNames: ["agent_name", "status"],
      buckets: [1, 3, 5, 10, 15, 20, 30, 50]
    }),

    stepDuration: new promClient.Histogram({
      name: p + "_step_duration_seconds",
      help: "Duration of individual agent steps",
      labelNames: ["agent_name", "step_type"],
      buckets: [0.1, 0.5, 1, 2, 5, 10, 30]
    }),

    tokenUsage: new promClient.Counter({
      name: p + "_token_usage_total",
      help: "Total tokens consumed by agent LLM calls",
      labelNames: ["agent_name", "token_type"]
    }),

    costAccumulated: new promClient.Counter({
      name: p + "_cost_dollars_total",
      help: "Accumulated cost of agent runs in dollars",
      labelNames: ["agent_name", "model"]
    }),

    toolCalls: new promClient.Counter({
      name: p + "_tool_calls_total",
      help: "Total number of tool calls made by agents",
      labelNames: ["agent_name", "tool_name", "status"]
    }),

    toolCallDuration: new promClient.Histogram({
      name: p + "_tool_call_duration_seconds",
      help: "Duration of individual tool calls",
      labelNames: ["agent_name", "tool_name"],
      buckets: [0.05, 0.1, 0.5, 1, 2, 5, 10, 30]
    }),

    completionRate: new promClient.Counter({
      name: p + "_runs_total",
      help: "Total agent runs by completion status",
      labelNames: ["agent_name", "status"]
    }),

    activeRuns: new promClient.Gauge({
      name: p + "_active_runs",
      help: "Currently active agent runs",
      labelNames: ["agent_name"]
    }),

    errorsByType: new promClient.Counter({
      name: p + "_errors_total",
      help: "Agent errors by type",
      labelNames: ["agent_name", "error_type"]
    })
  };

  return metrics;
}

module.exports = { createAgentMetrics: createAgentMetrics };

The stepCount histogram is particularly valuable. If your agent normally completes tasks in 5-8 steps and suddenly starts averaging 20, you have a regression in your prompt or a change in the LLM's behavior. This metric catches problems that error rates alone will miss.

Implementing a Trace Context That Follows the Agent

The trace context is the object that travels with every agent step, carrying correlation IDs, accumulated metrics, and timing data. This is the backbone of your observability:

var crypto = require("crypto");

function createTraceContext(agentName, taskId, taskDescription) {
  return {
    runId: "run_" + crypto.randomBytes(8).toString("hex"),
    agentName: agentName,
    taskId: taskId || "task_" + crypto.randomBytes(4).toString("hex"),
    taskDescription: taskDescription,
    startTime: Date.now(),
    stepCount: 0,
    totalTokens: { input: 0, output: 0 },
    totalCost: 0,
    steps: [],
    errors: [],
    toolCalls: {},
    status: "running"
  };
}

function recordStep(ctx, stepType, data) {
  ctx.stepCount++;

  var step = {
    stepNumber: ctx.stepCount,
    stepType: stepType,
    timestamp: Date.now(),
    elapsedMs: Date.now() - ctx.startTime,
    data: data
  };

  if (data.tokens) {
    ctx.totalTokens.input += data.tokens.input || 0;
    ctx.totalTokens.output += data.tokens.output || 0;
  }

  if (data.cost) {
    ctx.totalCost += data.cost;
  }

  if (data.toolName) {
    if (!ctx.toolCalls[data.toolName]) {
      ctx.toolCalls[data.toolName] = { count: 0, totalDurationMs: 0 };
    }
    ctx.toolCalls[data.toolName].count++;
    ctx.toolCalls[data.toolName].totalDurationMs += data.durationMs || 0;
  }

  if (data.error) {
    ctx.errors.push({
      stepNumber: ctx.stepCount,
      error: data.error,
      timestamp: step.timestamp
    });
  }

  ctx.steps.push(step);
  return step;
}

function finalizeContext(ctx, status) {
  ctx.status = status || "completed";
  ctx.endTime = Date.now();
  ctx.totalDurationMs = ctx.endTime - ctx.startTime;

  return {
    runId: ctx.runId,
    agentName: ctx.agentName,
    taskId: ctx.taskId,
    status: ctx.status,
    stepCount: ctx.stepCount,
    totalDurationMs: ctx.totalDurationMs,
    totalTokens: ctx.totalTokens,
    totalCost: ctx.totalCost,
    toolCalls: ctx.toolCalls,
    errorCount: ctx.errors.length,
    steps: ctx.steps
  };
}

module.exports = {
  createTraceContext: createTraceContext,
  recordStep: recordStep,
  finalizeContext: finalizeContext
};

Pass this context object through every function call in your agent loop. It accumulates the full execution history, and when the run finishes — whether successfully or not — you have a complete record of everything that happened.

Visualizing Agent Execution Flows

Beyond the waterfall view in Jaeger, you will want to generate visual summaries of agent runs for debugging. Here is a function that builds a text-based execution flow from a trace context:

function renderExecutionFlow(ctx) {
  var lines = [];
  lines.push("=== Agent Run: " + ctx.runId + " ===");
  lines.push("Agent: " + ctx.agentName);
  lines.push("Task: " + ctx.taskDescription);
  lines.push("Status: " + ctx.status);
  lines.push("Duration: " + ctx.totalDurationMs + "ms");
  lines.push("Steps: " + ctx.stepCount);
  lines.push("Tokens: " + ctx.totalTokens.input + " in / " + ctx.totalTokens.output + " out");
  lines.push("Cost: $" + ctx.totalCost.toFixed(4));
  lines.push("");
  lines.push("--- Execution Timeline ---");

  ctx.steps.forEach(function (step) {
    var prefix = step.data.error ? "[ERROR]" : "[OK]   ";
    var icon = "";

    switch (step.stepType) {
      case "llm_call":
        icon = "LLM";
        break;
      case "tool_call":
        icon = "TOOL";
        break;
      case "reasoning":
        icon = "THINK";
        break;
      case "output":
        icon = "OUT";
        break;
      default:
        icon = "STEP";
    }

    var detail = "";
    if (step.data.toolName) {
      detail = " -> " + step.data.toolName;
    }
    if (step.data.durationMs) {
      detail += " (" + step.data.durationMs + "ms)";
    }

    lines.push(
      "  " + prefix + " #" + step.stepNumber +
      " [" + icon + "]" + detail +
      " @ +" + step.elapsedMs + "ms"
    );
  });

  if (ctx.errors.length > 0) {
    lines.push("");
    lines.push("--- Errors ---");
    ctx.errors.forEach(function (err) {
      lines.push("  Step #" + err.stepNumber + ": " + err.error);
    });
  }

  lines.push("");
  lines.push("--- Tool Usage Summary ---");
  Object.keys(ctx.toolCalls).forEach(function (tool) {
    var info = ctx.toolCalls[tool];
    var avg = info.count > 0 ? Math.round(info.totalDurationMs / info.count) : 0;
    lines.push("  " + tool + ": " + info.count + " calls, avg " + avg + "ms");
  });

  return lines.join("\n");
}

module.exports = { renderExecutionFlow: renderExecutionFlow };

This produces output like:

=== Agent Run: run_a3f8c91d2e4b7106 ===
Agent: research-agent
Task: Find the top 3 Node.js ORMs by GitHub stars
Status: completed
Duration: 14532ms
Steps: 7
Tokens: 3200 in / 1850 out
Cost: $0.0341

--- Execution Timeline ---
  [OK]    #1 [THINK]  @ +12ms
  [OK]    #2 [TOOL] -> web_search (812ms) @ +845ms
  [OK]    #3 [LLM]  @ +3201ms
  [OK]    #4 [TOOL] -> web_search (650ms) @ +4102ms
  [OK]    #5 [TOOL] -> web_search (723ms) @ +5580ms
  [OK]    #6 [LLM]  @ +9321ms
  [OK]    #7 [OUT]  @ +14520ms

--- Tool Usage Summary ---
  web_search: 3 calls, avg 728ms

Alerting on Agent Anomalies

You need alerts for three classes of agent problems: stuck agents, cost spikes, and error rate increases. Here is a monitoring module that checks for these conditions:

function createAnomalyDetector(options) {
  var maxRunDurationMs = options.maxRunDurationMs || 300000;   // 5 minutes
  var maxStepCount = options.maxStepCount || 30;
  var maxCostPerRun = options.maxCostPerRun || 1.00;
  var errorRateThreshold = options.errorRateThreshold || 0.2;
  var activeRuns = {};
  var recentResults = [];
  var maxRecentResults = options.windowSize || 100;

  function checkRunHealth(ctx) {
    var alerts = [];
    var elapsed = Date.now() - ctx.startTime;

    if (elapsed > maxRunDurationMs) {
      alerts.push({
        type: "stuck_agent",
        severity: "critical",
        message: "Agent run " + ctx.runId + " has been running for " +
          Math.round(elapsed / 1000) + "s (threshold: " +
          Math.round(maxRunDurationMs / 1000) + "s)",
        runId: ctx.runId,
        agentName: ctx.agentName
      });
    }

    if (ctx.stepCount > maxStepCount) {
      alerts.push({
        type: "excessive_steps",
        severity: "warning",
        message: "Agent run " + ctx.runId + " has taken " +
          ctx.stepCount + " steps (threshold: " + maxStepCount + ")",
        runId: ctx.runId,
        agentName: ctx.agentName
      });
    }

    if (ctx.totalCost > maxCostPerRun) {
      alerts.push({
        type: "cost_spike",
        severity: "critical",
        message: "Agent run " + ctx.runId + " has accumulated $" +
          ctx.totalCost.toFixed(4) + " in cost (threshold: $" +
          maxCostPerRun.toFixed(2) + ")",
        runId: ctx.runId,
        agentName: ctx.agentName
      });
    }

    return alerts;
  }

  function recordResult(status) {
    recentResults.push({ status: status, timestamp: Date.now() });
    if (recentResults.length > maxRecentResults) {
      recentResults.shift();
    }
  }

  function checkErrorRate() {
    if (recentResults.length < 10) return null;

    var errorCount = recentResults.filter(function (r) {
      return r.status === "failed";
    }).length;

    var rate = errorCount / recentResults.length;

    if (rate > errorRateThreshold) {
      return {
        type: "high_error_rate",
        severity: "critical",
        message: "Agent error rate is " + (rate * 100).toFixed(1) +
          "% over last " + recentResults.length + " runs (threshold: " +
          (errorRateThreshold * 100) + "%)",
        errorRate: rate,
        windowSize: recentResults.length
      };
    }

    return null;
  }

  return {
    checkRunHealth: checkRunHealth,
    recordResult: recordResult,
    checkErrorRate: checkErrorRate
  };
}

module.exports = { createAnomalyDetector: createAnomalyDetector };

Wire these alerts into your notification system — Slack webhooks, PagerDuty, email, whatever your team uses. The stuck agent alert is especially important. An agent stuck in a loop will quietly burn through your LLM budget if you do not catch it.

Debugging Failed Agent Runs with Trace Replay

When an agent run fails, the trace context gives you everything you need to understand why. But raw JSON is hard to read. Here is a replay function that reconstructs the conversation flow:

function replayAgentRun(traceData) {
  var output = [];
  output.push("REPLAY: " + traceData.runId);
  output.push("Started: " + new Date(traceData.steps[0].timestamp).toISOString());
  output.push("Failed at step: " + traceData.errorCount + " error(s)");
  output.push("");

  traceData.steps.forEach(function (step) {
    output.push("--- Step " + step.stepNumber + " (" + step.stepType + ") ---");

    if (step.stepType === "llm_call" && step.data.prompt) {
      var promptPreview = step.data.prompt.substring(0, 200);
      output.push("Prompt: " + promptPreview + (step.data.prompt.length > 200 ? "..." : ""));
    }

    if (step.stepType === "llm_call" && step.data.response) {
      var responsePreview = step.data.response.substring(0, 200);
      output.push("Response: " + responsePreview + (step.data.response.length > 200 ? "..." : ""));
    }

    if (step.stepType === "tool_call") {
      output.push("Tool: " + step.data.toolName);
      output.push("Input: " + JSON.stringify(step.data.toolInput));
      if (step.data.toolOutput) {
        var toolPreview = JSON.stringify(step.data.toolOutput).substring(0, 300);
        output.push("Output: " + toolPreview);
      }
    }

    if (step.data.error) {
      output.push("*** ERROR: " + step.data.error + " ***");
      if (step.data.errorStack) {
        output.push("Stack: " + step.data.errorStack.split("\n").slice(0, 3).join("\n"));
      }
    }

    if (step.data.tokens) {
      output.push("Tokens: " + step.data.tokens.input + " in / " + step.data.tokens.output + " out");
    }

    output.push("");
  });

  return output.join("\n");
}

module.exports = { replayAgentRun: replayAgentRun };

Store your finalized trace contexts in a database or log aggregation system so you can pull them up later. When a user reports a bad agent response, grab the runId from the logs, pull the trace, and replay it. You will see exactly what the agent was thinking at every step.

Monitoring Tool Call Patterns and Durations

Tool call patterns reveal a lot about agent behavior. If your agent is supposed to call a database query tool once per run but is suddenly calling it 15 times, something is wrong with the prompt or the tool's responses. Track tool call distributions:

function analyzeToolPatterns(traceContexts) {
  var patterns = {};

  traceContexts.forEach(function (ctx) {
    var signature = ctx.steps
      .filter(function (s) { return s.stepType === "tool_call"; })
      .map(function (s) { return s.data.toolName; })
      .join(" -> ");

    if (!patterns[signature]) {
      patterns[signature] = { count: 0, avgDuration: 0, totalDuration: 0 };
    }
    patterns[signature].count++;
    patterns[signature].totalDuration += ctx.totalDurationMs;
    patterns[signature].avgDuration =
      patterns[signature].totalDuration / patterns[signature].count;
  });

  return patterns;
}

This gives you a view like: "The pattern web_search -> web_search -> database_query happens 45% of the time and takes an average of 8.2 seconds." If a new pattern appears that you have never seen before, that is worth investigating.

Tracking LLM Token Usage Per Agent Step

Token tracking is essential for cost management. Every LLM call in your agent should record input and output tokens:

function trackTokenUsage(metrics, ctx, stepData) {
  var inputTokens = stepData.tokens ? stepData.tokens.input : 0;
  var outputTokens = stepData.tokens ? stepData.tokens.output : 0;

  metrics.tokenUsage.inc(
    { agent_name: ctx.agentName, token_type: "input" },
    inputTokens
  );
  metrics.tokenUsage.inc(
    { agent_name: ctx.agentName, token_type: "output" },
    outputTokens
  );

  var model = stepData.model || "unknown";
  var inputCostPer1k = getCostPerThousandTokens(model, "input");
  var outputCostPer1k = getCostPerThousandTokens(model, "output");

  var stepCost =
    (inputTokens / 1000) * inputCostPer1k +
    (outputTokens / 1000) * outputCostPer1k;

  metrics.costAccumulated.inc(
    { agent_name: ctx.agentName, model: model },
    stepCost
  );

  return stepCost;
}

function getCostPerThousandTokens(model, type) {
  var costs = {
    "gpt-4o": { input: 0.0025, output: 0.01 },
    "gpt-4o-mini": { input: 0.00015, output: 0.0006 },
    "claude-3-5-sonnet": { input: 0.003, output: 0.015 },
    "claude-3-haiku": { input: 0.00025, output: 0.00125 }
  };

  if (costs[model] && costs[model][type]) {
    return costs[model][type];
  }
  return 0.001; // default fallback
}

module.exports = {
  trackTokenUsage: trackTokenUsage,
  getCostPerThousandTokens: getCostPerThousandTokens
};

When you plot token usage over time, you can spot prompt regressions immediately. A prompt change that adds 500 tokens per step across 10 steps across thousands of runs adds up fast.

Building Agent Dashboards

You need two types of dashboards: a Prometheus/Grafana dashboard for aggregate metrics, and a custom Express endpoint for inspecting individual runs. Here is the Express dashboard:

var express = require("express");

function createDashboardRouter(traceStore, metrics) {
  var router = express.Router();

  // Overview endpoint - aggregate stats
  router.get("/overview", function (req, res) {
    var recentRuns = traceStore.getRecentRuns(100);
    var completed = recentRuns.filter(function (r) { return r.status === "completed"; });
    var failed = recentRuns.filter(function (r) { return r.status === "failed"; });

    var totalSteps = recentRuns.reduce(function (sum, r) {
      return sum + r.stepCount;
    }, 0);

    var totalCost = recentRuns.reduce(function (sum, r) {
      return sum + r.totalCost;
    }, 0);

    var avgDuration = recentRuns.length > 0
      ? recentRuns.reduce(function (sum, r) { return sum + r.totalDurationMs; }, 0) / recentRuns.length
      : 0;

    res.json({
      totalRuns: recentRuns.length,
      completedRuns: completed.length,
      failedRuns: failed.length,
      completionRate: recentRuns.length > 0
        ? (completed.length / recentRuns.length * 100).toFixed(1) + "%"
        : "N/A",
      avgStepsPerRun: recentRuns.length > 0
        ? (totalSteps / recentRuns.length).toFixed(1)
        : 0,
      avgDurationMs: Math.round(avgDuration),
      totalCost: "$" + totalCost.toFixed(4),
      periodStart: recentRuns.length > 0
        ? new Date(recentRuns[recentRuns.length - 1].steps[0].timestamp).toISOString()
        : null
    });
  });

  // Individual run detail
  router.get("/runs/:runId", function (req, res) {
    var trace = traceStore.getByRunId(req.params.runId);
    if (!trace) {
      return res.status(404).json({ error: "Run not found" });
    }
    res.json(trace);
  });

  // List recent runs with filtering
  router.get("/runs", function (req, res) {
    var limit = parseInt(req.query.limit) || 50;
    var status = req.query.status;
    var agentName = req.query.agent;

    var runs = traceStore.getRecentRuns(limit);

    if (status) {
      runs = runs.filter(function (r) { return r.status === status; });
    }
    if (agentName) {
      runs = runs.filter(function (r) { return r.agentName === agentName; });
    }

    var summaries = runs.map(function (r) {
      return {
        runId: r.runId,
        agentName: r.agentName,
        status: r.status,
        stepCount: r.stepCount,
        durationMs: r.totalDurationMs,
        cost: "$" + r.totalCost.toFixed(4),
        errorCount: r.errorCount,
        startTime: new Date(r.steps[0].timestamp).toISOString()
      };
    });

    res.json({ runs: summaries, total: summaries.length });
  });

  // Prometheus metrics endpoint
  router.get("/metrics", function (req, res) {
    var promClient = require("prom-client");
    res.set("Content-Type", promClient.register.contentType);
    promClient.register.metrics().then(function (data) {
      res.end(data);
    });
  });

  return router;
}

module.exports = { createDashboardRouter: createDashboardRouter };

Mount this at /agent-dashboard in your Express app and you have a live API for inspecting agent behavior. Point Grafana at the /agent-dashboard/metrics endpoint for time-series dashboards, and use the /runs endpoints for on-demand debugging.

Log Aggregation Strategies for High-Volume Agent Systems

When you are running hundreds or thousands of agent executions per hour, log volume becomes a real concern. Here are strategies that work:

Sampling — Not every successful run needs full trace logging. Log full traces for all failures and a configurable percentage of successes. Keep aggregate metrics for everything:

function shouldLogFullTrace(ctx, sampleRate) {
  if (ctx.status === "failed") return true;
  if (ctx.errors.length > 0) return true;
  if (ctx.stepCount > 20) return true; // unusually long runs
  if (ctx.totalCost > 0.50) return true; // expensive runs
  return Math.random() < (sampleRate || 0.1);
}

Tiered storage — Keep recent full traces in a fast store (Redis, in-memory) for the dashboard. Archive older traces to cheaper storage (S3, database) for historical analysis. Delete traces older than your retention window.

Structured log routing — Use log levels strategically. Agent step logs at info, tool call details at debug, errors at error. In production, run at info level and flip to debug when you need to investigate a specific agent.

Implementing Health Checks for Agent Readiness

Your agent system should expose health check endpoints that go beyond simple liveness probes. Test that the agent can actually function:

function createHealthChecker(dependencies) {
  var checks = {
    llm: function (callback) {
      var start = Date.now();
      dependencies.llmClient.complete({
        model: dependencies.healthCheckModel || "gpt-4o-mini",
        messages: [{ role: "user", content: "Reply with OK" }],
        max_tokens: 5
      }).then(function (response) {
        callback(null, {
          status: "healthy",
          latencyMs: Date.now() - start,
          model: dependencies.healthCheckModel || "gpt-4o-mini"
        });
      }).catch(function (err) {
        callback(null, {
          status: "unhealthy",
          error: err.message,
          latencyMs: Date.now() - start
        });
      });
    },

    tools: function (callback) {
      var results = {};
      var toolNames = Object.keys(dependencies.tools || {});
      var pending = toolNames.length;

      if (pending === 0) {
        return callback(null, { status: "healthy", tools: {} });
      }

      toolNames.forEach(function (name) {
        var tool = dependencies.tools[name];
        if (typeof tool.healthCheck === "function") {
          tool.healthCheck(function (err, result) {
            results[name] = err ? { status: "unhealthy", error: err.message } : result;
            pending--;
            if (pending === 0) {
              var allHealthy = Object.keys(results).every(function (k) {
                return results[k].status === "healthy";
              });
              callback(null, { status: allHealthy ? "healthy" : "degraded", tools: results });
            }
          });
        } else {
          results[name] = { status: "unknown", message: "No health check defined" };
          pending--;
          if (pending === 0) {
            callback(null, { status: "degraded", tools: results });
          }
        }
      });
    },

    traceStore: function (callback) {
      try {
        var count = dependencies.traceStore.getRecentRuns(1).length;
        callback(null, { status: "healthy", reachable: true });
      } catch (err) {
        callback(null, { status: "unhealthy", error: err.message });
      }
    }
  };

  return {
    runAll: function (callback) {
      var results = {};
      var checkNames = Object.keys(checks);
      var pending = checkNames.length;

      checkNames.forEach(function (name) {
        checks[name](function (err, result) {
          results[name] = result;
          pending--;
          if (pending === 0) {
            var overallHealthy = Object.keys(results).every(function (k) {
              return results[k].status === "healthy";
            });
            callback(null, {
              status: overallHealthy ? "healthy" : "degraded",
              checks: results,
              timestamp: new Date().toISOString()
            });
          }
        });
      });
    }
  };
}

module.exports = { createHealthChecker: createHealthChecker };

Complete Working Example

Here is a complete agent observability module that ties everything together — structured logging, OpenTelemetry tracing, metrics collection, and an Express dashboard. This is a production-ready starting point:

// agent-observability.js
var crypto = require("crypto");
var winston = require("winston");
var promClient = require("prom-client");
var opentelemetry = require("@opentelemetry/api");

// ============================================================
// TRACE CONTEXT
// ============================================================

function createTraceContext(agentName, taskId, taskDescription) {
  return {
    runId: "run_" + crypto.randomBytes(8).toString("hex"),
    agentName: agentName,
    taskId: taskId || "task_" + crypto.randomBytes(4).toString("hex"),
    taskDescription: taskDescription || "",
    startTime: Date.now(),
    stepCount: 0,
    totalTokens: { input: 0, output: 0 },
    totalCost: 0,
    steps: [],
    errors: [],
    toolCalls: {},
    status: "running"
  };
}

function recordStep(ctx, stepType, data) {
  ctx.stepCount++;
  var step = {
    stepNumber: ctx.stepCount,
    stepType: stepType,
    timestamp: Date.now(),
    elapsedMs: Date.now() - ctx.startTime,
    data: data || {}
  };

  if (data.tokens) {
    ctx.totalTokens.input += data.tokens.input || 0;
    ctx.totalTokens.output += data.tokens.output || 0;
  }

  if (data.cost) {
    ctx.totalCost += data.cost;
  }

  if (data.toolName) {
    if (!ctx.toolCalls[data.toolName]) {
      ctx.toolCalls[data.toolName] = { count: 0, totalDurationMs: 0 };
    }
    ctx.toolCalls[data.toolName].count++;
    ctx.toolCalls[data.toolName].totalDurationMs += data.durationMs || 0;
  }

  if (data.error) {
    ctx.errors.push({
      stepNumber: ctx.stepCount,
      error: data.error,
      timestamp: step.timestamp
    });
  }

  ctx.steps.push(step);
  return step;
}

function finalizeContext(ctx, status) {
  ctx.status = status || "completed";
  ctx.endTime = Date.now();
  ctx.totalDurationMs = ctx.endTime - ctx.startTime;
  return ctx;
}

// ============================================================
// LOGGER
// ============================================================

function createLogger(serviceName) {
  return winston.createLogger({
    level: process.env.LOG_LEVEL || "info",
    format: winston.format.combine(
      winston.format.timestamp({ format: "ISO" }),
      winston.format.json()
    ),
    defaultMeta: { service: serviceName || "agent-service" },
    transports: [
      new winston.transports.Console(),
      new winston.transports.File({
        filename: "logs/agent.log",
        maxsize: 100 * 1024 * 1024,
        maxFiles: 10
      })
    ]
  });
}

function logStep(logger, ctx, stepType, data) {
  logger.info("agent_step", {
    runId: ctx.runId,
    stepNumber: ctx.stepCount,
    stepType: stepType,
    agentName: ctx.agentName,
    taskId: ctx.taskId,
    elapsedMs: Date.now() - ctx.startTime,
    data: data
  });
}

// ============================================================
// METRICS
// ============================================================

function createMetrics(prefix) {
  var p = prefix || "agent";
  return {
    runDuration: new promClient.Histogram({
      name: p + "_run_duration_seconds",
      help: "Duration of agent runs",
      labelNames: ["agent_name", "status"],
      buckets: [1, 5, 10, 30, 60, 120, 300]
    }),
    stepCount: new promClient.Histogram({
      name: p + "_steps_per_run",
      help: "Steps per agent run",
      labelNames: ["agent_name"],
      buckets: [1, 3, 5, 10, 15, 20, 30, 50]
    }),
    tokenUsage: new promClient.Counter({
      name: p + "_tokens_total",
      help: "Token usage",
      labelNames: ["agent_name", "token_type"]
    }),
    costTotal: new promClient.Counter({
      name: p + "_cost_dollars",
      help: "Cost in dollars",
      labelNames: ["agent_name"]
    }),
    toolCalls: new promClient.Counter({
      name: p + "_tool_calls",
      help: "Tool call count",
      labelNames: ["agent_name", "tool_name"]
    }),
    activeRuns: new promClient.Gauge({
      name: p + "_active_runs",
      help: "Active runs",
      labelNames: ["agent_name"]
    }),
    errors: new promClient.Counter({
      name: p + "_errors",
      help: "Error count",
      labelNames: ["agent_name", "error_type"]
    })
  };
}

function recordMetrics(metrics, ctx) {
  metrics.runDuration.observe(
    { agent_name: ctx.agentName, status: ctx.status },
    ctx.totalDurationMs / 1000
  );
  metrics.stepCount.observe(
    { agent_name: ctx.agentName },
    ctx.stepCount
  );
  metrics.tokenUsage.inc(
    { agent_name: ctx.agentName, token_type: "input" },
    ctx.totalTokens.input
  );
  metrics.tokenUsage.inc(
    { agent_name: ctx.agentName, token_type: "output" },
    ctx.totalTokens.output
  );
  metrics.costTotal.inc(
    { agent_name: ctx.agentName },
    ctx.totalCost
  );

  Object.keys(ctx.toolCalls).forEach(function (tool) {
    metrics.toolCalls.inc(
      { agent_name: ctx.agentName, tool_name: tool },
      ctx.toolCalls[tool].count
    );
  });

  ctx.errors.forEach(function (err) {
    metrics.errors.inc({
      agent_name: ctx.agentName,
      error_type: err.error.substring(0, 50)
    });
  });
}

// ============================================================
// TRACE STORE (in-memory, replace with database for production)
// ============================================================

function createTraceStore(maxSize) {
  var runs = [];
  var runIndex = {};
  var max = maxSize || 1000;

  return {
    store: function (ctx) {
      runs.unshift(ctx);
      runIndex[ctx.runId] = ctx;
      while (runs.length > max) {
        var removed = runs.pop();
        delete runIndex[removed.runId];
      }
    },
    getByRunId: function (runId) {
      return runIndex[runId] || null;
    },
    getRecentRuns: function (limit) {
      return runs.slice(0, limit || 50);
    }
  };
}

// ============================================================
// DASHBOARD ROUTES
// ============================================================

function createDashboardRoutes(traceStore) {
  var express = require("express");
  var router = express.Router();

  router.get("/overview", function (req, res) {
    var runs = traceStore.getRecentRuns(100);
    var completed = runs.filter(function (r) { return r.status === "completed"; });
    var failed = runs.filter(function (r) { return r.status === "failed"; });
    var totalCost = runs.reduce(function (s, r) { return s + r.totalCost; }, 0);

    res.json({
      totalRuns: runs.length,
      completed: completed.length,
      failed: failed.length,
      completionRate: runs.length > 0
        ? (completed.length / runs.length * 100).toFixed(1) + "%"
        : "N/A",
      totalCost: "$" + totalCost.toFixed(4)
    });
  });

  router.get("/runs", function (req, res) {
    var limit = parseInt(req.query.limit) || 50;
    var runs = traceStore.getRecentRuns(limit).map(function (r) {
      return {
        runId: r.runId,
        agentName: r.agentName,
        status: r.status,
        steps: r.stepCount,
        duration: r.totalDurationMs + "ms",
        cost: "$" + r.totalCost.toFixed(4)
      };
    });
    res.json({ runs: runs });
  });

  router.get("/runs/:runId", function (req, res) {
    var trace = traceStore.getByRunId(req.params.runId);
    if (!trace) return res.status(404).json({ error: "Not found" });
    res.json(trace);
  });

  router.get("/metrics", function (req, res) {
    res.set("Content-Type", promClient.register.contentType);
    promClient.register.metrics().then(function (data) {
      res.end(data);
    });
  });

  return router;
}

// ============================================================
// MAIN EXPORT - OBSERVABLE AGENT WRAPPER
// ============================================================

function createObservableAgent(agentName, options) {
  var opts = options || {};
  var logger = opts.logger || createLogger(agentName);
  var metrics = opts.metrics || createMetrics();
  var traceStore = opts.traceStore || createTraceStore();

  return {
    logger: logger,
    metrics: metrics,
    traceStore: traceStore,

    startRun: function (taskId, taskDescription) {
      var ctx = createTraceContext(agentName, taskId, taskDescription);
      metrics.activeRuns.inc({ agent_name: agentName });
      logger.info("agent_run_start", {
        runId: ctx.runId,
        agentName: agentName,
        taskId: ctx.taskId,
        task: taskDescription
      });
      return ctx;
    },

    recordStep: function (ctx, stepType, data) {
      var step = recordStep(ctx, stepType, data);
      logStep(logger, ctx, stepType, data);
      return step;
    },

    endRun: function (ctx, status) {
      finalizeContext(ctx, status);
      metrics.activeRuns.dec({ agent_name: agentName });
      recordMetrics(metrics, ctx);
      traceStore.store(ctx);
      logger.info("agent_run_end", {
        runId: ctx.runId,
        status: ctx.status,
        steps: ctx.stepCount,
        durationMs: ctx.totalDurationMs,
        cost: ctx.totalCost,
        tokens: ctx.totalTokens
      });
      return ctx;
    },

    getDashboardRouter: function () {
      return createDashboardRoutes(traceStore);
    }
  };
}

module.exports = {
  createObservableAgent: createObservableAgent,
  createTraceContext: createTraceContext,
  createLogger: createLogger,
  createMetrics: createMetrics,
  createTraceStore: createTraceStore,
  createDashboardRoutes: createDashboardRoutes,
  recordStep: recordStep,
  finalizeContext: finalizeContext
};

Usage in your application:

var express = require("express");
var observability = require("./agent-observability");

var app = express();
var agent = observability.createObservableAgent("research-agent");

// Mount the dashboard
app.use("/agent-dashboard", agent.getDashboardRouter());

// Example agent run
app.post("/api/research", function (req, res) {
  var ctx = agent.startRun(req.body.taskId, req.body.question);

  try {
    // Step 1: LLM reasoning
    agent.recordStep(ctx, "reasoning", {
      prompt: "Analyze the question: " + req.body.question,
      tokens: { input: 150, output: 80 },
      cost: 0.0012
    });

    // Step 2: Tool call
    var toolStart = Date.now();
    var searchResults = performWebSearch(req.body.question);
    agent.recordStep(ctx, "tool_call", {
      toolName: "web_search",
      toolInput: { query: req.body.question },
      toolOutput: searchResults,
      durationMs: Date.now() - toolStart
    });

    // Step 3: Final LLM synthesis
    var answer = synthesizeAnswer(req.body.question, searchResults);
    agent.recordStep(ctx, "llm_call", {
      tokens: { input: 800, output: 400 },
      cost: 0.0058,
      response: answer
    });

    agent.endRun(ctx, "completed");
    res.json({ answer: answer, runId: ctx.runId });
  } catch (err) {
    agent.recordStep(ctx, "error", { error: err.message, errorStack: err.stack });
    agent.endRun(ctx, "failed");
    res.status(500).json({ error: err.message, runId: ctx.runId });
  }
});

app.listen(3000, function () {
  console.log("Agent service running on port 3000");
  console.log("Dashboard at http://localhost:3000/agent-dashboard/overview");
});

Common Issues and Troubleshooting

1. OpenTelemetry spans not appearing in Jaeger

Error: connect ECONNREFUSED 127.0.0.1:4318
    at TCPConnectWrap.afterConnect [as oncomplete]

This means the OTLP exporter cannot reach the collector. Verify Jaeger is running with OTLP HTTP ingestion enabled. If using Docker: docker run -d -p 4318:4318 -p 16686:16686 jaegertracing/all-in-one:latest. The port 4318 is the OTLP HTTP receiver, and 16686 is the Jaeger UI.

2. Prometheus metric name collisions

Error: A metric with the name agent_run_duration_seconds has already been registered.

This happens when you create metrics more than once — typically when your module is require()-ed from multiple files. Use a singleton pattern: check promClient.register.getSingleMetric(name) before creating a new metric, or call promClient.register.clear() in tests.

3. Memory leak from unbounded trace storage

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

The in-memory trace store grows without limit if you forget to set a max size. Always cap it. In production, move traces to an external store (PostgreSQL, Elasticsearch) and keep only the last N minutes in memory. Set --max-old-space-size=4096 as a safety net, but fix the root cause.

4. Winston transport errors crashing the process

Error: ENOENT: no such file or directory, open 'logs/agent.log'

Winston does not create directories for you. Ensure the logs/ directory exists before starting the application, or add a startup check:

var fs = require("fs");
var path = require("path");
var logDir = path.join(__dirname, "logs");
if (!fs.existsSync(logDir)) {
  fs.mkdirSync(logDir, { recursive: true });
}

5. High cardinality labels blowing up Prometheus

Error: Too many time series. Consider reducing the number of label values.

Never use unbounded values (like runId or full error messages) as Prometheus label values. Use broad categories like error_type: "llm_timeout" rather than the full error string. If you need per-run detail, put it in logs, not metrics.

Best Practices

Always generate a correlation ID at the start of every agent run. Thread it through every log entry, span, and metric label. Without this, debugging a single run across thousands of log lines is nearly impossible.
Log the full prompt and response at debug level, not info level. In production at info level, log a summary (model name, token count, latency). Flip to debug only when actively investigating. Full prompts in info logs will overwhelm your log storage.
Set hard limits on agent step count and cost per run. A runaway agent will burn through your LLM budget in minutes. Kill the run if it exceeds your thresholds, and alert immediately. This is not optional — it is a production safety requirement.
Record tool call duration separately from LLM call duration. They have completely different performance characteristics and failure modes. An agent that is slow because of a database tool is a different problem than an agent that is slow because the LLM is taking 15 seconds per response.
Store finalized trace contexts for at least 30 days. Users will report problems days after they happened. If you have already deleted the trace, you are debugging in the dark. Compress and archive to cheap storage if cost is a concern.
Use structured log formats from day one. Converting from unstructured console.log() output to structured JSON logging after you already have a system in production is painful. Start with Winston or Pino and JSON format from the first line of code.
Build your dashboard endpoints early, even if the UI comes later. JSON API endpoints for agent run inspection cost almost nothing to implement and save hours of debugging. You can always add a proper Grafana dashboard later, but having the raw data available via HTTP is invaluable from day one.
Sample traces in production, but never sample errors. Log 100% of failed runs with full trace detail. Sample successful runs at 10-20%. This gives you complete error visibility without drowning in log volume.
Test your observability in staging before production incidents. Intentionally trigger agent failures, cost spikes, and stuck agents in staging. Verify your alerts fire, your dashboards update, and your trace replay works. The worst time to discover your monitoring is broken is during a production incident.

Observability for AI Agent Systems

Observability for AI Agent Systems

Prerequisites

The Three Pillars of Observability Applied to Agents

Structured Logging for Agent Steps

Distributed Tracing for Multi-Step Agent Workflows

Key Metrics for Agent Systems

Implementing a Trace Context That Follows the Agent

Visualizing Agent Execution Flows

Alerting on Agent Anomalies

Debugging Failed Agent Runs with Trace Replay

Monitoring Tool Call Patterns and Durations

Tracking LLM Token Usage Per Agent Step

Building Agent Dashboards

Log Aggregation Strategies for High-Volume Agent Systems

Implementing Health Checks for Agent Readiness

Complete Working Example

Common Issues and Troubleshooting

Best Practices

References

Quick Links

Need Expert Help?