Llm Apis

LLM API Monitoring and Cost Tracking

Build comprehensive monitoring and cost tracking for LLM APIs with token tracking, dashboards, alerts, and budget management in Node.js.

LLM API Monitoring and Cost Tracking

Overview

Running LLM APIs in production without proper monitoring is like driving at night with the headlights off. You will crash into a surprise invoice, miss degraded response quality, or burn through your budget before anyone notices. This article walks through building a comprehensive monitoring and cost tracking system for LLM API calls in Node.js, covering everything from per-request token logging to budget alerts that automatically shut off expensive features before they bankrupt your project.

Prerequisites

  • Node.js v18 or later installed
  • Basic familiarity with Express.js and middleware patterns
  • A PostgreSQL database for storing metrics (any version 13+)
  • An active API key for at least one LLM provider (OpenAI, Anthropic, etc.)
  • Working knowledge of HTTP APIs and JSON

What to Monitor in LLM Applications

Before writing any code, you need a clear picture of what matters. LLM APIs are unlike traditional REST services. A single request can cost anywhere from fractions of a cent to several dollars, latency varies wildly depending on output length, and the quality of responses is subjective and hard to quantify. Here is what you should be tracking from day one.

Token Usage

Every LLM provider charges by token count. You need to capture input tokens, output tokens, and total tokens for every single request. This is your primary cost driver and the foundation of everything else. Do not rely on estimates. Use the actual token counts returned in the API response.

Latency

LLM calls are slow compared to traditional APIs. A database query takes 5-50ms. An LLM call takes 500ms to 30 seconds. You need to track not just averages but percentiles: p50 (median), p95, and p99. A p95 of 8 seconds means one in twenty of your users is waiting that long. That matters.

Error Rates

Rate limits, timeouts, malformed responses, content filter rejections, and outright API failures. Each category tells you something different. Rate limit errors mean you are pushing too hard. Timeouts mean you need to tune your parameters or switch models. Content filter rejections might indicate a prompt injection issue.

Cost Per Request

Multiply token counts by the per-token price for the model you are using. This sounds simple, but prices differ between input and output tokens, vary by model, and change when providers update their pricing. You need a configurable pricing table that is easy to update.

Per-Feature and Per-User Attribution

Knowing your total spend is useful. Knowing that your summarization feature costs three times more than your search feature is actionable. Tagging every request with a feature identifier and user identifier lets you answer the questions that actually drive decisions.

Quality Metrics

The hardest thing to measure, but arguably the most important. User feedback (thumbs up/down), automated output scoring (length, format compliance, relevance heuristics), and retry rates all give you signal about whether the LLM is actually doing its job.

Building a Middleware-Based Logging Layer

The cleanest architecture wraps every LLM call through a single middleware layer. Every call goes through the same function, which handles timing, token extraction, cost calculation, error capture, and storage. No developer on your team should ever call the LLM provider directly.

var EventEmitter = require("events");

var llmEvents = new EventEmitter();

function createLLMClient(options) {
  var provider = options.provider || "openai";
  var defaultModel = options.model || "gpt-4o";
  var apiKey = options.apiKey;
  var featureTag = options.feature || "unknown";
  var userTag = options.user || "anonymous";

  function callLLM(params, callback) {
    var startTime = Date.now();
    var model = params.model || defaultModel;
    var messages = params.messages;
    var maxTokens = params.maxTokens || 4096;
    var feature = params.feature || featureTag;
    var user = params.user || userTag;

    var requestMeta = {
      provider: provider,
      model: model,
      feature: feature,
      user: user,
      timestamp: new Date().toISOString(),
      inputMessages: messages.length
    };

    llmEvents.emit("request:start", requestMeta);

    makeProviderCall(provider, apiKey, {
      model: model,
      messages: messages,
      max_tokens: maxTokens
    }, function(err, response) {
      var duration = Date.now() - startTime;

      if (err) {
        var errorMeta = Object.assign({}, requestMeta, {
          duration: duration,
          error: err.message,
          errorType: classifyError(err),
          statusCode: err.statusCode || null
        });
        llmEvents.emit("request:error", errorMeta);
        return callback(err);
      }

      var usage = extractUsage(provider, response);
      var cost = calculateCost(provider, model, usage);

      var successMeta = Object.assign({}, requestMeta, {
        duration: duration,
        inputTokens: usage.inputTokens,
        outputTokens: usage.outputTokens,
        totalTokens: usage.totalTokens,
        cost: cost,
        finishReason: response.finish_reason || response.stop_reason || "unknown"
      });

      llmEvents.emit("request:success", successMeta);
      callback(null, response, successMeta);
    });
  }

  return {
    call: callLLM,
    events: llmEvents
  };
}

The EventEmitter pattern is deliberate. It decouples the monitoring logic from the calling logic. Your metrics storage, alerting, and logging all subscribe to events independently. If your metrics database goes down, your LLM calls still work.

Tracking Input and Output Tokens

Every major provider returns token counts in the response body, but the structure differs. You need an extraction function that normalizes this.

function extractUsage(provider, response) {
  var usage = { inputTokens: 0, outputTokens: 0, totalTokens: 0 };

  if (provider === "openai") {
    if (response.usage) {
      usage.inputTokens = response.usage.prompt_tokens || 0;
      usage.outputTokens = response.usage.completion_tokens || 0;
      usage.totalTokens = response.usage.total_tokens || 0;
    }
  } else if (provider === "anthropic") {
    if (response.usage) {
      usage.inputTokens = response.usage.input_tokens || 0;
      usage.outputTokens = response.usage.output_tokens || 0;
      usage.totalTokens = usage.inputTokens + usage.outputTokens;
    }
  } else if (provider === "google") {
    if (response.usageMetadata) {
      usage.inputTokens = response.usageMetadata.promptTokenCount || 0;
      usage.outputTokens = response.usageMetadata.candidatesTokenCount || 0;
      usage.totalTokens = response.usageMetadata.totalTokenCount || 0;
    }
  }

  return usage;
}

Store these raw numbers. Do not try to aggregate them before writing to the database. You will always need the per-request granularity later for debugging, auditing, and anomaly detection.

Calculating Real-Time Costs by Model and Provider

Prices change. Models get deprecated. New models launch. Your pricing table needs to be a simple data structure that you can update without redeploying.

var PRICING = {
  openai: {
    "gpt-4o": { input: 0.0025, output: 0.01 },
    "gpt-4o-mini": { input: 0.00015, output: 0.0006 },
    "gpt-4-turbo": { input: 0.01, output: 0.03 },
    "o1": { input: 0.015, output: 0.06 },
    "o3-mini": { input: 0.00115, output: 0.0044 }
  },
  anthropic: {
    "claude-sonnet-4-20250514": { input: 0.003, output: 0.015 },
    "claude-haiku-35-20241022": { input: 0.0008, output: 0.004 },
    "claude-opus-4-20250514": { input: 0.015, output: 0.075 }
  },
  google: {
    "gemini-2.0-flash": { input: 0.0001, output: 0.0004 },
    "gemini-2.0-pro": { input: 0.00125, output: 0.005 }
  }
};

function calculateCost(provider, model, usage) {
  var providerPricing = PRICING[provider];
  if (!providerPricing) return 0;

  var modelPricing = providerPricing[model];
  if (!modelPricing) return 0;

  var inputCost = (usage.inputTokens / 1000) * modelPricing.input;
  var outputCost = (usage.outputTokens / 1000) * modelPricing.output;

  return Math.round((inputCost + outputCost) * 1000000) / 1000000;
}

The Math.round with six decimal places avoids floating point drift. At scale, those rounding errors add up to real discrepancies between your tracking and your provider's invoice.

Per-Feature and Per-User Cost Attribution

This is where monitoring becomes genuinely useful for product decisions. Tag every request and aggregate by feature and user.

function getFeatureCostReport(db, startDate, endDate, callback) {
  var query = [
    "SELECT feature,",
    "  COUNT(*) as request_count,",
    "  SUM(input_tokens) as total_input_tokens,",
    "  SUM(output_tokens) as total_output_tokens,",
    "  SUM(cost) as total_cost,",
    "  AVG(cost) as avg_cost_per_request,",
    "  AVG(duration) as avg_duration_ms",
    "FROM llm_metrics",
    "WHERE timestamp >= $1 AND timestamp < $2",
    "  AND error IS NULL",
    "GROUP BY feature",
    "ORDER BY total_cost DESC"
  ].join("\n");

  db.query(query, [startDate, endDate], callback);
}

function getUserCostReport(db, startDate, endDate, limit, callback) {
  var query = [
    "SELECT user_id,",
    "  COUNT(*) as request_count,",
    "  SUM(cost) as total_cost,",
    "  MAX(cost) as max_single_request_cost",
    "FROM llm_metrics",
    "WHERE timestamp >= $1 AND timestamp < $2",
    "  AND error IS NULL",
    "GROUP BY user_id",
    "ORDER BY total_cost DESC",
    "LIMIT $3"
  ].join("\n");

  db.query(query, [startDate, endDate, limit || 50], callback);
}

With this data you can answer questions like: "Is the AI summarization feature worth its cost?" or "Is one user responsible for 40% of our LLM spend?" Both of those are real scenarios I have encountered in production.

Latency Percentiles

Averages lie. If your average latency is 2 seconds but your p99 is 25 seconds, you have a problem that the average hides. PostgreSQL makes percentile calculations straightforward.

function getLatencyPercentiles(db, feature, hours, callback) {
  var query = [
    "SELECT",
    "  feature,",
    "  model,",
    "  COUNT(*) as request_count,",
    "  ROUND(AVG(duration)) as avg_ms,",
    "  ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY duration)) as p50_ms,",
    "  ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration)) as p95_ms,",
    "  ROUND(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration)) as p99_ms,",
    "  ROUND(MIN(duration)) as min_ms,",
    "  ROUND(MAX(duration)) as max_ms",
    "FROM llm_metrics",
    "WHERE timestamp >= NOW() - ($1 || ' hours')::INTERVAL",
    "  AND error IS NULL",
    "  AND ($2::TEXT IS NULL OR feature = $2)",
    "GROUP BY feature, model",
    "ORDER BY feature, model"
  ].join("\n");

  db.query(query, [hours || 24, feature || null], callback);
}

In my experience, LLM p95 latency is usually 3-5x the p50. If your p95 is more than 8x your p50, something is wrong — either your prompts vary wildly in complexity or the provider is having intermittent issues.

Error Rate Monitoring and Alerting Thresholds

Not all errors are equal. Rate limits are a capacity issue. Timeouts might be a prompt issue. Authentication failures are a configuration emergency. Classify errors and set different alert thresholds for each.

function classifyError(err) {
  var status = err.statusCode || err.status || 0;
  var message = (err.message || "").toLowerCase();

  if (status === 429) return "rate_limit";
  if (status === 401 || status === 403) return "auth_error";
  if (status === 400) return "bad_request";
  if (status === 500 || status === 502 || status === 503) return "provider_error";
  if (message.indexOf("timeout") !== -1) return "timeout";
  if (message.indexOf("econnrefused") !== -1) return "connection_error";
  if (message.indexOf("content_filter") !== -1) return "content_filter";
  return "unknown";
}

var ALERT_THRESHOLDS = {
  rate_limit: { windowMinutes: 5, maxCount: 10, severity: "warning" },
  auth_error: { windowMinutes: 1, maxCount: 1, severity: "critical" },
  provider_error: { windowMinutes: 10, maxCount: 5, severity: "warning" },
  timeout: { windowMinutes: 15, maxCount: 20, severity: "warning" },
  content_filter: { windowMinutes: 60, maxCount: 50, severity: "info" }
};

function checkAlertThresholds(db, callback) {
  var checks = Object.keys(ALERT_THRESHOLDS);
  var alerts = [];
  var pending = checks.length;

  checks.forEach(function(errorType) {
    var threshold = ALERT_THRESHOLDS[errorType];
    var query = [
      "SELECT COUNT(*) as error_count",
      "FROM llm_metrics",
      "WHERE error_type = $1",
      "  AND timestamp >= NOW() - ($2 || ' minutes')::INTERVAL"
    ].join("\n");

    db.query(query, [errorType, threshold.windowMinutes], function(err, result) {
      if (!err && result.rows[0]) {
        var count = parseInt(result.rows[0].error_count, 10);
        if (count >= threshold.maxCount) {
          alerts.push({
            errorType: errorType,
            count: count,
            threshold: threshold.maxCount,
            window: threshold.windowMinutes + " minutes",
            severity: threshold.severity
          });
        }
      }
      pending--;
      if (pending === 0) callback(null, alerts);
    });
  });
}

For critical alerts like auth errors, you want immediate notification. For rate limits, a five-minute window prevents alert fatigue. Tune these thresholds based on your traffic patterns.

Quality Metrics

Quality is the metric everyone skips and then regrets skipping. At minimum, capture user feedback and basic output validation.

function recordFeedback(db, requestId, feedback, callback) {
  var query = [
    "UPDATE llm_metrics",
    "SET user_feedback = $2,",
    "  feedback_timestamp = NOW()",
    "WHERE request_id = $1",
    "RETURNING request_id"
  ].join("\n");

  db.query(query, [requestId, feedback], function(err, result) {
    if (err) return callback(err);
    if (result.rowCount === 0) {
      return callback(new Error("Request not found: " + requestId));
    }
    callback(null, { recorded: true });
  });
}

function getQualityReport(db, feature, days, callback) {
  var query = [
    "SELECT",
    "  feature,",
    "  COUNT(*) FILTER (WHERE user_feedback = 'positive') as positive_count,",
    "  COUNT(*) FILTER (WHERE user_feedback = 'negative') as negative_count,",
    "  COUNT(*) FILTER (WHERE user_feedback IS NOT NULL) as total_feedback,",
    "  ROUND(",
    "    100.0 * COUNT(*) FILTER (WHERE user_feedback = 'positive') /",
    "    NULLIF(COUNT(*) FILTER (WHERE user_feedback IS NOT NULL), 0),",
    "    1",
    "  ) as approval_rate",
    "FROM llm_metrics",
    "WHERE timestamp >= NOW() - ($1 || ' days')::INTERVAL",
    "  AND ($2::TEXT IS NULL OR feature = $2)",
    "GROUP BY feature",
    "ORDER BY total_feedback DESC"
  ].join("\n");

  db.query(query, [days || 30, feature || null], callback);
}

An approval rate below 80% is a red flag. Anything below 60% means the feature is actively making users' lives worse and you should rethink your prompts or model choice.

Building Dashboards

You have two practical options: expose JSON endpoints from your Express app for a custom frontend, or export metrics to Grafana. Here is the Express endpoint approach, which works well for internal tools.

var express = require("express");
var router = express.Router();

router.get("/dashboard/summary", function(req, res) {
  var hours = parseInt(req.query.hours, 10) || 24;

  var query = [
    "SELECT",
    "  COUNT(*) as total_requests,",
    "  COUNT(*) FILTER (WHERE error IS NOT NULL) as total_errors,",
    "  ROUND(100.0 * COUNT(*) FILTER (WHERE error IS NOT NULL) / COUNT(*), 2) as error_rate,",
    "  SUM(cost) as total_cost,",
    "  SUM(input_tokens) as total_input_tokens,",
    "  SUM(output_tokens) as total_output_tokens,",
    "  ROUND(AVG(duration)) as avg_latency_ms,",
    "  COUNT(DISTINCT user_id) as unique_users,",
    "  COUNT(DISTINCT feature) as active_features",
    "FROM llm_metrics",
    "WHERE timestamp >= NOW() - ($1 || ' hours')::INTERVAL"
  ].join("\n");

  req.db.query(query, [hours], function(err, result) {
    if (err) return res.status(500).json({ error: err.message });
    res.json({
      period: hours + " hours",
      generated: new Date().toISOString(),
      metrics: result.rows[0]
    });
  });
});

router.get("/dashboard/hourly", function(req, res) {
  var hours = parseInt(req.query.hours, 10) || 24;

  var query = [
    "SELECT",
    "  DATE_TRUNC('hour', timestamp) as hour,",
    "  COUNT(*) as requests,",
    "  SUM(cost) as cost,",
    "  ROUND(AVG(duration)) as avg_latency,",
    "  COUNT(*) FILTER (WHERE error IS NOT NULL) as errors",
    "FROM llm_metrics",
    "WHERE timestamp >= NOW() - ($1 || ' hours')::INTERVAL",
    "GROUP BY DATE_TRUNC('hour', timestamp)",
    "ORDER BY hour"
  ].join("\n");

  req.db.query(query, [hours], function(err, result) {
    if (err) return res.status(500).json({ error: err.message });
    res.json({ hourly: result.rows });
  });
});

For Grafana integration, the PostgreSQL data source connects directly to your metrics table. Create panels for cost over time, latency percentiles, error rates by type, and top features by spend. Grafana's alerting can then replace or supplement your application-level alerts.

Cost Anomaly Detection

A spike in LLM cost usually means a bug — a loop that keeps calling the API, a prompt that accidentally includes a huge document, or a feature that went viral without anyone planning for the cost. Detect these automatically.

function detectCostAnomalies(db, callback) {
  var query = [
    "WITH hourly_costs AS (",
    "  SELECT",
    "    DATE_TRUNC('hour', timestamp) as hour,",
    "    feature,",
    "    SUM(cost) as hourly_cost",
    "  FROM llm_metrics",
    "  WHERE timestamp >= NOW() - INTERVAL '7 days'",
    "  GROUP BY DATE_TRUNC('hour', timestamp), feature",
    "),",
    "feature_stats AS (",
    "  SELECT",
    "    feature,",
    "    AVG(hourly_cost) as avg_hourly_cost,",
    "    STDDEV(hourly_cost) as stddev_hourly_cost",
    "  FROM hourly_costs",
    "  WHERE hour < NOW() - INTERVAL '1 hour'",
    "  GROUP BY feature",
    ")",
    "SELECT",
    "  h.feature,",
    "  h.hourly_cost as current_cost,",
    "  s.avg_hourly_cost as typical_cost,",
    "  ROUND((h.hourly_cost - s.avg_hourly_cost) / NULLIF(s.stddev_hourly_cost, 0), 2) as z_score",
    "FROM hourly_costs h",
    "JOIN feature_stats s ON h.feature = s.feature",
    "WHERE h.hour = DATE_TRUNC('hour', NOW())",
    "  AND h.hourly_cost > s.avg_hourly_cost + 3 * s.stddev_hourly_cost",
    "ORDER BY z_score DESC"
  ].join("\n");

  db.query(query, function(err, result) {
    if (err) return callback(err);
    callback(null, result.rows);
  });
}

A z-score above 3 means the current hour's cost is more than three standard deviations above the weekly average for that feature. That is almost always worth investigating. In practice, I set alerts at z-score 2.5 for warnings and 3.5 for critical alerts.

Budget Alerts and Automatic Circuit Breaking

The ultimate safety net. When spending exceeds a threshold, stop making LLM calls for non-critical features automatically. This requires an in-memory budget tracker that gets refreshed periodically from the database.

var budgetState = {
  daily: { limit: 100, spent: 0, lastRefresh: null },
  monthly: { limit: 2000, spent: 0, lastRefresh: null },
  circuitOpen: false
};

function refreshBudgetState(db, callback) {
  var dailyQuery = [
    "SELECT COALESCE(SUM(cost), 0) as total",
    "FROM llm_metrics",
    "WHERE timestamp >= CURRENT_DATE"
  ].join("\n");

  var monthlyQuery = [
    "SELECT COALESCE(SUM(cost), 0) as total",
    "FROM llm_metrics",
    "WHERE timestamp >= DATE_TRUNC('month', CURRENT_DATE)"
  ].join("\n");

  db.query(dailyQuery, function(err, dailyResult) {
    if (err) return callback(err);

    db.query(monthlyQuery, function(err2, monthlyResult) {
      if (err2) return callback(err2);

      budgetState.daily.spent = parseFloat(dailyResult.rows[0].total);
      budgetState.monthly.spent = parseFloat(monthlyResult.rows[0].total);
      budgetState.daily.lastRefresh = new Date();
      budgetState.monthly.lastRefresh = new Date();

      var dailyPct = budgetState.daily.spent / budgetState.daily.limit;
      var monthlyPct = budgetState.monthly.spent / budgetState.monthly.limit;

      budgetState.circuitOpen = dailyPct >= 1.0 || monthlyPct >= 1.0;

      if (dailyPct >= 0.8) {
        llmEvents.emit("budget:warning", {
          type: "daily",
          spent: budgetState.daily.spent,
          limit: budgetState.daily.limit,
          percentage: Math.round(dailyPct * 100)
        });
      }

      if (monthlyPct >= 0.8) {
        llmEvents.emit("budget:warning", {
          type: "monthly",
          spent: budgetState.monthly.spent,
          limit: budgetState.monthly.limit,
          percentage: Math.round(monthlyPct * 100)
        });
      }

      callback(null, budgetState);
    });
  });
}

function budgetGuard(feature, callback) {
  if (!budgetState.circuitOpen) {
    return callback(null, true);
  }

  var criticalFeatures = ["auth", "safety", "moderation"];
  if (criticalFeatures.indexOf(feature) !== -1) {
    return callback(null, true);
  }

  callback(new Error("Budget exceeded. LLM calls disabled for non-critical feature: " + feature));
}

Refresh the budget state every 60 seconds. The circuit breaker allows critical features like content moderation to keep running even when the budget is blown, but shuts off nice-to-have features like AI-generated summaries or suggestions.

Exporting Metrics to External Platforms

For teams already using Datadog, Prometheus, or CloudWatch, you can push metrics from the event emitter directly.

Prometheus Export

var promClient = require("prom-client");

var llmRequestDuration = new promClient.Histogram({
  name: "llm_request_duration_ms",
  help: "LLM API request duration in milliseconds",
  labelNames: ["provider", "model", "feature"],
  buckets: [100, 500, 1000, 2000, 5000, 10000, 20000, 30000]
});

var llmRequestCost = new promClient.Counter({
  name: "llm_request_cost_dollars",
  help: "Cumulative LLM API cost in dollars",
  labelNames: ["provider", "model", "feature"]
});

var llmTokensUsed = new promClient.Counter({
  name: "llm_tokens_used_total",
  help: "Total tokens consumed",
  labelNames: ["provider", "model", "feature", "direction"]
});

var llmErrors = new promClient.Counter({
  name: "llm_errors_total",
  help: "Total LLM API errors",
  labelNames: ["provider", "model", "error_type"]
});

llmEvents.on("request:success", function(meta) {
  var labels = { provider: meta.provider, model: meta.model, feature: meta.feature };
  llmRequestDuration.observe(labels, meta.duration);
  llmRequestCost.inc(labels, meta.cost);
  llmTokensUsed.inc(Object.assign({}, labels, { direction: "input" }), meta.inputTokens);
  llmTokensUsed.inc(Object.assign({}, labels, { direction: "output" }), meta.outputTokens);
});

llmEvents.on("request:error", function(meta) {
  llmErrors.inc({
    provider: meta.provider,
    model: meta.model,
    error_type: meta.errorType
  });
});

Datadog via StatsD

var StatsD = require("hot-shots");
var dogstatsd = new StatsD({ host: "localhost", port: 8125, prefix: "llm." });

llmEvents.on("request:success", function(meta) {
  var tags = [
    "provider:" + meta.provider,
    "model:" + meta.model,
    "feature:" + meta.feature
  ];
  dogstatsd.histogram("request.duration", meta.duration, tags);
  dogstatsd.increment("request.count", 1, tags);
  dogstatsd.increment("tokens.input", meta.inputTokens, tags);
  dogstatsd.increment("tokens.output", meta.outputTokens, tags);
  dogstatsd.histogram("request.cost", meta.cost, tags);
});

llmEvents.on("request:error", function(meta) {
  var tags = [
    "provider:" + meta.provider,
    "error_type:" + meta.errorType
  ];
  dogstatsd.increment("error.count", 1, tags);
});

The event-based architecture pays off here. You can add or remove metric exporters without touching any of the core LLM calling logic.

Monthly Cost Reporting and Forecasting

At the end of each month, generate a report. More usefully, project forward based on current trends.

function generateMonthlyReport(db, year, month, callback) {
  var query = [
    "SELECT",
    "  provider,",
    "  model,",
    "  feature,",
    "  COUNT(*) as requests,",
    "  SUM(input_tokens) as input_tokens,",
    "  SUM(output_tokens) as output_tokens,",
    "  ROUND(SUM(cost)::numeric, 4) as total_cost,",
    "  ROUND(AVG(cost)::numeric, 6) as avg_cost_per_request,",
    "  ROUND(AVG(duration)::numeric, 0) as avg_duration_ms",
    "FROM llm_metrics",
    "WHERE EXTRACT(YEAR FROM timestamp) = $1",
    "  AND EXTRACT(MONTH FROM timestamp) = $2",
    "GROUP BY provider, model, feature",
    "ORDER BY total_cost DESC"
  ].join("\n");

  db.query(query, [year, month], function(err, result) {
    if (err) return callback(err);

    var totalCost = 0;
    var totalRequests = 0;
    result.rows.forEach(function(row) {
      totalCost += parseFloat(row.total_cost);
      totalRequests += parseInt(row.requests, 10);
    });

    callback(null, {
      period: year + "-" + String(month).padStart(2, "0"),
      totalCost: Math.round(totalCost * 100) / 100,
      totalRequests: totalRequests,
      breakdown: result.rows
    });
  });
}

function forecastMonthlyCost(db, callback) {
  var now = new Date();
  var dayOfMonth = now.getDate();
  var daysInMonth = new Date(now.getFullYear(), now.getMonth() + 1, 0).getDate();

  var query = [
    "SELECT COALESCE(SUM(cost), 0) as month_to_date",
    "FROM llm_metrics",
    "WHERE timestamp >= DATE_TRUNC('month', CURRENT_DATE)"
  ].join("\n");

  db.query(query, function(err, result) {
    if (err) return callback(err);

    var spent = parseFloat(result.rows[0].month_to_date);
    var dailyRate = spent / dayOfMonth;
    var projected = dailyRate * daysInMonth;

    callback(null, {
      monthToDate: Math.round(spent * 100) / 100,
      dailyRate: Math.round(dailyRate * 100) / 100,
      projectedMonthEnd: Math.round(projected * 100) / 100,
      daysRemaining: daysInMonth - dayOfMonth
    });
  });
}

Run forecasting daily. If the projected month-end cost exceeds your budget by more than 10%, send an alert. This gives you days or weeks to react instead of getting surprised at the invoice.

Complete Working Example

Here is a full, self-contained monitoring module that ties everything together. It wraps LLM API calls, stores metrics in PostgreSQL, and exposes dashboard endpoints.

// llm-monitor.js
var https = require("https");
var EventEmitter = require("events");
var pg = require("pg");

var llmEvents = new EventEmitter();
var pool = null;

// ===== PRICING TABLE =====

var PRICING = {
  openai: {
    "gpt-4o": { input: 0.0025, output: 0.01 },
    "gpt-4o-mini": { input: 0.00015, output: 0.0006 }
  },
  anthropic: {
    "claude-sonnet-4-20250514": { input: 0.003, output: 0.015 },
    "claude-haiku-35-20241022": { input: 0.0008, output: 0.004 }
  }
};

// ===== DATABASE SETUP =====

function initialize(connectionString, callback) {
  pool = new pg.Pool({ connectionString: connectionString });

  var schema = [
    "CREATE TABLE IF NOT EXISTS llm_metrics (",
    "  id SERIAL PRIMARY KEY,",
    "  request_id TEXT UNIQUE NOT NULL,",
    "  provider TEXT NOT NULL,",
    "  model TEXT NOT NULL,",
    "  feature TEXT NOT NULL DEFAULT 'unknown',",
    "  user_id TEXT NOT NULL DEFAULT 'anonymous',",
    "  timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),",
    "  duration INTEGER NOT NULL,",
    "  input_tokens INTEGER NOT NULL DEFAULT 0,",
    "  output_tokens INTEGER NOT NULL DEFAULT 0,",
    "  total_tokens INTEGER NOT NULL DEFAULT 0,",
    "  cost NUMERIC(12,8) NOT NULL DEFAULT 0,",
    "  error TEXT,",
    "  error_type TEXT,",
    "  status_code INTEGER,",
    "  finish_reason TEXT,",
    "  user_feedback TEXT,",
    "  feedback_timestamp TIMESTAMPTZ",
    ");",
    "",
    "CREATE INDEX IF NOT EXISTS idx_llm_metrics_timestamp ON llm_metrics (timestamp);",
    "CREATE INDEX IF NOT EXISTS idx_llm_metrics_feature ON llm_metrics (feature);",
    "CREATE INDEX IF NOT EXISTS idx_llm_metrics_user ON llm_metrics (user_id);",
    "CREATE INDEX IF NOT EXISTS idx_llm_metrics_error_type ON llm_metrics (error_type);"
  ].join("\n");

  pool.query(schema, function(err) {
    if (err) return callback(err);
    setupEventListeners();
    callback(null, { status: "initialized" });
  });
}

// ===== EVENT LISTENERS =====

function setupEventListeners() {
  llmEvents.on("request:success", function(meta) {
    var query = [
      "INSERT INTO llm_metrics",
      "(request_id, provider, model, feature, user_id, duration,",
      " input_tokens, output_tokens, total_tokens, cost, finish_reason)",
      "VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)"
    ].join("\n");

    var params = [
      meta.requestId, meta.provider, meta.model, meta.feature,
      meta.user, meta.duration, meta.inputTokens, meta.outputTokens,
      meta.totalTokens, meta.cost, meta.finishReason
    ];

    pool.query(query, params, function(err) {
      if (err) console.error("Failed to store LLM metric:", err.message);
    });
  });

  llmEvents.on("request:error", function(meta) {
    var query = [
      "INSERT INTO llm_metrics",
      "(request_id, provider, model, feature, user_id, duration,",
      " error, error_type, status_code)",
      "VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)"
    ].join("\n");

    var params = [
      meta.requestId, meta.provider, meta.model, meta.feature,
      meta.user, meta.duration, meta.error, meta.errorType,
      meta.statusCode
    ];

    pool.query(query, params, function(err) {
      if (err) console.error("Failed to store LLM error metric:", err.message);
    });
  });
}

// ===== CORE WRAPPER =====

function generateRequestId() {
  return "llm_" + Date.now() + "_" + Math.random().toString(36).substring(2, 10);
}

function monitoredCall(options, callback) {
  var requestId = generateRequestId();
  var startTime = Date.now();
  var provider = options.provider || "openai";
  var model = options.model || "gpt-4o";
  var feature = options.feature || "unknown";
  var user = options.user || "anonymous";
  var messages = options.messages;
  var maxTokens = options.maxTokens || 4096;

  var requestBody = JSON.stringify({
    model: model,
    messages: messages,
    max_tokens: maxTokens
  });

  var hostMap = {
    openai: "api.openai.com",
    anthropic: "api.anthropic.com"
  };

  var pathMap = {
    openai: "/v1/chat/completions",
    anthropic: "/v1/messages"
  };

  var headers = { "Content-Type": "application/json" };

  if (provider === "openai") {
    headers["Authorization"] = "Bearer " + options.apiKey;
  } else if (provider === "anthropic") {
    headers["x-api-key"] = options.apiKey;
    headers["anthropic-version"] = "2023-06-01";
  }

  var reqOptions = {
    hostname: hostMap[provider],
    path: pathMap[provider],
    method: "POST",
    headers: headers
  };

  var req = https.request(reqOptions, function(res) {
    var body = "";
    res.on("data", function(chunk) { body += chunk; });
    res.on("end", function() {
      var duration = Date.now() - startTime;

      if (res.statusCode >= 400) {
        var errMeta = {
          requestId: requestId,
          provider: provider,
          model: model,
          feature: feature,
          user: user,
          duration: duration,
          error: body.substring(0, 500),
          errorType: classifyHTTPError(res.statusCode, body),
          statusCode: res.statusCode
        };
        llmEvents.emit("request:error", errMeta);
        return callback(new Error("LLM API error " + res.statusCode + ": " + body.substring(0, 200)));
      }

      var parsed;
      try { parsed = JSON.parse(body); }
      catch (e) { return callback(new Error("Invalid JSON response from LLM API")); }

      var usage = extractUsage(provider, parsed);
      var cost = calculateCost(provider, model, usage);

      var successMeta = {
        requestId: requestId,
        provider: provider,
        model: model,
        feature: feature,
        user: user,
        duration: duration,
        inputTokens: usage.inputTokens,
        outputTokens: usage.outputTokens,
        totalTokens: usage.totalTokens,
        cost: cost,
        finishReason: extractFinishReason(provider, parsed)
      };

      llmEvents.emit("request:success", successMeta);
      callback(null, parsed, successMeta);
    });
  });

  req.on("error", function(err) {
    var duration = Date.now() - startTime;
    var errMeta = {
      requestId: requestId,
      provider: provider,
      model: model,
      feature: feature,
      user: user,
      duration: duration,
      error: err.message,
      errorType: "connection_error",
      statusCode: null
    };
    llmEvents.emit("request:error", errMeta);
    callback(err);
  });

  req.setTimeout(30000, function() {
    req.destroy(new Error("Request timeout after 30000ms"));
  });

  req.write(requestBody);
  req.end();
}

function extractUsage(provider, response) {
  var usage = { inputTokens: 0, outputTokens: 0, totalTokens: 0 };

  if (provider === "openai" && response.usage) {
    usage.inputTokens = response.usage.prompt_tokens || 0;
    usage.outputTokens = response.usage.completion_tokens || 0;
    usage.totalTokens = response.usage.total_tokens || 0;
  } else if (provider === "anthropic" && response.usage) {
    usage.inputTokens = response.usage.input_tokens || 0;
    usage.outputTokens = response.usage.output_tokens || 0;
    usage.totalTokens = usage.inputTokens + usage.outputTokens;
  }

  return usage;
}

function calculateCost(provider, model, usage) {
  var p = PRICING[provider];
  if (!p) return 0;
  var m = p[model];
  if (!m) return 0;
  var inputCost = (usage.inputTokens / 1000) * m.input;
  var outputCost = (usage.outputTokens / 1000) * m.output;
  return Math.round((inputCost + outputCost) * 1000000) / 1000000;
}

function classifyHTTPError(statusCode, body) {
  if (statusCode === 429) return "rate_limit";
  if (statusCode === 401 || statusCode === 403) return "auth_error";
  if (statusCode === 400) return "bad_request";
  if (statusCode >= 500) return "provider_error";
  return "unknown";
}

function extractFinishReason(provider, response) {
  if (provider === "openai") {
    return response.choices && response.choices[0]
      ? response.choices[0].finish_reason
      : "unknown";
  }
  if (provider === "anthropic") {
    return response.stop_reason || "unknown";
  }
  return "unknown";
}

// ===== DASHBOARD ROUTES =====

function dashboardRoutes() {
  var express = require("express");
  var router = express.Router();

  router.get("/llm/dashboard", function(req, res) {
    var hours = parseInt(req.query.hours, 10) || 24;

    var summaryQuery = [
      "SELECT",
      "  COUNT(*) as total_requests,",
      "  COUNT(*) FILTER (WHERE error IS NOT NULL) as total_errors,",
      "  ROUND(100.0 * COUNT(*) FILTER (WHERE error IS NOT NULL) / NULLIF(COUNT(*), 0), 2) as error_rate,",
      "  ROUND(SUM(cost)::numeric, 4) as total_cost,",
      "  SUM(input_tokens) as total_input_tokens,",
      "  SUM(output_tokens) as total_output_tokens,",
      "  ROUND(AVG(duration)) as avg_latency_ms,",
      "  ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY duration)) as p50_latency,",
      "  ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration)) as p95_latency,",
      "  ROUND(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration)) as p99_latency",
      "FROM llm_metrics",
      "WHERE timestamp >= NOW() - ($1 || ' hours')::INTERVAL"
    ].join("\n");

    var featureQuery = [
      "SELECT feature, COUNT(*) as requests,",
      "  ROUND(SUM(cost)::numeric, 4) as cost,",
      "  ROUND(AVG(duration)) as avg_latency",
      "FROM llm_metrics",
      "WHERE timestamp >= NOW() - ($1 || ' hours')::INTERVAL",
      "  AND error IS NULL",
      "GROUP BY feature ORDER BY cost DESC"
    ].join("\n");

    pool.query(summaryQuery, [hours], function(err, summaryResult) {
      if (err) return res.status(500).json({ error: err.message });

      pool.query(featureQuery, [hours], function(err2, featureResult) {
        if (err2) return res.status(500).json({ error: err2.message });

        res.json({
          period: hours + " hours",
          generated: new Date().toISOString(),
          summary: summaryResult.rows[0],
          byFeature: featureResult.rows
        });
      });
    });
  });

  router.get("/llm/dashboard/forecast", function(req, res) {
    forecastMonthlyCost(pool, function(err, forecast) {
      if (err) return res.status(500).json({ error: err.message });
      res.json(forecast);
    });
  });

  return router;
}

function forecastMonthlyCost(dbPool, callback) {
  var now = new Date();
  var dayOfMonth = now.getDate();
  var daysInMonth = new Date(now.getFullYear(), now.getMonth() + 1, 0).getDate();

  var query = [
    "SELECT COALESCE(SUM(cost), 0) as month_to_date",
    "FROM llm_metrics",
    "WHERE timestamp >= DATE_TRUNC('month', CURRENT_DATE)"
  ].join("\n");

  dbPool.query(query, function(err, result) {
    if (err) return callback(err);
    var spent = parseFloat(result.rows[0].month_to_date);
    var dailyRate = spent / Math.max(dayOfMonth, 1);
    var projected = dailyRate * daysInMonth;
    callback(null, {
      monthToDate: Math.round(spent * 100) / 100,
      dailyRate: Math.round(dailyRate * 100) / 100,
      projectedMonthEnd: Math.round(projected * 100) / 100,
      daysRemaining: daysInMonth - dayOfMonth
    });
  });
}

// ===== EXPORTS =====

module.exports = {
  initialize: initialize,
  call: monitoredCall,
  events: llmEvents,
  dashboardRoutes: dashboardRoutes
};

Usage

var express = require("express");
var llmMonitor = require("./llm-monitor");

var app = express();

llmMonitor.initialize(process.env.POSTGRES_CONNECTION_STRING, function(err) {
  if (err) {
    console.error("Failed to initialize LLM monitor:", err.message);
    process.exit(1);
  }

  app.use(llmMonitor.dashboardRoutes());

  app.post("/api/summarize", function(req, res) {
    llmMonitor.call({
      provider: "openai",
      apiKey: process.env.OPENAI_API_KEY,
      model: "gpt-4o-mini",
      feature: "summarization",
      user: req.user ? req.user.id : "anonymous",
      messages: [
        { role: "system", content: "Summarize the following text concisely." },
        { role: "user", content: req.body.text }
      ],
      maxTokens: 500
    }, function(err, response, meta) {
      if (err) return res.status(500).json({ error: err.message });
      res.json({
        summary: response.choices[0].message.content,
        metrics: {
          requestId: meta.requestId,
          tokens: meta.totalTokens,
          cost: meta.cost,
          latency: meta.duration
        }
      });
    });
  });

  app.listen(3000, function() {
    console.log("Server running on port 3000");
  });
});

Hit GET /llm/dashboard?hours=24 for a full summary of your LLM usage. Hit GET /llm/dashboard/forecast to see projected month-end cost.

Common Issues and Troubleshooting

1. Token Counts Are Zero or Missing

Error: usage.prompt_tokens is undefined

Some providers do not return token counts for streaming responses. If you are using stream: true, the token counts only appear in the final chunk or not at all. Disable streaming for monitoring accuracy, or use a tokenizer library like tiktoken to estimate counts from the prompt and response text. Also check that you are accessing the correct field name — OpenAI uses prompt_tokens while Anthropic uses input_tokens.

2. Cost Calculations Do Not Match Provider Invoice

Warning: Monthly tracked cost $847.23 vs. provider invoice $912.56 (7.7% discrepancy)

This usually happens for three reasons. First, your pricing table is out of date — providers change prices without much notice. Second, you are missing some calls. If any code path bypasses your monitoring wrapper and calls the API directly, those costs are invisible. Third, floating point rounding accumulates error at scale. Always compare your tracked totals with the invoice monthly and adjust. A discrepancy under 2% is normal; over 5% means you have a leak.

3. PostgreSQL Connection Pool Exhaustion

Error: Cannot use a pool after calling end on the pool
Error: timeout exceeded when trying to connect

If your LLM monitoring generates many concurrent writes (high traffic application), the default connection pool size of 10 may not be enough. Increase max in the Pool configuration and make sure you are not creating new pools on every request. The pool should be a module-level singleton. Also verify that failed queries are not holding connections open — always handle errors in query callbacks.

4. Rate Limit Alerts Firing During Normal Operation

Alert: rate_limit threshold exceeded - 12 errors in 5 minutes (threshold: 10)

If you are consistently hitting rate limit alerts, your threshold is too low for your traffic or your concurrent request count is too high. First, check your provider's actual rate limits and compare to your request volume. Implement client-side request queuing with a concurrency limiter before adjusting alert thresholds upward. Adding exponential backoff with jitter to your retry logic will also help smooth out bursts. Do not just silence the alert — rate limit errors mean you are wasting time and compute on failed requests.

5. Dashboard Queries Slow on Large Tables

Query execution time: 12847ms (expected < 500ms)

The llm_metrics table grows fast. After a few months with moderate traffic, you could have millions of rows. Make sure you have indexes on timestamp, feature, and error_type columns. For dashboard queries that aggregate over long time periods, consider creating a materialized view that pre-aggregates hourly data. Partition the table by month if you are retaining more than 90 days of data.

Best Practices

  • Wrap every LLM call through a single monitoring layer. No exceptions. The moment someone makes a "quick" direct API call that bypasses monitoring, you lose visibility. Code review for this. Lint for this. Make the direct API client private to the monitoring module.

  • Store raw per-request metrics, not just aggregates. You will always need to drill down. An aggregate that says "cost spiked 300% on Tuesday" is useless without per-request data that shows which feature, user, and prompt caused it. Disk is cheap; regret is expensive.

  • Update your pricing table proactively. Subscribe to your provider's changelog or pricing page. Set a calendar reminder to check monthly. A stale pricing table means your dashboards lie to you, and you will not know until the invoice arrives.

  • Set budget alerts at 80%, not 100%. If you alert at 100%, the money is already spent. An 80% warning gives you time to investigate and react. Set a hard circuit breaker at 100% for non-critical features, and make sure critical features have their own dedicated budget.

  • Track cost per user from day one, even if you do not charge for it yet. When you eventually need to implement usage tiers, rate limiting by user, or cost-based pricing, you will be glad you have the historical data. Retrofitting user attribution is painful.

  • Separate monitoring storage failures from LLM call failures. If your metrics database goes down, your LLM calls should still work. The event emitter pattern handles this naturally — a failed database write is logged but does not propagate to the caller. Never let observability infrastructure bring down production features.

  • Implement request-level cost caps. A single malformed prompt with a 200,000-token context can cost $5+ in one call. Set a maximum token count per request at the middleware level, and reject prompts that exceed it before they reach the API.

  • Review the dashboard weekly. The best monitoring system in the world is useless if nobody looks at it. Build a weekly habit of checking cost trends, latency percentiles, and error rates. Five minutes per week prevents five-figure surprises.

References

Powered by Contentful