Production

LLM Application Monitoring: Metrics That Matter

Monitor LLM applications with specialized metrics for performance, cost, quality, and reliability with dashboards in Node.js.

LLM Application Monitoring: Metrics That Matter

Traditional application monitoring tracks request rates, error codes, and response times. LLM applications need all of that plus an entirely different layer of observability: token economics, output quality, model behavior drift, and cost attribution. If you are running LLM features in production without specialized monitoring, you are flying blind over expensive terrain.

This article covers the four categories of metrics every LLM application needs, how to implement a monitoring middleware in Node.js with Express, store everything in PostgreSQL for analysis, and build dashboards and alerts that keep your team informed and your budget intact.

Prerequisites

  • Node.js v18+ and npm
  • PostgreSQL 14+ running locally or in the cloud
  • An Express.js application that calls an LLM API (OpenAI, Anthropic, etc.)
  • Basic familiarity with SQL and middleware patterns
  • A working understanding of LLM API concepts (tokens, completions, streaming)

Why LLM Applications Need Specialized Monitoring

Standard APM tools were designed for deterministic systems. You send a request, you get a response, and the resource cost is roughly predictable. LLM applications break every one of those assumptions.

Non-deterministic outputs. The same prompt can produce different results every time. A response that was excellent yesterday might be mediocre today after a model update. You cannot rely on functional tests alone to catch quality regressions.

Cost sensitivity. A single misguided loop that sends GPT-4 requests can burn through hundreds of dollars in minutes. I have seen a staging environment rack up $2,400 overnight because a retry mechanism had no backoff and no budget cap. Traditional monitoring would have shown healthy 200 responses the entire time.

Variable latency. LLM responses range from 500ms to 60 seconds depending on prompt complexity, output length, and provider load. Fixed timeout thresholds from your standard HTTP monitoring will either fire constantly or miss real problems.

Quality drift. Model providers update their models regularly. These updates can subtly change behavior in ways that break your application's assumptions. Without quality metrics, you will not know until users start complaining.

The Four Categories of LLM Metrics

Every LLM monitoring system should capture metrics across four categories:

  1. Performance — How fast are responses? What do the latency distributions look like?
  2. Cost — How much is each request, user, and feature costing you?
  3. Quality — Are the outputs actually good? Are users satisfied?
  4. Reliability — Are requests succeeding? How often do you hit rate limits or timeouts?

Let us break each one down.

Performance Metrics

Latency in LLM applications is more nuanced than a single response time number. You need to track:

  • Time to First Token (TTFT): How long before the user sees the first token in a streaming response. This is the perceived responsiveness metric. Users will tolerate a 10-second total response if tokens start appearing in 300ms.
  • Total Response Time: End-to-end duration from request initiation to final token received. This is what matters for non-streaming use cases and background processing.
  • Percentile distributions (p50, p95, p99): Averages are meaningless for LLM latency. A p50 of 1.2s with a p99 of 45s tells a completely different story than an average of 3.8s. Always track percentiles.
  • Tokens per second: The throughput rate of the model. This helps you identify provider-side degradation before it shows up in your error rates.
var LLMTimer = function() {
  this.startTime = null;
  this.firstTokenTime = null;
  this.endTime = null;
  this.tokenCount = 0;
};

LLMTimer.prototype.start = function() {
  this.startTime = Date.now();
};

LLMTimer.prototype.markFirstToken = function() {
  if (!this.firstTokenTime) {
    this.firstTokenTime = Date.now();
  }
};

LLMTimer.prototype.markToken = function() {
  this.tokenCount++;
};

LLMTimer.prototype.end = function() {
  this.endTime = Date.now();
  var totalMs = this.endTime - this.startTime;
  var ttftMs = this.firstTokenTime ? this.firstTokenTime - this.startTime : null;
  var tokensPerSecond = totalMs > 0 ? (this.tokenCount / totalMs) * 1000 : 0;

  return {
    total_ms: totalMs,
    ttft_ms: ttftMs,
    tokens_per_second: Math.round(tokensPerSecond * 100) / 100,
    token_count: this.tokenCount
  };
};

Token Usage Metrics

Tokens are the currency of LLM APIs. Every request has an input token count (your prompt) and an output token count (the model's response). Track both separately because they are priced differently and they tell you different things.

  • Input tokens per request: Measures prompt complexity. A sudden spike means someone changed a system prompt or your context injection is pulling in too much data.
  • Output tokens per request: Measures response verbosity. If this creeps up, the model may be generating more verbose answers than needed.
  • Total tokens per request: The billing-relevant number.
  • Token efficiency ratio: Output tokens divided by input tokens. A ratio below 0.1 might mean you are sending massive prompts for tiny answers — a design smell worth investigating.

Cost Metrics

Cost monitoring is not optional for LLM applications. It is a first-class operational concern.

  • Cost per request: Calculated from token counts and model pricing. Track this per model since pricing varies dramatically.
  • Cost per user: Attribute LLM costs to individual users. Some users will consume 100x more than others. You need to know who they are.
  • Cost per feature: If your app has multiple LLM-powered features (chat, summarization, code generation), track costs separately. This informs product decisions about which features to optimize or gate.
  • Daily and monthly burn rate: The single most important number for your finance team. Set alerts on this aggressively.
var MODEL_PRICING = {
  "gpt-4o": { input_per_1k: 0.0025, output_per_1k: 0.01 },
  "gpt-4o-mini": { input_per_1k: 0.00015, output_per_1k: 0.0006 },
  "claude-sonnet-4-20250514": { input_per_1k: 0.003, output_per_1k: 0.015 },
  "claude-haiku-4-20250414": { input_per_1k: 0.0008, output_per_1k: 0.004 }
};

function calculateCost(model, inputTokens, outputTokens) {
  var pricing = MODEL_PRICING[model];
  if (!pricing) {
    return { input_cost: 0, output_cost: 0, total_cost: 0, model: model, error: "unknown_model" };
  }
  var inputCost = (inputTokens / 1000) * pricing.input_per_1k;
  var outputCost = (outputTokens / 1000) * pricing.output_per_1k;
  return {
    input_cost: Math.round(inputCost * 1000000) / 1000000,
    output_cost: Math.round(outputCost * 1000000) / 1000000,
    total_cost: Math.round((inputCost + outputCost) * 1000000) / 1000000,
    model: model
  };
}

Quality Metrics

Quality is the hardest category to measure, but it is arguably the most important. A fast, cheap response that is wrong is worse than no response at all.

  • User satisfaction signals: Thumbs up/down, star ratings, or explicit feedback. This is your ground truth.
  • Regeneration rate: How often users click "regenerate" or "try again." A high regeneration rate is a strong signal that output quality is poor.
  • Edit rate: If users can edit LLM output, track how much they change. Heavy editing means the model is getting close but not close enough.
  • Abandonment rate: Users who start an LLM interaction but leave before completion. High abandonment often correlates with latency problems or irrelevant outputs.
  • Automated quality scores: Use a smaller, cheaper model to evaluate outputs against criteria. Not perfect, but useful for trend detection.

Reliability Metrics

Reliability metrics tell you whether the system is functioning at a basic level.

  • Error rate: Percentage of LLM requests that fail (HTTP 500, API errors, malformed responses).
  • Timeout rate: Percentage of requests that exceed your timeout threshold. Track this separately from errors because timeouts usually indicate provider load, not bugs.
  • Rate limit hits: How often you hit the provider's rate limits. If this is non-zero, you need better queuing or a higher tier.
  • Retry rate: How often requests need retries. High retry rates mask latency and cost problems.
  • Fallback rate: If you have a fallback model (e.g., fall back from GPT-4o to GPT-4o-mini), track how often the fallback activates.

Database Schema for LLM Metrics

Before building the middleware, set up a PostgreSQL schema to store everything.

CREATE TABLE llm_metrics (
  id SERIAL PRIMARY KEY,
  request_id UUID NOT NULL DEFAULT gen_random_uuid(),
  timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),

  -- Request context
  user_id VARCHAR(255),
  feature VARCHAR(100) NOT NULL,
  model VARCHAR(100) NOT NULL,
  endpoint VARCHAR(255),

  -- Performance
  total_ms INTEGER NOT NULL,
  ttft_ms INTEGER,
  tokens_per_second NUMERIC(10, 2),

  -- Tokens
  input_tokens INTEGER NOT NULL DEFAULT 0,
  output_tokens INTEGER NOT NULL DEFAULT 0,
  total_tokens INTEGER GENERATED ALWAYS AS (input_tokens + output_tokens) STORED,

  -- Cost
  input_cost NUMERIC(12, 6) NOT NULL DEFAULT 0,
  output_cost NUMERIC(12, 6) NOT NULL DEFAULT 0,
  total_cost NUMERIC(12, 6) NOT NULL DEFAULT 0,

  -- Reliability
  status VARCHAR(20) NOT NULL DEFAULT 'success',
  error_type VARCHAR(100),
  error_message TEXT,
  retry_count INTEGER NOT NULL DEFAULT 0,
  is_fallback BOOLEAN NOT NULL DEFAULT FALSE,

  -- Metadata
  prompt_hash VARCHAR(64),
  metadata JSONB DEFAULT '{}'
);

CREATE INDEX idx_llm_metrics_timestamp ON llm_metrics (timestamp);
CREATE INDEX idx_llm_metrics_user_id ON llm_metrics (user_id);
CREATE INDEX idx_llm_metrics_feature ON llm_metrics (feature);
CREATE INDEX idx_llm_metrics_model ON llm_metrics (model);
CREATE INDEX idx_llm_metrics_status ON llm_metrics (status);

CREATE TABLE llm_quality_signals (
  id SERIAL PRIMARY KEY,
  request_id UUID NOT NULL,
  timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  user_id VARCHAR(255),
  signal_type VARCHAR(50) NOT NULL,
  signal_value INTEGER NOT NULL,
  comment TEXT,
  CONSTRAINT fk_request FOREIGN KEY (request_id) REFERENCES llm_metrics(request_id)
);

CREATE INDEX idx_quality_signals_request ON llm_quality_signals (request_id);
CREATE INDEX idx_quality_signals_type ON llm_quality_signals (signal_type);

Implementing the Monitoring Middleware

Here is the core monitoring module. It wraps LLM API calls, captures all four metric categories, and stores them in PostgreSQL.

var crypto = require("crypto");
var pg = require("pg");

var pool = new pg.Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING
});

var MODEL_PRICING = {
  "gpt-4o": { input_per_1k: 0.0025, output_per_1k: 0.01 },
  "gpt-4o-mini": { input_per_1k: 0.00015, output_per_1k: 0.0006 },
  "claude-sonnet-4-20250514": { input_per_1k: 0.003, output_per_1k: 0.015 },
  "claude-haiku-4-20250414": { input_per_1k: 0.0008, output_per_1k: 0.004 }
};

function LLMMonitor(options) {
  this.pool = options.pool || pool;
  this.alertThresholds = options.alertThresholds || {
    cost_per_request: 0.50,
    latency_p95_ms: 30000,
    error_rate_percent: 5,
    daily_cost_limit: 100
  };
  this.alertCallback = options.onAlert || function(alert) {
    console.error("[LLM ALERT]", JSON.stringify(alert));
  };
}

LLMMonitor.prototype.recordMetric = function(metric, callback) {
  var self = this;
  var cost = calculateCost(metric.model, metric.input_tokens, metric.output_tokens);

  var query = [
    "INSERT INTO llm_metrics",
    "(request_id, user_id, feature, model, endpoint,",
    " total_ms, ttft_ms, tokens_per_second,",
    " input_tokens, output_tokens,",
    " input_cost, output_cost, total_cost,",
    " status, error_type, error_message,",
    " retry_count, is_fallback, prompt_hash, metadata)",
    "VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20)",
    "RETURNING id, request_id"
  ].join(" ");

  var values = [
    metric.request_id || crypto.randomUUID(),
    metric.user_id || null,
    metric.feature || "unknown",
    metric.model,
    metric.endpoint || null,
    metric.total_ms,
    metric.ttft_ms || null,
    metric.tokens_per_second || null,
    metric.input_tokens || 0,
    metric.output_tokens || 0,
    cost.input_cost,
    cost.output_cost,
    cost.total_cost,
    metric.status || "success",
    metric.error_type || null,
    metric.error_message || null,
    metric.retry_count || 0,
    metric.is_fallback || false,
    metric.prompt_hash || null,
    JSON.stringify(metric.metadata || {})
  ];

  self.pool.query(query, values, function(err, result) {
    if (err) {
      console.error("[LLMMonitor] Failed to record metric:", err.message);
      if (callback) callback(err);
      return;
    }

    // Check alert thresholds
    if (cost.total_cost > self.alertThresholds.cost_per_request) {
      self.alertCallback({
        type: "high_cost_request",
        request_id: values[0],
        cost: cost.total_cost,
        threshold: self.alertThresholds.cost_per_request,
        model: metric.model,
        feature: metric.feature
      });
    }

    if (metric.total_ms > self.alertThresholds.latency_p95_ms) {
      self.alertCallback({
        type: "high_latency",
        request_id: values[0],
        latency_ms: metric.total_ms,
        threshold: self.alertThresholds.latency_p95_ms,
        model: metric.model
      });
    }

    if (callback) callback(null, result.rows[0]);
  });
};

LLMMonitor.prototype.recordQualitySignal = function(requestId, userId, signalType, signalValue, comment, callback) {
  var query = [
    "INSERT INTO llm_quality_signals (request_id, user_id, signal_type, signal_value, comment)",
    "VALUES ($1, $2, $3, $4, $5)"
  ].join(" ");

  this.pool.query(query, [requestId, userId, signalType, signalValue, comment || null], function(err) {
    if (err) {
      console.error("[LLMMonitor] Failed to record quality signal:", err.message);
    }
    if (callback) callback(err);
  });
};

function calculateCost(model, inputTokens, outputTokens) {
  var pricing = MODEL_PRICING[model];
  if (!pricing) {
    return { input_cost: 0, output_cost: 0, total_cost: 0 };
  }
  var inputCost = (inputTokens / 1000) * pricing.input_per_1k;
  var outputCost = (outputTokens / 1000) * pricing.output_per_1k;
  return {
    input_cost: Math.round(inputCost * 1000000) / 1000000,
    output_cost: Math.round(outputCost * 1000000) / 1000000,
    total_cost: Math.round((inputCost + outputCost) * 1000000) / 1000000
  };
}

module.exports = LLMMonitor;

Express.js Monitoring Middleware

Now wrap the monitor into an Express middleware that automatically instruments any route that calls an LLM.

var crypto = require("crypto");
var LLMMonitor = require("./llm-monitor");

var monitor = new LLMMonitor({
  alertThresholds: {
    cost_per_request: 0.25,
    latency_p95_ms: 20000,
    error_rate_percent: 5,
    daily_cost_limit: 50
  },
  onAlert: function(alert) {
    console.error("[ALERT]", alert.type, JSON.stringify(alert));
    // Send to Slack, PagerDuty, etc.
  }
});

function llmMetricsMiddleware(feature) {
  return function(req, res, next) {
    var requestId = crypto.randomUUID();
    var startTime = Date.now();
    var firstTokenTime = null;

    req.llmMetrics = {
      requestId: requestId,
      feature: feature,
      startTime: startTime,

      markFirstToken: function() {
        if (!firstTokenTime) {
          firstTokenTime = Date.now();
        }
      },

      record: function(data, callback) {
        var endTime = Date.now();
        var metric = {
          request_id: requestId,
          user_id: req.user ? req.user.id : req.ip,
          feature: feature,
          model: data.model,
          endpoint: req.originalUrl,
          total_ms: endTime - startTime,
          ttft_ms: firstTokenTime ? firstTokenTime - startTime : null,
          tokens_per_second: data.tokens_per_second || null,
          input_tokens: data.input_tokens || 0,
          output_tokens: data.output_tokens || 0,
          status: data.status || "success",
          error_type: data.error_type || null,
          error_message: data.error_message || null,
          retry_count: data.retry_count || 0,
          is_fallback: data.is_fallback || false,
          prompt_hash: data.prompt_hash || null,
          metadata: data.metadata || {}
        };

        monitor.recordMetric(metric, callback);
      }
    };

    next();
  };
}

module.exports = llmMetricsMiddleware;

Usage in a route:

var express = require("express");
var router = express.Router();
var llmMetrics = require("./middleware/llm-metrics");
var openai = require("./services/openai");

router.post("/api/summarize", llmMetrics("summarization"), function(req, res) {
  var startTime = Date.now();

  openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      { role: "system", content: "Summarize the following text concisely." },
      { role: "user", content: req.body.text }
    ]
  }).then(function(response) {
    var usage = response.usage || {};

    req.llmMetrics.record({
      model: "gpt-4o-mini",
      input_tokens: usage.prompt_tokens || 0,
      output_tokens: usage.completion_tokens || 0,
      status: "success"
    });

    res.json({
      summary: response.choices[0].message.content,
      request_id: req.llmMetrics.requestId
    });
  }).catch(function(err) {
    req.llmMetrics.record({
      model: "gpt-4o-mini",
      status: "error",
      error_type: err.code || "unknown",
      error_message: err.message
    });

    res.status(500).json({ error: "Summarization failed" });
  });
});

Building Real-Time Dashboards

The dashboard endpoints aggregate metrics from PostgreSQL and return data suitable for any frontend charting library.

var express = require("express");
var router = express.Router();
var pg = require("pg");

var pool = new pg.Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING
});

// Overview dashboard - last 24 hours
router.get("/dashboard/overview", function(req, res) {
  var hours = parseInt(req.query.hours) || 24;

  var query = [
    "SELECT",
    "  COUNT(*) AS total_requests,",
    "  COUNT(*) FILTER (WHERE status = 'success') AS successful,",
    "  COUNT(*) FILTER (WHERE status = 'error') AS failed,",
    "  COUNT(*) FILTER (WHERE status = 'timeout') AS timeouts,",
    "  ROUND(AVG(total_ms)) AS avg_latency_ms,",
    "  PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY total_ms) AS p50_ms,",
    "  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_ms) AS p95_ms,",
    "  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY total_ms) AS p99_ms,",
    "  SUM(total_cost) AS total_cost,",
    "  ROUND(AVG(total_cost)::NUMERIC, 6) AS avg_cost_per_request,",
    "  SUM(input_tokens) AS total_input_tokens,",
    "  SUM(output_tokens) AS total_output_tokens,",
    "  COUNT(DISTINCT user_id) AS unique_users",
    "FROM llm_metrics",
    "WHERE timestamp > NOW() - INTERVAL '" + hours + " hours'"
  ].join("\n");

  pool.query(query, function(err, result) {
    if (err) {
      return res.status(500).json({ error: err.message });
    }
    res.json(result.rows[0]);
  });
});

// Cost breakdown by feature
router.get("/dashboard/cost-by-feature", function(req, res) {
  var days = parseInt(req.query.days) || 7;

  var query = [
    "SELECT",
    "  feature,",
    "  model,",
    "  COUNT(*) AS request_count,",
    "  SUM(total_cost) AS total_cost,",
    "  ROUND(AVG(total_cost)::NUMERIC, 6) AS avg_cost,",
    "  SUM(total_tokens) AS total_tokens",
    "FROM llm_metrics",
    "WHERE timestamp > NOW() - INTERVAL '" + days + " days'",
    "GROUP BY feature, model",
    "ORDER BY total_cost DESC"
  ].join("\n");

  pool.query(query, function(err, result) {
    if (err) {
      return res.status(500).json({ error: err.message });
    }
    res.json(result.rows);
  });
});

// Latency trends - hourly buckets
router.get("/dashboard/latency-trends", function(req, res) {
  var hours = parseInt(req.query.hours) || 24;

  var query = [
    "SELECT",
    "  DATE_TRUNC('hour', timestamp) AS hour,",
    "  model,",
    "  COUNT(*) AS requests,",
    "  ROUND(AVG(total_ms)) AS avg_ms,",
    "  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_ms) AS p95_ms,",
    "  ROUND(AVG(tokens_per_second)::NUMERIC, 2) AS avg_tps",
    "FROM llm_metrics",
    "WHERE timestamp > NOW() - INTERVAL '" + hours + " hours'",
    "AND status = 'success'",
    "GROUP BY DATE_TRUNC('hour', timestamp), model",
    "ORDER BY hour DESC"
  ].join("\n");

  pool.query(query, function(err, result) {
    if (err) {
      return res.status(500).json({ error: err.message });
    }
    res.json(result.rows);
  });
});

// Quality signals summary
router.get("/dashboard/quality", function(req, res) {
  var days = parseInt(req.query.days) || 7;

  var query = [
    "SELECT",
    "  m.feature,",
    "  m.model,",
    "  COUNT(q.id) AS total_signals,",
    "  COUNT(q.id) FILTER (WHERE q.signal_type = 'thumbs_up') AS thumbs_up,",
    "  COUNT(q.id) FILTER (WHERE q.signal_type = 'thumbs_down') AS thumbs_down,",
    "  COUNT(q.id) FILTER (WHERE q.signal_type = 'regenerate') AS regenerations,",
    "  ROUND(",
    "    COUNT(q.id) FILTER (WHERE q.signal_type = 'thumbs_up')::NUMERIC /",
    "    NULLIF(COUNT(q.id) FILTER (WHERE q.signal_type IN ('thumbs_up','thumbs_down')), 0) * 100",
    "  , 1) AS satisfaction_pct",
    "FROM llm_metrics m",
    "LEFT JOIN llm_quality_signals q ON m.request_id = q.request_id",
    "WHERE m.timestamp > NOW() - INTERVAL '" + days + " days'",
    "GROUP BY m.feature, m.model",
    "ORDER BY total_signals DESC"
  ].join("\n");

  pool.query(query, function(err, result) {
    if (err) {
      return res.status(500).json({ error: err.message });
    }
    res.json(result.rows);
  });
});

module.exports = router;

Setting Up Alerts

Alerts need to catch three categories of problems: cost spikes, latency degradation, and error rate increases. The alert checker runs on a schedule and compares recent metrics against thresholds and historical baselines.

var pg = require("pg");

var pool = new pg.Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING
});

function AlertChecker(options) {
  this.pool = options.pool || pool;
  this.notify = options.notify || function(alert) {
    console.error("[ALERT]", JSON.stringify(alert));
  };
  this.thresholds = options.thresholds || {
    error_rate_pct: 5,
    daily_cost_usd: 100,
    p95_latency_ms: 30000,
    cost_spike_multiplier: 3
  };
}

AlertChecker.prototype.checkErrorRate = function(callback) {
  var self = this;
  var query = [
    "SELECT",
    "  COUNT(*) AS total,",
    "  COUNT(*) FILTER (WHERE status IN ('error', 'timeout')) AS errors,",
    "  ROUND(COUNT(*) FILTER (WHERE status IN ('error', 'timeout'))::NUMERIC / NULLIF(COUNT(*), 0) * 100, 2) AS error_rate",
    "FROM llm_metrics",
    "WHERE timestamp > NOW() - INTERVAL '15 minutes'"
  ].join(" ");

  self.pool.query(query, function(err, result) {
    if (err) return callback(err);
    var row = result.rows[0];
    if (parseFloat(row.error_rate) > self.thresholds.error_rate_pct && parseInt(row.total) > 10) {
      self.notify({
        type: "high_error_rate",
        severity: "critical",
        error_rate: row.error_rate + "%",
        errors: row.errors,
        total: row.total,
        window: "15 minutes"
      });
    }
    callback(null);
  });
};

AlertChecker.prototype.checkDailyCost = function(callback) {
  var self = this;
  var query = [
    "SELECT COALESCE(SUM(total_cost), 0) AS daily_cost",
    "FROM llm_metrics",
    "WHERE timestamp > DATE_TRUNC('day', NOW())"
  ].join(" ");

  self.pool.query(query, function(err, result) {
    if (err) return callback(err);
    var dailyCost = parseFloat(result.rows[0].daily_cost);
    if (dailyCost > self.thresholds.daily_cost_usd) {
      self.notify({
        type: "daily_cost_exceeded",
        severity: "critical",
        daily_cost: "$" + dailyCost.toFixed(2),
        threshold: "$" + self.thresholds.daily_cost_usd
      });
    }
    callback(null);
  });
};

AlertChecker.prototype.checkCostSpike = function(callback) {
  var self = this;
  var query = [
    "WITH hourly AS (",
    "  SELECT",
    "    DATE_TRUNC('hour', timestamp) AS hour,",
    "    SUM(total_cost) AS cost",
    "  FROM llm_metrics",
    "  WHERE timestamp > NOW() - INTERVAL '25 hours'",
    "  GROUP BY DATE_TRUNC('hour', timestamp)",
    "  ORDER BY hour DESC",
    ")",
    "SELECT",
    "  (SELECT cost FROM hourly ORDER BY hour DESC LIMIT 1) AS current_hour_cost,",
    "  (SELECT AVG(cost) FROM hourly ORDER BY hour DESC OFFSET 1 LIMIT 24) AS avg_hourly_cost"
  ].join(" ");

  self.pool.query(query, function(err, result) {
    if (err) return callback(err);
    var row = result.rows[0];
    var current = parseFloat(row.current_hour_cost) || 0;
    var avg = parseFloat(row.avg_hourly_cost) || 0;
    if (avg > 0 && (current / avg) > self.thresholds.cost_spike_multiplier) {
      self.notify({
        type: "cost_spike",
        severity: "warning",
        current_hour: "$" + current.toFixed(4),
        avg_hourly: "$" + avg.toFixed(4),
        multiplier: (current / avg).toFixed(1) + "x"
      });
    }
    callback(null);
  });
};

AlertChecker.prototype.runAllChecks = function(callback) {
  var self = this;
  var checks = ["checkErrorRate", "checkDailyCost", "checkCostSpike"];
  var completed = 0;
  var errors = [];

  checks.forEach(function(check) {
    self[check](function(err) {
      if (err) errors.push(err);
      completed++;
      if (completed === checks.length) {
        callback(errors.length > 0 ? errors : null);
      }
    });
  });
};

module.exports = AlertChecker;

Run the alert checker every five minutes with a simple interval:

var AlertChecker = require("./alert-checker");

var checker = new AlertChecker({
  thresholds: {
    error_rate_pct: 5,
    daily_cost_usd: 50,
    p95_latency_ms: 25000,
    cost_spike_multiplier: 3
  },
  notify: function(alert) {
    console.error("[LLM ALERT]", alert.type, alert.severity);
    // Send to Slack webhook, PagerDuty, email, etc.
    sendSlackAlert(alert);
  }
});

// Run every 5 minutes
setInterval(function() {
  checker.runAllChecks(function(err) {
    if (err) {
      console.error("[AlertChecker] Some checks failed:", err);
    }
  });
}, 5 * 60 * 1000);

Comparing Metrics Across Model Versions

When you upgrade models or switch providers, you need before-and-after comparisons. Tag your metrics with a model version or deployment identifier, then query for side-by-side analysis.

router.get("/dashboard/model-comparison", function(req, res) {
  var modelA = req.query.model_a;
  var modelB = req.query.model_b;
  var days = parseInt(req.query.days) || 7;

  if (!modelA || !modelB) {
    return res.status(400).json({ error: "Provide model_a and model_b query parameters" });
  }

  var query = [
    "SELECT",
    "  model,",
    "  COUNT(*) AS requests,",
    "  ROUND(AVG(total_ms)) AS avg_latency_ms,",
    "  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_ms) AS p95_latency_ms,",
    "  ROUND(AVG(total_cost)::NUMERIC, 6) AS avg_cost,",
    "  SUM(total_cost) AS total_cost,",
    "  ROUND(AVG(input_tokens)) AS avg_input_tokens,",
    "  ROUND(AVG(output_tokens)) AS avg_output_tokens,",
    "  ROUND(COUNT(*) FILTER (WHERE status = 'error')::NUMERIC / NULLIF(COUNT(*), 0) * 100, 2) AS error_rate_pct",
    "FROM llm_metrics",
    "WHERE model IN ($1, $2)",
    "AND timestamp > NOW() - INTERVAL '" + days + " days'",
    "GROUP BY model",
    "ORDER BY model"
  ].join("\n");

  pool.query(query, [modelA, modelB], function(err, result) {
    if (err) {
      return res.status(500).json({ error: err.message });
    }
    res.json({
      period_days: days,
      models: result.rows
    });
  });
});

This endpoint makes it trivial to evaluate whether a model migration is worth it. Run both models in parallel for a week, then compare cost, latency, and error rates.

Weekly and Monthly Reporting

Stakeholders do not want dashboards. They want a summary they can read in two minutes. Build a reporting endpoint that generates the numbers they care about.

router.get("/reports/weekly", function(req, res) {
  var query = [
    "WITH current_week AS (",
    "  SELECT * FROM llm_metrics WHERE timestamp > NOW() - INTERVAL '7 days'",
    "),",
    "previous_week AS (",
    "  SELECT * FROM llm_metrics",
    "  WHERE timestamp > NOW() - INTERVAL '14 days'",
    "  AND timestamp <= NOW() - INTERVAL '7 days'",
    ")",
    "SELECT",
    "  (SELECT COUNT(*) FROM current_week) AS requests_this_week,",
    "  (SELECT COUNT(*) FROM previous_week) AS requests_last_week,",
    "  (SELECT ROUND(SUM(total_cost)::NUMERIC, 2) FROM current_week) AS cost_this_week,",
    "  (SELECT ROUND(SUM(total_cost)::NUMERIC, 2) FROM previous_week) AS cost_last_week,",
    "  (SELECT ROUND(AVG(total_ms)) FROM current_week) AS avg_latency_this_week,",
    "  (SELECT ROUND(AVG(total_ms)) FROM previous_week) AS avg_latency_last_week,",
    "  (SELECT ROUND(",
    "    COUNT(*) FILTER (WHERE status = 'error')::NUMERIC / NULLIF(COUNT(*), 0) * 100, 2)",
    "    FROM current_week) AS error_rate_this_week,",
    "  (SELECT ROUND(",
    "    COUNT(*) FILTER (WHERE status = 'error')::NUMERIC / NULLIF(COUNT(*), 0) * 100, 2)",
    "    FROM previous_week) AS error_rate_last_week,",
    "  (SELECT COUNT(DISTINCT user_id) FROM current_week) AS active_users_this_week,",
    "  (SELECT COUNT(DISTINCT user_id) FROM previous_week) AS active_users_last_week"
  ].join("\n");

  pool.query(query, function(err, result) {
    if (err) {
      return res.status(500).json({ error: err.message });
    }
    var r = result.rows[0];
    var costChange = r.cost_last_week > 0
      ? (((r.cost_this_week - r.cost_last_week) / r.cost_last_week) * 100).toFixed(1)
      : "N/A";

    res.json({
      period: "Last 7 days",
      requests: {
        current: parseInt(r.requests_this_week),
        previous: parseInt(r.requests_last_week),
        change_pct: r.requests_last_week > 0
          ? (((r.requests_this_week - r.requests_last_week) / r.requests_last_week) * 100).toFixed(1) + "%"
          : "N/A"
      },
      cost: {
        current: "$" + r.cost_this_week,
        previous: "$" + r.cost_last_week,
        change_pct: costChange + "%"
      },
      latency: {
        current_avg_ms: parseInt(r.avg_latency_this_week),
        previous_avg_ms: parseInt(r.avg_latency_last_week)
      },
      error_rate: {
        current: r.error_rate_this_week + "%",
        previous: r.error_rate_last_week + "%"
      },
      active_users: {
        current: parseInt(r.active_users_this_week),
        previous: parseInt(r.active_users_last_week)
      }
    });
  });
});

Integrating with External Monitoring

You should not build everything from scratch. Use your existing observability stack for alerting and visualization, and push LLM metrics into it.

Datadog Integration

var StatsD = require("hot-shots");
var dogstatsd = new StatsD({ host: "localhost", port: 8125 });

function emitDatadogMetrics(metric) {
  var tags = [
    "model:" + metric.model,
    "feature:" + metric.feature,
    "status:" + metric.status
  ];

  dogstatsd.histogram("llm.latency_ms", metric.total_ms, tags);
  dogstatsd.increment("llm.requests", 1, tags);
  dogstatsd.histogram("llm.cost", metric.total_cost, tags);
  dogstatsd.histogram("llm.input_tokens", metric.input_tokens, tags);
  dogstatsd.histogram("llm.output_tokens", metric.output_tokens, tags);

  if (metric.status === "error") {
    dogstatsd.increment("llm.errors", 1, tags);
  }
  if (metric.ttft_ms) {
    dogstatsd.histogram("llm.ttft_ms", metric.ttft_ms, tags);
  }
}

Grafana via Prometheus

var promClient = require("prom-client");

var llmRequestDuration = new promClient.Histogram({
  name: "llm_request_duration_ms",
  help: "LLM request duration in milliseconds",
  labelNames: ["model", "feature", "status"],
  buckets: [100, 250, 500, 1000, 2500, 5000, 10000, 30000, 60000]
});

var llmRequestCost = new promClient.Histogram({
  name: "llm_request_cost_usd",
  help: "LLM request cost in USD",
  labelNames: ["model", "feature"],
  buckets: [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 1.0]
});

var llmRequestTotal = new promClient.Counter({
  name: "llm_requests_total",
  help: "Total LLM requests",
  labelNames: ["model", "feature", "status"]
});

function emitPrometheusMetrics(metric) {
  var labels = {
    model: metric.model,
    feature: metric.feature,
    status: metric.status
  };

  llmRequestDuration.observe(labels, metric.total_ms);
  llmRequestCost.observe({ model: metric.model, feature: metric.feature }, metric.total_cost);
  llmRequestTotal.inc(labels);
}

CloudWatch

var AWS = require("aws-sdk");
var cloudwatch = new AWS.CloudWatch({ region: "us-east-1" });

function emitCloudWatchMetrics(metric) {
  var params = {
    Namespace: "LLM/Application",
    MetricData: [
      {
        MetricName: "RequestLatency",
        Value: metric.total_ms,
        Unit: "Milliseconds",
        Dimensions: [
          { Name: "Model", Value: metric.model },
          { Name: "Feature", Value: metric.feature }
        ]
      },
      {
        MetricName: "RequestCost",
        Value: metric.total_cost,
        Unit: "None",
        Dimensions: [
          { Name: "Model", Value: metric.model },
          { Name: "Feature", Value: metric.feature }
        ]
      },
      {
        MetricName: "TokensUsed",
        Value: metric.input_tokens + metric.output_tokens,
        Unit: "Count",
        Dimensions: [
          { Name: "Model", Value: metric.model }
        ]
      }
    ]
  };

  cloudwatch.putMetricData(params, function(err) {
    if (err) {
      console.error("[CloudWatch] Failed to emit metrics:", err.message);
    }
  });
}

Complete Working Example

Here is the full monitoring module that ties everything together. Drop this into your Express application and you have all four metric categories, PostgreSQL storage, dashboard endpoints, and alert checking.

// llm-monitoring.js — Complete LLM Monitoring Module
var crypto = require("crypto");
var express = require("express");
var pg = require("pg");

var router = express.Router();

var pool = new pg.Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING,
  max: 10
});

var MODEL_PRICING = {
  "gpt-4o": { input_per_1k: 0.0025, output_per_1k: 0.01 },
  "gpt-4o-mini": { input_per_1k: 0.00015, output_per_1k: 0.0006 },
  "claude-sonnet-4-20250514": { input_per_1k: 0.003, output_per_1k: 0.015 },
  "claude-haiku-4-20250414": { input_per_1k: 0.0008, output_per_1k: 0.004 }
};

// ---- Cost Calculation ----
function calcCost(model, inTokens, outTokens) {
  var p = MODEL_PRICING[model];
  if (!p) return { input_cost: 0, output_cost: 0, total_cost: 0 };
  var ic = (inTokens / 1000) * p.input_per_1k;
  var oc = (outTokens / 1000) * p.output_per_1k;
  return {
    input_cost: Math.round(ic * 1e6) / 1e6,
    output_cost: Math.round(oc * 1e6) / 1e6,
    total_cost: Math.round((ic + oc) * 1e6) / 1e6
  };
}

// ---- Record a Metric ----
function recordMetric(data, callback) {
  var cost = calcCost(data.model, data.input_tokens || 0, data.output_tokens || 0);
  var sql = [
    "INSERT INTO llm_metrics",
    "(request_id, user_id, feature, model, endpoint,",
    " total_ms, ttft_ms, tokens_per_second,",
    " input_tokens, output_tokens,",
    " input_cost, output_cost, total_cost,",
    " status, error_type, error_message,",
    " retry_count, is_fallback, prompt_hash, metadata)",
    "VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20)",
    "RETURNING request_id"
  ].join(" ");

  var vals = [
    data.request_id || crypto.randomUUID(),
    data.user_id || null,
    data.feature || "unknown",
    data.model,
    data.endpoint || null,
    data.total_ms,
    data.ttft_ms || null,
    data.tokens_per_second || null,
    data.input_tokens || 0,
    data.output_tokens || 0,
    cost.input_cost,
    cost.output_cost,
    cost.total_cost,
    data.status || "success",
    data.error_type || null,
    data.error_message || null,
    data.retry_count || 0,
    data.is_fallback || false,
    data.prompt_hash || null,
    JSON.stringify(data.metadata || {})
  ];

  pool.query(sql, vals, function(err, result) {
    if (err) console.error("[LLMMonitor] Insert failed:", err.message);
    if (callback) callback(err, result ? result.rows[0] : null);
  });
}

// ---- Middleware ----
function metricsMiddleware(feature) {
  return function(req, res, next) {
    var rid = crypto.randomUUID();
    var t0 = Date.now();
    var ttft = null;

    req.llm = {
      requestId: rid,
      markFirstToken: function() { if (!ttft) ttft = Date.now(); },
      done: function(data, cb) {
        data.request_id = rid;
        data.user_id = (req.user && req.user.id) || req.ip;
        data.feature = feature;
        data.endpoint = req.originalUrl;
        data.total_ms = Date.now() - t0;
        data.ttft_ms = ttft ? ttft - t0 : null;
        recordMetric(data, cb);
      }
    };
    next();
  };
}

// ---- Quality Signal ----
router.post("/quality-signal", function(req, res) {
  var b = req.body;
  if (!b.request_id || !b.signal_type) {
    return res.status(400).json({ error: "request_id and signal_type required" });
  }
  var sql = "INSERT INTO llm_quality_signals (request_id, user_id, signal_type, signal_value, comment) VALUES ($1,$2,$3,$4,$5)";
  pool.query(sql, [b.request_id, b.user_id || null, b.signal_type, b.signal_value || 1, b.comment || null], function(err) {
    if (err) return res.status(500).json({ error: err.message });
    res.json({ recorded: true });
  });
});

// ---- Dashboard: Overview ----
router.get("/dashboard/overview", function(req, res) {
  var h = parseInt(req.query.hours) || 24;
  var sql = [
    "SELECT COUNT(*) AS total,",
    "COUNT(*) FILTER (WHERE status='success') AS ok,",
    "COUNT(*) FILTER (WHERE status='error') AS errors,",
    "ROUND(AVG(total_ms)) AS avg_ms,",
    "PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY total_ms) AS p50,",
    "PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_ms) AS p95,",
    "PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY total_ms) AS p99,",
    "SUM(total_cost) AS cost,",
    "SUM(input_tokens) AS in_tokens,",
    "SUM(output_tokens) AS out_tokens",
    "FROM llm_metrics WHERE timestamp > NOW() - INTERVAL '" + h + " hours'"
  ].join(" ");
  pool.query(sql, function(err, r) {
    if (err) return res.status(500).json({ error: err.message });
    res.json(r.rows[0]);
  });
});

// ---- Dashboard: Hourly Costs ----
router.get("/dashboard/hourly-costs", function(req, res) {
  var h = parseInt(req.query.hours) || 48;
  var sql = [
    "SELECT DATE_TRUNC('hour', timestamp) AS hour,",
    "model, COUNT(*) AS reqs, SUM(total_cost) AS cost",
    "FROM llm_metrics WHERE timestamp > NOW() - INTERVAL '" + h + " hours'",
    "GROUP BY 1, 2 ORDER BY 1 DESC"
  ].join(" ");
  pool.query(sql, function(err, r) {
    if (err) return res.status(500).json({ error: err.message });
    res.json(r.rows);
  });
});

// ---- Exports ----
module.exports = {
  router: router,
  middleware: metricsMiddleware,
  recordMetric: recordMetric
};

Register it in your main Express app:

var app = require("express")();
var bodyParser = require("body-parser");
var llmMonitoring = require("./llm-monitoring");

app.use(bodyParser.json());
app.use("/llm", llmMonitoring.router);

// Use the middleware on LLM-powered routes
app.post("/api/chat", llmMonitoring.middleware("chat"), function(req, res) {
  // Your LLM call here
  callLLM(req.body.message, function(err, result) {
    req.llm.done({
      model: result.model,
      input_tokens: result.usage.input_tokens,
      output_tokens: result.usage.output_tokens,
      status: err ? "error" : "success",
      error_message: err ? err.message : null
    });
    if (err) return res.status(500).json({ error: "Chat failed" });
    res.json({ response: result.text, request_id: req.llm.requestId });
  });
});

app.listen(8080);

Common Issues and Troubleshooting

1. Connection pool exhaustion during high LLM traffic

Error: Cannot acquire connection from pool - timeout
    at Pool._pendingQueue.push

LLM requests are slow, and if your monitoring inserts share a connection pool with your main database queries, long-running LLM calls can hold connections while the monitoring insert waits. Use a dedicated pool for monitoring with a smaller max setting and a connectionTimeoutMillis of 5000ms. If inserts are backing up, switch to batched inserts using pg-copy-streams or buffer metrics in memory and flush every 10 seconds.

2. Token counts missing from API responses

TypeError: Cannot read properties of undefined (reading 'prompt_tokens')

Some LLM providers do not return usage data when streaming is enabled. OpenAI requires stream_options: { include_usage: true } in the request body for streaming completions. Anthropic includes usage in the message_stop event. Always guard against missing usage data with fallback defaults and log when token counts are unavailable so you can track the gap.

3. Cost calculation drift from actual billing

[WARN] Calculated monthly cost: $847.32, Actual invoice: $912.18 (7.6% drift)

Your local pricing table will drift from the provider's actual pricing. Model pricing changes, cached prompt discounts apply, and batch API pricing differs from real-time pricing. Reconcile your calculated costs against actual invoices monthly. Store the pricing version in your metadata so you can retroactively recalculate when pricing changes. Add a pricing_version field to your metrics table.

4. Dashboard queries timing out on large datasets

ERROR: canceling statement due to statement timeout
CONTEXT: SQL statement "SELECT ... FROM llm_metrics WHERE timestamp > ..."

Once you have millions of metric rows, aggregate queries over 30-day windows will get slow. Add a materialized view that pre-aggregates hourly metrics and refresh it on a schedule. For real-time dashboards, query only the last few hours from the raw table and use the materialized view for historical trends.

CREATE MATERIALIZED VIEW llm_metrics_hourly AS
SELECT
  DATE_TRUNC('hour', timestamp) AS hour,
  model,
  feature,
  COUNT(*) AS requests,
  ROUND(AVG(total_ms)) AS avg_ms,
  SUM(total_cost) AS cost,
  SUM(input_tokens) AS in_tokens,
  SUM(output_tokens) AS out_tokens,
  COUNT(*) FILTER (WHERE status = 'error') AS errors
FROM llm_metrics
GROUP BY 1, 2, 3;

CREATE UNIQUE INDEX idx_metrics_hourly ON llm_metrics_hourly (hour, model, feature);

Refresh it every hour: REFRESH MATERIALIZED VIEW CONCURRENTLY llm_metrics_hourly;

5. Alert fatigue from noisy thresholds

When you first deploy alerting, you will get flooded with notifications because your thresholds are based on guesses, not data. Run the monitoring for two weeks in observation mode (log alerts but do not notify). Then set your thresholds at 2x the observed p95 for latency and 3x the average hourly cost for cost spikes. Revisit thresholds quarterly as usage patterns change.

Best Practices

  • Always record metrics asynchronously. Never let a monitoring insert block your API response. Use fire-and-forget patterns or buffer metrics in memory and flush periodically. A monitoring failure should never cause a user-facing error.

  • Include request IDs in every LLM response. Return the request_id to the client so users can reference it in bug reports and you can correlate quality signals back to specific LLM calls. This is your single most valuable debugging tool.

  • Separate monitoring pools from application pools. LLM calls tie up database connections for seconds at a time. If monitoring and application queries share a pool, a burst of slow LLM calls can starve your regular database operations. Two pools, two connection limits.

  • Hash your prompts, do not store them. Full prompts can contain PII, proprietary data, and sensitive context. Store a SHA-256 hash of the prompt for deduplication and analysis. If you need full prompt logging for debugging, store it in a separate encrypted table with strict access controls and automatic expiration.

  • Set cost circuit breakers at multiple levels. Implement per-request cost limits (reject prompts that will obviously cost too much), per-user hourly limits (prevent abuse), per-feature daily limits (contain runaway features), and a global daily kill switch. The cost of building these controls is trivial compared to a single $5,000 overnight surprise.

  • Track token efficiency, not just token volume. A feature that uses 10,000 tokens to produce a one-sentence answer has a problem. Monitor the ratio of output tokens to input tokens and the ratio of useful output to total output. These efficiency metrics often reveal optimization opportunities that raw volume numbers miss.

  • Version your monitoring schema from day one. Add a schema_version field to your metrics table. When you add new columns or change calculations, increment the version. This lets you filter dashboards to compare like with like and avoids misleading trend lines when your measurement methodology changes.

  • Retain raw metrics for at least 90 days. You will need historical data for model comparison, cost forecasting, and incident investigation. After 90 days, roll up to hourly aggregates. After a year, roll up to daily. Storage is cheap compared to the insights you lose by deleting data too early.

References

Powered by Contentful