Logging and Observability for LLM Calls

Shane

2/13/2026

25 min read

Build comprehensive logging for LLM calls with structured output, PII redaction, tracing, and searchable log storage in Node.js.

llm nodejs logging observability tracing structured logging

Logging and Observability for LLM Calls

When you ship LLM-powered features to production, you quickly discover that traditional application logging is not enough. LLM calls are expensive, non-deterministic, and slow compared to typical API calls — which means you need specialized logging that captures token usage, latency, cost, prompt content, and model behavior in a structured, searchable format. This article walks through building a production-grade logging and observability layer for LLM calls in Node.js, covering everything from structured JSON output and PII redaction to OpenTelemetry tracing and log-based alerting.

Prerequisites

Node.js v18 or later
Working knowledge of Express.js and middleware patterns
Basic familiarity with the OpenAI SDK or similar LLM client
PostgreSQL 14+ (for queryable log storage)
A general understanding of structured logging concepts

Install the dependencies you will need:

npm install winston openai uuid pg express @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/sdk-trace-base @opentelemetry/exporter-trace-otlp-http

What to Log for LLM Calls

Standard application logs capture request/response cycles, but LLM calls generate metadata that is critical for debugging, cost management, and compliance. Here is the minimum set of fields I log for every LLM interaction:

Field	Why It Matters
`correlation_id`	Ties the LLM call to the originating user request
`model`	Model drift detection; cost differs per model
`prompt_tokens`	Input cost tracking
`completion_tokens`	Output cost tracking
`total_tokens`	Budget enforcement
`latency_ms`	SLA monitoring, timeout tuning
`estimated_cost`	Real-time spend visibility
`temperature`	Reproducibility; explains output variance
`max_tokens`	Helps debug truncated responses
`status`	`success`, `error`, `timeout`, `rate_limited`
`error_message`	The actual failure reason from the provider
`prompt_hash`	Deduplication without storing raw prompts
`response_length`	Quick indicator of response quality

Missing any of these fields means you are flying blind. I have debugged production issues where the root cause was a model parameter change that only showed up because we logged temperature and max_tokens consistently.

Structured Logging Format

Unstructured log lines like "Called GPT-4, got response in 2.3s" are useless at scale. You need JSON with consistent field names that tools can parse, index, and query.

Here is the log schema I use:

var llmLogEntry = {
  timestamp: new Date().toISOString(),
  level: "info",
  service: "llm-gateway",
  correlation_id: "req-abc123",
  trace_id: "trace-xyz789",
  span_id: "span-001",
  event: "llm.completion",
  model: "gpt-4o",
  provider: "openai",
  prompt_tokens: 1240,
  completion_tokens: 356,
  total_tokens: 1596,
  latency_ms: 2340,
  estimated_cost_usd: 0.0247,
  temperature: 0.7,
  max_tokens: 1024,
  status: "success",
  prompt_hash: "sha256:a1b2c3d4...",
  response_length: 1423,
  user_id: "user-redacted",
  endpoint: "/api/summarize",
  error: null
};

Every field is present on every log entry — even if null. This makes downstream parsing predictable and prevents schema mismatches in your log aggregator.

Implementing a Logging Wrapper

The core idea is to wrap your LLM client so that every call automatically captures metadata without the caller needing to think about it.

var { createHash } = require("crypto");
var { v4: uuidv4 } = require("uuid");
var winston = require("winston");

var logger = winston.createLogger({
  level: process.env.LOG_LEVEL || "info",
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  defaultMeta: { service: "llm-gateway" },
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: "logs/llm-calls.log", maxsize: 52428800, maxFiles: 10 })
  ]
});

// Cost per 1K tokens (update as pricing changes)
var COST_TABLE = {
  "gpt-4o": { input: 0.005, output: 0.015 },
  "gpt-4o-mini": { input: 0.00015, output: 0.0006 },
  "gpt-4-turbo": { input: 0.01, output: 0.03 },
  "claude-3-5-sonnet": { input: 0.003, output: 0.015 }
};

function estimateCost(model, promptTokens, completionTokens) {
  var pricing = COST_TABLE[model];
  if (!pricing) return null;
  return ((promptTokens / 1000) * pricing.input) + ((completionTokens / 1000) * pricing.output);
}

function hashPrompt(prompt) {
  return "sha256:" + createHash("sha256").update(prompt).digest("hex").substring(0, 16);
}

function createLLMLogger(options) {
  var openai = options.client;
  var defaultModel = options.model || "gpt-4o";

  function chatCompletion(params, context) {
    var correlationId = (context && context.correlationId) || uuidv4();
    var startTime = Date.now();
    var model = params.model || defaultModel;
    var promptText = JSON.stringify(params.messages);
    var promptHash = hashPrompt(promptText);

    logger.debug("llm.request.start", {
      correlation_id: correlationId,
      model: model,
      prompt_hash: promptHash,
      prompt_tokens_estimate: Math.ceil(promptText.length / 4),
      temperature: params.temperature,
      max_tokens: params.max_tokens
    });

    return openai.chat.completions.create(params)
      .then(function(response) {
        var latencyMs = Date.now() - startTime;
        var usage = response.usage || {};
        var cost = estimateCost(model, usage.prompt_tokens, usage.completion_tokens);
        var responseText = response.choices[0].message.content || "";

        logger.info("llm.completion.success", {
          correlation_id: correlationId,
          event: "llm.completion",
          model: model,
          provider: "openai",
          prompt_tokens: usage.prompt_tokens,
          completion_tokens: usage.completion_tokens,
          total_tokens: usage.total_tokens,
          latency_ms: latencyMs,
          estimated_cost_usd: cost,
          temperature: params.temperature,
          max_tokens: params.max_tokens,
          status: "success",
          prompt_hash: promptHash,
          response_length: responseText.length,
          finish_reason: response.choices[0].finish_reason
        });

        return response;
      })
      .catch(function(error) {
        var latencyMs = Date.now() - startTime;
        var status = "error";
        if (error.status === 429) status = "rate_limited";
        if (error.code === "ETIMEDOUT" || error.code === "ECONNABORTED") status = "timeout";

        logger.error("llm.completion.error", {
          correlation_id: correlationId,
          event: "llm.completion",
          model: model,
          provider: "openai",
          latency_ms: latencyMs,
          status: status,
          error_message: error.message,
          error_code: error.status || error.code,
          prompt_hash: promptHash
        });

        throw error;
      });
  }

  return {
    chatCompletion: chatCompletion
  };
}

module.exports = { createLLMLogger: createLLMLogger, logger: logger };

This wrapper gives you consistent, structured logs for every LLM call without polluting your application code.

Log Levels for AI Operations

I use a specific log level strategy for LLM operations that differs from typical web application logging:

DEBUG: Full prompt and response content. Never enable in production unless actively debugging. These logs are enormous and may contain sensitive data.
INFO: Every completed LLM call with metadata (tokens, latency, cost, model). This is your primary operational log.
WARN: Degraded behavior — retries triggered, fallback models used, rate limit approached (80% of quota), responses truncated by max_tokens.
ERROR: Failed calls — API errors, timeouts, invalid responses, content filter blocks.

// Debug level: full prompt (development only)
logger.debug("llm.prompt.full", {
  correlation_id: correlationId,
  messages: params.messages  // NEVER in production
});

// Info level: standard completion log
logger.info("llm.completion.success", {
  correlation_id: correlationId,
  model: model,
  total_tokens: usage.total_tokens,
  latency_ms: latencyMs
});

// Warn level: degradation signals
logger.warn("llm.fallback.triggered", {
  correlation_id: correlationId,
  original_model: "gpt-4o",
  fallback_model: "gpt-4o-mini",
  reason: "rate_limit_approaching",
  quota_usage_pct: 85
});

// Error level: failures
logger.error("llm.completion.error", {
  correlation_id: correlationId,
  error_message: "Request too large for gpt-4o-mini",
  error_code: 400,
  prompt_tokens_estimate: 142000
});

PII Handling in LLM Logs

This is the part most teams get wrong. Logging raw prompts means logging user data — names, emails, addresses, medical information, whatever your users type into your product. You need a redaction layer between the raw input and the log output.

var PII_PATTERNS = [
  { pattern: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g, replacement: "[EMAIL_REDACTED]" },
  { pattern: /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, replacement: "[PHONE_REDACTED]" },
  { pattern: /\b\d{3}-\d{2}-\d{4}\b/g, replacement: "[SSN_REDACTED]" },
  { pattern: /\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/g, replacement: "[CARD_REDACTED]" },
  { pattern: /\b\d{1,5}\s+\w+\s+(street|st|avenue|ave|road|rd|boulevard|blvd|drive|dr|lane|ln)\b/gi, replacement: "[ADDRESS_REDACTED]" }
];

function redactPII(text) {
  if (typeof text !== "string") return text;
  var redacted = text;
  PII_PATTERNS.forEach(function(rule) {
    redacted = redacted.replace(rule.pattern, rule.replacement);
  });
  return redacted;
}

function redactMessages(messages) {
  return messages.map(function(msg) {
    return {
      role: msg.role,
      content: redactPII(msg.content)
    };
  });
}

// Usage: log redacted prompt at debug level
logger.debug("llm.prompt.redacted", {
  correlation_id: correlationId,
  messages: redactMessages(params.messages)
});

This is a baseline. For regulated industries (healthcare, finance), you should also consider tokenizing user identifiers and storing the mapping in a separate, access-controlled system.

Request Correlation IDs

Correlation IDs tie a single user request to every downstream operation — database queries, cache lookups, LLM calls, and external API calls. Without them, debugging a production issue involving an LLM call is nearly impossible.

var { v4: uuidv4 } = require("uuid");

function correlationMiddleware(req, res, next) {
  req.correlationId = req.headers["x-correlation-id"] || uuidv4();
  res.setHeader("x-correlation-id", req.correlationId);
  next();
}

// Pass correlation ID to every LLM call
app.post("/api/summarize", function(req, res) {
  var context = { correlationId: req.correlationId };

  llmClient.chatCompletion({
    model: "gpt-4o",
    messages: [
      { role: "system", content: "Summarize the following text." },
      { role: "user", content: req.body.text }
    ],
    temperature: 0.3,
    max_tokens: 512
  }, context)
  .then(function(response) {
    res.json({ summary: response.choices[0].message.content });
  })
  .catch(function(error) {
    res.status(500).json({ error: "Summarization failed" });
  });
});

When a user reports an issue, you look up the correlation ID from the response headers, and you can trace the entire lifecycle of that request across every service and log entry.

Log Storage Strategies

You have two main options, and I recommend using both:

PostgreSQL for Queryable Logs

Store structured LLM logs in PostgreSQL when you need to query, aggregate, and report on them. This is essential for cost tracking, model performance analysis, and compliance audits.

CREATE TABLE llm_logs (
  id SERIAL PRIMARY KEY,
  correlation_id VARCHAR(64) NOT NULL,
  trace_id VARCHAR(64),
  timestamp TIMESTAMPTZ DEFAULT NOW(),
  level VARCHAR(10) NOT NULL,
  event VARCHAR(64) NOT NULL,
  model VARCHAR(64),
  provider VARCHAR(32),
  prompt_tokens INTEGER,
  completion_tokens INTEGER,
  total_tokens INTEGER,
  latency_ms INTEGER,
  estimated_cost_usd NUMERIC(10, 6),
  temperature NUMERIC(3, 2),
  max_tokens INTEGER,
  status VARCHAR(20),
  error_message TEXT,
  prompt_hash VARCHAR(80),
  response_length INTEGER,
  finish_reason VARCHAR(20),
  user_id VARCHAR(64),
  endpoint VARCHAR(128),
  metadata JSONB DEFAULT '{}'
);

CREATE INDEX idx_llm_logs_correlation ON llm_logs(correlation_id);
CREATE INDEX idx_llm_logs_timestamp ON llm_logs(timestamp);
CREATE INDEX idx_llm_logs_model ON llm_logs(model);
CREATE INDEX idx_llm_logs_status ON llm_logs(status);
CREATE INDEX idx_llm_logs_user ON llm_logs(user_id);
CREATE INDEX idx_llm_logs_cost ON llm_logs(estimated_cost_usd);

var { Pool } = require("pg");

var pool = new Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING
});

function storeLLMLog(entry) {
  var sql = "INSERT INTO llm_logs (correlation_id, trace_id, level, event, model, provider, " +
    "prompt_tokens, completion_tokens, total_tokens, latency_ms, estimated_cost_usd, " +
    "temperature, max_tokens, status, error_message, prompt_hash, response_length, " +
    "finish_reason, user_id, endpoint, metadata) " +
    "VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21)";

  var values = [
    entry.correlation_id, entry.trace_id, entry.level, entry.event,
    entry.model, entry.provider, entry.prompt_tokens, entry.completion_tokens,
    entry.total_tokens, entry.latency_ms, entry.estimated_cost_usd,
    entry.temperature, entry.max_tokens, entry.status, entry.error_message,
    entry.prompt_hash, entry.response_length, entry.finish_reason,
    entry.user_id, entry.endpoint, JSON.stringify(entry.metadata || {})
  ];

  return pool.query(sql, values).catch(function(err) {
    // Log storage should never crash the application
    console.error("Failed to store LLM log:", err.message);
  });
}

File Rotation for Volume

For high-volume environments, also write to rotating log files. Winston handles this natively:

var winston = require("winston");
require("winston-daily-rotate-file");

var rotatingTransport = new winston.transports.DailyRotateFile({
  filename: "logs/llm-%DATE%.log",
  datePattern: "YYYY-MM-DD",
  maxSize: "100m",
  maxFiles: "30d",
  zippedArchive: true
});

Use file logs as your safety net. If PostgreSQL goes down, you still have the file logs. If a query is too expensive to run on the database, you can grep the files.

Distributed Tracing with OpenTelemetry

Structured logs tell you what happened. Traces tell you how long each step took and how they relate to each other. For LLM calls that involve prompt construction, retrieval-augmented generation, and post-processing, tracing is invaluable.

var opentelemetry = require("@opentelemetry/api");

var tracer = opentelemetry.trace.getTracer("llm-gateway");

function tracedChatCompletion(llmClient, params, context) {
  return tracer.startActiveSpan("llm.chat.completion", function(span) {
    var correlationId = (context && context.correlationId) || "unknown";

    span.setAttribute("llm.model", params.model);
    span.setAttribute("llm.temperature", params.temperature || 1.0);
    span.setAttribute("llm.max_tokens", params.max_tokens || 0);
    span.setAttribute("correlation.id", correlationId);

    return llmClient.chatCompletion(params, context)
      .then(function(response) {
        var usage = response.usage || {};
        span.setAttribute("llm.prompt_tokens", usage.prompt_tokens);
        span.setAttribute("llm.completion_tokens", usage.completion_tokens);
        span.setAttribute("llm.total_tokens", usage.total_tokens);
        span.setAttribute("llm.finish_reason", response.choices[0].finish_reason);
        span.setStatus({ code: opentelemetry.SpanStatusCode.OK });
        span.end();
        return response;
      })
      .catch(function(error) {
        span.setStatus({
          code: opentelemetry.SpanStatusCode.ERROR,
          message: error.message
        });
        span.recordException(error);
        span.end();
        throw error;
      });
  });
}

For a RAG pipeline, create child spans for each stage:

function ragPipeline(query, context) {
  return tracer.startActiveSpan("rag.pipeline", function(parentSpan) {

    return tracer.startActiveSpan("rag.embed_query", function(embedSpan) {
      return embedQuery(query).then(function(embedding) {
        embedSpan.end();

        return tracer.startActiveSpan("rag.vector_search", function(searchSpan) {
          return vectorSearch(embedding).then(function(documents) {
            searchSpan.setAttribute("rag.documents_found", documents.length);
            searchSpan.end();

            return tracer.startActiveSpan("rag.llm_completion", function(llmSpan) {
              var prompt = buildPromptWithContext(query, documents);
              return tracedChatCompletion(llmClient, {
                model: "gpt-4o",
                messages: prompt,
                temperature: 0.3
              }, context).then(function(response) {
                llmSpan.end();
                parentSpan.end();
                return response;
              });
            });
          });
        });
      });
    });
  });
}

This gives you a waterfall view in Jaeger or your tracing backend: you can see that the embedding took 120ms, the vector search took 45ms, and the LLM call took 2.8 seconds — and the LLM call is the bottleneck you need to optimize.

Building Log-Based Dashboards

With structured logs in PostgreSQL, building operational dashboards is straightforward SQL:

-- Daily cost by model
SELECT
  DATE(timestamp) AS day,
  model,
  COUNT(*) AS call_count,
  SUM(estimated_cost_usd) AS total_cost,
  AVG(latency_ms) AS avg_latency,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95_latency,
  SUM(total_tokens) AS total_tokens
FROM llm_logs
WHERE timestamp > NOW() - INTERVAL '30 days'
GROUP BY DATE(timestamp), model
ORDER BY day DESC, total_cost DESC;

-- Error rate by endpoint over the last 24 hours
SELECT
  endpoint,
  COUNT(*) AS total_calls,
  COUNT(*) FILTER (WHERE status = 'error') AS errors,
  ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'error') / COUNT(*), 2) AS error_rate_pct
FROM llm_logs
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY endpoint
ORDER BY error_rate_pct DESC;

-- Hourly token consumption
SELECT
  DATE_TRUNC('hour', timestamp) AS hour,
  SUM(prompt_tokens) AS total_prompt_tokens,
  SUM(completion_tokens) AS total_completion_tokens,
  SUM(estimated_cost_usd) AS hourly_cost
FROM llm_logs
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY DATE_TRUNC('hour', timestamp)
ORDER BY hour DESC;

Feed these queries into Grafana, Metabase, or even a simple Express endpoint that returns the data as JSON. The important thing is that someone on your team is looking at these numbers daily.

Log-Based Alerting

Set up alerts for the conditions that actually matter in LLM operations:

// Check for anomalies on a schedule (e.g., every 5 minutes via node-cron)
var cron = require("node-cron");

function checkAlerts() {
  // Error rate alert
  var errorQuery = "SELECT COUNT(*) FILTER (WHERE status = 'error') AS errors, " +
    "COUNT(*) AS total FROM llm_logs WHERE timestamp > NOW() - INTERVAL '5 minutes'";

  pool.query(errorQuery).then(function(result) {
    var row = result.rows[0];
    if (row.total > 10 && (row.errors / row.total) > 0.1) {
      sendAlert("LLM Error Rate Alert", "Error rate is " +
        Math.round((row.errors / row.total) * 100) + "% in the last 5 minutes. " +
        row.errors + " errors out of " + row.total + " calls.");
    }
  });

  // Latency spike alert
  var latencyQuery = "SELECT AVG(latency_ms) AS avg_latency, " +
    "PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95 " +
    "FROM llm_logs WHERE timestamp > NOW() - INTERVAL '5 minutes'";

  pool.query(latencyQuery).then(function(result) {
    var row = result.rows[0];
    if (row.p95 > 10000) {
      sendAlert("LLM Latency Alert", "P95 latency is " + row.p95 + "ms in the last 5 minutes.");
    }
  });

  // Cost anomaly alert
  var costQuery = "SELECT SUM(estimated_cost_usd) AS cost_5min FROM llm_logs " +
    "WHERE timestamp > NOW() - INTERVAL '5 minutes'";

  pool.query(costQuery).then(function(result) {
    var cost = result.rows[0].cost_5min || 0;
    var threshold = parseFloat(process.env.LLM_COST_ALERT_THRESHOLD || "5.00");
    if (cost > threshold) {
      sendAlert("LLM Cost Alert", "Spent $" + cost.toFixed(2) + " in the last 5 minutes. " +
        "Threshold: $" + threshold.toFixed(2));
    }
  });
}

function sendAlert(subject, message) {
  // Replace with your alerting mechanism: Slack webhook, PagerDuty, email, etc.
  logger.error("ALERT: " + subject, { alert: true, message: message });
}

cron.schedule("*/5 * * * *", checkAlerts);

The three alerts I never skip: error rate above 10%, P95 latency above 10 seconds, and cost exceeding the 5-minute budget. Everything else is secondary.

Searching and Filtering LLM Logs

Build a search API so your team can debug LLM issues without direct database access:

function searchLogs(filters) {
  var conditions = ["1=1"];
  var values = [];
  var paramIndex = 1;

  if (filters.correlation_id) {
    conditions.push("correlation_id = $" + paramIndex++);
    values.push(filters.correlation_id);
  }
  if (filters.model) {
    conditions.push("model = $" + paramIndex++);
    values.push(filters.model);
  }
  if (filters.status) {
    conditions.push("status = $" + paramIndex++);
    values.push(filters.status);
  }
  if (filters.min_latency) {
    conditions.push("latency_ms >= $" + paramIndex++);
    values.push(parseInt(filters.min_latency));
  }
  if (filters.min_cost) {
    conditions.push("estimated_cost_usd >= $" + paramIndex++);
    values.push(parseFloat(filters.min_cost));
  }
  if (filters.start_date) {
    conditions.push("timestamp >= $" + paramIndex++);
    values.push(filters.start_date);
  }
  if (filters.end_date) {
    conditions.push("timestamp <= $" + paramIndex++);
    values.push(filters.end_date);
  }
  if (filters.endpoint) {
    conditions.push("endpoint = $" + paramIndex++);
    values.push(filters.endpoint);
  }

  var sql = "SELECT * FROM llm_logs WHERE " + conditions.join(" AND ") +
    " ORDER BY timestamp DESC LIMIT " + (parseInt(filters.limit) || 100);

  return pool.query(sql, values);
}

Log Retention Policies and Cost Management

LLM logs are large. A single entry with full metadata runs 500-1000 bytes. At 10,000 LLM calls per day, that is 5-10 MB daily. Manageable. At 1 million calls per day, it is 500 MB to 1 GB daily. You need a retention strategy.

-- Partition by month for easy retention management
CREATE TABLE llm_logs_partitioned (
  LIKE llm_logs INCLUDING ALL
) PARTITION BY RANGE (timestamp);

CREATE TABLE llm_logs_2026_01 PARTITION OF llm_logs_partitioned
  FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE llm_logs_2026_02 PARTITION OF llm_logs_partitioned
  FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

-- Drop old partitions instead of DELETE (instant, no vacuum)
-- DROP TABLE llm_logs_2025_10;

My retention tiers:

Hot (0-7 days): Full detail in PostgreSQL. Used for active debugging and real-time dashboards.
Warm (7-90 days): Aggregated daily summaries in PostgreSQL. Individual logs moved to compressed files.
Cold (90-365 days): Compressed log files on object storage (S3/DigitalOcean Spaces). Required for compliance.
Archive (1+ year): Monthly cost summaries only. Delete raw logs per your data retention policy.

// Automated aggregation job — run daily
function aggregateDailyLogs() {
  var sql = "INSERT INTO llm_logs_daily_summary " +
    "(day, model, endpoint, call_count, error_count, total_tokens, total_cost, " +
    "avg_latency, p95_latency, p99_latency) " +
    "SELECT DATE(timestamp), model, endpoint, COUNT(*), " +
    "COUNT(*) FILTER (WHERE status = 'error'), " +
    "SUM(total_tokens), SUM(estimated_cost_usd), " +
    "AVG(latency_ms), " +
    "PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms), " +
    "PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) " +
    "FROM llm_logs WHERE DATE(timestamp) = CURRENT_DATE - INTERVAL '1 day' " +
    "GROUP BY DATE(timestamp), model, endpoint " +
    "ON CONFLICT (day, model, endpoint) DO NOTHING";

  return pool.query(sql);
}

Compliance Logging and Audit Trails

If your LLM makes decisions that affect users — content moderation, loan approvals, hiring recommendations — you need an audit trail that proves what the model saw, what it decided, and why.

function logAuditableDecision(params) {
  var auditEntry = {
    correlation_id: params.correlationId,
    decision_type: params.decisionType,     // "content_moderation", "risk_assessment"
    model: params.model,
    model_version: params.modelVersion,      // Snapshot which version was used
    input_hash: hashPrompt(JSON.stringify(params.input)),
    decision: params.decision,               // The actual output/decision
    confidence: params.confidence,
    prompt_template_version: params.templateVersion,  // Track prompt changes
    human_override: false,
    timestamp: new Date().toISOString(),
    retention_days: 2555                     // 7 years for financial compliance
  };

  // Store in a separate, immutable audit table
  var sql = "INSERT INTO llm_audit_log (correlation_id, decision_type, model, " +
    "model_version, input_hash, decision, confidence, prompt_template_version, " +
    "human_override, retention_until) VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9, " +
    "NOW() + ($10 || ' days')::INTERVAL)";

  return pool.query(sql, [
    auditEntry.correlation_id, auditEntry.decision_type, auditEntry.model,
    auditEntry.model_version, auditEntry.input_hash, JSON.stringify(auditEntry.decision),
    auditEntry.confidence, auditEntry.prompt_template_version,
    auditEntry.human_override, auditEntry.retention_days
  ]);
}

The key principle: log enough to reproduce the decision, but not so much that you violate data minimization requirements. Hash the inputs, store the outputs, track the model and prompt versions.

Exporting Logs to External Systems

Not every team runs their own ELK stack. Here are the most common export patterns:

// Export to Elasticsearch / OpenSearch
var { Client } = require("@elastic/elasticsearch");
var esClient = new Client({ node: process.env.ELASTICSEARCH_URL });

function exportToElasticsearch(logEntry) {
  return esClient.index({
    index: "llm-logs-" + new Date().toISOString().substring(0, 7),
    body: logEntry
  });
}

// Export to Datadog via HTTP API
var https = require("https");

function exportToDatadog(logEntry) {
  var payload = JSON.stringify({
    ddsource: "llm-gateway",
    ddtags: "model:" + logEntry.model + ",status:" + logEntry.status,
    hostname: require("os").hostname(),
    message: JSON.stringify(logEntry),
    service: "llm-gateway"
  });

  var options = {
    hostname: "http-intake.logs.datadoghq.com",
    path: "/api/v2/logs",
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "DD-API-KEY": process.env.DD_API_KEY
    }
  };

  var req = https.request(options);
  req.write(payload);
  req.end();
}

// Winston transport for CloudWatch (via winston-cloudwatch)
var WinstonCloudWatch = require("winston-cloudwatch");

logger.add(new WinstonCloudWatch({
  logGroupName: "llm-gateway",
  logStreamName: function() {
    var date = new Date().toISOString().split("T")[0];
    return "llm-calls-" + date;
  },
  awsRegion: process.env.AWS_REGION || "us-east-1",
  jsonMessage: true
}));

Complete Working Example

Here is a self-contained Node.js module that ties together everything discussed above — structured logging, PII redaction, correlation IDs, PostgreSQL storage, and a search API.

// llm-observability.js
var express = require("express");
var { Pool } = require("pg");
var { createHash } = require("crypto");
var { v4: uuidv4 } = require("uuid");
var winston = require("winston");
var OpenAI = require("openai");

// ---- Configuration ----
var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });
var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

var COST_TABLE = {
  "gpt-4o": { input: 0.005, output: 0.015 },
  "gpt-4o-mini": { input: 0.00015, output: 0.0006 }
};

// ---- Logger Setup ----
var logger = winston.createLogger({
  level: process.env.LOG_LEVEL || "info",
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  defaultMeta: { service: "llm-gateway" },
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({
      filename: "logs/llm-calls.log",
      maxsize: 52428800,
      maxFiles: 10
    })
  ]
});

// ---- PII Redaction ----
var PII_PATTERNS = [
  { pattern: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g, replacement: "[EMAIL]" },
  { pattern: /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, replacement: "[PHONE]" },
  { pattern: /\b\d{3}-\d{2}-\d{4}\b/g, replacement: "[SSN]" },
  { pattern: /\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/g, replacement: "[CARD]" }
];

function redactPII(text) {
  if (typeof text !== "string") return text;
  var result = text;
  PII_PATTERNS.forEach(function(rule) {
    result = result.replace(rule.pattern, rule.replacement);
  });
  return result;
}

// ---- Utilities ----
function hashPrompt(text) {
  return "sha256:" + createHash("sha256").update(text).digest("hex").substring(0, 16);
}

function estimateCost(model, promptTokens, completionTokens) {
  var pricing = COST_TABLE[model];
  if (!pricing) return null;
  return ((promptTokens / 1000) * pricing.input) + ((completionTokens / 1000) * pricing.output);
}

// ---- PostgreSQL Storage ----
function storeLLMLog(entry) {
  var sql = "INSERT INTO llm_logs (correlation_id, trace_id, level, event, model, " +
    "provider, prompt_tokens, completion_tokens, total_tokens, latency_ms, " +
    "estimated_cost_usd, temperature, max_tokens, status, error_message, " +
    "prompt_hash, response_length, finish_reason, user_id, endpoint, metadata) " +
    "VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21)";

  var values = [
    entry.correlation_id, entry.trace_id, entry.level, entry.event,
    entry.model, entry.provider, entry.prompt_tokens, entry.completion_tokens,
    entry.total_tokens, entry.latency_ms, entry.estimated_cost_usd,
    entry.temperature, entry.max_tokens, entry.status, entry.error_message,
    entry.prompt_hash, entry.response_length, entry.finish_reason,
    entry.user_id, entry.endpoint, JSON.stringify(entry.metadata || {})
  ];

  return pool.query(sql, values).catch(function(err) {
    logger.error("log_storage_failure", { error: err.message });
  });
}

// ---- LLM Wrapper ----
function createLLMClient(options) {
  var defaultModel = (options && options.model) || "gpt-4o";

  function chatCompletion(params, context) {
    var correlationId = (context && context.correlationId) || uuidv4();
    var endpoint = (context && context.endpoint) || "unknown";
    var userId = (context && context.userId) || null;
    var model = params.model || defaultModel;
    var startTime = Date.now();
    var promptText = JSON.stringify(params.messages);
    var pHash = hashPrompt(promptText);

    logger.debug("llm.request.start", {
      correlation_id: correlationId,
      model: model,
      prompt_hash: pHash,
      messages_redacted: params.messages.map(function(m) {
        return { role: m.role, content: redactPII(m.content) };
      })
    });

    return openai.chat.completions.create(params)
      .then(function(response) {
        var latencyMs = Date.now() - startTime;
        var usage = response.usage || {};
        var cost = estimateCost(model, usage.prompt_tokens, usage.completion_tokens);
        var responseText = response.choices[0].message.content || "";

        var logEntry = {
          correlation_id: correlationId,
          trace_id: (context && context.traceId) || null,
          level: "info",
          event: "llm.completion",
          model: model,
          provider: "openai",
          prompt_tokens: usage.prompt_tokens,
          completion_tokens: usage.completion_tokens,
          total_tokens: usage.total_tokens,
          latency_ms: latencyMs,
          estimated_cost_usd: cost,
          temperature: params.temperature || null,
          max_tokens: params.max_tokens || null,
          status: "success",
          error_message: null,
          prompt_hash: pHash,
          response_length: responseText.length,
          finish_reason: response.choices[0].finish_reason,
          user_id: userId,
          endpoint: endpoint,
          metadata: {}
        };

        logger.info("llm.completion.success", logEntry);
        storeLLMLog(logEntry);

        return response;
      })
      .catch(function(error) {
        var latencyMs = Date.now() - startTime;
        var status = "error";
        if (error.status === 429) status = "rate_limited";
        if (error.code === "ETIMEDOUT") status = "timeout";

        var logEntry = {
          correlation_id: correlationId,
          trace_id: (context && context.traceId) || null,
          level: "error",
          event: "llm.completion",
          model: model,
          provider: "openai",
          prompt_tokens: null,
          completion_tokens: null,
          total_tokens: null,
          latency_ms: latencyMs,
          estimated_cost_usd: null,
          temperature: params.temperature || null,
          max_tokens: params.max_tokens || null,
          status: status,
          error_message: error.message,
          prompt_hash: pHash,
          response_length: null,
          finish_reason: null,
          user_id: userId,
          endpoint: endpoint,
          metadata: { error_code: error.status || error.code }
        };

        logger.error("llm.completion.error", logEntry);
        storeLLMLog(logEntry);

        throw error;
      });
  }

  return { chatCompletion: chatCompletion };
}

// ---- Search API ----
function searchLogs(filters) {
  var conditions = ["1=1"];
  var values = [];
  var idx = 1;

  if (filters.correlation_id) { conditions.push("correlation_id = $" + idx++); values.push(filters.correlation_id); }
  if (filters.model) { conditions.push("model = $" + idx++); values.push(filters.model); }
  if (filters.status) { conditions.push("status = $" + idx++); values.push(filters.status); }
  if (filters.min_latency) { conditions.push("latency_ms >= $" + idx++); values.push(parseInt(filters.min_latency)); }
  if (filters.min_cost) { conditions.push("estimated_cost_usd >= $" + idx++); values.push(parseFloat(filters.min_cost)); }
  if (filters.start_date) { conditions.push("timestamp >= $" + idx++); values.push(filters.start_date); }
  if (filters.end_date) { conditions.push("timestamp <= $" + idx++); values.push(filters.end_date); }
  if (filters.endpoint) { conditions.push("endpoint = $" + idx++); values.push(filters.endpoint); }
  if (filters.user_id) { conditions.push("user_id = $" + idx++); values.push(filters.user_id); }

  var limit = Math.min(parseInt(filters.limit) || 100, 1000);
  var sql = "SELECT * FROM llm_logs WHERE " + conditions.join(" AND ") +
    " ORDER BY timestamp DESC LIMIT " + limit;

  return pool.query(sql, values);
}

// ---- Express Routes ----
var router = express.Router();

router.use(function(req, res, next) {
  req.correlationId = req.headers["x-correlation-id"] || uuidv4();
  res.setHeader("x-correlation-id", req.correlationId);
  next();
});

// Log search endpoint
router.get("/logs/search", function(req, res) {
  searchLogs(req.query)
    .then(function(result) {
      res.json({ count: result.rows.length, logs: result.rows });
    })
    .catch(function(err) {
      res.status(500).json({ error: "Search failed", message: err.message });
    });
});

// Log summary/dashboard endpoint
router.get("/logs/summary", function(req, res) {
  var days = parseInt(req.query.days) || 7;
  var sql = "SELECT DATE(timestamp) AS day, model, COUNT(*) AS calls, " +
    "COUNT(*) FILTER (WHERE status = 'error') AS errors, " +
    "SUM(total_tokens) AS tokens, " +
    "ROUND(SUM(estimated_cost_usd)::numeric, 4) AS cost, " +
    "ROUND(AVG(latency_ms)::numeric, 0) AS avg_latency_ms " +
    "FROM llm_logs WHERE timestamp > NOW() - ($1 || ' days')::INTERVAL " +
    "GROUP BY DATE(timestamp), model ORDER BY day DESC, cost DESC";

  pool.query(sql, [days.toString()])
    .then(function(result) {
      res.json({ days: days, summary: result.rows });
    })
    .catch(function(err) {
      res.status(500).json({ error: "Summary failed", message: err.message });
    });
});

module.exports = {
  createLLMClient: createLLMClient,
  searchLogs: searchLogs,
  storeLLMLog: storeLLMLog,
  redactPII: redactPII,
  router: router,
  logger: logger
};

Usage in your application:

// app.js
var express = require("express");
var observability = require("./llm-observability");

var app = express();
app.use(express.json());

// Mount the log search/dashboard API
app.use("/llm", observability.router);

// Create an instrumented LLM client
var llm = observability.createLLMClient({ model: "gpt-4o" });

// Use it in your routes
app.post("/api/summarize", function(req, res) {
  llm.chatCompletion({
    model: "gpt-4o-mini",
    messages: [
      { role: "system", content: "Summarize the following text concisely." },
      { role: "user", content: req.body.text }
    ],
    temperature: 0.3,
    max_tokens: 512
  }, {
    correlationId: req.correlationId,
    endpoint: "/api/summarize",
    userId: req.user && req.user.id
  })
  .then(function(response) {
    res.json({ summary: response.choices[0].message.content });
  })
  .catch(function(error) {
    res.status(500).json({ error: "Summarization failed" });
  });
});

app.listen(process.env.PORT || 3000);

Every call to llm.chatCompletion now automatically logs structured JSON, stores to PostgreSQL, redacts PII in debug logs, and includes the correlation ID for end-to-end tracing.

Common Issues and Troubleshooting

1. Log Storage Fails Silently

Error: connect ECONNREFUSED 127.0.0.1:5432

If your PostgreSQL connection drops, the storeLLMLog function catches the error and logs it to the console, but the LLM call still succeeds. This is by design — log storage should never break your application. However, you will lose log data during the outage. Solution: add a memory buffer that retries failed inserts when the connection recovers, or write to a local file as a fallback.

2. Winston JSON Circular Reference Error

TypeError: Converting circular structure to JSON

This happens when you accidentally log the full OpenAI response object, which contains circular references. Always extract the specific fields you need before logging. Never pass response directly to logger.info().

3. PII Redaction Misses in Unstructured Input

// User input: "My name is John Smith and I live at [email protected]"
// After redaction: "My name is John Smith and I live at [EMAIL]"

Regex-based PII redaction catches structured patterns (emails, phone numbers, SSNs) but misses names, free-form addresses, and context-dependent sensitive data. For higher-confidence PII detection, integrate a dedicated PII detection service like AWS Comprehend or Microsoft Presidio as an additional redaction layer.

4. Log Volume Causes Disk Pressure

Error: ENOSPC: no space left on device, write

LLM logs at high volume can fill disks fast, especially if you are logging at DEBUG level with full prompt content. Set maxsize and maxFiles on your Winston file transport, use log rotation, and never run DEBUG level in production for more than a few hours during active debugging sessions.

5. Correlation ID Missing in Async Chains

{ "correlation_id": "unknown", "model": "gpt-4o", "status": "success" }

If your correlation ID shows up as "unknown", the context object is not being threaded through async calls properly. Always pass the context explicitly rather than relying on thread-local or global state. In Node.js, use AsyncLocalStorage if you want automatic propagation without passing context manually.

Best Practices

Log every LLM call, no exceptions. Even cache hits should be logged so you can track cache effectiveness and total request volume accurately.
Never log raw prompts in production. Use prompt hashing for deduplication and redacted versions for debugging. Store raw prompts only in development environments with test data.
Include cost estimates on every log entry. When your monthly LLM bill spikes, you need to identify the endpoint, user, or feature responsible within minutes, not days.
Set up P95 latency alerts, not just average latency. LLM call latency has a long tail. Average latency can look fine while 5% of your users experience 15-second waits.
Use separate tables for operational logs and audit logs. Operational logs have short retention and high volume. Audit logs have long retention and legal requirements. Mixing them makes retention policies impossible to enforce cleanly.
Version your prompt templates and log the version. When a prompt change causes a regression in output quality, you need to know which version of the prompt was active for each logged call.
Buffer log writes and batch insert to PostgreSQL. Individual INSERT statements for every LLM call add unnecessary database load. Buffer entries and flush every few seconds or every N entries.
Test your PII redaction regularly. Add unit tests that verify redaction patterns against real-world examples. PII patterns evolve — international phone numbers, new email TLDs, and local address formats all require pattern updates.
Make your log search API read-only and authenticated. LLM logs contain business-critical information about your AI operations. Restrict access to your engineering and ops teams.

References

Winston Logging Library — Structured logging for Node.js
OpenTelemetry JavaScript SDK — Distributed tracing instrumentation
OpenAI Node.js SDK — Official OpenAI client
PostgreSQL JSONB Documentation — Flexible metadata storage
OWASP Logging Cheat Sheet — Security considerations for logging
node-cron — Scheduled task execution for alert checks
winston-daily-rotate-file — Log rotation transport

Logging and Observability for LLM Calls

Logging and Observability for LLM Calls

Prerequisites

What to Log for LLM Calls

Structured Logging Format

Implementing a Logging Wrapper

Log Levels for AI Operations

PII Handling in LLM Logs

Request Correlation IDs

Log Storage Strategies

PostgreSQL for Queryable Logs

File Rotation for Volume

Distributed Tracing with OpenTelemetry

Building Log-Based Dashboards

Log-Based Alerting

Searching and Filtering LLM Logs

Log Retention Policies and Cost Management

Compliance Logging and Audit Trails

Exporting Logs to External Systems

Complete Working Example

Common Issues and Troubleshooting

1. Log Storage Fails Silently

2. Winston JSON Circular Reference Error

3. PII Redaction Misses in Unstructured Input

4. Log Volume Causes Disk Pressure

5. Correlation ID Missing in Async Chains

Best Practices

References

Quick Links

Need Expert Help?