Production

Performance Profiling LLM-Powered Features

Profile LLM-powered features with granular timing, memory tracking, bottleneck identification, and performance dashboards in Node.js.

Performance Profiling LLM-Powered Features

Overview

LLM-powered features introduce performance characteristics that traditional profiling tools were never designed to handle: variable latency ranging from 500ms to 30 seconds on the same endpoint, token-dependent execution times that change based on input and output length, and streaming responses that blur the line between "started" and "finished." If you are shipping LLM features in production, you need a profiling strategy purpose-built for these constraints. This article walks through a complete Node.js approach to instrumenting, measuring, and optimizing every phase of the LLM request lifecycle.

Prerequisites

  • Node.js v18+ (for stable perf_hooks and PerformanceObserver APIs)
  • Express.js application with at least one LLM-powered endpoint
  • Familiarity with async/await patterns in Node.js
  • An OpenAI, Anthropic, or similar LLM API key for testing
  • Basic understanding of HTTP request lifecycles

Why LLM Features Have Unique Performance Characteristics

Traditional web endpoints are predictable. A database query returns in 2-15ms. A REST API call to an internal service takes 20-100ms. You can set a timeout at 5 seconds and know something is broken if you hit it.

LLM endpoints are fundamentally different. I have seen the same prompt go from 800ms to 12 seconds depending on the model's response length. A summarization endpoint that processes short text returns in under a second, but hand it a 4000-word document and you are waiting 8-15 seconds. The same input can produce different-length outputs on different calls, which means latency is inherently non-deterministic.

This creates three profiling challenges:

  1. Variable latency - p50 and p99 can differ by 10x or more on the same endpoint
  2. Token-dependent timing - execution time correlates with input and output token count, not request complexity
  3. Streaming ambiguity - time to first token (TTFT) matters more than total time for user experience, but most profiling tools only measure total duration

You cannot treat these like regular endpoints. You need dedicated instrumentation.

Profiling the Full Request Lifecycle

Every LLM-powered request follows a lifecycle with distinct phases, and each phase has different performance characteristics:

Client Request → Prompt Construction → API Call → Stream/Wait →
Post-Processing → Response Serialization → Client Response

Here is a basic timing wrapper that captures each phase:

var perf_hooks = require("perf_hooks");
var performance = perf_hooks.performance;

function createLLMProfiler(requestId) {
    var marks = {};
    var measurements = {};

    return {
        mark: function (name) {
            var key = requestId + ":" + name;
            marks[name] = performance.now();
            performance.mark(key);
        },

        measure: function (name, startMark, endMark) {
            var startKey = requestId + ":" + startMark;
            var endKey = requestId + ":" + endMark;
            performance.measure(requestId + ":" + name, startKey, endKey);
            measurements[name] = marks[endMark] - marks[startMark];
        },

        getReport: function () {
            return {
                requestId: requestId,
                measurements: Object.assign({}, measurements),
                marks: Object.assign({}, marks)
            };
        },

        cleanup: function () {
            Object.keys(marks).forEach(function (name) {
                try {
                    performance.clearMarks(requestId + ":" + name);
                } catch (e) { /* already cleared */ }
            });
        }
    };
}

Use it like this in an Express handler:

var crypto = require("crypto");

function handleSummarize(req, res) {
    var requestId = crypto.randomUUID();
    var profiler = createLLMProfiler(requestId);

    profiler.mark("request_start");

    // Phase 1: Prompt Construction
    profiler.mark("prompt_start");
    var prompt = buildPrompt(req.body.text, req.body.options);
    profiler.mark("prompt_end");
    profiler.measure("prompt_construction", "prompt_start", "prompt_end");

    // Phase 2: LLM API Call
    profiler.mark("api_start");
    callLLMAPI(prompt).then(function (response) {
        profiler.mark("api_end");
        profiler.measure("api_call", "api_start", "api_end");

        // Phase 3: Post-Processing
        profiler.mark("postprocess_start");
        var result = parseAndValidate(response);
        profiler.mark("postprocess_end");
        profiler.measure("post_processing", "postprocess_start", "postprocess_end");

        profiler.mark("request_end");
        profiler.measure("total", "request_start", "request_end");

        var report = profiler.getReport();
        logPerformanceReport(report);
        profiler.cleanup();

        res.json({ result: result, timing: report.measurements });
    }).catch(function (err) {
        profiler.cleanup();
        res.status(500).json({ error: err.message });
    });
}

Measuring Time to First Token vs Total Completion Time

For streaming LLM responses, total completion time is the wrong metric to optimize for. Users perceive responsiveness based on when they see the first token appear. I have measured endpoints where TTFT is 400ms but total time is 9 seconds, and users report the feature as "fast" because text starts flowing immediately.

function measureStreamingTiming(llmStream) {
    var timing = {
        firstTokenAt: null,
        lastTokenAt: null,
        totalTokens: 0,
        startedAt: performance.now()
    };

    return {
        onToken: function (token) {
            if (!timing.firstTokenAt) {
                timing.firstTokenAt = performance.now();
            }
            timing.lastTokenAt = performance.now();
            timing.totalTokens++;
        },

        getMetrics: function () {
            var now = performance.now();
            return {
                timeToFirstToken: timing.firstTokenAt
                    ? timing.firstTokenAt - timing.startedAt
                    : null,
                totalDuration: (timing.lastTokenAt || now) - timing.startedAt,
                tokenCount: timing.totalTokens,
                tokensPerSecond: timing.totalTokens > 0
                    ? (timing.totalTokens / ((timing.lastTokenAt - timing.firstTokenAt) / 1000))
                    : 0,
                interTokenLatency: timing.totalTokens > 1
                    ? ((timing.lastTokenAt - timing.firstTokenAt) / (timing.totalTokens - 1))
                    : 0
            };
        }
    };
}

The interTokenLatency metric is valuable for detecting API throttling. If your average inter-token latency suddenly spikes from 15ms to 80ms, the LLM provider is likely rate-limiting your account.

Identifying Bottlenecks

After instrumenting hundreds of LLM endpoints in production, the bottleneck distribution I see most often is:

Phase Typical % of Total Time Common Bottleneck
Prompt Construction 1-5% Template rendering with large context windows
API Call (TTFT) 15-30% Network latency, queue depth at provider
Token Generation 50-70% Model speed, output length
Post-Processing 2-10% JSON parsing, validation, sanitization
Response Serialization <1% Rarely an issue

Most teams waste time optimizing prompt construction (which is already fast) while ignoring the real win: reducing output token count through better prompt engineering. Cutting your average response from 500 tokens to 300 tokens saves more time than any code optimization.

Using Node.js Performance Hooks for Granular Timing

The perf_hooks module gives you microsecond-precision timing without the overhead of Date.now(). Here is a more sophisticated setup using PerformanceObserver:

var perf_hooks = require("perf_hooks");
var PerformanceObserver = perf_hooks.PerformanceObserver;
var performance = perf_hooks.performance;

var metricsBuffer = [];

var observer = new PerformanceObserver(function (list) {
    var entries = list.getEntries();
    entries.forEach(function (entry) {
        if (entry.name.indexOf("llm:") === 0) {
            metricsBuffer.push({
                name: entry.name,
                duration: entry.duration,
                startTime: entry.startTime,
                timestamp: Date.now()
            });
        }
    });
});

observer.observe({ entryTypes: ["measure"], buffered: true });

function instrumentLLMCall(name, asyncFn) {
    return function () {
        var args = Array.prototype.slice.call(arguments);
        var startMark = "llm:" + name + ":start:" + Date.now();
        var endMark = "llm:" + name + ":end:" + Date.now();
        var measureName = "llm:" + name;

        performance.mark(startMark);

        return asyncFn.apply(null, args).then(function (result) {
            performance.mark(endMark);
            performance.measure(measureName, startMark, endMark);
            performance.clearMarks(startMark);
            performance.clearMarks(endMark);
            return result;
        }).catch(function (err) {
            performance.mark(endMark);
            performance.measure(measureName, startMark, endMark);
            performance.clearMarks(startMark);
            performance.clearMarks(endMark);
            throw err;
        });
    };
}

Now you can wrap any async function:

var profiledSummarize = instrumentLLMCall("summarize", callLLMForSummary);
var profiledClassify = instrumentLLMCall("classify", callLLMForClassification);

// All timings flow through the observer automatically
profiledSummarize(text).then(function (summary) {
    console.log("Summary complete, check metrics buffer");
});

Profiling Memory Usage During Streaming Responses

Streaming LLM responses can accumulate significant memory, especially when you buffer the full response for post-processing while simultaneously streaming to the client. I have seen memory spikes of 50-100MB on endpoints that process large documents.

function trackMemoryDuringStream(streamHandler) {
    var memSnapshots = [];
    var intervalId;

    function captureSnapshot(label) {
        var usage = process.memoryUsage();
        memSnapshots.push({
            label: label,
            timestamp: performance.now(),
            heapUsed: usage.heapUsed,
            heapTotal: usage.heapTotal,
            rss: usage.rss,
            external: usage.external,
            arrayBuffers: usage.arrayBuffers
        });
    }

    captureSnapshot("before_stream");

    // Sample memory every 100ms during streaming
    intervalId = setInterval(function () {
        captureSnapshot("during_stream");
    }, 100);

    return streamHandler()
        .then(function (result) {
            clearInterval(intervalId);
            captureSnapshot("after_stream");

            var peakHeap = 0;
            var baselineHeap = memSnapshots[0].heapUsed;
            memSnapshots.forEach(function (snap) {
                if (snap.heapUsed > peakHeap) {
                    peakHeap = snap.heapUsed;
                }
            });

            return {
                result: result,
                memory: {
                    baselineHeapMB: (baselineHeap / 1024 / 1024).toFixed(2),
                    peakHeapMB: (peakHeap / 1024 / 1024).toFixed(2),
                    deltaHeapMB: ((peakHeap - baselineHeap) / 1024 / 1024).toFixed(2),
                    snapshots: memSnapshots.length
                }
            };
        })
        .catch(function (err) {
            clearInterval(intervalId);
            throw err;
        });
}

When deltaHeapMB consistently exceeds 20MB on a single request, you have a buffering problem. The fix is usually to process the stream in chunks rather than accumulating the entire response.

CPU Profiling for Prompt Template Rendering and Response Parsing

Prompt construction seems trivial, but when you are injecting thousands of tokens of context, template rendering becomes CPU-bound. Use the built-in V8 profiler:

node --prof app.js

After generating load, process the output:

node --prof-process isolate-0x*.log > profile.txt

Look for functions related to string concatenation, template rendering, and JSON parsing in the "Bottom up (heavy) profile" section.

For programmatic CPU profiling of specific code paths:

var v8Profiler;
try {
    v8Profiler = require("v8-profiler-next");
} catch (e) {
    v8Profiler = null;
}

function cpuProfileLLMHandler(handlerName, handlerFn) {
    return function (req, res) {
        if (!v8Profiler || !req.query._profile) {
            return handlerFn(req, res);
        }

        var profileTitle = handlerName + "-" + Date.now();
        v8Profiler.startProfiling(profileTitle, true);

        var originalEnd = res.end;
        res.end = function () {
            var profile = v8Profiler.stopProfiling(profileTitle);
            profile.export(function (error, result) {
                if (!error) {
                    var fs = require("fs");
                    var path = require("path");
                    var filename = path.join(__dirname, "profiles", profileTitle + ".cpuprofile");
                    fs.writeFileSync(filename, result);
                    console.log("CPU profile saved: " + filename);
                }
                profile.delete();
            });
            return originalEnd.apply(res, arguments);
        };

        return handlerFn(req, res);
    };
}

Load the .cpuprofile files in Chrome DevTools (Performance tab) to visualize flame graphs.

Network Profiling for LLM API Calls

LLM API calls involve DNS resolution, TLS handshake, and potentially connection pool management. Profile the network layer separately:

var http = require("http");
var https = require("https");

function createTimedAgent(baseAgent) {
    var timedAgent = new https.Agent({
        keepAlive: true,
        maxSockets: 10,
        maxFreeSockets: 5
    });

    var originalCreateConnection = timedAgent.createConnection;
    timedAgent.createConnection = function (options, callback) {
        var start = performance.now();
        var socket = originalCreateConnection.call(this, options, function () {
            var connectTime = performance.now() - start;
            if (connectTime > 100) {
                console.warn(
                    "Slow TLS connection to " + options.host +
                    ": " + connectTime.toFixed(1) + "ms"
                );
            }
            if (callback) callback.apply(null, arguments);
        });
        return socket;
    };

    return timedAgent;
}

var llmAgent = createTimedAgent();

// Use with your HTTP client
var axios = require("axios");
var llmClient = axios.create({
    httpsAgent: llmAgent,
    timeout: 60000
});

Connection reuse via keepAlive: true typically saves 80-150ms per request by avoiding repeated TLS handshakes to the LLM provider.

Implementing Performance Budgets Per Feature

Every LLM feature should have an explicit performance budget. Without one, latency creeps up silently until users complain.

var PERFORMANCE_BUDGETS = {
    "summarize": { p50: 2000, p95: 5000, p99: 10000, ttft: 800 },
    "classify": { p50: 800, p95: 2000, p99: 4000, ttft: 400 },
    "chat": { p50: 1500, p95: 4000, p99: 8000, ttft: 500 },
    "extract": { p50: 3000, p95: 8000, p99: 15000, ttft: 1000 }
};

function checkBudget(featureName, metrics) {
    var budget = PERFORMANCE_BUDGETS[featureName];
    if (!budget) return { withinBudget: true, violations: [] };

    var violations = [];

    if (metrics.duration > budget.p99) {
        violations.push({
            metric: "duration",
            value: metrics.duration,
            budget: budget.p99,
            severity: "critical"
        });
    } else if (metrics.duration > budget.p95) {
        violations.push({
            metric: "duration",
            value: metrics.duration,
            budget: budget.p95,
            severity: "warning"
        });
    }

    if (metrics.ttft && metrics.ttft > budget.ttft) {
        violations.push({
            metric: "ttft",
            value: metrics.ttft,
            budget: budget.ttft,
            severity: "warning"
        });
    }

    return {
        withinBudget: violations.length === 0,
        violations: violations
    };
}

I set budgets based on the first two weeks of production data. Collect metrics first, then set budgets at the p95 of your observed values. Tighten them over time as you optimize.

Flame Graph Analysis for LLM Request Handlers

Generate flame graphs using 0x for quick visual analysis:

npm install -g 0x
0x -- node app.js

Send traffic to your LLM endpoints, then stop the process. 0x generates an interactive flame graph HTML file.

What to look for in LLM handler flame graphs:

  • Wide bars in JSON.parse/stringify - You are serializing large payloads. Consider streaming JSON parsing with stream-json
  • Deep stacks in template rendering - Your prompt templates have too many nested conditionals or loops
  • Long bars in Buffer operations - You are doing unnecessary buffer conversions on the response stream
  • Wide bars in regex operations - Your post-processing uses expensive regex patterns on large LLM outputs

Load Testing LLM Features Under Realistic Conditions

Standard load testing tools generate uniform traffic, but LLM features see wildly variable request sizes. Build a realistic load profile:

var http = require("http");

var TEST_SCENARIOS = [
    { weight: 40, name: "short_text", bodySize: 200 },
    { weight: 35, name: "medium_text", bodySize: 1000 },
    { weight: 20, name: "long_text", bodySize: 4000 },
    { weight: 5, name: "very_long_text", bodySize: 10000 }
];

function pickScenario() {
    var roll = Math.random() * 100;
    var cumulative = 0;
    for (var i = 0; i < TEST_SCENARIOS.length; i++) {
        cumulative += TEST_SCENARIOS[i].weight;
        if (roll <= cumulative) return TEST_SCENARIOS[i];
    }
    return TEST_SCENARIOS[0];
}

function generateLoadTest(targetRPS, durationSeconds) {
    var intervalMs = 1000 / targetRPS;
    var results = [];
    var startTime = Date.now();

    var intervalId = setInterval(function () {
        if (Date.now() - startTime > durationSeconds * 1000) {
            clearInterval(intervalId);
            printLoadTestResults(results);
            return;
        }

        var scenario = pickScenario();
        var reqStart = performance.now();

        var reqData = JSON.stringify({
            text: "x".repeat(scenario.bodySize),
            scenario: scenario.name
        });

        var options = {
            hostname: "localhost",
            port: 8080,
            path: "/api/summarize",
            method: "POST",
            headers: {
                "Content-Type": "application/json",
                "Content-Length": Buffer.byteLength(reqData)
            }
        };

        var req = http.request(options, function (res) {
            var body = "";
            res.on("data", function (chunk) { body += chunk; });
            res.on("end", function () {
                results.push({
                    scenario: scenario.name,
                    duration: performance.now() - reqStart,
                    status: res.statusCode,
                    bodySize: scenario.bodySize
                });
            });
        });

        req.on("error", function (err) {
            results.push({
                scenario: scenario.name,
                duration: performance.now() - reqStart,
                status: 0,
                error: err.message
            });
        });

        req.write(reqData);
        req.end();
    }, intervalMs);
}

function printLoadTestResults(results) {
    var byScenario = {};
    results.forEach(function (r) {
        if (!byScenario[r.scenario]) byScenario[r.scenario] = [];
        byScenario[r.scenario].push(r.duration);
    });

    Object.keys(byScenario).forEach(function (scenario) {
        var durations = byScenario[scenario].sort(function (a, b) { return a - b; });
        var p50 = durations[Math.floor(durations.length * 0.5)];
        var p95 = durations[Math.floor(durations.length * 0.95)];
        var p99 = durations[Math.floor(durations.length * 0.99)];
        console.log(
            scenario + ": p50=" + p50.toFixed(0) + "ms" +
            " p95=" + p95.toFixed(0) + "ms" +
            " p99=" + p99.toFixed(0) + "ms" +
            " n=" + durations.length
        );
    });
}

Run this against a staging environment with real LLM calls, not mocks. Mocked LLM responses return in 1ms and tell you nothing about real-world behavior.

Performance Regression Detection in CI

Catch performance regressions before they reach production by running a baseline comparison in your CI pipeline:

var fs = require("fs");
var path = require("path");

var BASELINE_FILE = path.join(__dirname, "perf-baseline.json");
var REGRESSION_THRESHOLD = 0.20; // 20% regression triggers failure

function loadBaseline() {
    try {
        var data = fs.readFileSync(BASELINE_FILE, "utf8");
        return JSON.parse(data);
    } catch (e) {
        return null;
    }
}

function saveBaseline(metrics) {
    fs.writeFileSync(BASELINE_FILE, JSON.stringify(metrics, null, 2));
}

function checkForRegressions(currentMetrics) {
    var baseline = loadBaseline();
    if (!baseline) {
        console.log("No baseline found. Saving current metrics as baseline.");
        saveBaseline(currentMetrics);
        return { passed: true, regressions: [] };
    }

    var regressions = [];

    Object.keys(currentMetrics).forEach(function (feature) {
        var current = currentMetrics[feature];
        var base = baseline[feature];
        if (!base) return;

        ["p50", "p95", "ttft"].forEach(function (metric) {
            if (!current[metric] || !base[metric]) return;

            var ratio = current[metric] / base[metric];
            if (ratio > 1 + REGRESSION_THRESHOLD) {
                regressions.push({
                    feature: feature,
                    metric: metric,
                    baseline: base[metric],
                    current: current[metric],
                    regression: ((ratio - 1) * 100).toFixed(1) + "%"
                });
            }
        });
    });

    return {
        passed: regressions.length === 0,
        regressions: regressions
    };
}

In your CI script:

# Run performance test suite
node perf-tests/run.js --output perf-results.json

# Check for regressions
node perf-tests/check-regression.js perf-results.json

If the check fails, the CI pipeline should block the merge and report which features regressed.

Optimizing Identified Bottlenecks

Once you have profiling data, these are the highest-impact optimizations, in order:

1. Parallel LLM Calls

When a feature requires multiple LLM calls, run them in parallel:

function processDocumentParallel(document) {
    var chunks = splitIntoChunks(document, 2000);

    var promises = chunks.map(function (chunk, index) {
        return callLLM({
            prompt: "Summarize this section:\n\n" + chunk,
            maxTokens: 200
        }).then(function (result) {
            return { index: index, summary: result };
        });
    });

    return Promise.all(promises).then(function (results) {
        results.sort(function (a, b) { return a.index - b.index; });
        return results.map(function (r) { return r.summary; }).join("\n\n");
    });
}

I have seen this cut total latency by 60-70% on document processing endpoints that were making 4-5 serial LLM calls.

2. Response Streaming

Stream tokens to the client as they arrive instead of waiting for the full response:

function streamLLMResponse(req, res, prompt) {
    res.setHeader("Content-Type", "text/event-stream");
    res.setHeader("Cache-Control", "no-cache");
    res.setHeader("Connection", "keep-alive");

    var stream = openai.chat.completions.create({
        model: "gpt-4o-mini",
        messages: [{ role: "user", content: prompt }],
        stream: true
    });

    stream.then(function (completion) {
        var reader = completion[Symbol.asyncIterator]();

        function processNext() {
            reader.next().then(function (result) {
                if (result.done) {
                    res.write("data: [DONE]\n\n");
                    res.end();
                    return;
                }

                var chunk = result.value;
                var content = chunk.choices[0].delta.content;
                if (content) {
                    res.write("data: " + JSON.stringify({ text: content }) + "\n\n");
                }
                processNext();
            }).catch(function (err) {
                res.write("data: " + JSON.stringify({ error: err.message }) + "\n\n");
                res.end();
            });
        }

        processNext();
    });
}

3. Semantic Caching

Cache responses for semantically similar prompts:

var crypto = require("crypto");
var NodeCache = require("node-cache");

var llmCache = new NodeCache({ stdTTL: 3600, maxKeys: 1000 });

function hashPrompt(prompt) {
    return crypto.createHash("sha256").update(prompt).digest("hex");
}

function cachedLLMCall(prompt, options) {
    var cacheKey = hashPrompt(prompt + JSON.stringify(options || {}));
    var cached = llmCache.get(cacheKey);

    if (cached) {
        return Promise.resolve({
            result: cached,
            fromCache: true,
            timing: { duration: 0 }
        });
    }

    var start = performance.now();
    return callLLM(prompt, options).then(function (result) {
        var duration = performance.now() - start;
        llmCache.set(cacheKey, result);
        return {
            result: result,
            fromCache: false,
            timing: { duration: duration }
        };
    });
}

For classification and extraction endpoints where identical inputs produce identical outputs, caching eliminates the API call entirely. I have seen cache hit rates of 30-45% on classification endpoints in production.

Building a Performance Dashboard for LLM Features

Aggregate your profiling data into a real-time dashboard endpoint:

function createPerformanceDashboard() {
    var metrics = {};
    var WINDOW_SIZE = 1000; // Keep last 1000 measurements per feature

    function record(feature, measurement) {
        if (!metrics[feature]) {
            metrics[feature] = {
                measurements: [],
                totalRequests: 0,
                totalErrors: 0,
                budgetViolations: 0
            };
        }

        var m = metrics[feature];
        m.measurements.push(measurement);
        m.totalRequests++;

        if (measurement.error) m.totalErrors++;
        if (measurement.budgetViolation) m.budgetViolations++;

        // Trim to window size
        if (m.measurements.length > WINDOW_SIZE) {
            m.measurements = m.measurements.slice(-WINDOW_SIZE);
        }
    }

    function getStats(feature) {
        var m = metrics[feature];
        if (!m || m.measurements.length === 0) return null;

        var durations = m.measurements
            .filter(function (d) { return !d.error; })
            .map(function (d) { return d.duration; })
            .sort(function (a, b) { return a - b; });

        var ttfts = m.measurements
            .filter(function (d) { return d.ttft != null; })
            .map(function (d) { return d.ttft; })
            .sort(function (a, b) { return a - b; });

        function percentile(arr, p) {
            if (arr.length === 0) return 0;
            var idx = Math.ceil(arr.length * p) - 1;
            return arr[Math.max(0, idx)];
        }

        return {
            feature: feature,
            totalRequests: m.totalRequests,
            errorRate: ((m.totalErrors / m.totalRequests) * 100).toFixed(2) + "%",
            budgetViolationRate: ((m.budgetViolations / m.totalRequests) * 100).toFixed(2) + "%",
            duration: {
                p50: percentile(durations, 0.5).toFixed(0) + "ms",
                p95: percentile(durations, 0.95).toFixed(0) + "ms",
                p99: percentile(durations, 0.99).toFixed(0) + "ms"
            },
            ttft: {
                p50: percentile(ttfts, 0.5).toFixed(0) + "ms",
                p95: percentile(ttfts, 0.95).toFixed(0) + "ms"
            },
            tokensPerSecond: calculateAvgTokensPerSecond(m.measurements)
        };
    }

    function calculateAvgTokensPerSecond(measurements) {
        var rates = measurements
            .filter(function (d) { return d.tokensPerSecond > 0; })
            .map(function (d) { return d.tokensPerSecond; });
        if (rates.length === 0) return "N/A";
        var sum = rates.reduce(function (a, b) { return a + b; }, 0);
        return (sum / rates.length).toFixed(1);
    }

    function getAllStats() {
        var result = {};
        Object.keys(metrics).forEach(function (feature) {
            result[feature] = getStats(feature);
        });
        return result;
    }

    return {
        record: record,
        getStats: getStats,
        getAllStats: getAllStats
    };
}

Complete Working Example

Here is a fully working Express.js module that instruments LLM feature handlers with granular timing, memory tracking, and exposes a performance dashboard via API endpoints.

// llm-profiler.js — LLM Performance Profiling Module
var perf_hooks = require("perf_hooks");
var performance = perf_hooks.performance;
var crypto = require("crypto");
var express = require("express");

// ============================================================
// Metrics Store
// ============================================================
var metricsStore = {};
var WINDOW_SIZE = 2000;

function recordMetric(feature, data) {
    if (!metricsStore[feature]) {
        metricsStore[feature] = {
            requests: [],
            total: 0,
            errors: 0,
            budgetViolations: 0
        };
    }
    var store = metricsStore[feature];
    store.requests.push(data);
    store.total++;
    if (data.error) store.errors++;
    if (data.budgetViolated) store.budgetViolations++;
    if (store.requests.length > WINDOW_SIZE) {
        store.requests = store.requests.slice(-WINDOW_SIZE);
    }
}

// ============================================================
// Performance Budgets
// ============================================================
var budgets = {};

function setBudget(feature, budget) {
    budgets[feature] = budget;
}

function checkBudget(feature, metrics) {
    var budget = budgets[feature];
    if (!budget) return { violated: false, details: [] };

    var details = [];
    if (budget.maxDuration && metrics.totalDuration > budget.maxDuration) {
        details.push("duration " + metrics.totalDuration.toFixed(0) +
            "ms exceeded budget " + budget.maxDuration + "ms");
    }
    if (budget.maxTTFT && metrics.ttft && metrics.ttft > budget.maxTTFT) {
        details.push("TTFT " + metrics.ttft.toFixed(0) +
            "ms exceeded budget " + budget.maxTTFT + "ms");
    }
    if (budget.maxMemoryMB && metrics.peakMemoryMB > budget.maxMemoryMB) {
        details.push("memory " + metrics.peakMemoryMB.toFixed(1) +
            "MB exceeded budget " + budget.maxMemoryMB + "MB");
    }
    return { violated: details.length > 0, details: details };
}

// ============================================================
// Core Profiler
// ============================================================
function profileLLMHandler(featureName, handlerFn) {
    return function (req, res, next) {
        var requestId = crypto.randomUUID();
        var phases = {};
        var currentPhase = null;
        var phaseStart = null;
        var overallStart = performance.now();
        var memBaseline = process.memoryUsage().heapUsed;
        var peakHeap = memBaseline;
        var ttft = null;
        var tokenCount = 0;
        var firstTokenTime = null;
        var lastTokenTime = null;

        // Memory sampling interval
        var memInterval = setInterval(function () {
            var heap = process.memoryUsage().heapUsed;
            if (heap > peakHeap) peakHeap = heap;
        }, 50);

        // Profiling context injected into req
        req.llmProfile = {
            requestId: requestId,

            startPhase: function (name) {
                if (currentPhase) {
                    phases[currentPhase] = performance.now() - phaseStart;
                }
                currentPhase = name;
                phaseStart = performance.now();
            },

            endPhase: function () {
                if (currentPhase) {
                    phases[currentPhase] = performance.now() - phaseStart;
                    currentPhase = null;
                    phaseStart = null;
                }
            },

            onToken: function () {
                tokenCount++;
                var now = performance.now();
                if (!firstTokenTime) {
                    firstTokenTime = now;
                    ttft = now - overallStart;
                }
                lastTokenTime = now;
            },

            onTokenBatch: function (count) {
                tokenCount += count;
                var now = performance.now();
                if (!firstTokenTime) {
                    firstTokenTime = now;
                    ttft = now - overallStart;
                }
                lastTokenTime = now;
            }
        };

        // Intercept response to capture final timing
        var originalEnd = res.end;
        res.end = function () {
            clearInterval(memInterval);

            if (currentPhase) {
                phases[currentPhase] = performance.now() - phaseStart;
            }

            var totalDuration = performance.now() - overallStart;
            var peakMemoryMB = (peakHeap - memBaseline) / (1024 * 1024);
            var tokensPerSecond = 0;
            if (tokenCount > 1 && lastTokenTime && firstTokenTime) {
                tokensPerSecond = tokenCount /
                    ((lastTokenTime - firstTokenTime) / 1000);
            }

            var metrics = {
                requestId: requestId,
                feature: featureName,
                timestamp: new Date().toISOString(),
                totalDuration: totalDuration,
                ttft: ttft,
                tokenCount: tokenCount,
                tokensPerSecond: tokensPerSecond,
                peakMemoryMB: peakMemoryMB,
                phases: phases,
                statusCode: res.statusCode,
                error: res.statusCode >= 400
            };

            var budgetResult = checkBudget(featureName, metrics);
            metrics.budgetViolated = budgetResult.violated;
            metrics.budgetDetails = budgetResult.details;

            if (budgetResult.violated) {
                console.warn(
                    "[LLM-PERF] Budget violation on " + featureName +
                    " (request " + requestId + "): " +
                    budgetResult.details.join(", ")
                );
            }

            recordMetric(featureName, metrics);

            // Add timing header for client-side observability
            try {
                if (!res.headersSent) {
                    res.setHeader("Server-Timing",
                        "total;dur=" + totalDuration.toFixed(1) +
                        (ttft ? ",ttft;dur=" + ttft.toFixed(1) : "")
                    );
                }
            } catch (e) { /* headers already sent for streaming */ }

            return originalEnd.apply(res, arguments);
        };

        handlerFn(req, res, next);
    };
}

// ============================================================
// Dashboard Router
// ============================================================
function createDashboardRouter() {
    var router = express.Router();

    // Summary of all features
    router.get("/", function (req, res) {
        var summary = {};
        Object.keys(metricsStore).forEach(function (feature) {
            var store = metricsStore[feature];
            var durations = store.requests
                .filter(function (r) { return !r.error; })
                .map(function (r) { return r.totalDuration; })
                .sort(function (a, b) { return a - b; });

            var ttfts = store.requests
                .filter(function (r) { return r.ttft != null; })
                .map(function (r) { return r.ttft; })
                .sort(function (a, b) { return a - b; });

            function pct(arr, p) {
                if (arr.length === 0) return null;
                return arr[Math.max(0, Math.ceil(arr.length * p) - 1)];
            }

            summary[feature] = {
                totalRequests: store.total,
                errorRate: store.total > 0
                    ? ((store.errors / store.total) * 100).toFixed(2) + "%"
                    : "0%",
                budgetViolationRate: store.total > 0
                    ? ((store.budgetViolations / store.total) * 100).toFixed(2) + "%"
                    : "0%",
                duration: {
                    p50: pct(durations, 0.5),
                    p95: pct(durations, 0.95),
                    p99: pct(durations, 0.99)
                },
                ttft: {
                    p50: pct(ttfts, 0.5),
                    p95: pct(ttfts, 0.95)
                },
                sampleSize: store.requests.length
            };
        });
        res.json(summary);
    });

    // Detailed metrics for a specific feature
    router.get("/:feature", function (req, res) {
        var feature = req.params.feature;
        var store = metricsStore[feature];
        if (!store) {
            return res.status(404).json({ error: "Feature not found" });
        }

        var recentRequests = store.requests.slice(-50).map(function (r) {
            return {
                requestId: r.requestId,
                timestamp: r.timestamp,
                totalDuration: r.totalDuration ? r.totalDuration.toFixed(1) + "ms" : null,
                ttft: r.ttft ? r.ttft.toFixed(1) + "ms" : null,
                tokenCount: r.tokenCount,
                tokensPerSecond: r.tokensPerSecond
                    ? r.tokensPerSecond.toFixed(1)
                    : null,
                peakMemoryMB: r.peakMemoryMB
                    ? r.peakMemoryMB.toFixed(2) + "MB"
                    : null,
                phases: r.phases,
                budgetViolated: r.budgetViolated,
                error: r.error
            };
        });

        res.json({
            feature: feature,
            totalRequests: store.total,
            recentRequests: recentRequests
        });
    });

    // Phase breakdown for a specific feature
    router.get("/:feature/phases", function (req, res) {
        var feature = req.params.feature;
        var store = metricsStore[feature];
        if (!store) {
            return res.status(404).json({ error: "Feature not found" });
        }

        var phaseAggregates = {};
        store.requests.forEach(function (r) {
            if (!r.phases) return;
            Object.keys(r.phases).forEach(function (phase) {
                if (!phaseAggregates[phase]) {
                    phaseAggregates[phase] = { values: [], sum: 0 };
                }
                phaseAggregates[phase].values.push(r.phases[phase]);
                phaseAggregates[phase].sum += r.phases[phase];
            });
        });

        var breakdown = {};
        Object.keys(phaseAggregates).forEach(function (phase) {
            var vals = phaseAggregates[phase].values.sort(
                function (a, b) { return a - b; }
            );
            breakdown[phase] = {
                avg: (phaseAggregates[phase].sum / vals.length).toFixed(1) + "ms",
                p50: vals[Math.floor(vals.length * 0.5)].toFixed(1) + "ms",
                p95: vals[Math.floor(vals.length * 0.95)].toFixed(1) + "ms",
                count: vals.length
            };
        });

        res.json({ feature: feature, phases: breakdown });
    });

    // Reset metrics
    router.post("/reset", function (req, res) {
        var feature = req.body.feature;
        if (feature) {
            delete metricsStore[feature];
        } else {
            Object.keys(metricsStore).forEach(function (key) {
                delete metricsStore[key];
            });
        }
        res.json({ status: "ok" });
    });

    return router;
}

// ============================================================
// Usage Example
// ============================================================
//
//   var profiler = require("./llm-profiler");
//   var app = express();
//
//   // Set performance budgets
//   profiler.setBudget("summarize", {
//       maxDuration: 10000,
//       maxTTFT: 800,
//       maxMemoryMB: 50
//   });
//
//   // Wrap your LLM handler
//   app.post("/api/summarize",
//       profiler.profileLLMHandler("summarize", function (req, res) {
//           req.llmProfile.startPhase("prompt_construction");
//           var prompt = buildPrompt(req.body.text);
//           req.llmProfile.endPhase();
//
//           req.llmProfile.startPhase("api_call");
//           callLLM(prompt).then(function (result) {
//               req.llmProfile.onTokenBatch(result.usage.completion_tokens);
//               req.llmProfile.endPhase();
//
//               req.llmProfile.startPhase("post_processing");
//               var output = formatResult(result);
//               req.llmProfile.endPhase();
//
//               res.json({ summary: output });
//           });
//       })
//   );
//
//   // Mount the dashboard
//   app.use("/perf/llm", profiler.createDashboardRouter());
//
// ============================================================

module.exports = {
    profileLLMHandler: profileLLMHandler,
    setBudget: setBudget,
    createDashboardRouter: createDashboardRouter,
    recordMetric: recordMetric
};

Save this as llm-profiler.js and mount it in your Express app. The dashboard endpoints provide:

  • GET /perf/llm/ — Summary statistics for all instrumented features
  • GET /perf/llm/:feature — Detailed recent request history for a specific feature
  • GET /perf/llm/:feature/phases — Phase-by-phase breakdown showing where time is spent
  • POST /perf/llm/reset — Clear collected metrics

Example dashboard output from GET /perf/llm/:

{
  "summarize": {
    "totalRequests": 1847,
    "errorRate": "2.11%",
    "budgetViolationRate": "4.33%",
    "duration": { "p50": 2340, "p95": 6120, "p99": 11430 },
    "ttft": { "p50": 412, "p95": 890 },
    "sampleSize": 1847
  },
  "classify": {
    "totalRequests": 5621,
    "errorRate": "0.89%",
    "budgetViolationRate": "1.07%",
    "duration": { "p50": 680, "p95": 1840, "p99": 3200 },
    "ttft": { "p50": 245, "p95": 520 },
    "sampleSize": 2000
  }
}

Common Issues and Troubleshooting

1. Performance Marks Accumulate and Leak Memory

RangeError: Too many performance marks (more than 1000000)

The performance timeline keeps all marks in memory unless you explicitly clear them. In a high-traffic application, this fills up fast.

Fix: Always call performance.clearMarks() and performance.clearMeasures() after capturing your data. The profiler module above handles this in the res.end override, but if you are using raw perf_hooks, clean up manually:

performance.mark("start");
// ... work ...
performance.mark("end");
performance.measure("work", "start", "end");
performance.clearMarks("start");
performance.clearMarks("end");
performance.clearMeasures("work");

2. Memory Sampling Interval Blocks Event Loop

Warning: Possible EventEmitter memory leak detected. 11 'timeout' listeners added.

If you set setInterval to sample memory too frequently (every 1ms), the callbacks pile up and block the event loop, which distorts your timing measurements.

Fix: Use a sampling interval of 50-100ms. Sub-50ms sampling adds overhead that affects the measurements you are trying to capture. For high-precision memory tracking, use process.memoryUsage.rss() (the fast version available in Node 18+) instead of the full process.memoryUsage().

3. Server-Timing Header Too Large

Error [ERR_HTTP_HEADERS_SENT]: Cannot set headers after they are sent to the client

When streaming responses, headers are sent with the first chunk. Attempting to set Server-Timing after streaming starts throws this error.

Fix: Set the Server-Timing header before streaming begins with estimated values, or skip it entirely for streaming endpoints and rely on the metrics store instead:

try {
    if (!res.headersSent) {
        res.setHeader("Server-Timing", "total;dur=" + duration.toFixed(1));
    }
} catch (e) {
    // Streaming response, headers already sent - metrics stored internally
}

4. PerformanceObserver Entries Not Firing

// observer callback never executes
observer.observe({ entryTypes: ["measure"] });

The PerformanceObserver does not fire for marks created in the same tick if buffered: false (the default). Measurements created in rapid succession may be missed.

Fix: Always pass buffered: true:

observer.observe({ entryTypes: ["measure"], buffered: true });

And ensure you are reading entries in the callback, not polling externally. The observer is GC-sensitive; keep a reference to it at the module level so it does not get collected.

5. Token Count Mismatch Between Profiler and LLM Provider

Profiler reports 342 tokens, OpenAI usage reports 287 completion_tokens

If you count tokens by splitting on whitespace or counting onToken callbacks in a stream, your count will differ from the provider's BPE token count. Streaming chunks do not always correspond 1:1 with tokens.

Fix: Use the provider's reported usage.completion_tokens as the source of truth for token counts. Use stream-based counting only for TTFT and inter-token latency calculations, not for cost or throughput metrics.

Best Practices

  • Measure TTFT separately from total duration. For user-facing features, time to first token is the metric that determines perceived performance. A 10-second endpoint that starts streaming in 400ms feels faster than a 3-second endpoint that returns nothing until complete.

  • Set explicit performance budgets before launch. Define p95 latency, TTFT, and memory limits for each LLM feature during development. Without budgets, latency creeps up and nobody notices until users complain.

  • Profile with realistic input distributions. Do not test exclusively with short prompts. Build a load test profile that mirrors your production traffic distribution: 40% short, 35% medium, 20% long, 5% very long inputs.

  • Track tokens per second as a provider health indicator. A sudden drop in tokens-per-second across all features usually indicates provider-side throttling or degradation, not a problem in your code.

  • Clean up performance marks aggressively. The perf_hooks timeline is a memory leak waiting to happen in long-running processes. Clear marks and measures immediately after capturing them.

  • Cache deterministic LLM calls. Classification, extraction, and structured output endpoints with identical inputs produce identical outputs. Cache these aggressively. Even a 30% cache hit rate saves significant cost and latency.

  • Profile in staging with real LLM calls, not mocks. Mocked LLM responses return in microseconds and tell you nothing about real performance. Your staging profiling should hit the actual LLM API.

  • Log phase timings on every request, not just sampled ones. LLM latency variance is so high that sampling misses the tail. Log all requests and aggregate server-side. Storage is cheap; missing a 30-second outlier is expensive.

  • Use connection pooling for LLM API calls. A single TLS handshake to api.openai.com costs 80-150ms. Use keepAlive: true on your HTTPS agent to reuse connections across requests.

References

Powered by Contentful