Production

Scaling LLM Applications: Architecture Patterns

Scale LLM applications with queue-based architecture, worker pools, caching layers, and auto-scaling patterns in Node.js.

Scaling LLM Applications: Architecture Patterns

Overview

LLM-powered applications introduce scaling challenges that traditional web architectures were never designed to handle. Single API calls that take 5-30 seconds, cost real money per request, and produce wildly variable response times will break any naive deployment the moment real traffic arrives. This article walks through the architecture patterns, infrastructure decisions, and Node.js implementation strategies that let you scale LLM applications from proof-of-concept to production traffic without burning through your budget or your users' patience.

Prerequisites

  • Solid understanding of Node.js and Express.js
  • Familiarity with Redis, PostgreSQL, and message queues
  • Experience deploying and operating web applications at scale
  • Basic understanding of LLM APIs (OpenAI, Anthropic, etc.)
  • Knowledge of Docker and container orchestration concepts

Why LLM Applications Break Traditional Scaling

Most web applications follow a predictable pattern: a request comes in, the server does some computation or database queries, and a response goes back in under 200 milliseconds. You scale horizontally by adding more servers behind a load balancer. The math is straightforward.

LLM applications destroy every assumption in that model.

High latency per request. A single LLM call can take anywhere from 2 to 60 seconds depending on the model, prompt length, and output token count. That means each request ties up a connection, a worker thread, and memory for orders of magnitude longer than a typical web request. A Node.js server that handles 1,000 requests per second for a REST API might handle 20 concurrent LLM requests before it starts dropping connections.

Variable response times. A short summarization might return in 3 seconds. A complex code generation request might take 45 seconds. You cannot predict capacity because you cannot predict how long each request will hold resources. Standard timeout configurations and health check intervals become meaningless.

Expensive compute. Every LLM call costs money. A poorly designed system that retries aggressively, lacks caching, or allows unbounded concurrency can burn through thousands of dollars in API costs during a traffic spike. Unlike CPU or memory, you cannot simply add more of it without increasing your bill proportionally.

Token-based rate limits. LLM providers impose rate limits measured in tokens per minute and requests per minute. These limits are per-account, not per-server, so scaling horizontally does not buy you more LLM capacity. You need architecture-level solutions to manage this constraint.

Horizontal Scaling with Load Balancers

The first layer of scaling is still the load balancer, but the configuration needs to account for LLM-specific behavior.

// health-check.js - LLM-aware health check endpoint
var express = require("express");
var router = express.Router();

var activeRequests = 0;
var maxConcurrentRequests = 50;

function incrementActive() {
  activeRequests++;
}

function decrementActive() {
  activeRequests--;
}

router.get("/health", function(req, res) {
  var healthy = activeRequests < maxConcurrentRequests;
  var status = healthy ? 200 : 503;

  res.status(status).json({
    status: healthy ? "healthy" : "overloaded",
    activeRequests: activeRequests,
    maxConcurrent: maxConcurrentRequests,
    uptime: process.uptime(),
    memoryUsage: process.memoryUsage().heapUsed
  });
});

module.exports = {
  router: router,
  incrementActive: incrementActive,
  decrementActive: decrementActive
};

Standard round-robin load balancing fails for LLM workloads because it does not account for how many long-running requests each server is already handling. Use least-connections balancing instead. If your load balancer supports it, configure weighted balancing based on the health check response so that servers already saturated with LLM calls stop receiving new ones.

Set your load balancer timeouts to at least 120 seconds. The default 30-second timeout on most load balancers will kill LLM requests mid-stream and leave users with nothing.

Queue-Based Architecture

The single most important architectural pattern for LLM applications is decoupling request handling from LLM processing using a message queue. The API server's only job is to accept a request, validate it, drop it onto a queue, and return a job ID. Separate worker processes pull from the queue and make the actual LLM calls.

This pattern solves multiple problems simultaneously. The API server stays responsive because it never blocks on LLM calls. Workers can be scaled independently based on queue depth. Rate limits become manageable because you control exactly how many workers are calling the LLM provider at any given time. And if a worker crashes mid-request, the job stays in the queue and another worker picks it up.

// queue-setup.js - Bull queue configuration for LLM jobs
var Queue = require("bull");
var Redis = require("ioredis");

var redisConfig = {
  host: process.env.REDIS_HOST || "localhost",
  port: parseInt(process.env.REDIS_PORT) || 6379,
  password: process.env.REDIS_PASSWORD || undefined,
  maxRetriesPerRequest: null,
  enableReadyCheck: false
};

var llmQueue = new Queue("llm-processing", {
  createClient: function(type) {
    return new Redis(redisConfig);
  },
  defaultJobOptions: {
    attempts: 3,
    backoff: {
      type: "exponential",
      delay: 5000
    },
    timeout: 120000,
    removeOnComplete: 1000,
    removeOnFail: 5000
  }
});

llmQueue.on("error", function(error) {
  console.error("Queue error:", error.message);
});

llmQueue.on("failed", function(job, error) {
  console.error("Job " + job.id + " failed:", error.message);
});

module.exports = { llmQueue: llmQueue };
// api-server.js - Express server that enqueues LLM requests
var express = require("express");
var { v4: uuidv4 } = require("uuid");
var { llmQueue } = require("./queue-setup");

var app = express();
app.use(express.json({ limit: "1mb" }));

app.post("/api/generate", function(req, res) {
  var jobId = uuidv4();
  var prompt = req.body.prompt;
  var model = req.body.model || "gpt-4o";
  var userId = req.user ? req.user.id : "anonymous";

  if (!prompt || prompt.length > 10000) {
    return res.status(400).json({ error: "Invalid prompt" });
  }

  llmQueue.add({
    jobId: jobId,
    prompt: prompt,
    model: model,
    userId: userId,
    createdAt: Date.now()
  }, {
    jobId: jobId,
    priority: req.body.priority === "high" ? 1 : 10
  }).then(function() {
    res.status(202).json({
      jobId: jobId,
      status: "queued",
      pollUrl: "/api/jobs/" + jobId
    });
  }).catch(function(error) {
    console.error("Failed to enqueue:", error.message);
    res.status(500).json({ error: "Failed to process request" });
  });
});

app.get("/api/jobs/:jobId", function(req, res) {
  var jobId = req.params.jobId;

  llmQueue.getJob(jobId).then(function(job) {
    if (!job) {
      return res.status(404).json({ error: "Job not found" });
    }

    return job.getState().then(function(state) {
      var response = {
        jobId: jobId,
        status: state,
        createdAt: job.data.createdAt
      };

      if (state === "completed") {
        response.result = job.returnvalue;
      } else if (state === "failed") {
        response.error = job.failedReason;
      }

      res.json(response);
    });
  }).catch(function(error) {
    res.status(500).json({ error: "Failed to retrieve job" });
  });
});

var PORT = process.env.PORT || 3000;
app.listen(PORT, function() {
  console.log("API server running on port " + PORT);
});

Implementing Worker Pools for LLM Calls

Workers are the processes that actually make LLM calls. The key design decision is concurrency per worker. You want enough concurrency to keep the LLM provider busy, but not so much that you exhaust rate limits or memory.

// llm-worker.js - Worker process with controlled concurrency
var { llmQueue } = require("./queue-setup");
var { callLLM } = require("./llm-client");
var Redis = require("ioredis");

var CONCURRENCY = parseInt(process.env.WORKER_CONCURRENCY) || 5;
var redis = new Redis(process.env.REDIS_HOST || "localhost");

llmQueue.process(CONCURRENCY, function(job) {
  var data = job.data;

  console.log("Processing job " + job.id + " for model " + data.model);

  return job.progress(10).then(function() {
    return checkCache(data.prompt, data.model);
  }).then(function(cached) {
    if (cached) {
      console.log("Cache hit for job " + job.id);
      return JSON.parse(cached);
    }

    return job.progress(30).then(function() {
      return callLLM(data.prompt, data.model);
    }).then(function(result) {
      return job.progress(80).then(function() {
        return cacheResult(data.prompt, data.model, result).then(function() {
          return result;
        });
      });
    });
  }).then(function(result) {
    return job.progress(100).then(function() {
      return result;
    });
  });
});

function checkCache(prompt, model) {
  var crypto = require("crypto");
  var key = "llm:" + model + ":" + crypto.createHash("sha256").update(prompt).digest("hex");
  return redis.get(key);
}

function cacheResult(prompt, model, result) {
  var crypto = require("crypto");
  var key = "llm:" + model + ":" + crypto.createHash("sha256").update(prompt).digest("hex");
  var ttl = 3600; // 1 hour
  return redis.setex(key, ttl, JSON.stringify(result));
}

console.log("Worker started with concurrency " + CONCURRENCY);
// llm-client.js - LLM API client with retry and circuit breaker
var https = require("https");

var circuitState = "closed";
var failureCount = 0;
var failureThreshold = 5;
var resetTimeout = 30000;
var lastFailureTime = 0;

function callLLM(prompt, model) {
  return new Promise(function(resolve, reject) {
    if (circuitState === "open") {
      if (Date.now() - lastFailureTime > resetTimeout) {
        circuitState = "half-open";
      } else {
        return reject(new Error("Circuit breaker is open. LLM calls temporarily disabled."));
      }
    }

    var payload = JSON.stringify({
      model: model,
      messages: [{ role: "user", content: prompt }],
      max_tokens: 4096,
      temperature: 0.7
    });

    var options = {
      hostname: "api.openai.com",
      path: "/v1/chat/completions",
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
        "Content-Length": Buffer.byteLength(payload)
      },
      timeout: 90000
    };

    var req = https.request(options, function(res) {
      var body = "";
      res.on("data", function(chunk) { body += chunk; });
      res.on("end", function() {
        if (res.statusCode === 429) {
          var retryAfter = parseInt(res.headers["retry-after"]) || 10;
          failureCount++;
          lastFailureTime = Date.now();

          if (failureCount >= failureThreshold) {
            circuitState = "open";
          }

          return reject(new Error("Rate limited. Retry after " + retryAfter + "s"));
        }

        if (res.statusCode !== 200) {
          failureCount++;
          lastFailureTime = Date.now();

          if (failureCount >= failureThreshold) {
            circuitState = "open";
          }

          return reject(new Error("LLM API error: " + res.statusCode + " - " + body));
        }

        // Success - reset circuit breaker
        failureCount = 0;
        circuitState = "closed";

        var parsed = JSON.parse(body);
        resolve({
          content: parsed.choices[0].message.content,
          usage: parsed.usage,
          model: parsed.model,
          completedAt: Date.now()
        });
      });
    });

    req.on("error", function(error) {
      failureCount++;
      lastFailureTime = Date.now();
      reject(error);
    });

    req.on("timeout", function() {
      req.destroy();
      reject(new Error("LLM request timed out after 90s"));
    });

    req.write(payload);
    req.end();
  });
}

module.exports = { callLLM: callLLM };

Run multiple worker processes rather than increasing concurrency in a single process. Each worker process gets its own event loop and memory space. If one worker crashes or leaks memory, the others keep running. A typical production deployment runs 3-10 worker processes per server, each with a concurrency of 3-8.

Connection Pooling for Database-Backed LLM Features

LLM applications frequently need to fetch context from a database before calling the model and store results afterward. With requests taking 10-30 seconds each, database connections stay open far longer than normal. Without pooling, you will exhaust your database connection limit quickly.

// db-pool.js - PostgreSQL connection pool with LLM-appropriate settings
var { Pool } = require("pg");

var pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 30,                          // more connections than typical
  min: 5,                           // keep warm connections ready
  idleTimeoutMillis: 60000,         // longer idle timeout for LLM workflows
  connectionTimeoutMillis: 10000,
  statement_timeout: 30000,
  query_timeout: 30000
});

pool.on("error", function(err) {
  console.error("Unexpected pool error:", err.message);
});

pool.on("connect", function() {
  console.log("New database connection established");
});

function getContextForPrompt(userId, contextType) {
  var query = "SELECT content, metadata FROM user_contexts WHERE user_id = $1 AND type = $2 ORDER BY created_at DESC LIMIT 10";
  return pool.query(query, [userId, contextType]).then(function(result) {
    return result.rows;
  });
}

function storeResult(jobId, userId, result) {
  var query = "INSERT INTO llm_results (job_id, user_id, content, usage_tokens, model, created_at) VALUES ($1, $2, $3, $4, $5, NOW())";
  var params = [jobId, userId, result.content, result.usage.total_tokens, result.model];
  return pool.query(query, params);
}

module.exports = {
  pool: pool,
  getContextForPrompt: getContextForPrompt,
  storeResult: storeResult
};

Multi-Layer Caching Strategy

Caching is not optional for LLM applications. It is the single biggest lever you have for reducing costs and improving response times. Implement caching at every layer.

CDN/Edge cache: For responses that are identical across users (e.g., FAQ answers, product descriptions), cache at the CDN level. Set appropriate Cache-Control headers and use surrogate keys for targeted invalidation.

Application cache (Redis): Cache LLM responses keyed by a hash of the prompt and model. Even partial cache hits (where the prompt is similar but not identical) can be valuable if you implement semantic caching.

Database cache: Store all LLM results in the database. Before making an LLM call, check if an identical request was processed recently. This survives Redis restarts and provides an audit trail.

// cache-layer.js - Multi-layer caching for LLM responses
var Redis = require("ioredis");
var crypto = require("crypto");
var { pool } = require("./db-pool");

var redis = new Redis(process.env.REDIS_URL || "redis://localhost:6379");

var CACHE_TTL_REDIS = 3600;       // 1 hour in Redis
var CACHE_TTL_DB = 86400 * 7;     // 7 days in database

function generateCacheKey(prompt, model, params) {
  var normalized = prompt.trim().toLowerCase();
  var input = model + ":" + JSON.stringify(params || {}) + ":" + normalized;
  return "llm:cache:" + crypto.createHash("sha256").update(input).digest("hex");
}

function getFromCache(prompt, model, params) {
  var key = generateCacheKey(prompt, model, params);

  // Layer 1: Redis
  return redis.get(key).then(function(cached) {
    if (cached) {
      console.log("Redis cache hit");
      return { source: "redis", data: JSON.parse(cached) };
    }

    // Layer 2: Database
    var query = "SELECT content, usage_tokens, model FROM llm_cache WHERE cache_key = $1 AND created_at > NOW() - INTERVAL '" + CACHE_TTL_DB + " seconds'";
    return pool.query(query, [key]).then(function(result) {
      if (result.rows.length > 0) {
        console.log("Database cache hit");
        var data = result.rows[0];

        // Backfill Redis
        redis.setex(key, CACHE_TTL_REDIS, JSON.stringify(data));
        return { source: "database", data: data };
      }

      return null;
    });
  });
}

function setCache(prompt, model, params, result) {
  var key = generateCacheKey(prompt, model, params);
  var serialized = JSON.stringify(result);

  // Write to both layers
  var redisPromise = redis.setex(key, CACHE_TTL_REDIS, serialized);
  var dbPromise = pool.query(
    "INSERT INTO llm_cache (cache_key, prompt_hash, model, content, usage_tokens, created_at) VALUES ($1, $2, $3, $4, $5, NOW()) ON CONFLICT (cache_key) DO UPDATE SET content = $4, usage_tokens = $5, created_at = NOW()",
    [key, crypto.createHash("md5").update(prompt).digest("hex"), model, result.content, result.usage.total_tokens]
  );

  return Promise.all([redisPromise, dbPromise]);
}

module.exports = {
  getFromCache: getFromCache,
  setCache: setCache,
  generateCacheKey: generateCacheKey
};

Async Processing: Submit Job, Poll for Result

The submit-and-poll pattern is the standard approach for LLM applications where response times exceed what users or clients will tolerate as a synchronous call. The client submits a request, gets back a job ID immediately, and polls a status endpoint until the result is ready.

For a better user experience, combine polling with Server-Sent Events (SSE) so clients get notified immediately when their job completes.

// sse-notifications.js - Server-Sent Events for job completion
var express = require("express");
var router = express.Router();

var clients = {};

router.get("/api/jobs/:jobId/stream", function(req, res) {
  var jobId = req.params.jobId;

  res.writeHead(200, {
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache",
    "Connection": "keep-alive"
  });

  res.write("data: " + JSON.stringify({ status: "connected" }) + "\n\n");

  if (!clients[jobId]) {
    clients[jobId] = [];
  }
  clients[jobId].push(res);

  req.on("close", function() {
    if (clients[jobId]) {
      clients[jobId] = clients[jobId].filter(function(client) {
        return client !== res;
      });
      if (clients[jobId].length === 0) {
        delete clients[jobId];
      }
    }
  });
});

function notifyJobComplete(jobId, result) {
  if (clients[jobId]) {
    var message = "data: " + JSON.stringify({
      status: "completed",
      result: result
    }) + "\n\n";

    clients[jobId].forEach(function(client) {
      client.write(message);
      client.end();
    });

    delete clients[jobId];
  }
}

function notifyJobFailed(jobId, error) {
  if (clients[jobId]) {
    var message = "data: " + JSON.stringify({
      status: "failed",
      error: error
    }) + "\n\n";

    clients[jobId].forEach(function(client) {
      client.write(message);
      client.end();
    });

    delete clients[jobId];
  }
}

module.exports = {
  router: router,
  notifyJobComplete: notifyJobComplete,
  notifyJobFailed: notifyJobFailed
};

Read Replica Strategies for Embedding Search

If your LLM application uses vector embeddings for retrieval-augmented generation (RAG), the embedding search query is often the performance bottleneck. Vector similarity searches are CPU-intensive and can block other database operations.

Deploy dedicated read replicas for embedding search. Write new embeddings to the primary, but route all similarity searches to replicas. This isolates the heavy computation from your transactional workload.

// db-replicas.js - Read replica routing for embedding search
var { Pool } = require("pg");

var primaryPool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 20
});

var replicaPool = new Pool({
  connectionString: process.env.DATABASE_REPLICA_URL || process.env.DATABASE_URL,
  max: 40  // more connections for read-heavy search
});

function searchEmbeddings(embedding, limit, threshold) {
  var query = "SELECT id, content, metadata, 1 - (embedding <=> $1::vector) AS similarity FROM documents WHERE 1 - (embedding <=> $1::vector) > $2 ORDER BY embedding <=> $1::vector LIMIT $3";
  return replicaPool.query(query, [
    "[" + embedding.join(",") + "]",
    threshold || 0.7,
    limit || 10
  ]).then(function(result) {
    return result.rows;
  });
}

function insertEmbedding(content, embedding, metadata) {
  var query = "INSERT INTO documents (content, embedding, metadata, created_at) VALUES ($1, $2::vector, $3, NOW()) RETURNING id";
  return primaryPool.query(query, [
    content,
    "[" + embedding.join(",") + "]",
    JSON.stringify(metadata)
  ]).then(function(result) {
    return result.rows[0];
  });
}

module.exports = {
  searchEmbeddings: searchEmbeddings,
  insertEmbedding: insertEmbedding,
  primaryPool: primaryPool,
  replicaPool: replicaPool
};

Partitioning Strategies for Large Vector Stores

Once your vector store exceeds a few million rows, single-table similarity search becomes too slow regardless of indexing. Partition your vectors by tenant, category, or time range so each search only scans a relevant subset.

Use PostgreSQL table partitioning combined with pgvector indexes on each partition. This keeps index sizes manageable and allows you to drop old partitions without expensive delete operations.

For multi-tenant applications, partition by tenant ID so each customer's searches only hit their own data. This also provides natural data isolation.

Auto-Scaling Based on Queue Depth and Latency

Traditional CPU-based auto-scaling does not work for LLM applications because workers spend most of their time waiting on network I/O, not burning CPU. Instead, scale based on queue depth and job latency.

// autoscaler-metrics.js - Custom metrics for auto-scaling decisions
var { llmQueue } = require("./queue-setup");
var express = require("express");
var router = express.Router();

function getScalingMetrics() {
  return Promise.all([
    llmQueue.getWaitingCount(),
    llmQueue.getActiveCount(),
    llmQueue.getDelayedCount(),
    llmQueue.getCompletedCount(),
    llmQueue.getFailedCount()
  ]).then(function(counts) {
    var waiting = counts[0];
    var active = counts[1];
    var delayed = counts[2];
    var completed = counts[3];
    var failed = counts[4];

    var queuePressure = waiting / Math.max(active, 1);
    var failureRate = failed / Math.max(completed + failed, 1);

    return {
      queue: {
        waiting: waiting,
        active: active,
        delayed: delayed,
        completed: completed,
        failed: failed
      },
      scaling: {
        queuePressure: queuePressure,
        failureRate: failureRate,
        shouldScaleUp: queuePressure > 5 || waiting > 100,
        shouldScaleDown: waiting === 0 && active < 2
      }
    };
  });
}

// Prometheus-compatible metrics endpoint
router.get("/metrics", function(req, res) {
  getScalingMetrics().then(function(metrics) {
    var output = "";
    output += "llm_queue_waiting " + metrics.queue.waiting + "\n";
    output += "llm_queue_active " + metrics.queue.active + "\n";
    output += "llm_queue_delayed " + metrics.queue.delayed + "\n";
    output += "llm_queue_completed_total " + metrics.queue.completed + "\n";
    output += "llm_queue_failed_total " + metrics.queue.failed + "\n";
    output += "llm_queue_pressure " + metrics.scaling.queuePressure + "\n";
    output += "llm_failure_rate " + metrics.scaling.failureRate + "\n";

    res.set("Content-Type", "text/plain");
    res.send(output);
  }).catch(function(error) {
    res.status(500).send("Error collecting metrics");
  });
});

module.exports = {
  router: router,
  getScalingMetrics: getScalingMetrics
};

Configure your container orchestrator (Kubernetes, ECS, etc.) to use these custom metrics. Scale workers up when queue pressure exceeds 5 (five jobs waiting per active worker) and scale down when the queue is empty and fewer than two jobs are active. Set a minimum of 2 workers for availability and a maximum based on your LLM provider's rate limits.

Rate Limiting Per User

Without per-user rate limiting, a single user can exhaust your entire LLM budget or consume all your provider rate limits. Implement token-bucket rate limiting that accounts for both request count and estimated token consumption.

// rate-limiter.js - Token-bucket rate limiter for LLM requests
var Redis = require("ioredis");
var redis = new Redis(process.env.REDIS_URL || "redis://localhost:6379");

var REQUESTS_PER_MINUTE = parseInt(process.env.RATE_LIMIT_RPM) || 10;
var TOKENS_PER_HOUR = parseInt(process.env.RATE_LIMIT_TPH) || 100000;

function checkRateLimit(userId) {
  var now = Date.now();
  var minuteKey = "ratelimit:rpm:" + userId + ":" + Math.floor(now / 60000);
  var hourKey = "ratelimit:tph:" + userId + ":" + Math.floor(now / 3600000);

  return Promise.all([
    redis.get(minuteKey),
    redis.get(hourKey)
  ]).then(function(results) {
    var currentRPM = parseInt(results[0]) || 0;
    var currentTPH = parseInt(results[1]) || 0;

    if (currentRPM >= REQUESTS_PER_MINUTE) {
      return {
        allowed: false,
        reason: "Request rate limit exceeded. Max " + REQUESTS_PER_MINUTE + " requests per minute.",
        retryAfter: 60 - (Math.floor(now / 1000) % 60)
      };
    }

    if (currentTPH >= TOKENS_PER_HOUR) {
      return {
        allowed: false,
        reason: "Token rate limit exceeded. Max " + TOKENS_PER_HOUR + " tokens per hour.",
        retryAfter: 3600 - (Math.floor(now / 1000) % 3600)
      };
    }

    return { allowed: true };
  });
}

function recordUsage(userId, tokenCount) {
  var now = Date.now();
  var minuteKey = "ratelimit:rpm:" + userId + ":" + Math.floor(now / 60000);
  var hourKey = "ratelimit:tph:" + userId + ":" + Math.floor(now / 3600000);

  var pipeline = redis.pipeline();
  pipeline.incr(minuteKey);
  pipeline.expire(minuteKey, 120);
  pipeline.incrby(hourKey, tokenCount);
  pipeline.expire(hourKey, 7200);

  return pipeline.exec();
}

function rateLimitMiddleware(req, res, next) {
  var userId = req.user ? req.user.id : req.ip;

  checkRateLimit(userId).then(function(result) {
    if (!result.allowed) {
      res.set("Retry-After", String(result.retryAfter));
      return res.status(429).json({
        error: result.reason,
        retryAfter: result.retryAfter
      });
    }
    next();
  }).catch(function(error) {
    console.error("Rate limit check failed:", error.message);
    next(); // fail open
  });
}

module.exports = {
  rateLimitMiddleware: rateLimitMiddleware,
  recordUsage: recordUsage,
  checkRateLimit: checkRateLimit
};

Microservice Decomposition for LLM Features

As your application grows, decompose LLM features into separate services. Each service owns its queue, workers, cache, and scaling configuration. A text generation service has different scaling characteristics than an embedding search service or an image analysis service.

A typical decomposition looks like this:

  • Gateway Service: API routing, authentication, rate limiting
  • Generation Service: Text generation with LLM calls (high latency, low throughput)
  • Embedding Service: Vector embedding creation and search (CPU-intensive, cacheable)
  • Orchestration Service: Multi-step workflows that chain multiple LLM calls

Each service gets its own Bull queue, worker pool, and auto-scaling rules. The orchestration service coordinates between them using job dependencies in the queue.

Edge Caching for Common LLM Responses

Many LLM applications have a long tail of unique requests and a fat head of common ones. FAQ bots, product description generators, and content summarizers see the same inputs repeatedly. Cache these at the edge.

Use a CDN with programmable edge workers. Hash the request body, check the edge cache, and only forward cache misses to your origin. For deterministic prompts (temperature 0, same system prompt), the response is identical every time, making them perfect cache candidates.

Set Vary headers appropriately if responses differ by user context. Use short TTLs (5-15 minutes) for personalized responses and longer TTLs (1-24 hours) for generic ones.

Capacity Planning and Load Testing

LLM application capacity planning requires different math than traditional web applications. Start by calculating your theoretical maximum throughput.

Throughput formula: If your LLM provider allows R requests per minute, and each request takes an average of T seconds, then your maximum concurrent requests are R * T / 60. With a rate limit of 500 RPM and an average latency of 10 seconds, you can have at most 500 * 10 / 60 = 83 concurrent requests.

Load test with realistic prompts at realistic ratios. Do not use a single short prompt for all test requests. Build a test corpus that matches your production prompt length distribution. Tools like k6 or Artillery work well for this.

// load-test-config.js - k6-compatible load test configuration
// Run with: k6 run load-test-config.js

var prompts = [
  { text: "Summarize the following article in 3 bullet points...", weight: 40 },
  { text: "Generate a product description for...", weight: 30 },
  { text: "Analyze the sentiment of the following customer review...", weight: 20 },
  { text: "Write a detailed technical explanation of...", weight: 10 }
];

// Simulate gradual ramp-up
var stages = [
  { duration: "2m", target: 10 },   // warm up
  { duration: "5m", target: 50 },   // ramp to expected load
  { duration: "10m", target: 50 },  // sustain
  { duration: "5m", target: 100 },  // push to 2x expected
  { duration: "5m", target: 0 }     // cool down
];

// Thresholds for pass/fail
var thresholds = {
  "http_req_duration": ["p(95)<30000"],  // 95th percentile under 30s
  "http_req_failed": ["rate<0.05"],      // less than 5% failure
  "http_req_duration{status:202}": ["p(99)<1000"]  // queue acceptance under 1s
};

Monitor these metrics during load tests: queue depth over time, p50/p95/p99 latency, error rate by error type, LLM provider rate limit hits, memory usage per worker, database connection pool utilization.

Complete Working Example

Here is the full architecture wired together as a deployable Node.js application.

// app.js - Main application entry point
var express = require("express");
var { llmQueue } = require("./queue-setup");
var { rateLimitMiddleware, recordUsage } = require("./rate-limiter");
var { getFromCache, setCache } = require("./cache-layer");
var { router: sseRouter, notifyJobComplete, notifyJobFailed } = require("./sse-notifications");
var { router: metricsRouter } = require("./autoscaler-metrics");
var { router: healthRouter, incrementActive, decrementActive } = require("./health-check");
var { getContextForPrompt, storeResult } = require("./db-pool");
var { callLLM } = require("./llm-client");

var app = express();
app.use(express.json({ limit: "1mb" }));

// Health and metrics
app.use(healthRouter);
app.use(metricsRouter);
app.use(sseRouter);

// API routes with rate limiting
app.post("/api/generate", rateLimitMiddleware, function(req, res) {
  var { v4: uuidv4 } = require("uuid");
  var jobId = uuidv4();
  var userId = req.user ? req.user.id : "anonymous";

  // Check cache first
  getFromCache(req.body.prompt, req.body.model || "gpt-4o", {}).then(function(cached) {
    if (cached) {
      return res.json({
        jobId: jobId,
        status: "completed",
        result: cached.data,
        source: "cache"
      });
    }

    // Enqueue for processing
    return llmQueue.add({
      jobId: jobId,
      prompt: req.body.prompt,
      model: req.body.model || "gpt-4o",
      userId: userId,
      createdAt: Date.now()
    }, {
      jobId: jobId
    }).then(function() {
      res.status(202).json({
        jobId: jobId,
        status: "queued",
        pollUrl: "/api/jobs/" + jobId,
        streamUrl: "/api/jobs/" + jobId + "/stream"
      });
    });
  }).catch(function(error) {
    console.error("Request handling error:", error.message);
    res.status(500).json({ error: "Internal server error" });
  });
});

app.get("/api/jobs/:jobId", function(req, res) {
  llmQueue.getJob(req.params.jobId).then(function(job) {
    if (!job) {
      return res.status(404).json({ error: "Job not found" });
    }
    return job.getState().then(function(state) {
      var response = { jobId: req.params.jobId, status: state };
      if (state === "completed") response.result = job.returnvalue;
      if (state === "failed") response.error = job.failedReason;
      res.json(response);
    });
  }).catch(function() {
    res.status(500).json({ error: "Failed to retrieve job status" });
  });
});

// Worker processing (run in separate process via llm-worker.js)
// This handler is included here for reference but should be in its own process
if (process.env.ROLE === "worker") {
  var CONCURRENCY = parseInt(process.env.WORKER_CONCURRENCY) || 5;

  llmQueue.process(CONCURRENCY, function(job) {
    var data = job.data;
    incrementActive();

    return getContextForPrompt(data.userId, "default").then(function(context) {
      var fullPrompt = data.prompt;
      if (context.length > 0) {
        var contextText = context.map(function(c) { return c.content; }).join("\n");
        fullPrompt = "Context:\n" + contextText + "\n\nUser request:\n" + data.prompt;
      }

      return callLLM(fullPrompt, data.model);
    }).then(function(result) {
      return Promise.all([
        storeResult(data.jobId, data.userId, result),
        setCache(data.prompt, data.model, {}, result),
        recordUsage(data.userId, result.usage.total_tokens)
      ]).then(function() {
        notifyJobComplete(data.jobId, result);
        decrementActive();
        return result;
      });
    }).catch(function(error) {
      decrementActive();
      notifyJobFailed(data.jobId, error.message);
      throw error;
    });
  });
}

var PORT = process.env.PORT || 3000;
app.listen(PORT, function() {
  console.log("LLM application running on port " + PORT + " as " + (process.env.ROLE || "api"));
});
// docker-compose.yml reference structure
{
  "services": {
    "api": {
      "build": ".",
      "environment": { "ROLE": "api", "PORT": "3000" },
      "ports": ["3000:3000"],
      "deploy": { "replicas": 3 }
    },
    "worker": {
      "build": ".",
      "environment": { "ROLE": "worker", "WORKER_CONCURRENCY": "5" },
      "deploy": {
        "replicas": 4,
        "resources": { "limits": { "memory": "1G" } }
      }
    },
    "redis": {
      "image": "redis:7-alpine",
      "ports": ["6379:6379"]
    },
    "postgres": {
      "image": "pgvector/pgvector:pg16",
      "environment": { "POSTGRES_DB": "llmapp" }
    }
  }
}

Common Issues and Troubleshooting

1. Redis Connection Exhaustion

Error: connect ECONNREFUSED 127.0.0.1:6379
MaxRetriesPerRequestError: Reached the max retries per request limit

Bull creates three Redis connections per queue (client, subscriber, events). With multiple queues and workers, you can exhaust Redis connection limits fast. Set maxRetriesPerRequest: null in your Redis config to prevent Bull from crashing on transient connection issues. Monitor Redis connections with INFO clients and increase maxclients in redis.conf. Use createClient in your Bull queue config to share Redis connection pools.

2. Worker Memory Leaks from Large LLM Responses

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
<--- Last few GCs --->
[12345:0x5f1d8a0]  4523 ms: Scavenge 1987.3 (2048.0) -> 1986.8 (2048.0) MB

Large LLM responses (especially with JSON mode producing massive structured outputs) accumulate in memory when jobs are processed faster than they are garbage collected. Set removeOnComplete and removeOnFail in your job options to limit how many completed jobs Bull retains. Run workers with --max-old-space-size=2048 and implement periodic restarts with SIGTERM handling to gracefully drain and restart workers.

3. Queue Stalls from Unacknowledged Jobs

[Bull] WARNING: Job 4582 stalled more than allowable limit. Job will be re-processed.

When a worker crashes mid-processing without acknowledging a job, Bull considers it stalled after the stalledInterval (default 30 seconds). For LLM jobs that legitimately take 60+ seconds, this triggers false stall detections and re-processing, which means duplicate LLM calls and duplicate charges. Increase stalledInterval to at least 120 seconds and set maxStalledCount to 2:

var llmQueue = new Queue("llm-processing", {
  settings: {
    stalledInterval: 120000,  // 2 minutes
    maxStalledCount: 2
  }
});

4. LLM Provider Rate Limit Cascade

HTTP 429 Too Many Requests
{"error":{"message":"Rate limit reached for gpt-4o in organization org-xxx on tokens per min (TPM): Limit 800000, Used 795432, Requested 12000.","type":"tokens","code":"rate_limit_exceeded"}}

When multiple workers hit the rate limit simultaneously, they all retry at the same time, creating a thundering herd. Use exponential backoff with jitter on retries. Implement a shared semaphore in Redis that tracks your organization's token usage and prevents workers from sending requests when you are approaching the limit. Pre-calculate estimated token usage before sending to avoid over-limit requests.

5. Database Connection Pool Timeout During Spikes

Error: timeout expired before a connection was available from the pool
    at Pool._pendingQueue.push (node_modules/pg-pool/index.js:186:27)

LLM requests hold database connections for the full duration of the request (fetch context, wait for LLM, store result). During traffic spikes, the pool runs dry. Split your database operations: acquire a connection to fetch context, release it, make the LLM call, then acquire a new connection to store results. Never hold a database connection while waiting on an external API call.

Best Practices

  • Decouple everything. The API server should never make direct LLM calls. Always go through a queue. This is non-negotiable for any production LLM application.

  • Cache aggressively and at every layer. The cheapest and fastest LLM call is the one you never make. Implement semantic caching where similar (not just identical) prompts return cached results. Even a 20% cache hit rate can cut your LLM costs significantly.

  • Set hard concurrency limits per LLM provider. Do not rely on the provider to rate-limit you gracefully. Track your token usage in Redis and stop sending requests before you hit the limit. The provider's 429 response is your last line of defense, not your first.

  • Implement circuit breakers on all LLM calls. When a provider is down or degraded, fail fast instead of queuing up thousands of requests that will all timeout. Return cached results or graceful degradation messages instead.

  • Monitor queue depth as your primary scaling signal. CPU and memory utilization are meaningless for I/O-bound LLM workers. Queue depth and job wait time tell you whether you need more workers. Scale based on the ratio of waiting jobs to active workers.

  • Use separate worker pools for different priority levels. Interactive user requests should never wait behind batch processing jobs. Create priority queues and dedicate workers to each tier.

  • Plan for provider outages. Support multiple LLM providers and implement fallback routing. If your primary provider returns errors, route to a secondary. The slight differences in output are better than downtime.

  • Log every LLM call with its cost. Track prompt tokens, completion tokens, model, latency, and cache status for every request. This data is essential for capacity planning, cost allocation, and debugging. Store it in a time-series database for efficient querying.

  • Set billing alerts and hard spending caps. A bug that causes a retry loop or a cache miss spike can generate thousands of dollars in LLM charges in minutes. Every LLM integration must have a circuit breaker tied to spend rate, not just error rate.

  • Load test with realistic prompt distributions. A load test that sends the same short prompt repeatedly tells you nothing about production behavior. Build test corpora that match your real prompt length and complexity distribution, and test at 2x your expected peak traffic.

References

Powered by Contentful