Rate Limiting Strategies for LLM APIs

Shane

2/13/2026

23 min read

Comprehensive strategies for rate limiting LLM API calls including token buckets, priority queues, backoff, and per-user quotas in Node.js.

llm nodejs rate-limiting api throttling

Rate Limiting Strategies for LLM APIs

Overview

Every LLM provider enforces rate limits, and if you are building production applications that rely on OpenAI, Anthropic, or any other language model API, you will hit those limits. The difference between a prototype that crashes under load and a production system that gracefully handles thousands of concurrent users comes down to how you manage request throughput on the client side. This article walks through battle-tested rate limiting strategies for LLM APIs in Node.js, from basic token buckets to distributed systems with Redis-backed coordination.

Prerequisites

Working knowledge of Node.js and Express
Familiarity with HTTP status codes, particularly 429 (Too Many Requests)
Basic understanding of LLM API usage (OpenAI, Anthropic, etc.)
Node.js v16+ installed locally
Redis installed locally (for the distributed rate limiting section)

Understanding LLM Provider Rate Limits

LLM APIs do not rate limit the same way a typical REST API does. Most REST APIs care about requests per second. LLM providers care about two separate dimensions: requests per minute (RPM) and tokens per minute (TPM). You can hit either limit independently, and whichever one you exhaust first will trigger throttling.

Here is what makes this tricky. A single request to GPT-4 might consume 4,000 tokens while another consumes 50. If your rate limiter only counts requests, you will blow through your token budget on a handful of large prompts. If it only counts tokens, you might still exceed the RPM cap with many small requests.

The typical limits look something like this:

Provider	Tier	RPM	TPM
OpenAI GPT-4o	Tier 1	500	30,000
OpenAI GPT-4o	Tier 5	10,000	12,000,000
Anthropic Claude Sonnet	Tier 1	50	40,000
Anthropic Claude Sonnet	Tier 4	4,000	400,000

Your rate limiter needs to track both dimensions simultaneously. Let us build that.

Implementing Client-Side Rate Limiting with Token Buckets

The token bucket algorithm is the most natural fit for LLM rate limiting because it models the way providers actually enforce limits. You have a bucket that fills up at a steady rate. Each request drains tokens from the bucket. When the bucket is empty, requests wait.

var EventEmitter = require("events");

function TokenBucket(options) {
  this.maxTokens = options.maxTokens;
  this.refillRate = options.refillRate; // tokens added per ms
  this.tokens = options.maxTokens;
  this.lastRefill = Date.now();
}

TokenBucket.prototype.refill = function () {
  var now = Date.now();
  var elapsed = now - this.lastRefill;
  var newTokens = elapsed * this.refillRate;
  this.tokens = Math.min(this.maxTokens, this.tokens + newTokens);
  this.lastRefill = now;
};

TokenBucket.prototype.consume = function (count) {
  this.refill();
  if (this.tokens >= count) {
    this.tokens -= count;
    return true;
  }
  return false;
};

TokenBucket.prototype.waitTime = function (count) {
  this.refill();
  if (this.tokens >= count) {
    return 0;
  }
  var deficit = count - this.tokens;
  return Math.ceil(deficit / this.refillRate);
};

For LLM APIs, you need two buckets running in parallel — one for RPM and one for TPM:

function DualBucketLimiter(options) {
  this.rpmBucket = new TokenBucket({
    maxTokens: options.rpm,
    refillRate: options.rpm / 60000 // requests per ms
  });
  this.tpmBucket = new TokenBucket({
    maxTokens: options.tpm,
    refillRate: options.tpm / 60000 // tokens per ms
  });
}

DualBucketLimiter.prototype.canProceed = function (estimatedTokens) {
  var rpmReady = this.rpmBucket.consume(1);
  if (!rpmReady) {
    return { allowed: false, waitMs: this.rpmBucket.waitTime(1), reason: "rpm" };
  }
  var tpmReady = this.tpmBucket.consume(estimatedTokens);
  if (!tpmReady) {
    // refund the RPM token since we cannot proceed
    this.rpmBucket.tokens += 1;
    return { allowed: false, waitMs: this.tpmBucket.waitTime(estimatedTokens), reason: "tpm" };
  }
  return { allowed: true, waitMs: 0 };
};

The key insight here is that you need to estimate token counts before making the request. A reasonable approximation is 1 token per 4 characters for English text. You can refine later with a proper tokenizer like tiktoken, but the estimate gets you close enough for rate limiting purposes.

Sliding Window Rate Limiting

Token buckets are great, but they allow bursts. If you filled the bucket and then fired 500 requests in one second, you would overwhelm the API even though you are technically "within limits." A sliding window smoother approach prevents these bursts:

function SlidingWindowLimiter(options) {
  this.windowMs = options.windowMs || 60000;
  this.maxRequests = options.maxRequests;
  this.maxTokens = options.maxTokens;
  this.requests = []; // { timestamp, tokens }
}

SlidingWindowLimiter.prototype.cleanup = function () {
  var cutoff = Date.now() - this.windowMs;
  while (this.requests.length > 0 && this.requests[0].timestamp < cutoff) {
    this.requests.shift();
  }
};

SlidingWindowLimiter.prototype.canProceed = function (estimatedTokens) {
  this.cleanup();

  var currentRequests = this.requests.length;
  var currentTokens = 0;
  for (var i = 0; i < this.requests.length; i++) {
    currentTokens += this.requests[i].tokens;
  }

  if (currentRequests >= this.maxRequests) {
    var oldestExpiry = this.requests[0].timestamp + this.windowMs - Date.now();
    return { allowed: false, waitMs: oldestExpiry, reason: "rpm" };
  }

  if (currentTokens + estimatedTokens > this.maxTokens) {
    var oldestExpiry = this.requests[0].timestamp + this.windowMs - Date.now();
    return { allowed: false, waitMs: oldestExpiry, reason: "tpm" };
  }

  return { allowed: true, waitMs: 0 };
};

SlidingWindowLimiter.prototype.record = function (tokens) {
  this.requests.push({ timestamp: Date.now(), tokens: tokens });
};

I prefer the sliding window for production systems. It is more predictable and avoids the burst behavior that token buckets allow.

Queuing Requests with Priority Levels

Not all LLM requests are equal. A real-time chat response needs to go out immediately. A batch summarization job can wait. A priority queue ensures critical requests get through first when you are near your rate limit:

function PriorityQueue() {
  this.queues = {
    critical: [],
    high: [],
    normal: [],
    low: []
  };
  this.priorities = ["critical", "high", "normal", "low"];
}

PriorityQueue.prototype.enqueue = function (item, priority) {
  var level = priority || "normal";
  if (!this.queues[level]) {
    level = "normal";
  }
  this.queues[level].push(item);
};

PriorityQueue.prototype.dequeue = function () {
  for (var i = 0; i < this.priorities.length; i++) {
    var queue = this.queues[this.priorities[i]];
    if (queue.length > 0) {
      return queue.shift();
    }
  }
  return null;
};

PriorityQueue.prototype.size = function () {
  var total = 0;
  for (var i = 0; i < this.priorities.length; i++) {
    total += this.queues[this.priorities[i]].length;
  }
  return total;
};

PriorityQueue.prototype.isEmpty = function () {
  return this.size() === 0;
};

You wire this into your rate limiter so that when a request cannot proceed immediately, it gets queued at the appropriate priority level and processed when capacity becomes available.

Handling 429 Responses with Exponential Backoff

No matter how good your client-side rate limiter is, you will still occasionally receive 429 responses. The API might have reduced your limits temporarily, another client might be sharing your organization's quota, or your token estimation might be off. You need robust retry logic:

function RetryHandler(options) {
  this.maxRetries = options.maxRetries || 5;
  this.baseDelay = options.baseDelay || 1000;
  this.maxDelay = options.maxDelay || 60000;
  this.jitterFactor = options.jitterFactor || 0.1;
}

RetryHandler.prototype.getDelay = function (attempt, retryAfterHeader) {
  if (retryAfterHeader) {
    var retryAfter = parseInt(retryAfterHeader, 10);
    if (!isNaN(retryAfter)) {
      return retryAfter * 1000;
    }
  }

  var exponentialDelay = this.baseDelay * Math.pow(2, attempt);
  var jitter = exponentialDelay * this.jitterFactor * Math.random();
  var delay = exponentialDelay + jitter;

  return Math.min(delay, this.maxDelay);
};

RetryHandler.prototype.execute = function (fn, callback, attempt) {
  var self = this;
  attempt = attempt || 0;

  fn(function (err, result, response) {
    if (!err) {
      return callback(null, result);
    }

    if (err.status !== 429 || attempt >= self.maxRetries) {
      return callback(err);
    }

    var retryAfter = response && response.headers
      ? response.headers["retry-after"]
      : null;
    var delay = self.getDelay(attempt, retryAfter);

    console.log(
      "[RetryHandler] 429 received, retrying in " + delay + "ms (attempt " +
      (attempt + 1) + "/" + self.maxRetries + ")"
    );

    setTimeout(function () {
      self.execute(fn, callback, attempt + 1);
    }, delay);
  });
};

Always respect the Retry-After header when it is present. The provider is telling you exactly when to try again. Ignoring it and using your own shorter delay is a fast path to getting your API key suspended.

The jitter factor is not optional. Without jitter, all your retrying clients will hammer the API simultaneously when the backoff period expires, creating a thundering herd problem.

Distributing Requests Across Multiple API Keys

For high-throughput applications, a single API key will not cut it. You can pool multiple API keys and distribute requests across them, effectively multiplying your rate limits:

function ApiKeyPool(keys) {
  this.keys = keys.map(function (key) {
    return {
      key: key,
      limiter: new DualBucketLimiter({ rpm: 500, tpm: 30000 }),
      active: true,
      cooldownUntil: 0
    };
  });
  this.currentIndex = 0;
}

ApiKeyPool.prototype.getAvailableKey = function (estimatedTokens) {
  var now = Date.now();
  var startIndex = this.currentIndex;

  for (var i = 0; i < this.keys.length; i++) {
    var index = (startIndex + i) % this.keys.length;
    var entry = this.keys[index];

    if (!entry.active || entry.cooldownUntil > now) {
      continue;
    }

    var result = entry.limiter.canProceed(estimatedTokens);
    if (result.allowed) {
      this.currentIndex = (index + 1) % this.keys.length;
      return { key: entry.key, index: index };
    }
  }

  return null;
};

ApiKeyPool.prototype.reportError = function (index, statusCode) {
  if (statusCode === 429) {
    this.keys[index].cooldownUntil = Date.now() + 30000;
  }
  if (statusCode === 401 || statusCode === 403) {
    this.keys[index].active = false;
    console.error("[ApiKeyPool] Key at index " + index + " deactivated: " + statusCode);
  }
};

A word of caution: make sure your use of multiple API keys complies with the provider's terms of service. Some providers explicitly prohibit circumventing rate limits this way. Others allow multiple keys under a single organization. Know the rules before you implement this.

Rate Limiting Per User in Multi-Tenant Applications

When you are building a SaaS product that wraps an LLM API, you need to rate limit your own users independently. You do not want one power user exhausting the quota for everyone:

function PerUserLimiter(options) {
  this.limits = {
    free: { rpm: 10, tpm: 5000, dailyRequests: 50 },
    pro: { rpm: 60, tpm: 30000, dailyRequests: 1000 },
    enterprise: { rpm: 300, tpm: 200000, dailyRequests: 10000 }
  };
  this.users = {};
  this.dailyCounts = {};
}

PerUserLimiter.prototype.getLimiter = function (userId, tier) {
  if (!this.users[userId]) {
    var limits = this.limits[tier] || this.limits.free;
    this.users[userId] = new SlidingWindowLimiter({
      windowMs: 60000,
      maxRequests: limits.rpm,
      maxTokens: limits.tpm
    });
    this.dailyCounts[userId] = { count: 0, resetAt: this.getEndOfDay() };
  }
  return this.users[userId];
};

PerUserLimiter.prototype.getEndOfDay = function () {
  var now = new Date();
  var endOfDay = new Date(now);
  endOfDay.setHours(23, 59, 59, 999);
  return endOfDay.getTime();
};

PerUserLimiter.prototype.checkUser = function (userId, tier, estimatedTokens) {
  var limiter = this.getLimiter(userId, tier);
  var limits = this.limits[tier] || this.limits.free;

  // Check daily limit
  var daily = this.dailyCounts[userId];
  if (Date.now() > daily.resetAt) {
    daily.count = 0;
    daily.resetAt = this.getEndOfDay();
  }

  if (daily.count >= limits.dailyRequests) {
    return {
      allowed: false,
      reason: "daily_limit",
      message: "Daily request limit reached. Resets at midnight UTC."
    };
  }

  var result = limiter.canProceed(estimatedTokens);
  if (!result.allowed) {
    return {
      allowed: false,
      reason: result.reason,
      waitMs: result.waitMs,
      message: "Rate limit exceeded. Try again in " + Math.ceil(result.waitMs / 1000) + " seconds."
    };
  }

  return { allowed: true };
};

PerUserLimiter.prototype.record = function (userId, tokens) {
  var limiter = this.users[userId];
  if (limiter) {
    limiter.record(tokens);
  }
  if (this.dailyCounts[userId]) {
    this.dailyCounts[userId].count += 1;
  }
};

This gives you three layers of protection: per-minute request limits, per-minute token limits, and daily request caps. You can adjust these per pricing tier to create a natural upgrade incentive.

Monitoring Rate Limit Headroom from Response Headers

Most LLM providers return rate limit information in response headers. These are invaluable for adaptive rate limiting:

function HeaderMonitor() {
  this.lastHeaders = {};
  this.history = [];
}

HeaderMonitor.prototype.update = function (headers) {
  var info = {
    timestamp: Date.now(),
    remainingRequests: parseInt(headers["x-ratelimit-remaining-requests"] || "0", 10),
    remainingTokens: parseInt(headers["x-ratelimit-remaining-tokens"] || "0", 10),
    limitRequests: parseInt(headers["x-ratelimit-limit-requests"] || "0", 10),
    limitTokens: parseInt(headers["x-ratelimit-limit-tokens"] || "0", 10),
    resetRequests: headers["x-ratelimit-reset-requests"] || "",
    resetTokens: headers["x-ratelimit-reset-tokens"] || ""
  };

  this.lastHeaders = info;
  this.history.push(info);

  // Keep last 100 data points
  if (this.history.length > 100) {
    this.history.shift();
  }

  return info;
};

HeaderMonitor.prototype.getUtilization = function () {
  var info = this.lastHeaders;
  if (!info.limitRequests) {
    return null;
  }

  return {
    requestUtilization: 1 - (info.remainingRequests / info.limitRequests),
    tokenUtilization: 1 - (info.remainingTokens / info.limitTokens),
    requestsRemaining: info.remainingRequests,
    tokensRemaining: info.remainingTokens
  };
};

HeaderMonitor.prototype.shouldThrottle = function (threshold) {
  var utilization = this.getUtilization();
  if (!utilization) {
    return false;
  }
  threshold = threshold || 0.8;
  return utilization.requestUtilization > threshold ||
         utilization.tokenUtilization > threshold;
};

I check utilization after every response. When you cross 80% utilization, start throttling proactively. Do not wait for the 429 to tell you that you have gone too far.

Adaptive Rate Limiting Based on Current Usage

Static rate limits are a starting point, but adaptive limiting adjusts throughput based on real-time feedback from the API:

function AdaptiveRateLimiter(options) {
  this.baseRpm = options.rpm;
  this.baseTpm = options.tpm;
  this.currentRpm = options.rpm;
  this.currentTpm = options.tpm;
  this.monitor = new HeaderMonitor();
  this.limiter = new DualBucketLimiter({ rpm: this.currentRpm, tpm: this.currentTpm });
  this.adjustInterval = null;
}

AdaptiveRateLimiter.prototype.start = function () {
  var self = this;
  this.adjustInterval = setInterval(function () {
    self.adjust();
  }, 5000);
};

AdaptiveRateLimiter.prototype.stop = function () {
  if (this.adjustInterval) {
    clearInterval(this.adjustInterval);
  }
};

AdaptiveRateLimiter.prototype.adjust = function () {
  var utilization = this.monitor.getUtilization();
  if (!utilization) {
    return;
  }

  var maxUtil = Math.max(utilization.requestUtilization, utilization.tokenUtilization);

  if (maxUtil > 0.9) {
    // Aggressively reduce
    this.currentRpm = Math.max(1, Math.floor(this.currentRpm * 0.5));
    this.currentTpm = Math.max(100, Math.floor(this.currentTpm * 0.5));
    console.log("[Adaptive] High utilization (" + (maxUtil * 100).toFixed(1) +
      "%), reducing to RPM=" + this.currentRpm + " TPM=" + this.currentTpm);
  } else if (maxUtil > 0.7) {
    // Slight reduction
    this.currentRpm = Math.max(1, Math.floor(this.currentRpm * 0.8));
    this.currentTpm = Math.max(100, Math.floor(this.currentTpm * 0.8));
  } else if (maxUtil < 0.3) {
    // Ramp up towards base limits
    this.currentRpm = Math.min(this.baseRpm, Math.ceil(this.currentRpm * 1.2));
    this.currentTpm = Math.min(this.baseTpm, Math.ceil(this.currentTpm * 1.2));
    console.log("[Adaptive] Low utilization, increasing to RPM=" +
      this.currentRpm + " TPM=" + this.currentTpm);
  }

  this.limiter = new DualBucketLimiter({ rpm: this.currentRpm, tpm: this.currentTpm });
};

AdaptiveRateLimiter.prototype.onResponse = function (headers) {
  this.monitor.update(headers);
};

Adaptive limiting is especially important when you share an organization-level quota with other services. Your allocation is not fixed — it shifts based on what other consumers are doing.

Request Coalescing for Duplicate Prompts

In many applications, multiple users or processes submit identical prompts within a short window. Instead of sending duplicate requests to the API, coalesce them:

function RequestCoalescer(options) {
  this.pending = {};
  this.ttl = options.ttl || 5000; // ms to wait for duplicates
}

RequestCoalescer.prototype.generateKey = function (model, messages) {
  var crypto = require("crypto");
  var payload = JSON.stringify({ model: model, messages: messages });
  return crypto.createHash("sha256").update(payload).digest("hex");
};

RequestCoalescer.prototype.execute = function (model, messages, apiCallFn, callback) {
  var key = this.generateKey(model, messages);
  var self = this;

  if (this.pending[key]) {
    // Duplicate request — attach to existing flight
    this.pending[key].callbacks.push(callback);
    return;
  }

  this.pending[key] = {
    callbacks: [callback],
    timestamp: Date.now()
  };

  apiCallFn(model, messages, function (err, result) {
    var entry = self.pending[key];
    delete self.pending[key];

    if (entry) {
      for (var i = 0; i < entry.callbacks.length; i++) {
        entry.callbacks[i](err, result);
      }
    }
  });
};

I have seen this save 15-30% of API calls in chatbot applications where users frequently ask similar questions. Pair this with a response cache for even bigger savings.

Rate Limiting with Redis for Distributed Systems

When your application runs across multiple server instances, in-memory rate limiting breaks down. Each instance tracks its own counters independently, so your combined throughput can exceed the provider's limits by a factor equal to your instance count. Redis solves this:

var Redis = require("ioredis");

function RedisRateLimiter(options) {
  this.redis = new Redis(options.redisUrl);
  this.prefix = options.prefix || "ratelimit";
  this.windowMs = options.windowMs || 60000;
  this.maxRequests = options.maxRequests;
  this.maxTokens = options.maxTokens;
}

RedisRateLimiter.prototype.check = function (key, estimatedTokens, callback) {
  var self = this;
  var now = Date.now();
  var windowStart = now - self.windowMs;
  var requestKey = self.prefix + ":req:" + key;
  var tokenKey = self.prefix + ":tok:" + key;

  var multi = self.redis.multi();

  // Remove old entries
  multi.zremrangebyscore(requestKey, "-inf", windowStart);
  multi.zremrangebyscore(tokenKey, "-inf", windowStart);

  // Count current entries
  multi.zcard(requestKey);
  multi.zrangebyscore(tokenKey, windowStart, "+inf");

  multi.exec(function (err, results) {
    if (err) {
      return callback(err);
    }

    var currentRequests = results[2][1];
    var tokenEntries = results[3][1] || [];
    var currentTokens = 0;

    for (var i = 0; i < tokenEntries.length; i++) {
      currentTokens += parseInt(tokenEntries[i], 10) || 0;
    }

    if (currentRequests >= self.maxRequests) {
      return callback(null, { allowed: false, reason: "rpm" });
    }

    if (currentTokens + estimatedTokens > self.maxTokens) {
      return callback(null, { allowed: false, reason: "tpm" });
    }

    // Record this request
    var recordMulti = self.redis.multi();
    recordMulti.zadd(requestKey, now, now + ":" + Math.random());
    recordMulti.zadd(tokenKey, now, estimatedTokens.toString());
    recordMulti.pexpire(requestKey, self.windowMs);
    recordMulti.pexpire(tokenKey, self.windowMs);

    recordMulti.exec(function (recordErr) {
      if (recordErr) {
        return callback(recordErr);
      }
      callback(null, { allowed: true });
    });
  });
};

The sorted set approach gives you a true sliding window. Each entry is scored by timestamp, so cleanup is efficient and the window slides smoothly rather than resetting on fixed boundaries.

Complete Working Example

Here is a production-ready rate limiting middleware that combines everything covered above:

var EventEmitter = require("events");
var crypto = require("crypto");

// ---- Token Bucket ----
function TokenBucket(options) {
  this.maxTokens = options.maxTokens;
  this.refillRate = options.refillRate;
  this.tokens = options.maxTokens;
  this.lastRefill = Date.now();
}

TokenBucket.prototype.refill = function () {
  var now = Date.now();
  var elapsed = now - this.lastRefill;
  this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
  this.lastRefill = now;
};

TokenBucket.prototype.tryConsume = function (count) {
  this.refill();
  if (this.tokens >= count) {
    this.tokens -= count;
    return true;
  }
  return false;
};

// ---- Priority Queue ----
function PriorityRequestQueue() {
  this.levels = { critical: [], high: [], normal: [], low: [] };
  this.order = ["critical", "high", "normal", "low"];
}

PriorityRequestQueue.prototype.push = function (item, priority) {
  var level = this.levels[priority] ? priority : "normal";
  this.levels[level].push(item);
};

PriorityRequestQueue.prototype.shift = function () {
  for (var i = 0; i < this.order.length; i++) {
    if (this.levels[this.order[i]].length > 0) {
      return this.levels[this.order[i]].shift();
    }
  }
  return null;
};

PriorityRequestQueue.prototype.size = function () {
  var n = 0;
  for (var i = 0; i < this.order.length; i++) {
    n += this.levels[this.order[i]].length;
  }
  return n;
};

// ---- Per-User Quotas ----
function UserQuotaTracker(tiers) {
  this.tiers = tiers || {
    free: { dailyRequests: 50 },
    pro: { dailyRequests: 1000 },
    enterprise: { dailyRequests: 10000 }
  };
  this.usage = {};
}

UserQuotaTracker.prototype.check = function (userId, tier) {
  var limits = this.tiers[tier] || this.tiers.free;
  if (!this.usage[userId]) {
    this.usage[userId] = { count: 0, resetAt: endOfDayUTC() };
  }
  var u = this.usage[userId];
  if (Date.now() > u.resetAt) {
    u.count = 0;
    u.resetAt = endOfDayUTC();
  }
  return u.count < limits.dailyRequests;
};

UserQuotaTracker.prototype.record = function (userId) {
  if (this.usage[userId]) {
    this.usage[userId].count += 1;
  }
};

function endOfDayUTC() {
  var d = new Date();
  d.setUTCHours(23, 59, 59, 999);
  return d.getTime();
}

// ---- Main Middleware ----
function LLMRateLimiter(options) {
  this.rpmBucket = new TokenBucket({
    maxTokens: options.rpm || 500,
    refillRate: (options.rpm || 500) / 60000
  });
  this.tpmBucket = new TokenBucket({
    maxTokens: options.tpm || 30000,
    refillRate: (options.tpm || 30000) / 60000
  });
  this.queue = new PriorityRequestQueue();
  this.quotas = new UserQuotaTracker(options.tiers);
  this.maxRetries = options.maxRetries || 5;
  this.baseDelay = options.baseDelay || 1000;
  this.processing = false;
  this.coalesceMap = {};
}

LLMRateLimiter.prototype.estimateTokens = function (messages) {
  var text = "";
  for (var i = 0; i < messages.length; i++) {
    text += messages[i].content || "";
  }
  return Math.ceil(text.length / 4) + 50; // rough estimate plus overhead
};

LLMRateLimiter.prototype.coalesceKey = function (model, messages) {
  return crypto.createHash("sha256")
    .update(JSON.stringify({ m: model, msg: messages }))
    .digest("hex");
};

LLMRateLimiter.prototype.submit = function (request, callback) {
  // request: { model, messages, userId, tier, priority, apiCallFn }
  var userId = request.userId || "anonymous";
  var tier = request.tier || "free";

  if (!this.quotas.check(userId, tier)) {
    return callback({
      status: 429,
      message: "Daily quota exceeded for user " + userId
    });
  }

  // Check coalescing
  var key = this.coalesceKey(request.model, request.messages);
  if (this.coalesceMap[key]) {
    this.coalesceMap[key].push(callback);
    return;
  }
  this.coalesceMap[key] = [callback];

  var self = this;
  var wrappedCallback = function (err, result) {
    var callbacks = self.coalesceMap[key] || [];
    delete self.coalesceMap[key];
    for (var i = 0; i < callbacks.length; i++) {
      callbacks[i](err, result);
    }
  };

  this.queue.push({
    model: request.model,
    messages: request.messages,
    userId: userId,
    estimatedTokens: this.estimateTokens(request.messages),
    apiCallFn: request.apiCallFn,
    callback: wrappedCallback,
    retries: 0
  }, request.priority || "normal");

  this.processQueue();
};

LLMRateLimiter.prototype.processQueue = function () {
  if (this.processing) {
    return;
  }
  this.processing = true;

  var self = this;

  function next() {
    var item = self.queue.shift();
    if (!item) {
      self.processing = false;
      return;
    }

    var rpmOk = self.rpmBucket.tryConsume(1);
    var tpmOk = rpmOk ? self.tpmBucket.tryConsume(item.estimatedTokens) : false;

    if (!rpmOk || !tpmOk) {
      if (rpmOk && !tpmOk) {
        self.rpmBucket.tokens += 1;
      }
      // Re-queue at front and wait
      self.queue.push(item, "critical");
      setTimeout(function () {
        next();
      }, 500);
      return;
    }

    item.apiCallFn(item.model, item.messages, function (err, result, response) {
      if (err && err.status === 429 && item.retries < self.maxRetries) {
        var retryAfter = (response && response.headers && response.headers["retry-after"])
          ? parseInt(response.headers["retry-after"], 10) * 1000
          : self.baseDelay * Math.pow(2, item.retries) + Math.random() * 500;

        item.retries += 1;
        console.log("[LLMRateLimiter] 429 for user " + item.userId +
          ", retry " + item.retries + " in " + retryAfter + "ms");

        setTimeout(function () {
          self.queue.push(item, "high");
          if (!self.processing) {
            self.processQueue();
          }
        }, retryAfter);

        next();
        return;
      }

      if (!err) {
        self.quotas.record(item.userId);
      }

      item.callback(err, result);
      next();
    });
  }

  next();
};

module.exports = LLMRateLimiter;

Usage with Express:

var express = require("express");
var LLMRateLimiter = require("./llm-rate-limiter");

var app = express();
app.use(express.json());

var limiter = new LLMRateLimiter({
  rpm: 500,
  tpm: 30000,
  maxRetries: 5,
  tiers: {
    free: { dailyRequests: 50 },
    pro: { dailyRequests: 1000 },
    enterprise: { dailyRequests: 10000 }
  }
});

function callOpenAI(model, messages, callback) {
  var https = require("https");
  var data = JSON.stringify({ model: model, messages: messages });

  var options = {
    hostname: "api.openai.com",
    path: "/v1/chat/completions",
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
      "Content-Length": Buffer.byteLength(data)
    }
  };

  var req = https.request(options, function (res) {
    var body = "";
    res.on("data", function (chunk) { body += chunk; });
    res.on("end", function () {
      if (res.statusCode === 429) {
        return callback({ status: 429 }, null, res);
      }
      if (res.statusCode !== 200) {
        return callback({ status: res.statusCode, body: body }, null, res);
      }
      var parsed = JSON.parse(body);
      callback(null, parsed, res);
    });
  });

  req.on("error", function (err) { callback(err); });
  req.write(data);
  req.end();
}

app.post("/api/chat", function (req, res) {
  limiter.submit({
    model: req.body.model || "gpt-4o",
    messages: req.body.messages,
    userId: req.user ? req.user.id : "anonymous",
    tier: req.user ? req.user.tier : "free",
    priority: req.body.priority || "normal",
    apiCallFn: callOpenAI
  }, function (err, result) {
    if (err) {
      return res.status(err.status || 500).json({ error: err.message || "LLM request failed" });
    }
    res.json(result);
  });
});

app.listen(3000, function () {
  console.log("Server running on port 3000");
});

Common Issues & Troubleshooting

1. "Rate limit exceeded" despite client-side limiting

Error: 429 Too Many Requests
{"error":{"message":"Rate limit reached for gpt-4o in organization org-xxx on requests per min (RPM): Limit 500, Used 500, Requested 1."}}

This happens when your token estimation is significantly off, or when other services in your organization are consuming the shared quota. Fix it by reading the actual limits from response headers and feeding them into your adaptive limiter, rather than relying on hardcoded values.

2. Queue memory leak under sustained load

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

If your queue grows without bound because incoming requests outpace your rate limit, you will eventually run out of memory. Add a maximum queue size and reject requests with a 503 when it is full:

if (self.queue.size() > 10000) {
  return callback({ status: 503, message: "Service overloaded, try again later" });
}

3. Thundering herd after backoff window

[LLMRateLimiter] 429 for user user_123, retry 1 in 1000ms
[LLMRateLimiter] 429 for user user_456, retry 1 in 1000ms
[LLMRateLimiter] 429 for user user_789, retry 1 in 1000ms
// All three retry at exactly the same time, causing another 429

This is the classic thundering herd. All your retries fire simultaneously because they received the 429 at the same moment and calculated the same backoff. The fix is jitter — always add a random component to your delay. Replace Math.pow(2, attempt) * 1000 with Math.pow(2, attempt) * 1000 + Math.random() * 1000.

4. Redis connection failures silently disable rate limiting

[ioredis] Unhandled error event: Error: connect ECONNREFUSED 127.0.0.1:6379

When Redis goes down, your distributed rate limiter stops enforcing limits. Every instance thinks it is the only one, and your combined throughput multiplies by your instance count. Always implement a fallback to in-memory limiting when Redis is unavailable:

RedisRateLimiter.prototype.checkWithFallback = function (key, tokens, callback) {
  var self = this;
  if (!self.redis.status || self.redis.status !== "ready") {
    console.warn("[RateLimiter] Redis unavailable, falling back to in-memory limits");
    return self.memoryFallback.check(key, tokens, callback);
  }
  self.check(key, tokens, callback);
};

5. Token estimation drift causes premature throttling

Warning: TPM bucket exhausted but only 60% of actual API limit used

If your character-based token estimation consistently overestimates, you are leaving capacity on the table. Track the actual usage.total_tokens from API responses and use a rolling average to calibrate your estimator:

function calibrateEstimate(estimated, actual) {
  // exponential moving average
  var alpha = 0.1;
  self.calibrationFactor = self.calibrationFactor * (1 - alpha) + (actual / estimated) * alpha;
}

Best Practices

Always implement dual-dimension limiting. Track both RPM and TPM simultaneously. Tracking only one will lead to hitting the other limit unexpectedly, and you will spend hours debugging intermittent 429s.
Respect the Retry-After header unconditionally. When the provider tells you to wait 30 seconds, wait 30 seconds. Retrying sooner is not clever — it counts against your rate limit and may trigger escalating penalties.
Add jitter to every retry delay. Without randomization, coordinated retries from multiple clients or coroutines will create repeated spikes. A simple Math.random() * baseDelay appended to your exponential backoff eliminates this entirely.
Set hard caps on queue depth. An unbounded queue will eventually consume all available memory under sustained load. Define a maximum queue size and return 503 immediately when it is exceeded. Your users would rather get a fast rejection than wait 10 minutes for a queued response.
Monitor and alert on rate limit utilization, not just errors. By the time you are seeing 429 errors in your logs, the user experience has already degraded. Set up alerts at 70% utilization so you can scale your key pool or reduce traffic before users are impacted.
Cache aggressively for repeated prompts. Many LLM applications see significant prompt duplication — FAQs, system prompts appended to every request, identical classification tasks. A simple SHA-256 hash of the prompt plus a TTL cache can save 20-40% of your API budget.
Implement graceful degradation by priority. When approaching rate limits, shed low-priority traffic first. Batch jobs, pre-computation tasks, and analytics queries should yield to real-time user requests. A priority queue makes this trivial.
Use separate API keys for separate concerns. Do not share a single key between your real-time chat feature and your background batch processing pipeline. Isolate them so a batch job spike does not degrade your user-facing latency.
Log every 429 with context. Include the user ID, estimated tokens, queue depth, and current utilization in every rate limit log entry. When you are investigating a production incident at 2 AM, you will be grateful for the context.
Test your rate limiter under realistic load. Unit tests that verify the token bucket math are necessary but insufficient. Run load tests that simulate your actual traffic patterns — bursty user chat, steady batch processing, and the mix of both — to validate your configuration.

Rate Limiting Strategies for LLM APIs

Rate Limiting Strategies for LLM APIs

Overview

Prerequisites

Understanding LLM Provider Rate Limits

Implementing Client-Side Rate Limiting with Token Buckets

Sliding Window Rate Limiting

Queuing Requests with Priority Levels

Handling 429 Responses with Exponential Backoff

Distributing Requests Across Multiple API Keys

Rate Limiting Per User in Multi-Tenant Applications

Monitoring Rate Limit Headroom from Response Headers

Adaptive Rate Limiting Based on Current Usage

Request Coalescing for Duplicate Prompts

Rate Limiting with Redis for Distributed Systems

Complete Working Example

Common Issues & Troubleshooting

1. "Rate limit exceeded" despite client-side limiting

2. Queue memory leak under sustained load

3. Thundering herd after backoff window

4. Redis connection failures silently disable rate limiting

5. Token estimation drift causes premature throttling

Best Practices

References

Quick Links

Need Expert Help?