Rate Limiting AI Features Per User

Shane

2/13/2026

23 min read

Implement per-user rate limiting for AI features with token budgets, tier management, and usage dashboards in Node.js.

nodejs rate-limiting throttling per-user ai features quotas

Rate Limiting AI Features Per User

Rate limiting AI features per user is one of the most important things you will build when shipping LLM-powered functionality to production. Unlike traditional API rate limiting where you count requests, AI features burn real money on every call — a single uncapped user can rack up thousands of dollars in token costs in minutes. This article walks through building a complete per-user rate limiting system in Node.js with Redis-backed token buckets, tiered plans, usage tracking, and the middleware that ties it all together.

Prerequisites

Node.js v18 or later installed
Redis server running locally or accessible remotely
Basic understanding of Express.js middleware
Familiarity with how LLM APIs charge by token (input and output tokens)
A working Express.js application with user authentication in place

Install the dependencies we will use throughout this article:

npm install express ioredis uuid

Why Per-User Rate Limiting Is Critical for AI Features

Traditional rate limiting protects your server from being overwhelmed. Per-user rate limiting for AI features protects your bank account. Here is the difference: a standard REST endpoint that queries a database costs you fractions of a cent per request. A single GPT-4 call with a large context window can cost 10 to 30 cents. Multiply that by a thousand requests from one aggressive user and you have a serious problem.

There are three core reasons you need per-user limits specifically for AI features:

Cost protection. LLM providers charge by the token. Without per-user caps, a single bad actor or even an enthusiastic legitimate user can exhaust your monthly AI budget in hours. I have seen this happen on a production system where a developer left a retry loop running against an AI summarization endpoint. The bill was over $4,000 before anyone noticed.

Fair usage. If ten users share your AI features and one of them consumes 90% of the token budget, the other nine get degraded service. Per-user limits ensure everyone gets a reasonable share.

Predictable margins. If you are charging users a subscription fee, you need to know that each tier stays profitable. A $20/month plan where a user can consume $200 in AI tokens is not a business — it is a charity.

Designing Rate Limit Tiers

Before writing code, you need to define your tiers. Here is a practical starting point for an application with AI chat, AI search, and AI content generation features:

var TIER_CONFIG = {
  free: {
    dailyTokenBudget: 10000,
    monthlyTokenBudget: 100000,
    limits: {
      chat: { requestsPerMinute: 5, maxTokensPerRequest: 1000 },
      search: { requestsPerMinute: 10, maxTokensPerRequest: 500 },
      generation: { requestsPerMinute: 2, maxTokensPerRequest: 2000 }
    }
  },
  pro: {
    dailyTokenBudget: 100000,
    monthlyTokenBudget: 2000000,
    limits: {
      chat: { requestsPerMinute: 20, maxTokensPerRequest: 4000 },
      search: { requestsPerMinute: 40, maxTokensPerRequest: 2000 },
      generation: { requestsPerMinute: 10, maxTokensPerRequest: 8000 }
    }
  },
  enterprise: {
    dailyTokenBudget: 1000000,
    monthlyTokenBudget: 50000000,
    limits: {
      chat: { requestsPerMinute: 100, maxTokensPerRequest: 8000 },
      search: { requestsPerMinute: 200, maxTokensPerRequest: 4000 },
      generation: { requestsPerMinute: 50, maxTokensPerRequest: 16000 }
    }
  }
};

Notice that each AI feature has its own limits. This is deliberate. Content generation is far more expensive than search, so it gets tighter request limits and a higher token-per-request allowance. You do not want a user burning through their entire daily budget on one long generation when they might also need chat and search.

Token Bucket Algorithm for Per-User Limits

The token bucket algorithm is ideal for per-user AI rate limiting because it allows bursting while enforcing an average rate. The concept is simple: each user has a bucket that fills with tokens at a steady rate. Each request drains tokens from the bucket. If the bucket is empty, the request is denied.

For AI features, we use a modified token bucket where the "tokens" in the bucket represent LLM tokens (or a proxy for cost), not just request counts.

var Redis = require("ioredis");
var redis = new Redis(process.env.REDIS_URL || "redis://localhost:6379");

function TokenBucket(userId, feature, tierConfig) {
  this.userId = userId;
  this.feature = feature;
  this.config = tierConfig;
  this.bucketKey = "tb:" + userId + ":" + feature;
  this.lastRefillKey = "tb_refill:" + userId + ":" + feature;
}

TokenBucket.prototype.consume = function (tokensNeeded, callback) {
  var self = this;
  var now = Date.now();
  var maxTokens = self.config.limits[self.feature].maxTokensPerRequest;
  var refillRate = self.config.limits[self.feature].requestsPerMinute;

  if (tokensNeeded > maxTokens) {
    return callback(null, {
      allowed: false,
      reason: "Request exceeds maximum tokens per request (" + maxTokens + ")",
      remaining: 0
    });
  }

  redis.multi()
    .get(self.bucketKey)
    .get(self.lastRefillKey)
    .exec(function (err, results) {
      if (err) return callback(err);

      var currentTokens = parseFloat(results[0][1]) || maxTokens;
      var lastRefill = parseInt(results[1][1]) || now;

      var elapsed = (now - lastRefill) / 1000;
      var tokensToAdd = elapsed * (refillRate / 60) * maxTokens;
      currentTokens = Math.min(maxTokens, currentTokens + tokensToAdd);

      if (currentTokens < tokensNeeded) {
        var waitTime = Math.ceil((tokensNeeded - currentTokens) / ((refillRate / 60) * maxTokens));
        return callback(null, {
          allowed: false,
          reason: "Rate limit exceeded for " + self.feature,
          remaining: Math.floor(currentTokens),
          retryAfterSeconds: waitTime
        });
      }

      currentTokens = currentTokens - tokensNeeded;

      redis.multi()
        .set(self.bucketKey, currentTokens.toString(), "EX", 120)
        .set(self.lastRefillKey, now.toString(), "EX", 120)
        .exec(function (err2) {
          if (err2) return callback(err2);
          callback(null, {
            allowed: true,
            remaining: Math.floor(currentTokens)
          });
        });
    });
};

Sliding Window Counters with Redis

Token buckets handle request-level rate limiting, but you also need daily and monthly budget caps. A sliding window counter gives you accurate usage tracking without the boundary problems of fixed windows.

function SlidingWindowCounter(userId) {
  this.userId = userId;
}

SlidingWindowCounter.prototype.recordUsage = function (feature, tokensUsed, callback) {
  var self = this;
  var now = Date.now();
  var dailyKey = "usage:daily:" + self.userId;
  var monthlyKey = "usage:monthly:" + self.userId;
  var featureKey = "usage:feature:" + self.userId + ":" + feature;
  var detailEntry = JSON.stringify({
    feature: feature,
    tokens: tokensUsed,
    timestamp: now
  });

  var todayEnd = new Date();
  todayEnd.setHours(23, 59, 59, 999);
  var dailyTTL = Math.ceil((todayEnd.getTime() - now) / 1000);

  var monthEnd = new Date();
  monthEnd.setMonth(monthEnd.getMonth() + 1, 1);
  monthEnd.setHours(0, 0, 0, 0);
  var monthlyTTL = Math.ceil((monthEnd.getTime() - now) / 1000);

  redis.multi()
    .incrby(dailyKey, tokensUsed)
    .expire(dailyKey, dailyTTL)
    .incrby(monthlyKey, tokensUsed)
    .expire(monthlyKey, monthlyTTL)
    .incrby(featureKey, tokensUsed)
    .expire(featureKey, monthlyTTL)
    .lpush("usage:log:" + self.userId, detailEntry)
    .ltrim("usage:log:" + self.userId, 0, 999)
    .expire("usage:log:" + self.userId, monthlyTTL)
    .exec(function (err) {
      if (err) return callback(err);
      callback(null);
    });
};

SlidingWindowCounter.prototype.getUsage = function (callback) {
  var self = this;
  redis.multi()
    .get("usage:daily:" + self.userId)
    .get("usage:monthly:" + self.userId)
    .get("usage:feature:" + self.userId + ":chat")
    .get("usage:feature:" + self.userId + ":search")
    .get("usage:feature:" + self.userId + ":generation")
    .exec(function (err, results) {
      if (err) return callback(err);
      callback(null, {
        dailyTokensUsed: parseInt(results[0][1]) || 0,
        monthlyTokensUsed: parseInt(results[1][1]) || 0,
        featureBreakdown: {
          chat: parseInt(results[2][1]) || 0,
          search: parseInt(results[3][1]) || 0,
          generation: parseInt(results[4][1]) || 0
        }
      });
    });
};

Limiting Based on Tokens Consumed, Not Just Requests

This is the single most important distinction for AI rate limiting. A user who sends "Summarize this in one sentence" and a user who sends "Write a 5000-word essay on quantum computing" should not count the same against their limits. You need to rate-limit on actual token consumption.

The pattern is: estimate tokens before the call, enforce the limit, then record actual tokens after the call completes.

function estimateTokens(text) {
  // Rough estimation: ~4 characters per token for English text
  // This is intentionally conservative - better to over-estimate
  return Math.ceil(text.length / 3.5);
}

function checkBudget(userId, tier, estimatedTokens, callback) {
  var counter = new SlidingWindowCounter(userId);
  counter.getUsage(function (err, usage) {
    if (err) return callback(err);

    var tierConfig = TIER_CONFIG[tier];
    var dailyRemaining = tierConfig.dailyTokenBudget - usage.dailyTokensUsed;
    var monthlyRemaining = tierConfig.monthlyTokenBudget - usage.monthlyTokensUsed;

    if (estimatedTokens > dailyRemaining) {
      return callback(null, {
        allowed: false,
        reason: "Daily token budget exceeded",
        dailyUsed: usage.dailyTokensUsed,
        dailyLimit: tierConfig.dailyTokenBudget,
        resetsIn: getTimeUntilMidnight()
      });
    }

    if (estimatedTokens > monthlyRemaining) {
      return callback(null, {
        allowed: false,
        reason: "Monthly token budget exceeded",
        monthlyUsed: usage.monthlyTokensUsed,
        monthlyLimit: tierConfig.monthlyTokenBudget
      });
    }

    // Soft limit warning at 80% usage
    var dailyPercent = usage.dailyTokensUsed / tierConfig.dailyTokenBudget;
    var monthlyPercent = usage.monthlyTokensUsed / tierConfig.monthlyTokenBudget;

    callback(null, {
      allowed: true,
      warning: dailyPercent > 0.8 || monthlyPercent > 0.8,
      warningMessage: dailyPercent > 0.8
        ? "You have used " + Math.round(dailyPercent * 100) + "% of your daily budget"
        : monthlyPercent > 0.8
          ? "You have used " + Math.round(monthlyPercent * 100) + "% of your monthly budget"
          : null,
      dailyRemaining: dailyRemaining,
      monthlyRemaining: monthlyRemaining
    });
  });
}

function getTimeUntilMidnight() {
  var now = new Date();
  var midnight = new Date();
  midnight.setHours(24, 0, 0, 0);
  return Math.ceil((midnight.getTime() - now.getTime()) / 1000);
}

Communicating Limits to Users via Response Headers

Users and their client applications need to know where they stand. Use standard-ish rate limit headers on every AI feature response:

function setRateLimitHeaders(res, usage, tierConfig) {
  res.set("X-RateLimit-Limit-Daily", tierConfig.dailyTokenBudget.toString());
  res.set("X-RateLimit-Remaining-Daily", Math.max(0, tierConfig.dailyTokenBudget - usage.dailyTokensUsed).toString());
  res.set("X-RateLimit-Limit-Monthly", tierConfig.monthlyTokenBudget.toString());
  res.set("X-RateLimit-Remaining-Monthly", Math.max(0, tierConfig.monthlyTokenBudget - usage.monthlyTokensUsed).toString());
  res.set("X-RateLimit-Tokens-Used", usage.lastRequestTokens ? usage.lastRequestTokens.toString() : "0");

  if (usage.retryAfterSeconds) {
    res.set("Retry-After", usage.retryAfterSeconds.toString());
  }
}

Handling Rate Limit Exceeded: Queuing vs Rejecting

You have two strategies when a user hits their limit: reject immediately or queue the request. My strong recommendation is to reject for synchronous features and queue for asynchronous ones.

For chat and search (where the user is waiting), return a 429 immediately with a clear message. For content generation (where the result could be delivered later), consider a queue.

var requestQueue = {};

function enqueueOrReject(req, res, feature, callback) {
  var userId = req.user.id;

  if (feature === "generation") {
    if (!requestQueue[userId]) {
      requestQueue[userId] = [];
    }
    if (requestQueue[userId].length >= 3) {
      return res.status(429).json({
        error: "Queue full",
        message: "You have 3 generation requests already queued. Please wait for them to complete."
      });
    }

    var queueEntry = {
      id: require("uuid").v4(),
      feature: feature,
      payload: req.body,
      queuedAt: Date.now()
    };
    requestQueue[userId].push(queueEntry);

    res.status(202).json({
      message: "Request queued",
      queueId: queueEntry.id,
      position: requestQueue[userId].length,
      estimatedWait: requestQueue[userId].length * 30 + " seconds"
    });

    return callback(queueEntry);
  }

  // For synchronous features, reject immediately
  return res.status(429).json({
    error: "Rate limit exceeded",
    message: "You have exceeded your " + feature + " rate limit. Please try again later.",
    retryAfter: 60
  });
}

Soft Limits with Warnings Before Hard Cutoff

Nobody likes hitting a wall with no warning. Implement soft limits that warn users at 80% and 95% thresholds before the hard cutoff at 100%.

function evaluateSoftLimits(usage, tierConfig) {
  var dailyPercent = usage.dailyTokensUsed / tierConfig.dailyTokenBudget;
  var monthlyPercent = usage.monthlyTokensUsed / tierConfig.monthlyTokenBudget;

  var result = { level: "normal", warnings: [] };

  if (dailyPercent >= 1.0 || monthlyPercent >= 1.0) {
    result.level = "hard_limit";
    result.warnings.push("You have reached your usage limit. Requests will be denied until your budget resets.");
  } else if (dailyPercent >= 0.95 || monthlyPercent >= 0.95) {
    result.level = "critical";
    result.warnings.push("You are at 95% of your usage limit. Consider upgrading your plan.");
  } else if (dailyPercent >= 0.8 || monthlyPercent >= 0.8) {
    result.level = "warning";
    result.warnings.push("You have used over 80% of your usage limit for this period.");
  }

  return result;
}

Admin Overrides and Temporary Limit Increases

You will inevitably need to grant temporary limit increases for demos, sales prospects, or users who hit an edge case. Build this into your system from the start.

function AdminOverrides(userId) {
  this.userId = userId;
  this.overrideKey = "override:" + userId;
}

AdminOverrides.prototype.setTemporaryOverride = function (overrideConfig, expiresInHours, callback) {
  var self = this;
  var ttl = expiresInHours * 3600;
  var data = JSON.stringify({
    dailyTokenBudget: overrideConfig.dailyTokenBudget,
    monthlyTokenBudget: overrideConfig.monthlyTokenBudget,
    reason: overrideConfig.reason,
    grantedBy: overrideConfig.adminId,
    grantedAt: Date.now(),
    expiresAt: Date.now() + (ttl * 1000)
  });

  redis.set(self.overrideKey, data, "EX", ttl, function (err) {
    if (err) return callback(err);
    callback(null, { success: true, expiresInHours: expiresInHours });
  });
};

AdminOverrides.prototype.getEffectiveLimits = function (baseTierConfig, callback) {
  var self = this;
  redis.get(self.overrideKey, function (err, data) {
    if (err) return callback(err);
    if (!data) return callback(null, baseTierConfig);

    var override = JSON.parse(data);
    var effective = JSON.parse(JSON.stringify(baseTierConfig));
    if (override.dailyTokenBudget) {
      effective.dailyTokenBudget = override.dailyTokenBudget;
    }
    if (override.monthlyTokenBudget) {
      effective.monthlyTokenBudget = override.monthlyTokenBudget;
    }
    callback(null, effective);
  });
};

Tracking Usage in Real Time

Real-time usage tracking is essential for both the user dashboard and internal monitoring. We push every AI call into a Redis stream for real-time processing.

function trackUsageRealtime(userId, feature, tokensUsed, latencyMs) {
  var entry = {
    userId: userId,
    feature: feature,
    tokens: tokensUsed,
    latency: latencyMs,
    timestamp: Date.now()
  };

  // Add to Redis stream for real-time processing
  redis.xadd(
    "usage_stream",
    "MAXLEN", "~", "10000",
    "*",
    "data", JSON.stringify(entry),
    function (err) {
      if (err) {
        console.error("Failed to track usage:", err.message);
      }
    }
  );

  // Update real-time counters for dashboard
  var minuteKey = "rt:" + userId + ":" + Math.floor(Date.now() / 60000);
  redis.multi()
    .incrby(minuteKey, tokensUsed)
    .expire(minuteKey, 3600)
    .exec();
}

Billing Integration: Usage-Based Pricing

When you tie AI usage to billing, you need an accurate, auditable record. Here is how to bridge rate limiting with a billing system:

function BillingTracker(userId) {
  this.userId = userId;
  this.billingKey = "billing:" + userId + ":" + getCurrentBillingPeriod();
}

function getCurrentBillingPeriod() {
  var now = new Date();
  return now.getFullYear() + "-" + String(now.getMonth() + 1).padStart(2, "0");
}

BillingTracker.prototype.recordBillableUsage = function (feature, inputTokens, outputTokens, model, callback) {
  var self = this;

  // Cost per 1K tokens (example rates)
  var costTable = {
    "gpt-4o": { input: 0.0025, output: 0.01 },
    "gpt-4o-mini": { input: 0.00015, output: 0.0006 },
    "claude-3-haiku": { input: 0.00025, output: 0.00125 }
  };

  var rates = costTable[model] || costTable["gpt-4o-mini"];
  var cost = ((inputTokens / 1000) * rates.input) + ((outputTokens / 1000) * rates.output);

  var record = JSON.stringify({
    feature: feature,
    inputTokens: inputTokens,
    outputTokens: outputTokens,
    model: model,
    cost: Math.round(cost * 1000000) / 1000000,
    timestamp: Date.now()
  });

  redis.multi()
    .rpush(self.billingKey, record)
    .expire(self.billingKey, 90 * 86400)
    .incrbyfloat("billing:total:" + self.userId, cost)
    .exec(function (err) {
      if (err) return callback(err);
      callback(null, { cost: cost, totalTokens: inputTokens + outputTokens });
    });
};

BillingTracker.prototype.getBillingSummary = function (callback) {
  var self = this;
  redis.lrange(self.billingKey, 0, -1, function (err, records) {
    if (err) return callback(err);

    var summary = { totalCost: 0, totalTokens: 0, byFeature: {}, byModel: {} };
    records.forEach(function (raw) {
      var record = JSON.parse(raw);
      summary.totalCost += record.cost;
      summary.totalTokens += record.inputTokens + record.outputTokens;

      if (!summary.byFeature[record.feature]) {
        summary.byFeature[record.feature] = { cost: 0, tokens: 0, requests: 0 };
      }
      summary.byFeature[record.feature].cost += record.cost;
      summary.byFeature[record.feature].tokens += record.inputTokens + record.outputTokens;
      summary.byFeature[record.feature].requests += 1;

      if (!summary.byModel[record.model]) {
        summary.byModel[record.model] = { cost: 0, tokens: 0 };
      }
      summary.byModel[record.model].cost += record.cost;
      summary.byModel[record.model].tokens += record.inputTokens + record.outputTokens;
    });

    summary.totalCost = Math.round(summary.totalCost * 100) / 100;
    callback(null, summary);
  });
};

Complete Working Example

Here is a complete Express.js middleware system that ties everything together. This is production-grade code you can adapt for your own application.

var express = require("express");
var Redis = require("ioredis");
var uuid = require("uuid");

var app = express();
var redis = new Redis(process.env.REDIS_URL || "redis://localhost:6379");

app.use(express.json());

// ============================================================
// Tier Configuration
// ============================================================
var TIER_CONFIG = {
  free: {
    dailyTokenBudget: 10000,
    monthlyTokenBudget: 100000,
    limits: {
      chat: { requestsPerMinute: 5, maxTokensPerRequest: 1000 },
      search: { requestsPerMinute: 10, maxTokensPerRequest: 500 },
      generation: { requestsPerMinute: 2, maxTokensPerRequest: 2000 }
    }
  },
  pro: {
    dailyTokenBudget: 100000,
    monthlyTokenBudget: 2000000,
    limits: {
      chat: { requestsPerMinute: 20, maxTokensPerRequest: 4000 },
      search: { requestsPerMinute: 40, maxTokensPerRequest: 2000 },
      generation: { requestsPerMinute: 10, maxTokensPerRequest: 8000 }
    }
  },
  enterprise: {
    dailyTokenBudget: 1000000,
    monthlyTokenBudget: 50000000,
    limits: {
      chat: { requestsPerMinute: 100, maxTokensPerRequest: 8000 },
      search: { requestsPerMinute: 200, maxTokensPerRequest: 4000 },
      generation: { requestsPerMinute: 50, maxTokensPerRequest: 16000 }
    }
  }
};

// ============================================================
// Helper Functions
// ============================================================
function estimateTokens(text) {
  return Math.ceil((text || "").length / 3.5);
}

function getCurrentBillingPeriod() {
  var now = new Date();
  return now.getFullYear() + "-" + String(now.getMonth() + 1).padStart(2, "0");
}

function getTimeUntilMidnight() {
  var now = new Date();
  var midnight = new Date();
  midnight.setHours(24, 0, 0, 0);
  return Math.ceil((midnight.getTime() - now.getTime()) / 1000);
}

// ============================================================
// Usage Tracker
// ============================================================
function getUsage(userId, callback) {
  redis.multi()
    .get("usage:daily:" + userId)
    .get("usage:monthly:" + userId)
    .get("usage:feature:" + userId + ":chat")
    .get("usage:feature:" + userId + ":search")
    .get("usage:feature:" + userId + ":generation")
    .exec(function (err, results) {
      if (err) return callback(err);
      callback(null, {
        dailyTokensUsed: parseInt(results[0][1]) || 0,
        monthlyTokensUsed: parseInt(results[1][1]) || 0,
        featureBreakdown: {
          chat: parseInt(results[2][1]) || 0,
          search: parseInt(results[3][1]) || 0,
          generation: parseInt(results[4][1]) || 0
        }
      });
    });
}

function recordUsage(userId, feature, tokensUsed, callback) {
  var now = Date.now();
  var dailyKey = "usage:daily:" + userId;
  var monthlyKey = "usage:monthly:" + userId;
  var featureKey = "usage:feature:" + userId + ":" + feature;

  var todayEnd = new Date();
  todayEnd.setHours(23, 59, 59, 999);
  var dailyTTL = Math.ceil((todayEnd.getTime() - now) / 1000);

  var monthEnd = new Date();
  monthEnd.setMonth(monthEnd.getMonth() + 1, 1);
  monthEnd.setHours(0, 0, 0, 0);
  var monthlyTTL = Math.ceil((monthEnd.getTime() - now) / 1000);

  redis.multi()
    .incrby(dailyKey, tokensUsed)
    .expire(dailyKey, dailyTTL)
    .incrby(monthlyKey, tokensUsed)
    .expire(monthlyKey, monthlyTTL)
    .incrby(featureKey, tokensUsed)
    .expire(featureKey, monthlyTTL)
    .exec(function (err) {
      callback(err || null);
    });
}

// ============================================================
// Admin Override Check
// ============================================================
function getEffectiveTierConfig(userId, baseTier, callback) {
  var overrideKey = "override:" + userId;
  redis.get(overrideKey, function (err, data) {
    if (err || !data) return callback(null, TIER_CONFIG[baseTier] || TIER_CONFIG.free);

    var override = JSON.parse(data);
    var config = JSON.parse(JSON.stringify(TIER_CONFIG[baseTier] || TIER_CONFIG.free));
    if (override.dailyTokenBudget) config.dailyTokenBudget = override.dailyTokenBudget;
    if (override.monthlyTokenBudget) config.monthlyTokenBudget = override.monthlyTokenBudget;
    callback(null, config);
  });
}

// ============================================================
// Rate Limit Middleware
// ============================================================
function aiRateLimit(feature) {
  return function (req, res, next) {
    // req.user must be set by authentication middleware upstream
    if (!req.user || !req.user.id) {
      return res.status(401).json({ error: "Authentication required" });
    }

    var userId = req.user.id;
    var tier = req.user.tier || "free";
    var inputText = req.body.prompt || req.body.query || req.body.message || "";
    var estimatedTokens = estimateTokens(inputText);

    getEffectiveTierConfig(userId, tier, function (err, tierConfig) {
      if (err) {
        console.error("Failed to get tier config:", err.message);
        return res.status(500).json({ error: "Internal error checking rate limits" });
      }

      // Check feature-level limits
      var featureLimits = tierConfig.limits[feature];
      if (!featureLimits) {
        return res.status(400).json({ error: "Unknown AI feature: " + feature });
      }

      if (estimatedTokens > featureLimits.maxTokensPerRequest) {
        return res.status(413).json({
          error: "Request too large",
          message: "Estimated " + estimatedTokens + " tokens exceeds max of " + featureLimits.maxTokensPerRequest,
          maxTokensPerRequest: featureLimits.maxTokensPerRequest
        });
      }

      // Check per-minute rate via sliding window
      var minuteKey = "rpm:" + userId + ":" + feature + ":" + Math.floor(Date.now() / 60000);
      redis.get(minuteKey, function (err2, count) {
        if (err2) {
          console.error("Redis error:", err2.message);
          return res.status(500).json({ error: "Internal error" });
        }

        var currentCount = parseInt(count) || 0;
        if (currentCount >= featureLimits.requestsPerMinute) {
          res.set("Retry-After", "60");
          return res.status(429).json({
            error: "Rate limit exceeded",
            message: "You have exceeded " + featureLimits.requestsPerMinute + " requests per minute for " + feature,
            retryAfter: 60
          });
        }

        // Check budget caps
        getUsage(userId, function (err3, usage) {
          if (err3) {
            console.error("Usage check error:", err3.message);
            return res.status(500).json({ error: "Internal error" });
          }

          var dailyRemaining = tierConfig.dailyTokenBudget - usage.dailyTokensUsed;
          var monthlyRemaining = tierConfig.monthlyTokenBudget - usage.monthlyTokensUsed;

          // Set rate limit headers on every response
          res.set("X-RateLimit-Limit-Daily", tierConfig.dailyTokenBudget.toString());
          res.set("X-RateLimit-Remaining-Daily", Math.max(0, dailyRemaining).toString());
          res.set("X-RateLimit-Limit-Monthly", tierConfig.monthlyTokenBudget.toString());
          res.set("X-RateLimit-Remaining-Monthly", Math.max(0, monthlyRemaining).toString());

          if (estimatedTokens > dailyRemaining) {
            res.set("Retry-After", getTimeUntilMidnight().toString());
            return res.status(429).json({
              error: "Daily budget exceeded",
              message: "You have used " + usage.dailyTokensUsed + " of " + tierConfig.dailyTokenBudget + " daily tokens",
              resetsIn: getTimeUntilMidnight()
            });
          }

          if (estimatedTokens > monthlyRemaining) {
            return res.status(429).json({
              error: "Monthly budget exceeded",
              message: "You have used " + usage.monthlyTokensUsed + " of " + tierConfig.monthlyTokenBudget + " monthly tokens",
              upgradeUrl: "/pricing"
            });
          }

          // Soft limit warnings
          var dailyPercent = usage.dailyTokensUsed / tierConfig.dailyTokenBudget;
          var monthlyPercent = usage.monthlyTokensUsed / tierConfig.monthlyTokenBudget;
          if (dailyPercent > 0.8 || monthlyPercent > 0.8) {
            res.set("X-RateLimit-Warning", "Usage above 80% of budget");
          }

          // Increment per-minute counter
          redis.multi()
            .incr(minuteKey)
            .expire(minuteKey, 120)
            .exec();

          // Attach usage metadata to request for post-processing
          req.aiRateLimit = {
            feature: feature,
            estimatedTokens: estimatedTokens,
            tierConfig: tierConfig,
            usage: usage
          };

          next();
        });
      });
    });
  };
}

// ============================================================
// Post-Processing Middleware (records actual usage)
// ============================================================
function recordAiUsage(req, res, actualTokensUsed) {
  if (!req.user || !req.aiRateLimit) return;

  var userId = req.user.id;
  var feature = req.aiRateLimit.feature;

  recordUsage(userId, feature, actualTokensUsed, function (err) {
    if (err) {
      console.error("Failed to record usage for user " + userId + ":", err.message);
    }
  });
}

// ============================================================
// AI Feature Routes
// ============================================================
app.post("/api/ai/chat", aiRateLimit("chat"), function (req, res) {
  var startTime = Date.now();

  // Simulate AI call - replace with actual LLM integration
  var responseText = "This is a simulated AI chat response.";
  var actualTokensUsed = estimateTokens(req.body.message) + estimateTokens(responseText);

  recordAiUsage(req, res, actualTokensUsed);

  res.json({
    response: responseText,
    tokensUsed: actualTokensUsed,
    latencyMs: Date.now() - startTime
  });
});

app.post("/api/ai/search", aiRateLimit("search"), function (req, res) {
  var actualTokensUsed = estimateTokens(req.body.query) + 200;

  recordAiUsage(req, res, actualTokensUsed);

  res.json({
    results: [{ title: "Example result", snippet: "..." }],
    tokensUsed: actualTokensUsed
  });
});

app.post("/api/ai/generate", aiRateLimit("generation"), function (req, res) {
  var actualTokensUsed = estimateTokens(req.body.prompt) + 2000;

  recordAiUsage(req, res, actualTokensUsed);

  res.json({
    content: "Generated content goes here...",
    tokensUsed: actualTokensUsed
  });
});

// ============================================================
// User Dashboard Endpoints
// ============================================================
app.get("/api/usage/summary", function (req, res) {
  if (!req.user || !req.user.id) {
    return res.status(401).json({ error: "Authentication required" });
  }

  var userId = req.user.id;
  var tier = req.user.tier || "free";

  getEffectiveTierConfig(userId, tier, function (err, tierConfig) {
    if (err) return res.status(500).json({ error: "Internal error" });

    getUsage(userId, function (err2, usage) {
      if (err2) return res.status(500).json({ error: "Internal error" });

      res.json({
        tier: tier,
        daily: {
          used: usage.dailyTokensUsed,
          limit: tierConfig.dailyTokenBudget,
          remaining: Math.max(0, tierConfig.dailyTokenBudget - usage.dailyTokensUsed),
          percentUsed: Math.round((usage.dailyTokensUsed / tierConfig.dailyTokenBudget) * 100)
        },
        monthly: {
          used: usage.monthlyTokensUsed,
          limit: tierConfig.monthlyTokenBudget,
          remaining: Math.max(0, tierConfig.monthlyTokenBudget - usage.monthlyTokensUsed),
          percentUsed: Math.round((usage.monthlyTokensUsed / tierConfig.monthlyTokenBudget) * 100)
        },
        features: usage.featureBreakdown,
        limits: tierConfig.limits
      });
    });
  });
});

app.get("/api/usage/history", function (req, res) {
  if (!req.user || !req.user.id) {
    return res.status(401).json({ error: "Authentication required" });
  }

  redis.lrange("usage:log:" + req.user.id, 0, 99, function (err, records) {
    if (err) return res.status(500).json({ error: "Internal error" });

    var history = records.map(function (raw) {
      return JSON.parse(raw);
    });

    res.json({ history: history });
  });
});

// ============================================================
// Admin Endpoints
// ============================================================
app.post("/api/admin/override", function (req, res) {
  // In production, verify admin role from req.user
  var targetUserId = req.body.userId;
  var overrideData = JSON.stringify({
    dailyTokenBudget: req.body.dailyTokenBudget,
    monthlyTokenBudget: req.body.monthlyTokenBudget,
    reason: req.body.reason,
    grantedBy: req.user ? req.user.id : "admin",
    grantedAt: Date.now()
  });

  var ttl = (req.body.expiresInHours || 24) * 3600;

  redis.set("override:" + targetUserId, overrideData, "EX", ttl, function (err) {
    if (err) return res.status(500).json({ error: "Failed to set override" });
    res.json({ success: true, expiresInHours: req.body.expiresInHours || 24 });
  });
});

// ============================================================
// Start Server
// ============================================================
var PORT = process.env.PORT || 3000;
app.listen(PORT, function () {
  console.log("AI rate limiting server running on port " + PORT);
});

Common Issues and Troubleshooting

1. Redis connection failures cause all AI requests to fail.

Error: connect ECONNREFUSED 127.0.0.1:6379
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1595:16)

If Redis goes down, your rate limiter becomes a brick wall that blocks every request. Always implement a fallback. Add a try-catch around your Redis calls and default to allowing the request (with logging) rather than denying it. A brief period of unmetered usage is better than a full outage. Use a circuit breaker pattern on the Redis connection and fall back to in-memory rate limiting with a Map when Redis is unavailable.

2. Race conditions on concurrent requests from the same user.

Error: User exceeded daily budget by 45000 tokens (budget: 100000, actual: 145000)

When two requests arrive simultaneously, both check the budget, both see sufficient tokens, and both proceed. The budget gets over-spent. Fix this by using Redis MULTI/EXEC for atomic check-and-decrement, or better yet, use a Lua script that performs the check and increment atomically on the Redis server side. The small amount of overage from race conditions is usually acceptable, but if you need exact enforcement, Lua scripts are the way to go.

3. Token estimation is wildly inaccurate for non-English text.

Warning: Estimated 500 tokens, actual usage was 1847 tokens (269% over estimate)

The "divide by 3.5" estimation works reasonably for English but falls apart for CJK languages, code with lots of special characters, or highly technical text with acronyms. Use a proper tokenizer library like tiktoken (for OpenAI models) or @anthropic-ai/tokenizer for Claude. If you cannot use a tokenizer, multiply your estimate by 2x for non-English content as a safety margin. Always record actual tokens after the call and reconcile against estimates.

4. Monthly counters reset at different times for different users.

Bug report: "My monthly limit reset at midnight UTC but I'm in PST. I lost 8 hours of budget."

If you use calendar-month keys (like usage:monthly:user123), they expire based on when they were first set, not on the user's local midnight. For consistency, document that all limits reset at midnight UTC. If you need per-user timezone support, include the user's timezone offset in the key calculation and adjust TTLs accordingly. Most B2B SaaS applications just standardize on UTC and document it clearly.

5. Memory leaks from in-memory request queues.

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

If you implement request queuing in memory (as shown in the queuing example above), you must have strict limits on queue size and a cleanup mechanism for stale entries. Set a maximum queue depth per user (3-5 items), add a TTL to each queue entry (5 minutes max), and run a periodic cleanup sweep. Better yet, use a proper message queue like Redis lists or BullMQ for durable queuing that survives server restarts.

Best Practices

Rate limit on tokens, not just requests. A request sending a 100-word prompt and a request sending a 10,000-word document are not equivalent. Your rate limiter must account for the actual cost of each request, not just the count.
Always set rate limit headers. Include X-RateLimit-Limit, X-RateLimit-Remaining, and Retry-After on every AI endpoint response. This lets client applications implement their own backoff logic without hammering your API with doomed requests.
Implement separate limits per AI feature. Chat, search, and generation have different cost profiles. A user should not exhaust their search budget because they used too much generation. Separate limits also let you tune each feature independently as costs change.
Use soft limits before hard cutoffs. Warn users at 80% and 95% of their budget. Send an email or in-app notification when they cross the 80% threshold. This reduces support tickets and gives users time to adjust their usage or upgrade.
Make your rate limiter fail open, not closed. When Redis is down or your rate limiting logic throws an error, allow the request through and log the failure. A brief period of unmetered usage is vastly preferable to a complete service outage. Monitor these failures and alert on them.
Log every rate limit decision. Record when users are allowed, warned, and denied. This data is invaluable for tuning your tier limits, identifying users who consistently hit limits (upgrade candidates), and debugging billing disputes.
Use atomic operations for budget checks. Redis MULTI/EXEC or Lua scripts prevent race conditions where concurrent requests both pass the budget check and cause overspending. This is especially important for expensive generation requests.
Build admin overrides from day one. Sales will ask you to increase limits for a prospect. Support will ask you to temporarily lift limits for a user debugging an issue. Having an override system ready saves you from hot-patching production.
Record actual token usage, not estimates. Use the estimate for pre-flight checks, but always record the actual token count returned by the LLM API. Reconcile the two periodically and adjust your estimation function if the delta is consistently large.
Plan for tier changes mid-billing-cycle. When a user upgrades from free to pro, do they get the pro limits immediately? Does their usage counter reset? Define these rules explicitly and implement them. Most applications grant the new limits immediately without resetting counters, which is simple and fair.

Rate Limiting AI Features Per User

Rate Limiting AI Features Per User

Prerequisites

Why Per-User Rate Limiting Is Critical for AI Features

Designing Rate Limit Tiers

Token Bucket Algorithm for Per-User Limits

Sliding Window Counters with Redis

Limiting Based on Tokens Consumed, Not Just Requests

Communicating Limits to Users via Response Headers

Handling Rate Limit Exceeded: Queuing vs Rejecting

Soft Limits with Warnings Before Hard Cutoff

Admin Overrides and Temporary Limit Increases

Tracking Usage in Real Time

Billing Integration: Usage-Based Pricing

Complete Working Example

Common Issues and Troubleshooting

Best Practices

References

Quick Links

Need Expert Help?