Llm Apis

Model Selection: Choosing the Right LLM for Your Task

Guide to selecting the right LLM for different tasks with cost analysis, benchmarking strategies, and a model router implementation in Node.js.

Model Selection: Choosing the Right LLM for Your Task

Not every task needs your most powerful model. Sending a simple classification request to Claude Opus is like hiring a brain surgeon to put on a bandage — it works, but you are burning money and time for no reason. This guide covers how to evaluate models systematically, build a routing layer that picks the right model per request, and keep your LLM costs under control without sacrificing output quality.

Prerequisites

  • Node.js v18 or later installed
  • Basic familiarity with LLM API calls (Anthropic, OpenAI)
  • API keys for at least one provider (Anthropic or OpenAI)
  • npm packages: @anthropic-ai/sdk, openai, express

The Current LLM Landscape

The model landscape shifts fast, but the fundamental tiers have stayed remarkably stable. Understanding these tiers matters more than memorizing specific model names, because names change every few months while the selection logic does not.

The Claude Family (Anthropic)

Anthropic's lineup follows a clear tiering strategy:

  • Claude Opus — The flagship. Best reasoning, longest context retention, most nuanced output. Expensive. Use it when quality is non-negotiable: complex code generation, multi-step analysis, long-form writing that needs to hold together across thousands of words.
  • Claude Sonnet — The workhorse. Roughly 80-90% of Opus quality at a fraction of the cost. Handles most production tasks well: summarization, moderate code generation, structured data extraction, conversational AI.
  • Claude Haiku — The speedster. Fast, cheap, and surprisingly capable for well-defined tasks. Classification, simple extraction, template-based generation, routing decisions. This is your default until you prove you need something bigger.

The GPT Family (OpenAI)

OpenAI follows a similar pattern:

  • GPT-4o — Their current flagship multimodal model. Strong across the board, particularly good at following complex instructions and working with images.
  • GPT-4o-mini — The budget option. Comparable to Haiku in cost, solid for straightforward tasks.
  • o1 / o3 series — Reasoning-focused models that "think" before responding. Excellent for math, logic puzzles, and multi-step reasoning. Slower and more expensive, but they solve problems that other models cannot.

Open-Source Models

Llama, Mistral, Mixtral, and others offer self-hosted alternatives. They make sense when you have strict data residency requirements, need to eliminate per-token costs at massive scale, or want to fine-tune on proprietary data. The tradeoff is operational complexity — you are now managing GPU infrastructure, model updates, and scaling yourself.

Benchmarking Models for Your Use Case

Public benchmarks like MMLU, HumanEval, and GPQA tell you how models perform on academic tasks. They tell you almost nothing about how a model will perform on your specific workload. You need your own benchmarks.

Build a Test Suite

Create a set of 50-100 representative inputs from your actual production data. For each input, define what a good output looks like. This does not need to be a single correct answer — it can be a rubric.

var testCases = [
  {
    input: "Classify this support ticket: 'My payment was charged twice'",
    expectedCategory: "billing",
    minConfidence: 0.9
  },
  {
    input: "Summarize this 2000-word technical document about Kubernetes networking",
    document: "...",
    criteria: {
      maxLength: 300,
      mustMention: ["pod networking", "service mesh", "CNI"],
      mustNotHallucinate: true
    }
  }
];

Run Comparisons Systematically

var Anthropic = require("@anthropic-ai/sdk");
var client = new Anthropic();

function runBenchmark(testCases, models) {
  var results = {};

  models.forEach(function(model) {
    results[model] = {
      totalTime: 0,
      totalTokens: 0,
      totalCost: 0,
      scores: []
    };
  });

  return Promise.all(testCases.map(function(testCase) {
    return Promise.all(models.map(function(model) {
      var start = Date.now();

      return client.messages.create({
        model: model,
        max_tokens: 1024,
        messages: [{ role: "user", content: testCase.input }]
      }).then(function(response) {
        var elapsed = Date.now() - start;
        var inputTokens = response.usage.input_tokens;
        var outputTokens = response.usage.output_tokens;
        var cost = calculateCost(model, inputTokens, outputTokens);

        results[model].totalTime += elapsed;
        results[model].totalTokens += inputTokens + outputTokens;
        results[model].totalCost += cost;
        results[model].scores.push(
          scoreOutput(response.content[0].text, testCase)
        );
      });
    }));
  })).then(function() {
    return results;
  });
}

This gives you real numbers — not vibes, not blog posts, not Twitter threads. Actual cost, latency, and quality data for your workload.

Task-Based Selection Criteria

Different tasks have fundamentally different requirements. Here is how I think about model selection by task type.

Classification and Labeling

Use the smallest model that hits your accuracy threshold. Classification tasks have discrete outputs, so you can measure accuracy precisely. Start with Haiku or GPT-4o-mini. In my experience, small models handle binary and multi-class classification at 95%+ accuracy when you write clear prompts with examples.

Summarization

Mid-tier models shine here. Sonnet and GPT-4o produce summaries that are nearly indistinguishable from Opus output. The key is a well-structured prompt that specifies length, tone, and which details to prioritize. Only escalate to a flagship model if your summaries need to handle highly technical or nuanced source material.

Code Generation

This is where model quality matters most. Small models generate code that looks right but has subtle bugs — off-by-one errors, incorrect edge case handling, missing null checks. For production code generation, use Opus or GPT-4o. For boilerplate, scaffolding, or simple utility functions, Sonnet or GPT-4o-mini are fine.

Conversational AI

Depends entirely on the conversation complexity. A customer support bot answering FAQ-style questions works great with Haiku. A technical advisor that needs to maintain context across a long conversation and reason about complex scenarios needs Sonnet or Opus. Match the model to the hardest conversation your bot needs to handle, then consider routing simple conversations to a cheaper model.

Data Extraction and Transformation

Structured output tasks — pulling fields from unstructured text, converting formats, parsing documents — are Haiku territory. The task is well-defined, the output format is constrained, and small models handle it efficiently. Just make sure your prompt includes explicit output format examples.

Cost-Per-Token Comparison

Costs change frequently, but the relative ratios between tiers stay consistent. Here are approximate costs per million tokens as of early 2026:

Model Input (per 1M tokens) Output (per 1M tokens)
Claude Opus $15.00 $75.00
Claude Sonnet $3.00 $15.00
Claude Haiku $0.25 $1.25
GPT-4o $2.50 $10.00
GPT-4o-mini $0.15 $0.60
o1 $15.00 $60.00

The gap between tiers is enormous. Routing 80% of your traffic from Opus to Haiku does not save you 80% — it saves you closer to 97% on those requests. At scale, this is the difference between a $50,000/month LLM bill and a $2,000/month one.

Latency vs Quality Tradeoffs

Latency is not just about user experience — it compounds through your system. If an LLM call sits in the middle of a request pipeline, every millisecond of model latency adds directly to your response time.

Typical response times for a moderate request (500 input tokens, 200 output tokens):

  • Haiku: 300-800ms
  • Sonnet: 800-2000ms
  • Opus: 2000-5000ms
  • GPT-4o-mini: 300-700ms
  • GPT-4o: 800-2000ms

For user-facing features where someone is staring at a loading spinner, latency matters as much as quality. For background processing jobs that run overnight, you can afford to wait for the best model. Factor this into your routing logic.

When to Use Small Models vs Large Models

I have a simple heuristic that works surprisingly well:

Use the small model when:

  • The task has a well-defined output format (JSON, categories, yes/no)
  • You can validate the output programmatically
  • A wrong answer is cheap to catch and retry
  • The input is short and well-structured

Use the large model when:

  • The task requires reasoning across multiple pieces of information
  • Output quality is subjective and hard to validate automatically
  • Errors are expensive (customer-facing content, code that runs in production)
  • The input is long, ambiguous, or requires domain expertise to interpret

The middle tier (Sonnet, GPT-4o) covers the vast majority of production use cases. My default is always Sonnet until I have evidence that I need something bigger or can get away with something smaller.

Building a Model Router

A model router examines each incoming request and decides which model should handle it. This is the most impactful cost optimization you can build into an LLM-powered system.

Router Design Principles

  1. Classify first, generate second. Use a cheap, fast call to assess request complexity before making the expensive generation call.
  2. Default to the cheapest model. Escalate up, do not start at the top and try to optimize down.
  3. Track everything. Log which model handled each request, the cost, the latency, and any quality signals. You need this data to tune your router.
  4. Make it overridable. Sometimes you know a request needs a specific model. Let callers specify a model preference that bypasses the router.

Complexity Classification

The router's core job is determining request complexity. You can do this with heuristics, a small classifier model, or a combination of both.

function classifyComplexity(prompt) {
  // Heuristic signals
  var wordCount = prompt.split(/\s+/).length;
  var hasCodeRequest = /write|generate|create|implement|build.*code|function|class/i.test(prompt);
  var hasAnalysis = /analyze|compare|evaluate|reason|explain why|trade-?off/i.test(prompt);
  var hasSimpleTask = /classify|categorize|extract|convert|translate|summarize briefly/i.test(prompt);
  var questionCount = (prompt.match(/\?/g) || []).length;

  var score = 0;

  // Length-based scoring
  if (wordCount > 500) score += 2;
  else if (wordCount > 200) score += 1;

  // Task-based scoring
  if (hasCodeRequest) score += 2;
  if (hasAnalysis) score += 2;
  if (hasSimpleTask) score -= 1;
  if (questionCount > 3) score += 1;

  if (score <= 0) return "simple";
  if (score <= 2) return "medium";
  return "complex";
}

This heuristic approach is fast and free — no API call needed. For more accurate classification, you can use Haiku itself as the classifier, which costs fractions of a cent per classification.

A/B Testing Models in Production

You should not just trust your offline benchmarks. Models behave differently on real production data, and user perception of quality does not always match automated metrics.

Implementing an A/B Test

var crypto = require("crypto");

function selectModelVariant(userId, experimentId, variants) {
  var hash = crypto
    .createHash("md5")
    .update(userId + ":" + experimentId)
    .digest("hex");

  var bucket = parseInt(hash.substring(0, 8), 16) % 100;
  var cumulative = 0;

  for (var i = 0; i < variants.length; i++) {
    cumulative += variants[i].percentage;
    if (bucket < cumulative) {
      return variants[i];
    }
  }

  return variants[variants.length - 1];
}

// Usage
var variants = [
  { model: "claude-sonnet-4-20250514", percentage: 50, name: "control" },
  { model: "claude-haiku-3-5-20241022", percentage: 50, name: "challenger" }
];

var selected = selectModelVariant("user-123", "summarization-v2", variants);

The critical piece is logging outcomes alongside the variant. Track completion rates, user feedback signals (thumbs up/down, regeneration requests), task success metrics, and cost. Run the experiment for at least a week before drawing conclusions.

Evaluating Model Outputs Programmatically

Automated evaluation is the key to scaling model selection decisions. You cannot manually review thousands of outputs.

Scoring Strategies

function evaluateOutput(output, criteria) {
  var scores = {};
  var totalWeight = 0;
  var weightedSum = 0;

  // Length compliance
  if (criteria.maxWords) {
    var wordCount = output.split(/\s+/).length;
    scores.length = wordCount <= criteria.maxWords ? 1.0 :
      Math.max(0, 1.0 - (wordCount - criteria.maxWords) / criteria.maxWords);
  }

  // Required content
  if (criteria.mustMention && criteria.mustMention.length > 0) {
    var mentioned = criteria.mustMention.filter(function(term) {
      return output.toLowerCase().indexOf(term.toLowerCase()) !== -1;
    });
    scores.coverage = mentioned.length / criteria.mustMention.length;
  }

  // Format compliance (JSON, etc.)
  if (criteria.expectedFormat === "json") {
    try {
      JSON.parse(output);
      scores.format = 1.0;
    } catch (e) {
      scores.format = 0.0;
    }
  }

  // Prohibited content
  if (criteria.mustNotContain) {
    var violations = criteria.mustNotContain.filter(function(term) {
      return output.toLowerCase().indexOf(term.toLowerCase()) !== -1;
    });
    scores.compliance = violations.length === 0 ? 1.0 : 0.0;
  }

  // Calculate weighted average
  Object.keys(scores).forEach(function(key) {
    var weight = (criteria.weights && criteria.weights[key]) || 1.0;
    weightedSum += scores[key] * weight;
    totalWeight += weight;
  });

  return {
    scores: scores,
    overall: totalWeight > 0 ? weightedSum / totalWeight : 0,
    pass: (weightedSum / totalWeight) >= (criteria.passThreshold || 0.8)
  };
}

For subjective quality — tone, coherence, helpfulness — use LLM-as-judge. Have a strong model (Opus) evaluate the output of a weaker model. This is cheaper than human evaluation and correlates well with human preferences when the judging prompt is specific.

LLM-as-Judge Pattern

function llmJudge(originalPrompt, modelOutput, judgingCriteria) {
  var Anthropic = require("@anthropic-ai/sdk");
  var client = new Anthropic();

  var judgePrompt = "You are evaluating an AI assistant's response.\n\n" +
    "Original request: " + originalPrompt + "\n\n" +
    "Response to evaluate:\n" + modelOutput + "\n\n" +
    "Score the response from 1-5 on each criterion:\n" +
    judgingCriteria + "\n\n" +
    "Respond with JSON only: {\"scores\": {\"criterion\": score}, \"explanation\": \"...\"}";

  return client.messages.create({
    model: "claude-haiku-3-5-20241022",
    max_tokens: 512,
    messages: [{ role: "user", content: judgePrompt }]
  }).then(function(response) {
    return JSON.parse(response.content[0].text);
  });
}

Model Fallback Chains

APIs go down. Rate limits get hit. Models get deprecated. A fallback chain keeps your system running when your primary model is unavailable.

function createFallbackChain(models) {
  return function callWithFallback(params, modelIndex) {
    modelIndex = modelIndex || 0;

    if (modelIndex >= models.length) {
      return Promise.reject(new Error("All models in fallback chain failed"));
    }

    var modelConfig = models[modelIndex];
    var client = modelConfig.client;

    return client.messages.create(
      Object.assign({}, params, { model: modelConfig.model })
    ).catch(function(err) {
      console.error(
        "Model " + modelConfig.model + " failed: " + err.message +
        ". Falling back to next model."
      );

      // Only retry on transient errors
      if (err.status === 400) {
        return Promise.reject(err); // Bad request, don't retry
      }

      return callWithFallback(params, modelIndex + 1);
    });
  };
}

// Usage
var Anthropic = require("@anthropic-ai/sdk");
var OpenAI = require("openai");

var anthropic = new Anthropic();
var openai = new OpenAI();

var chain = createFallbackChain([
  { client: anthropic, model: "claude-sonnet-4-20250514" },
  { client: anthropic, model: "claude-haiku-3-5-20241022" },
  // Cross-provider fallback requires adapter (shown in full example)
]);

The order matters. Put your preferred model first, a same-tier alternative second, and a cheaper fallback last. Cross-provider fallback adds resilience against full-provider outages but requires normalizing request/response formats between APIs.

Keeping Up with New Model Releases

New models drop every few weeks. Here is how I stay current without drowning:

  1. Subscribe to provider changelogs. Anthropic and OpenAI both publish release notes. Skim them when new models launch.
  2. Re-run your benchmark suite. When a new model appears in the same tier as one you use, run your test suite against it. This takes minutes and gives you concrete upgrade data.
  3. Abstract your model references. Never hardcode model names throughout your codebase. Use a config file or environment variables so swapping models is a one-line change.
  4. Set calendar reminders. Check quarterly whether your model choices still make sense. Cost and performance ratios shift with every release.

Local and Self-Hosted Models vs API Models

The build-vs-buy decision for LLMs comes down to three factors:

Choose API models when:

  • You need state-of-the-art quality
  • Your volume is low to moderate (under ~10M tokens/day)
  • You want zero infrastructure management
  • You need the latest models immediately

Choose self-hosted models when:

  • Data cannot leave your network (compliance, regulation)
  • You need to fine-tune on proprietary data
  • Your volume is very high and sustained
  • You need deterministic pricing regardless of usage spikes

Self-hosted models like Llama 3 running on your own GPUs can match API models on specific tasks after fine-tuning, but they require significant engineering investment in serving infrastructure, monitoring, and updates.

Complete Working Example: Model Router with Cost Tracking

Here is a production-ready model router that classifies requests, routes to the appropriate model, tracks costs, and supports fallback.

var Anthropic = require("@anthropic-ai/sdk");
var express = require("express");

// ============================================================
// Configuration
// ============================================================

var MODEL_CONFIG = {
  simple: {
    model: "claude-haiku-3-5-20241022",
    maxTokens: 1024,
    inputCostPer1M: 0.25,
    outputCostPer1M: 1.25
  },
  medium: {
    model: "claude-sonnet-4-20250514",
    maxTokens: 2048,
    inputCostPer1M: 3.00,
    outputCostPer1M: 15.00
  },
  complex: {
    model: "claude-opus-4-20250514",
    maxTokens: 4096,
    inputCostPer1M: 15.00,
    outputCostPer1M: 75.00
  }
};

var FALLBACK_ORDER = ["simple", "medium", "complex"];

// ============================================================
// Cost Tracker
// ============================================================

function CostTracker() {
  this.requests = [];
  this.totalCost = 0;
  this.costByModel = {};
  this.costByComplexity = {};
  this.requestCount = 0;
}

CostTracker.prototype.record = function(entry) {
  var inputCost = (entry.inputTokens / 1000000) * entry.config.inputCostPer1M;
  var outputCost = (entry.outputTokens / 1000000) * entry.config.outputCostPer1M;
  var totalCost = inputCost + outputCost;

  var record = {
    timestamp: new Date().toISOString(),
    model: entry.config.model,
    complexity: entry.complexity,
    inputTokens: entry.inputTokens,
    outputTokens: entry.outputTokens,
    cost: totalCost,
    latencyMs: entry.latencyMs
  };

  this.requests.push(record);
  this.totalCost += totalCost;
  this.requestCount += 1;

  if (!this.costByModel[record.model]) {
    this.costByModel[record.model] = { cost: 0, count: 0 };
  }
  this.costByModel[record.model].cost += totalCost;
  this.costByModel[record.model].count += 1;

  if (!this.costByComplexity[record.complexity]) {
    this.costByComplexity[record.complexity] = { cost: 0, count: 0 };
  }
  this.costByComplexity[record.complexity].cost += totalCost;
  this.costByComplexity[record.complexity].count += 1;

  return record;
};

CostTracker.prototype.getSummary = function() {
  return {
    totalRequests: this.requestCount,
    totalCost: "$" + this.totalCost.toFixed(6),
    costByModel: this.costByModel,
    costByComplexity: this.costByComplexity,
    averageCostPerRequest: "$" + (this.totalCost / (this.requestCount || 1)).toFixed(6),
    recentRequests: this.requests.slice(-10)
  };
};

// ============================================================
// Complexity Classifier
// ============================================================

function classifyComplexity(prompt, options) {
  options = options || {};

  // Allow manual override
  if (options.forceComplexity) {
    return options.forceComplexity;
  }

  var wordCount = prompt.split(/\s+/).length;
  var score = 0;

  // Length signals
  if (wordCount > 500) score += 2;
  else if (wordCount > 200) score += 1;

  // Complex task indicators
  var complexPatterns = [
    /write.*(?:code|function|class|module|application)/i,
    /analyze.*(?:and|then|also)/i,
    /compare.*(?:and|versus|vs)/i,
    /explain.*(?:why|how|reasoning|trade-?off)/i,
    /design.*(?:system|architecture|schema)/i,
    /debug|refactor|optimize/i,
    /step.by.step.*(?:plan|guide|tutorial)/i
  ];

  complexPatterns.forEach(function(pattern) {
    if (pattern.test(prompt)) score += 2;
  });

  // Simple task indicators
  var simplePatterns = [
    /^classify|^categorize|^label/i,
    /^extract.*(?:from|the)/i,
    /^convert.*(?:to|into)/i,
    /^translate/i,
    /^(?:is|does|can|will)\s/i,
    /yes.or.no/i,
    /true.or.false/i,
    /^summarize\s(?:this|the)\s(?:sentence|title|name)/i
  ];

  simplePatterns.forEach(function(pattern) {
    if (pattern.test(prompt)) score -= 1;
  });

  // Multi-question signals
  var questionCount = (prompt.match(/\?/g) || []).length;
  if (questionCount > 3) score += 1;

  // Classify based on score
  if (score <= 0) return "simple";
  if (score <= 3) return "medium";
  return "complex";
}

// ============================================================
// Model Router
// ============================================================

function ModelRouter(options) {
  options = options || {};
  this.client = new Anthropic();
  this.tracker = new CostTracker();
  this.config = MODEL_CONFIG;
}

ModelRouter.prototype.route = function(prompt, options) {
  options = options || {};
  var self = this;
  var complexity = classifyComplexity(prompt, options);
  var config = self.config[complexity];
  var start = Date.now();

  console.log(
    "[Router] Classified as " + complexity +
    " -> routing to " + config.model
  );

  return self._callWithFallback(prompt, complexity, config, options)
    .then(function(result) {
      var latencyMs = Date.now() - start;
      var record = self.tracker.record({
        config: result.config,
        complexity: complexity,
        inputTokens: result.response.usage.input_tokens,
        outputTokens: result.response.usage.output_tokens,
        latencyMs: latencyMs
      });

      return {
        text: result.response.content[0].text,
        model: result.config.model,
        complexity: complexity,
        cost: record.cost,
        latencyMs: latencyMs,
        inputTokens: result.response.usage.input_tokens,
        outputTokens: result.response.usage.output_tokens
      };
    });
};

ModelRouter.prototype._callWithFallback = function(prompt, complexity, config, options) {
  var self = this;
  var systemPrompt = options.system || undefined;

  var messages = [{ role: "user", content: prompt }];

  var requestParams = {
    model: config.model,
    max_tokens: options.maxTokens || config.maxTokens,
    messages: messages
  };

  if (systemPrompt) {
    requestParams.system = systemPrompt;
  }

  return self.client.messages.create(requestParams)
    .then(function(response) {
      return { response: response, config: config };
    })
    .catch(function(err) {
      console.error(
        "[Router] " + config.model + " failed (" + (err.status || "unknown") +
        "): " + err.message
      );

      // Find the next tier to fall back to
      var currentIndex = FALLBACK_ORDER.indexOf(complexity);
      if (err.status !== 400 && currentIndex < FALLBACK_ORDER.length - 1) {
        var nextComplexity = FALLBACK_ORDER[currentIndex + 1];
        var nextConfig = self.config[nextComplexity];
        console.log("[Router] Falling back to " + nextConfig.model);
        return self._callWithFallback(prompt, nextComplexity, nextConfig, options);
      }

      return Promise.reject(err);
    });
};

ModelRouter.prototype.getCostSummary = function() {
  return this.tracker.getSummary();
};

// ============================================================
// Express API Server
// ============================================================

function createServer(router) {
  var app = express();
  app.use(express.json());

  // Main routing endpoint
  app.post("/api/generate", function(req, res) {
    var prompt = req.body.prompt;
    var options = {
      system: req.body.system || undefined,
      maxTokens: req.body.maxTokens || undefined,
      forceComplexity: req.body.complexity || undefined
    };

    if (!prompt) {
      return res.status(400).json({ error: "prompt is required" });
    }

    router.route(prompt, options)
      .then(function(result) {
        res.json({
          success: true,
          text: result.text,
          metadata: {
            model: result.model,
            complexity: result.complexity,
            cost: "$" + result.cost.toFixed(6),
            latencyMs: result.latencyMs,
            inputTokens: result.inputTokens,
            outputTokens: result.outputTokens
          }
        });
      })
      .catch(function(err) {
        console.error("[Server] Error:", err.message);
        res.status(500).json({
          success: false,
          error: err.message
        });
      });
  });

  // Cost dashboard endpoint
  app.get("/api/costs", function(req, res) {
    res.json(router.getCostSummary());
  });

  // Health check
  app.get("/api/health", function(req, res) {
    res.json({ status: "ok", models: Object.keys(MODEL_CONFIG) });
  });

  return app;
}

// ============================================================
// Start the server
// ============================================================

var router = new ModelRouter();
var app = createServer(router);
var port = process.env.PORT || 3000;

app.listen(port, function() {
  console.log("Model router listening on port " + port);
});

module.exports = {
  ModelRouter: ModelRouter,
  CostTracker: CostTracker,
  classifyComplexity: classifyComplexity,
  createServer: createServer
};

Testing the Router

# Simple request → Haiku
curl -X POST http://localhost:3000/api/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Classify this as positive or negative: I love this product"}'

# Complex request → Opus
curl -X POST http://localhost:3000/api/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Design a distributed caching system that handles cache invalidation across 5 regions with eventual consistency. Include the data flow, failure modes, and explain the tradeoffs between different consistency models."}'

# Force a specific tier
curl -X POST http://localhost:3000/api/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello world", "complexity": "complex"}'

# Check cost summary
curl http://localhost:3000/api/costs

Common Issues and Troubleshooting

1. Rate Limit Errors Across Tiers

Error: 429 Too Many Requests
{"type":"error","error":{"type":"rate_limit_error","message":"Number of request tokens has exceeded your per-minute rate limit"}}

Each model tier has its own rate limits, and your fallback chain can burn through limits on the fallback model too. Implement exponential backoff with jitter and respect the retry-after header:

function retryWithBackoff(fn, maxRetries, baseDelay) {
  maxRetries = maxRetries || 3;
  baseDelay = baseDelay || 1000;

  return function attempt(retryCount) {
    retryCount = retryCount || 0;
    return fn().catch(function(err) {
      if (retryCount >= maxRetries || err.status !== 429) {
        return Promise.reject(err);
      }
      var retryAfter = err.headers && err.headers["retry-after"];
      var delay = retryAfter
        ? parseInt(retryAfter, 10) * 1000
        : baseDelay * Math.pow(2, retryCount) + Math.random() * 1000;

      console.log("[Retry] Waiting " + Math.round(delay) + "ms before retry " + (retryCount + 1));
      return new Promise(function(resolve) {
        setTimeout(resolve, delay);
      }).then(function() {
        return attempt(retryCount + 1);
      });
    });
  }();
}

2. Context Length Exceeded

Error: 400 Bad Request
{"type":"error","error":{"type":"invalid_request_error","message":"prompt is too long: 204831 tokens > 200000 maximum"}}

Different models have different context windows. Haiku and Sonnet currently support 200K tokens, Opus supports 200K as well, but your maxTokens setting plus input length must fit within the window. Add a pre-check:

function estimateTokens(text) {
  // Rough estimate: 1 token per 4 characters for English
  return Math.ceil(text.length / 4);
}

function validateContextFit(prompt, config) {
  var estimatedInput = estimateTokens(prompt);
  var maxContext = 200000; // Adjust per model
  if (estimatedInput + config.maxTokens > maxContext) {
    return {
      fits: false,
      estimatedInput: estimatedInput,
      available: maxContext - config.maxTokens
    };
  }
  return { fits: true };
}

3. Inconsistent JSON Output from Smaller Models

SyntaxError: Unexpected token 'H' at position 0
  -- model returned: "Here is the JSON:\n{\"category\": \"billing\"}"

Smaller models are more likely to include preamble text before JSON output. Always extract JSON from the response rather than parsing the entire response:

function extractJSON(text) {
  var jsonMatch = text.match(/\{[\s\S]*\}/);
  if (!jsonMatch) {
    jsonMatch = text.match(/\[[\s\S]*\]/);
  }
  if (!jsonMatch) {
    throw new Error("No JSON found in model output: " + text.substring(0, 100));
  }
  return JSON.parse(jsonMatch[0]);
}

4. Cost Tracking Drift from Actual Invoice

Warning: Tracked costs ($142.50) differ from invoice ($158.30) by 11%

Token counting from the API response is exact, but cost calculations drift if you hardcode prices. Model pricing changes, and cached token pricing is different from standard pricing. Pull pricing from a config that you update when providers announce changes, and account for prompt caching discounts:

// Keep pricing in a separate config file that's easy to update
var PRICING_VERSION = "2026-02-01";
var pricing = require("./model-pricing.json");

function calculateCost(model, inputTokens, outputTokens, cachedTokens) {
  var modelPricing = pricing[model];
  if (!modelPricing) {
    console.warn("[Cost] No pricing found for " + model + ", using estimate");
    return 0;
  }
  var inputCost = ((inputTokens - (cachedTokens || 0)) / 1000000) * modelPricing.input;
  var cachedCost = ((cachedTokens || 0) / 1000000) * (modelPricing.cachedInput || modelPricing.input * 0.1);
  var outputCost = (outputTokens / 1000000) * modelPricing.output;
  return inputCost + cachedCost + outputCost;
}

Best Practices

  • Start with the smallest model and scale up. It is always easier to justify upgrading to a more expensive model with data than to explain why you started with the most expensive one. Run your test suite against Haiku first. You will be surprised how often it is sufficient.

  • Never hardcode model identifiers in business logic. Use a configuration layer that maps logical names ("fast", "balanced", "powerful") to specific model versions. When Claude 4.5 Sonnet drops next month, you change one config file instead of grepping through your entire codebase.

  • Implement cost alerting before you need it. Set up daily cost summaries and alerts that fire when spending exceeds thresholds. A misconfigured loop that sends Opus requests can burn through hundreds of dollars in minutes. Catch it early.

  • Cache aggressively for repeated or similar prompts. If your system asks the same classification question for slight variations of input, cache the results. Anthropic's prompt caching reduces costs on the API side, but application-level caching eliminates the call entirely.

  • Use structured output modes when available. Constraining the model's output format reduces token waste and makes parsing more reliable. Both Anthropic and OpenAI support tool use / function calling patterns that enforce output structure. This is especially valuable with smaller models where unconstrained output is less predictable.

  • Separate your evaluation data from your tuning data. When you optimize prompts based on test case results, those test cases stop being valid evaluation data. Maintain a held-out set of examples that you only use for final validation, never for prompt engineering.

  • Log full request-response pairs for debugging. When a model produces bad output, you need to see exactly what went in and what came out. Store these logs with appropriate retention and access controls. They are invaluable for diagnosing regressions after prompt changes.

  • Plan for model deprecation from day one. Every model you use today will eventually be sunset. Build your system so that swapping models is a configuration change, not a refactoring project. Your router abstraction layer is your insurance policy against deprecation notices.

References

Powered by Contentful