OpenAI API Mastery for Production Applications

Shane

2/13/2026

24 min read

Production guide to the OpenAI API covering chat completions, streaming, function calling, embeddings, and cost management in Node.js.

llm nodejs openai ai-integration api gpt

OpenAI API Mastery for Production Applications

Overview

The OpenAI API is the most widely adopted large language model API in production software today, powering everything from chatbots and content generation to code analysis and semantic search. This guide covers the full surface area of the API from a production engineering perspective — not toy demos, but real patterns you will use when building systems that need to be reliable, cost-effective, and performant. We will work through the Node.js SDK, covering chat completions, streaming, function calling, embeddings, vision, fine-tuning, and the operational concerns (cost tracking, error handling, rate limiting) that separate a prototype from a production service.

Prerequisites

Node.js v18 or later installed
An OpenAI API key with billing configured (free tier is too restrictive for production work)
Basic familiarity with REST APIs and async JavaScript
A working understanding of what large language models do (you do not need ML expertise)

Install the official SDK:

npm install openai

At the time of writing, we are using openai version 4.x, which is a significant rewrite from the 3.x series. If you are migrating from the old openai npm package, the API surface has changed substantially.

Setting Up the OpenAI Node.js SDK

The SDK is straightforward to initialize. The only required configuration is your API key.

var OpenAI = require("openai");

var client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

That is all you need for basic usage. However, in production you will often want more control:

var OpenAI = require("openai");

var client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  organization: process.env.OPENAI_ORG_ID,
  timeout: 60000,        // 60 second timeout
  maxRetries: 3,         // built-in retry logic
  defaultHeaders: {
    "X-Request-Source": "my-production-service"
  }
});

The organization parameter matters when your API key has access to multiple organizations. Every API call gets billed to the organization, so setting this explicitly avoids surprise charges to the wrong account.

Project-Based Access

OpenAI now supports project-level API keys. If your company runs multiple products off the same organization, create separate projects in the OpenAI dashboard and generate project-scoped keys. This gives you per-project usage tracking and spending limits without needing separate organizations.

var client = new OpenAI({
  apiKey: process.env.OPENAI_PROJECT_API_KEY,
  project: process.env.OPENAI_PROJECT_ID
});

Chat Completions API

The chat completions endpoint is the workhorse of the API. Every interaction is modeled as a conversation with three message roles:

system: Sets the behavior and personality of the model. Processed first and carries high weight.
user: The human input or the prompt you are sending.
assistant: Previous model responses (for multi-turn conversations).

var OpenAI = require("openai");
var client = new OpenAI();

async function generateSummary(text) {
  var response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: "You are a technical writer. Summarize the given text in 2-3 sentences. Be precise and avoid filler."
      },
      {
        role: "user",
        content: "Summarize this:\n\n" + text
      }
    ],
    temperature: 0.3,
    max_tokens: 200
  });

  return response.choices[0].message.content;
}

Temperature and Top_p

These two parameters control randomness in the output. They serve similar purposes, so OpenAI recommends changing one but not both.

Parameter	Range	Effect
`temperature`	0.0 – 2.0	Lower = more deterministic, higher = more creative
`top_p`	0.0 – 1.0	Nucleus sampling; 0.1 means only top 10% probability tokens considered

My rules of thumb after running these in production for two years:

Structured extraction / classification: temperature: 0 — you want the same answer every time.
Summarization / rewriting: temperature: 0.3 — slight variation is fine but you want consistency.
Creative content / brainstorming: temperature: 0.8–1.0 — let the model explore.
Never go above 1.2 unless you specifically want chaotic output.

Multi-Turn Conversations

For chat applications, you pass the full conversation history with each request. The API is stateless — it does not remember previous calls.

var conversationHistory = [
  { role: "system", content: "You are a helpful Node.js expert." }
];

async function chat(userMessage) {
  conversationHistory.push({ role: "user", content: userMessage });

  var response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: conversationHistory,
    temperature: 0.7
  });

  var assistantMessage = response.choices[0].message;
  conversationHistory.push(assistantMessage);

  return assistantMessage.content;
}

Warning: conversation history grows with every turn, and you are paying for every token in that history on every request. In production, implement a sliding window or summarization strategy to keep context size manageable.

Streaming Responses with Server-Sent Events

For any user-facing application, streaming is non-negotiable. Without it, the user stares at a blank screen for 5-15 seconds while the model generates. With streaming, tokens appear in real-time.

async function streamCompletion(prompt) {
  var stream = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    stream: true
  });

  var fullResponse = "";

  for await (var chunk of stream) {
    var content = chunk.choices[0].delta.content;
    if (content) {
      process.stdout.write(content);
      fullResponse += content;
    }
  }

  console.log("\n\n--- Stream complete ---");
  return fullResponse;
}

Streaming Over HTTP with Express

In a real web service, you pipe the stream to the client as server-sent events:

var express = require("express");
var OpenAI = require("openai");
var app = express();
var client = new OpenAI();

app.use(express.json());

app.post("/api/chat", function(req, res) {
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");

  var messages = req.body.messages || [];

  (async function() {
    try {
      var stream = await client.chat.completions.create({
        model: "gpt-4o-mini",
        messages: messages,
        stream: true
      });

      for await (var chunk of stream) {
        var content = chunk.choices[0].delta.content;
        if (content) {
          res.write("data: " + JSON.stringify({ content: content }) + "\n\n");
        }
      }

      res.write("data: [DONE]\n\n");
      res.end();
    } catch (err) {
      res.write("data: " + JSON.stringify({ error: err.message }) + "\n\n");
      res.end();
    }
  })();
});

The client consumes this with the standard EventSource API or a fetch-based SSE reader. Time to first token is typically 200-500ms for GPT-4o-mini and 300-800ms for GPT-4o.

Function Calling and Tool Use Patterns

Function calling is one of the most powerful features in the API. It lets the model decide when to invoke external functions you define, effectively giving the LLM the ability to interact with your application's business logic.

var tools = [
  {
    type: "function",
    function: {
      name: "get_weather",
      description: "Get the current weather for a location",
      parameters: {
        type: "object",
        properties: {
          location: {
            type: "string",
            description: "City and state, e.g. 'San Francisco, CA'"
          },
          unit: {
            type: "string",
            enum: ["celsius", "fahrenheit"],
            description: "Temperature unit"
          }
        },
        required: ["location"]
      }
    }
  },
  {
    type: "function",
    function: {
      name: "search_database",
      description: "Search the product database by name or category",
      parameters: {
        type: "object",
        properties: {
          query: { type: "string", description: "Search query" },
          category: { type: "string", description: "Product category filter" },
          limit: { type: "integer", description: "Max results to return" }
        },
        required: ["query"]
      }
    }
  }
];

async function runWithTools(userMessage) {
  var messages = [{ role: "user", content: userMessage }];

  var response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: messages,
    tools: tools,
    tool_choice: "auto"
  });

  var message = response.choices[0].message;

  // Check if the model wants to call functions
  if (message.tool_calls) {
    messages.push(message);

    for (var i = 0; i < message.tool_calls.length; i++) {
      var toolCall = message.tool_calls[i];
      var args = JSON.parse(toolCall.function.arguments);
      var result;

      // Dispatch to actual function implementations
      if (toolCall.function.name === "get_weather") {
        result = await fetchWeather(args.location, args.unit);
      } else if (toolCall.function.name === "search_database") {
        result = await searchProducts(args.query, args.category, args.limit);
      }

      messages.push({
        role: "tool",
        tool_call_id: toolCall.id,
        content: JSON.stringify(result)
      });
    }

    // Send function results back to the model for final response
    var finalResponse = await client.chat.completions.create({
      model: "gpt-4o",
      messages: messages
    });

    return finalResponse.choices[0].message.content;
  }

  return message.content;
}

The model can call multiple functions in a single turn (parallel function calling). This is common when the user asks something like "What is the weather in New York and search for winter jackets" — the model will emit two tool calls in one response.

Forcing Function Calls

Use tool_choice to control behavior:

// Let the model decide (default)
tool_choice: "auto"

// Force a specific function
tool_choice: { type: "function", function: { name: "search_database" } }

// Prevent any function calls
tool_choice: "none"

Structured Output with JSON Mode

When you need the model to return structured data instead of prose, use response_format:

async function extractEntities(text) {
  var response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: "Extract all named entities from the text. Return a JSON object with keys: people (array of strings), organizations (array of strings), locations (array of strings), dates (array of strings)."
      },
      { role: "user", content: text }
    ],
    response_format: { type: "json_object" },
    temperature: 0
  });

  return JSON.parse(response.choices[0].message.content);
}

Output for a sample news article:

{
  "people": ["Satya Nadella", "Sam Altman"],
  "organizations": ["Microsoft", "OpenAI"],
  "locations": ["San Francisco", "Redmond"],
  "dates": ["January 2026", "Q1 2026"]
}

Important: when using response_format: { type: "json_object" }, you must include the word "JSON" somewhere in your system or user message. The API will reject the request otherwise.

For even stricter output, OpenAI supports structured outputs with JSON schemas:

var response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "system", content: "Extract product details from the description." },
    { role: "user", content: productDescription }
  ],
  response_format: {
    type: "json_schema",
    json_schema: {
      name: "product_extraction",
      strict: true,
      schema: {
        type: "object",
        properties: {
          name: { type: "string" },
          price: { type: "number" },
          currency: { type: "string" },
          features: { type: "array", items: { type: "string" } }
        },
        required: ["name", "price", "currency", "features"],
        additionalProperties: false
      }
    }
  }
});

The strict: true option guarantees the output conforms to your schema. This eliminates the need for post-hoc validation in most cases.

Vision API for Image Understanding

GPT-4o and GPT-4o-mini both support vision. You pass images as part of the message content:

async function analyzeImage(imageUrl) {
  var response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: "Describe what you see in this image. If there is text, transcribe it."
          },
          {
            type: "image_url",
            image_url: {
              url: imageUrl,
              detail: "high"  // "low", "high", or "auto"
            }
          }
        ]
      }
    ],
    max_tokens: 500
  });

  return response.choices[0].message.content;
}

You can also pass base64-encoded images for local files:

var fs = require("fs");
var path = require("path");

async function analyzeLocalImage(filePath) {
  var imageBuffer = fs.readFileSync(filePath);
  var base64Image = imageBuffer.toString("base64");
  var mimeType = path.extname(filePath) === ".png" ? "image/png" : "image/jpeg";

  var response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "Analyze this image in detail." },
          {
            type: "image_url",
            image_url: {
              url: "data:" + mimeType + ";base64," + base64Image
            }
          }
        ]
      }
    ],
    max_tokens: 1000
  });

  return response.choices[0].message.content;
}

The detail parameter controls image processing cost. Use "low" for simple classification tasks (fixed 85 tokens), "high" for OCR and detailed analysis (scales with image resolution, up to ~1600 tokens for the image alone).

Embeddings API for Semantic Search

Embeddings convert text into high-dimensional vectors that capture semantic meaning. Two texts about the same topic will have vectors that are close together, even if they use different words.

async function getEmbedding(text) {
  var response = await client.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  });

  return response.data[0].embedding;  // Array of 1536 floats
}

async function getEmbeddings(texts) {
  var response = await client.embeddings.create({
    model: "text-embedding-3-small",
    input: texts  // Pass an array for batch processing
  });

  return response.data.map(function(item) {
    return item.embedding;
  });
}

Model Selection for Embeddings

Model	Dimensions	Price per 1M tokens	Use Case
`text-embedding-3-small`	1536	$0.02	Most applications, best cost/performance
`text-embedding-3-large`	3072	$0.13	When you need maximum retrieval accuracy
`text-embedding-ada-002`	1536	$0.10	Legacy, no reason to use for new projects

Cosine Similarity for Search

function cosineSimilarity(a, b) {
  var dotProduct = 0;
  var normA = 0;
  var normB = 0;

  for (var i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }

  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

async function semanticSearch(query, documents) {
  var queryEmbedding = await getEmbedding(query);
  var docEmbeddings = await getEmbeddings(documents);

  var results = documents.map(function(doc, index) {
    return {
      text: doc,
      score: cosineSimilarity(queryEmbedding, docEmbeddings[index])
    };
  });

  results.sort(function(a, b) { return b.score - a.score; });

  return results;
}

In production, do not compute cosine similarity in JavaScript. Use a vector database like Pinecone, Weaviate, pgvector, or Qdrant. They handle indexing, approximate nearest neighbor search, and scale to millions of vectors.

Fine-Tuning Models for Domain-Specific Tasks

When prompt engineering is not enough — maybe you need a specific output format, domain terminology, or a tone that is hard to describe in a system prompt — fine-tuning is the answer.

Preparing Training Data

Training data is JSONL format, one conversation per line:

{"messages":[{"role":"system","content":"You are a medical coding assistant."},{"role":"user","content":"Patient presents with acute bronchitis"},{"role":"assistant","content":"ICD-10: J20.9 - Acute bronchitis, unspecified"}]}
{"messages":[{"role":"system","content":"You are a medical coding assistant."},{"role":"user","content":"Type 2 diabetes with diabetic chronic kidney disease"},{"role":"assistant","content":"ICD-10: E11.22 - Type 2 diabetes mellitus with diabetic chronic kidney disease"}]}

You need a minimum of 10 examples, but 50-100 well-curated examples typically produce strong results.

Creating a Fine-Tuning Job

var fs = require("fs");

async function startFineTuning() {
  // Step 1: Upload training file
  var file = await client.files.create({
    file: fs.createReadStream("training_data.jsonl"),
    purpose: "fine-tune"
  });

  console.log("File uploaded:", file.id);

  // Step 2: Create fine-tuning job
  var job = await client.fineTuning.jobs.create({
    training_file: file.id,
    model: "gpt-4o-mini-2024-07-18",
    hyperparameters: {
      n_epochs: 3
    }
  });

  console.log("Fine-tuning job started:", job.id);
  return job;
}

Fine-tuning GPT-4o-mini is significantly cheaper than GPT-4o and is the right starting point for most use cases. Training typically takes 15-45 minutes for small datasets.

Using a Fine-Tuned Model

var response = await client.chat.completions.create({
  model: "ft:gpt-4o-mini-2024-07-18:my-org:medical-coder:abc123",
  messages: [
    { role: "system", content: "You are a medical coding assistant." },
    { role: "user", content: "Congestive heart failure" }
  ]
});

Managing Costs with Token Counting and Model Selection

Cost management is one of the biggest operational concerns with the OpenAI API. Without guardrails, a single runaway feature can generate a five-figure bill in a week.

Current Pricing (as of early 2026)

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-4o	$2.50	$10.00
GPT-4o-mini	$0.15	$0.60
GPT-4 Turbo	$10.00	$30.00

GPT-4o-mini is roughly 17x cheaper than GPT-4o on input and output. For many production tasks — classification, extraction, summarization of short texts, simple Q&A — GPT-4o-mini performs comparably to GPT-4o. I default to GPT-4o-mini and only upgrade to GPT-4o when the task demonstrably requires it.

Token Counting Before Sending

Use the tiktoken library to count tokens before making API calls:

npm install tiktoken

var tiktoken = require("tiktoken");

var encoder = tiktoken.encoding_for_model("gpt-4o");

function countTokens(text) {
  var tokens = encoder.encode(text);
  return tokens.length;
}

function estimateCost(inputText, estimatedOutputTokens, model) {
  var inputTokens = countTokens(inputText);
  var pricing = {
    "gpt-4o": { input: 2.50 / 1000000, output: 10.00 / 1000000 },
    "gpt-4o-mini": { input: 0.15 / 1000000, output: 0.60 / 1000000 }
  };

  var rates = pricing[model] || pricing["gpt-4o-mini"];
  var cost = (inputTokens * rates.input) + (estimatedOutputTokens * rates.output);

  return {
    inputTokens: inputTokens,
    estimatedOutputTokens: estimatedOutputTokens,
    estimatedCost: "$" + cost.toFixed(6),
    model: model
  };
}

// Example usage
var estimate = estimateCost("Explain quantum computing in 500 words.", 700, "gpt-4o");
console.log(estimate);
// { inputTokens: 9, estimatedOutputTokens: 700, estimatedCost: "$0.007023", model: "gpt-4o" }

Tracking Usage from API Responses

Every API response includes a usage object:

var response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "Hello" }]
});

console.log(response.usage);
// {
//   prompt_tokens: 8,
//   completion_tokens: 12,
//   total_tokens: 20
// }

In production, log this for every request. Build dashboards. Set alerts when daily spend exceeds thresholds.

Error Handling and Automatic Retries

The SDK has built-in retry logic, but you should understand the error types to handle them appropriately.

var OpenAI = require("openai");

var client = new OpenAI({
  maxRetries: 3  // retries on 429, 500, 503 with exponential backoff
});

async function safeCompletion(messages, options) {
  options = options || {};
  var model = options.model || "gpt-4o-mini";
  var maxTokens = options.maxTokens || 1000;

  try {
    var response = await client.chat.completions.create({
      model: model,
      messages: messages,
      max_tokens: maxTokens
    });

    return {
      success: true,
      content: response.choices[0].message.content,
      usage: response.usage,
      finishReason: response.choices[0].finish_reason
    };

  } catch (error) {
    if (error instanceof OpenAI.APIError) {
      switch (error.status) {
        case 400:
          console.error("Bad request:", error.message);
          return { success: false, error: "invalid_request", message: error.message };
        case 401:
          console.error("Authentication failed");
          return { success: false, error: "auth_failed", message: "Invalid API key" };
        case 403:
          console.error("Permission denied — check organization/project access");
          return { success: false, error: "forbidden", message: error.message };
        case 429:
          console.error("Rate limited — SDK retries exhausted");
          return { success: false, error: "rate_limited", message: "Please try again later" };
        case 500:
        case 503:
          console.error("OpenAI service error:", error.status);
          return { success: false, error: "service_error", message: "OpenAI is experiencing issues" };
        default:
          console.error("Unexpected API error:", error.status, error.message);
          return { success: false, error: "unknown", message: error.message };
      }
    }

    // Network errors, timeouts, etc.
    if (error instanceof OpenAI.APIConnectionError) {
      console.error("Connection failed:", error.message);
      return { success: false, error: "connection_failed", message: "Could not reach OpenAI" };
    }

    throw error;  // Unknown error, let it propagate
  }
}

Rate Limiting with Tiered Access

OpenAI uses a tier system based on your spending history:

Tier	Usage Threshold	GPT-4o RPM	GPT-4o TPM
Free	$0	3	200
Tier 1	$5	500	30,000
Tier 2	$50	5,000	450,000
Tier 3	$100	5,000	800,000
Tier 4	$250	10,000	2,000,000
Tier 5	$1,000	10,000	30,000,000

RPM = Requests Per Minute, TPM = Tokens Per Minute. For production applications, you want to be at least Tier 3. If you are at Tier 1, batch processing large datasets will be painfully slow due to the 500 RPM limit.

To manage rate limits in your application:

var requestQueue = [];
var isProcessing = false;
var requestsThisMinute = 0;
var maxRequestsPerMinute = 500;  // Tier 1 limit

function resetCounter() {
  requestsThisMinute = 0;
}

setInterval(resetCounter, 60000);

async function rateLimitedRequest(messages, model) {
  return new Promise(function(resolve, reject) {
    requestQueue.push({ messages: messages, model: model, resolve: resolve, reject: reject });
    processQueue();
  });
}

async function processQueue() {
  if (isProcessing || requestQueue.length === 0) return;
  isProcessing = true;

  while (requestQueue.length > 0) {
    if (requestsThisMinute >= maxRequestsPerMinute) {
      // Wait until the next minute window
      await new Promise(function(r) { setTimeout(r, 5000); });
      continue;
    }

    var item = requestQueue.shift();
    requestsThisMinute++;

    try {
      var result = await client.chat.completions.create({
        model: item.model,
        messages: item.messages
      });
      item.resolve(result);
    } catch (err) {
      item.reject(err);
    }
  }

  isProcessing = false;
}

Complete Working Example: Production OpenAI Service

Here is a complete, production-ready Node.js service that wraps the OpenAI API with streaming, function calling, automatic retries, and cost tracking.

// openai-service.js
var OpenAI = require("openai");
var EventEmitter = require("events");

function OpenAIService(config) {
  config = config || {};

  this.client = new OpenAI({
    apiKey: config.apiKey || process.env.OPENAI_API_KEY,
    organization: config.organization || process.env.OPENAI_ORG_ID,
    timeout: config.timeout || 60000,
    maxRetries: config.maxRetries || 3
  });

  this.defaultModel = config.defaultModel || "gpt-4o-mini";

  // Cost tracking
  this.usage = {
    totalInputTokens: 0,
    totalOutputTokens: 0,
    totalRequests: 0,
    totalCost: 0,
    byModel: {}
  };

  this.pricing = {
    "gpt-4o": { input: 2.50 / 1000000, output: 10.00 / 1000000 },
    "gpt-4o-mini": { input: 0.15 / 1000000, output: 0.60 / 1000000 }
  };
}

OpenAIService.prototype.trackUsage = function(model, usage) {
  var rates = this.pricing[model] || this.pricing["gpt-4o-mini"];
  var cost = (usage.prompt_tokens * rates.input) + (usage.completion_tokens * rates.output);

  this.usage.totalInputTokens += usage.prompt_tokens;
  this.usage.totalOutputTokens += usage.completion_tokens;
  this.usage.totalRequests += 1;
  this.usage.totalCost += cost;

  if (!this.usage.byModel[model]) {
    this.usage.byModel[model] = { requests: 0, inputTokens: 0, outputTokens: 0, cost: 0 };
  }

  this.usage.byModel[model].requests += 1;
  this.usage.byModel[model].inputTokens += usage.prompt_tokens;
  this.usage.byModel[model].outputTokens += usage.completion_tokens;
  this.usage.byModel[model].cost += cost;
};

OpenAIService.prototype.getUsageReport = function() {
  return {
    totalRequests: this.usage.totalRequests,
    totalTokens: this.usage.totalInputTokens + this.usage.totalOutputTokens,
    totalCost: "$" + this.usage.totalCost.toFixed(4),
    byModel: this.usage.byModel
  };
};

OpenAIService.prototype.complete = async function(messages, options) {
  options = options || {};
  var model = options.model || this.defaultModel;

  try {
    var requestParams = {
      model: model,
      messages: messages,
      temperature: options.temperature !== undefined ? options.temperature : 0.7,
      max_tokens: options.maxTokens || 1000
    };

    if (options.tools) {
      requestParams.tools = options.tools;
      requestParams.tool_choice = options.toolChoice || "auto";
    }

    if (options.responseFormat) {
      requestParams.response_format = options.responseFormat;
    }

    var response = await this.client.chat.completions.create(requestParams);

    this.trackUsage(model, response.usage);

    return {
      success: true,
      content: response.choices[0].message.content,
      toolCalls: response.choices[0].message.tool_calls || null,
      finishReason: response.choices[0].finish_reason,
      usage: response.usage
    };

  } catch (error) {
    console.error("[OpenAIService] Error:", error.status || "unknown", error.message);
    return {
      success: false,
      error: error.status || "unknown",
      message: error.message
    };
  }
};

OpenAIService.prototype.stream = async function(messages, options) {
  options = options || {};
  var model = options.model || this.defaultModel;
  var emitter = new EventEmitter();
  var self = this;

  (async function() {
    try {
      var stream = await self.client.chat.completions.create({
        model: model,
        messages: messages,
        temperature: options.temperature !== undefined ? options.temperature : 0.7,
        max_tokens: options.maxTokens || 1000,
        stream: true,
        stream_options: { include_usage: true }
      });

      var fullContent = "";

      for await (var chunk of stream) {
        if (chunk.choices && chunk.choices[0] && chunk.choices[0].delta.content) {
          var content = chunk.choices[0].delta.content;
          fullContent += content;
          emitter.emit("data", content);
        }

        // Usage is included in the final chunk when stream_options.include_usage is true
        if (chunk.usage) {
          self.trackUsage(model, chunk.usage);
          emitter.emit("usage", chunk.usage);
        }
      }

      emitter.emit("end", fullContent);
    } catch (error) {
      emitter.emit("error", error);
    }
  })();

  return emitter;
};

OpenAIService.prototype.completeWithTools = async function(messages, tools, functionMap) {
  var model = this.defaultModel;
  var maxIterations = 5;
  var currentMessages = messages.slice();

  for (var i = 0; i < maxIterations; i++) {
    var result = await this.complete(currentMessages, { tools: tools });

    if (!result.success) return result;

    if (result.toolCalls) {
      currentMessages.push({
        role: "assistant",
        tool_calls: result.toolCalls,
        content: result.content
      });

      for (var j = 0; j < result.toolCalls.length; j++) {
        var toolCall = result.toolCalls[j];
        var fnName = toolCall.function.name;
        var args = JSON.parse(toolCall.function.arguments);

        var fnResult;
        if (functionMap[fnName]) {
          fnResult = await functionMap[fnName](args);
        } else {
          fnResult = { error: "Unknown function: " + fnName };
        }

        currentMessages.push({
          role: "tool",
          tool_call_id: toolCall.id,
          content: JSON.stringify(fnResult)
        });
      }
    } else {
      return result;
    }
  }

  return { success: false, error: "max_iterations", message: "Tool calling loop exceeded max iterations" };
};

module.exports = OpenAIService;

Using the Service

var OpenAIService = require("./openai-service");
var service = new OpenAIService({ defaultModel: "gpt-4o-mini" });

// Simple completion
(async function() {
  var result = await service.complete([
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What is Node.js?" }
  ], { temperature: 0.3 });

  console.log(result.content);
  console.log("Tokens used:", result.usage.total_tokens);

  // Streaming
  var stream = await service.stream([
    { role: "user", content: "Write a haiku about JavaScript." }
  ]);

  stream.on("data", function(chunk) {
    process.stdout.write(chunk);
  });

  stream.on("end", function(fullContent) {
    console.log("\n\nFull response:", fullContent);
    console.log("\nUsage report:", JSON.stringify(service.getUsageReport(), null, 2));
  });

  // Function calling
  var tools = [
    {
      type: "function",
      function: {
        name: "lookup_user",
        description: "Look up a user by email address",
        parameters: {
          type: "object",
          properties: {
            email: { type: "string", description: "User email address" }
          },
          required: ["email"]
        }
      }
    }
  ];

  var functionMap = {
    lookup_user: async function(args) {
      // Simulate database lookup
      return { name: "Jane Doe", email: args.email, plan: "pro", active: true };
    }
  };

  var toolResult = await service.completeWithTools(
    [{ role: "user", content: "Look up the user [email protected] and tell me their plan" }],
    tools,
    functionMap
  );

  console.log(toolResult.content);
})();

Common Issues and Troubleshooting

1. "You exceeded your current quota"

Error: 429 You exceeded your current quota, please check your plan and billing details.

This does not mean you hit a rate limit. It means your account has no credits or your spending limit has been reached. Go to platform.openai.com/account/billing and add credits or raise your spending limit. This is the number one issue I see from developers who think their code is broken.

2. "This model's maximum context length is 128000 tokens"

Error: 400 This model's maximum context length is 128000 tokens. However, your messages resulted in 131426 tokens. Please reduce the length of the messages.

You are sending too much context. This happens often in multi-turn conversations or when stuffing large documents into the prompt. Solutions: truncate older messages, summarize conversation history, or split the content into chunks. For RAG applications, retrieve only the top-K most relevant chunks, not the entire document.

3. "Invalid value for 'response_format'"

Error: 400 Invalid value for 'response_format': when using 'json_object', you must include the word 'JSON' in the prompt.

When you set response_format: { type: "json_object" }, you must mention "JSON" in either the system or user message. Add something like "Respond in JSON format" to your system prompt. This is a requirement from OpenAI to prevent the model from silently ignoring the format directive.

4. "Connection error" or Timeout on Long Completions

Error: APIConnectionError: Connection error.

The default timeout in the SDK is 10 minutes, but network intermediaries (load balancers, proxies, API gateways) may have shorter timeouts. If you are generating long responses, set explicit timeouts and consider using streaming — streamed connections stay alive as long as data is flowing. Also check if you are behind a corporate proxy that terminates idle connections.

5. "Rate limit reached for model" in Batch Processing

Error: 429 Rate limit reached for gpt-4o-mini in organization org-xxx on tokens per min (TPM): Limit 200000

When processing large datasets, implement a queue with rate limiting. Use OpenAI's Batch API for non-time-sensitive workloads — it offers a 50% discount and has separate, higher rate limits. Batch requests complete within 24 hours.

// Use the Batch API for bulk processing
var batch = await client.batches.create({
  input_file_id: uploadedFile.id,
  endpoint: "/v1/chat/completions",
  completion_window: "24h"
});

Best Practices

Default to GPT-4o-mini. It handles 80% of production use cases at a fraction of the cost. Only escalate to GPT-4o for tasks that demonstrably require stronger reasoning: complex multi-step logic, nuanced writing, or difficult code generation.
Always set max_tokens. Without it, the model generates until it naturally stops or hits the model's context limit. A runaway generation with GPT-4o can cost dollars per request. Set an explicit ceiling.
Use streaming for anything user-facing. The perceived latency difference between waiting 8 seconds for a response vs. seeing the first token in 300ms is enormous. Users will not tolerate staring at a spinner.
Log every API call. Record the model, input tokens, output tokens, finish reason, latency, and any errors. This data is invaluable for cost optimization, debugging, and identifying which features consume the most budget.
Implement circuit breakers. If OpenAI has an outage, your retry logic will burn through your queue and stack up errors. Detect consecutive failures and stop making requests until the service recovers. A simple counter with a cooldown period works.
Pin model versions in production. Instead of gpt-4o, use gpt-4o-2024-08-06. When OpenAI updates the default alias, your application behavior can change overnight. Pinning gives you control over when to upgrade.
Use structured outputs for data extraction. If you are parsing the model's response with regex or string splitting, you are doing it wrong. Use response_format with a JSON schema. It is more reliable and eliminates an entire class of parsing bugs.
Cache aggressively. If you are sending the same prompt repeatedly (e.g., classifying a fixed set of categories), cache the results. OpenAI charges for every token, every time. Redis or even an in-memory LRU cache can save substantial money.
Set spending limits in the OpenAI dashboard. This is your safety net. Set a hard monthly limit that you are comfortable with. Even with perfect code, a misconfigured batch job or a traffic spike can cause unexpected spend.
Test with GPT-4o-mini, validate with GPT-4o. During development, use the cheap model for rapid iteration. Before shipping, run your evaluation suite with GPT-4o to see if the quality improvement justifies the 17x cost increase for your specific use case.

OpenAI API Mastery for Production Applications

OpenAI API Mastery for Production Applications

Overview

Prerequisites

Setting Up the OpenAI Node.js SDK

Project-Based Access

Chat Completions API

Temperature and Top_p

Multi-Turn Conversations

Streaming Responses with Server-Sent Events

Streaming Over HTTP with Express

Function Calling and Tool Use Patterns

Forcing Function Calls

Structured Output with JSON Mode

Vision API for Image Understanding

Embeddings API for Semantic Search

Model Selection for Embeddings

Cosine Similarity for Search

Fine-Tuning Models for Domain-Specific Tasks

Preparing Training Data

Creating a Fine-Tuning Job

Using a Fine-Tuned Model

Managing Costs with Token Counting and Model Selection

Current Pricing (as of early 2026)

Token Counting Before Sending

Tracking Usage from API Responses

Error Handling and Automatic Retries

Rate Limiting with Tiered Access

Complete Working Example: Production OpenAI Service

Using the Service

Common Issues and Troubleshooting

1. "You exceeded your current quota"

2. "This model's maximum context length is 128000 tokens"

3. "Invalid value for 'response_format'"

4. "Connection error" or Timeout on Long Completions

5. "Rate limit reached for model" in Batch Processing

Best Practices

References

Quick Links

Need Expert Help?