Context Window Management Strategies
Strategies for managing LLM context windows including sliding windows, summarization, priority selection, and token budgeting in Node.js.
Context Window Management Strategies
Overview
Every LLM has a hard limit on how much text it can process in a single request, and if you are building anything beyond a simple one-shot prompt, you will hit that limit faster than you think. Context window management is the set of strategies you use to decide what goes into that window and what gets dropped, summarized, or retrieved on demand. Getting this right is the difference between a chatbot that forgets what you said three messages ago and one that maintains coherent, useful conversations across hundreds of turns.
Prerequisites
- Node.js 18+ installed
- Working knowledge of LLM API calls (OpenAI, Anthropic, or similar)
- Basic understanding of tokens and how LLMs process text
- An OpenAI API key (for the working examples)
- npm packages:
openai,tiktoken,gpt-tokenizer
Install the dependencies:
npm install openai tiktoken gpt-tokenizer
Understanding Context Windows
A context window is the total number of tokens an LLM can process in a single request. This includes everything: the system prompt, all conversation messages, any injected context, and the model's response. Think of it as a fixed-size buffer. Once it is full, something has to go.
Here is what you are working with across major providers as of early 2026:
| Model | Context Window | Approximate Words |
|---|---|---|
| GPT-4o | 128K tokens | ~96,000 words |
| GPT-4o-mini | 128K tokens | ~96,000 words |
| Claude 3.5 Sonnet | 200K tokens | ~150,000 words |
| Claude Opus 4 | 200K tokens | ~150,000 words |
| Gemini 1.5 Pro | 2M tokens | ~1,500,000 words |
| Llama 3.1 405B | 128K tokens | ~96,000 words |
| Mistral Large | 128K tokens | ~96,000 words |
Those numbers look generous until you realize a typical enterprise chatbot might have a 2,000-token system prompt, inject 10,000 tokens of retrieved documents per turn, and need to reserve 4,000 tokens for the response. Suddenly your 128K window is more like 112K of usable conversation history, and that fills up in a long session.
The other thing people miss: larger context windows do not mean free performance. Models tend to degrade on recall tasks when the context gets very long. Research on the "lost in the middle" problem shows that information buried in the middle of a long context is recalled less reliably than information at the beginning or end. So even if you can fit 200K tokens, you probably should not just dump everything in there.
Measuring Conversation Size with Token Counting
Before you can manage context, you need to measure it. Tokens are not characters and they are not words. A token is roughly 3-4 characters in English, but it varies by language and content type. Code tends to use more tokens per semantic unit than prose.
var { encoding_for_model } = require("tiktoken");
function countTokens(text, model) {
var enc = encoding_for_model(model || "gpt-4o");
var tokens = enc.encode(text);
var count = tokens.length;
enc.free();
return count;
}
function countMessageTokens(messages, model) {
var total = 0;
var enc = encoding_for_model(model || "gpt-4o");
messages.forEach(function(message) {
// Every message has overhead: role tokens + content tokens + separators
total += 4; // <|im_start|>role\n ... <|im_end|>\n
total += enc.encode(message.role).length;
total += enc.encode(message.content).length;
});
total += 2; // priming tokens for assistant response
enc.free();
return total;
}
// Usage
var messages = [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "How do I read a file in Node.js?" },
{ role: "assistant", content: "You can use fs.readFileSync or fs.readFile..." }
];
var tokenCount = countMessageTokens(messages, "gpt-4o");
console.log("Total tokens:", tokenCount);
// Total tokens: 38
If you are working with Anthropic's API, the token counting is different. Anthropic provides token usage in the API response, but for pre-flight estimation you can still use tiktoken with the cl100k_base encoding as a rough approximation. For production systems, I recommend tracking actual usage from API responses and using that to calibrate your estimates.
Sliding Window Strategies
The simplest context management strategy is the sliding window: keep the N most recent messages and drop everything else. It is crude but effective for many use cases.
function slidingWindow(messages, maxTokens, model) {
var systemMessages = messages.filter(function(m) { return m.role === "system"; });
var nonSystemMessages = messages.filter(function(m) { return m.role !== "system"; });
var systemTokens = countMessageTokens(systemMessages, model);
var availableTokens = maxTokens - systemTokens;
var window = [];
var currentTokens = 0;
// Walk backwards from the most recent message
for (var i = nonSystemMessages.length - 1; i >= 0; i--) {
var msgTokens = countMessageTokens([nonSystemMessages[i]], model);
if (currentTokens + msgTokens > availableTokens) {
break;
}
window.unshift(nonSystemMessages[i]);
currentTokens += msgTokens;
}
return systemMessages.concat(window);
}
The sliding window has an obvious weakness: once a message falls off the window, it is gone completely. The model has no idea the user mentioned their name, their project requirements, or any other critical context from earlier turns. That is where more sophisticated strategies come in.
Summarization-Based Compression
Instead of dropping old messages entirely, you can summarize them. When the conversation gets too long, you compress the oldest messages into a summary and keep that summary as a synthetic message at the top of the conversation.
var OpenAI = require("openai");
var client = new OpenAI();
function summarizeMessages(messages) {
var conversationText = messages.map(function(m) {
return m.role + ": " + m.content;
}).join("\n");
return client.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: "Summarize this conversation concisely. Preserve key facts, decisions, user preferences, and any specific requirements mentioned. Be thorough but brief."
},
{
role: "user",
content: conversationText
}
],
max_tokens: 500
}).then(function(response) {
return response.choices[0].message.content;
});
}
function compressConversation(messages, maxTokens, model) {
var systemMessages = messages.filter(function(m) { return m.role === "system"; });
var nonSystemMessages = messages.filter(function(m) { return m.role !== "system"; });
var currentTokens = countMessageTokens(messages, model);
if (currentTokens <= maxTokens) {
return Promise.resolve(messages);
}
// Split: older half gets summarized, newer half stays intact
var splitPoint = Math.floor(nonSystemMessages.length / 2);
var olderMessages = nonSystemMessages.slice(0, splitPoint);
var newerMessages = nonSystemMessages.slice(splitPoint);
return summarizeMessages(olderMessages).then(function(summary) {
var summaryMessage = {
role: "system",
content: "Summary of earlier conversation:\n" + summary
};
return systemMessages.concat([summaryMessage]).concat(newerMessages);
});
}
The trade-off here is clear: you are spending an extra API call (and tokens) to create the summary, and summaries inherently lose detail. I use gpt-4o-mini for summarization because it is cheap, fast, and good enough for this task. You do not need a frontier model to compress text.
Hierarchical Memory
The most robust approach combines multiple layers of memory, similar to how a CPU has L1/L2/L3 cache. Each layer stores information at a different level of detail and cost.
function HierarchicalMemory(options) {
this.recentMessages = []; // L1: Full messages, last N turns
this.summaryBuffer = ""; // L2: Rolling summary of older messages
this.keyFacts = []; // L3: Extracted key facts (user name, preferences, etc.)
this.maxRecentMessages = options.maxRecentMessages || 20;
this.maxRecentTokens = options.maxRecentTokens || 8000;
this.model = options.model || "gpt-4o";
}
HierarchicalMemory.prototype.addMessage = function(message) {
this.recentMessages.push(message);
// Check if we need to compress
var recentTokens = countMessageTokens(this.recentMessages, this.model);
if (this.recentMessages.length > this.maxRecentMessages || recentTokens > this.maxRecentTokens) {
return this._compress();
}
return Promise.resolve();
};
HierarchicalMemory.prototype._compress = function() {
var self = this;
// Move the oldest half of recent messages into the summary
var splitPoint = Math.floor(this.recentMessages.length / 2);
var toSummarize = this.recentMessages.slice(0, splitPoint);
this.recentMessages = this.recentMessages.slice(splitPoint);
return this._extractKeyFacts(toSummarize).then(function(facts) {
facts.forEach(function(fact) {
if (self.keyFacts.indexOf(fact) === -1) {
self.keyFacts.push(fact);
}
});
return self._updateSummary(toSummarize);
});
};
HierarchicalMemory.prototype._extractKeyFacts = function(messages) {
var text = messages.map(function(m) { return m.role + ": " + m.content; }).join("\n");
return client.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: "Extract key facts from this conversation as a JSON array of strings. Include: user name, preferences, project details, technical requirements, decisions made. Only include concrete facts, not general discussion."
},
{ role: "user", content: text }
],
max_tokens: 300,
response_format: { type: "json_object" }
}).then(function(response) {
var parsed = JSON.parse(response.choices[0].message.content);
return parsed.facts || [];
});
};
HierarchicalMemory.prototype._updateSummary = function(messages) {
var text = messages.map(function(m) { return m.role + ": " + m.content; }).join("\n");
var self = this;
var prompt = this.summaryBuffer
? "Update this existing summary with new conversation content.\n\nExisting summary:\n" + this.summaryBuffer + "\n\nNew conversation:\n" + text
: "Summarize this conversation:\n" + text;
return client.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: "Create a concise rolling summary. Preserve chronological order and key details."
},
{ role: "user", content: prompt }
],
max_tokens: 500
}).then(function(response) {
self.summaryBuffer = response.choices[0].message.content;
});
};
HierarchicalMemory.prototype.buildContext = function(systemPrompt) {
var messages = [{ role: "system", content: systemPrompt }];
if (this.keyFacts.length > 0) {
messages.push({
role: "system",
content: "Key facts about this conversation:\n- " + this.keyFacts.join("\n- ")
});
}
if (this.summaryBuffer) {
messages.push({
role: "system",
content: "Earlier conversation summary:\n" + this.summaryBuffer
});
}
return messages.concat(this.recentMessages);
};
This three-tier approach gives you the best of all worlds: full fidelity for recent messages, compressed summaries for older context, and persistent key facts that never get lost. I have shipped this pattern in production chatbots and it works remarkably well for conversations that span hundreds of turns.
Priority-Based Message Selection
Not all messages are equally important. A user correcting the model's misunderstanding is more important than a generic greeting. Priority-based selection lets you assign weights and keep the most important messages regardless of recency.
function prioritizeMessages(messages, maxTokens, model) {
var scored = messages.map(function(msg, index) {
var score = 0;
// System messages are always highest priority
if (msg.role === "system") score += 1000;
// Recency bonus: more recent = higher score
score += index;
// Length penalty: very short messages are less important
if (msg.content.length < 20) score -= 5;
// Content signals that indicate importance
if (msg.content.match(/\b(requirement|must|always|never|important|critical)\b/i)) {
score += 20;
}
// User corrections and clarifications
if (msg.content.match(/\b(actually|correction|I meant|no,|wrong)\b/i)) {
score += 15;
}
// Code blocks are usually important context
if (msg.content.indexOf("```") !== -1) {
score += 10;
}
// Custom priority field if provided
if (msg.priority) score += msg.priority;
return { message: msg, score: score, index: index };
});
// Sort by score descending
scored.sort(function(a, b) { return b.score - a.score; });
// Select messages until we hit the token limit
var selected = [];
var currentTokens = 0;
scored.forEach(function(item) {
var msgTokens = countMessageTokens([item.message], model);
if (currentTokens + msgTokens <= maxTokens) {
selected.push(item);
currentTokens += msgTokens;
}
});
// Re-sort by original index to maintain conversation order
selected.sort(function(a, b) { return a.index - b.index; });
return selected.map(function(item) { return item.message; });
}
The priority scoring is where your domain expertise matters. The heuristics above are a starting point. In production, I have seen teams add scoring for messages that contain entity names relevant to the current query, messages where the user provided explicit constraints, and messages that represent decisions or agreements.
Chunking Long Documents
When you need to feed a large document into context, you cannot just shove the whole thing in. You need to chunk it intelligently and select the most relevant chunks.
function chunkDocument(text, maxChunkTokens, overlap) {
var enc = encoding_for_model("gpt-4o");
var tokens = enc.encode(text);
var chunks = [];
var overlapTokens = overlap || 100;
var start = 0;
while (start < tokens.length) {
var end = Math.min(start + maxChunkTokens, tokens.length);
var chunkTokens = tokens.slice(start, end);
var chunkText = enc.decode(chunkTokens);
chunks.push({
text: new TextDecoder().decode(chunkText),
startToken: start,
endToken: end,
tokenCount: end - start
});
start = end - overlapTokens;
if (start >= tokens.length) break;
}
enc.free();
return chunks;
}
// Chunk by semantic boundaries (paragraphs) instead of raw tokens
function chunkBySections(text, maxChunkTokens) {
var paragraphs = text.split(/\n\n+/);
var chunks = [];
var currentChunk = "";
var currentTokens = 0;
paragraphs.forEach(function(paragraph) {
var paraTokens = countTokens(paragraph, "gpt-4o");
if (currentTokens + paraTokens > maxChunkTokens && currentChunk.length > 0) {
chunks.push(currentChunk.trim());
currentChunk = "";
currentTokens = 0;
}
currentChunk += paragraph + "\n\n";
currentTokens += paraTokens;
});
if (currentChunk.trim().length > 0) {
chunks.push(currentChunk.trim());
}
return chunks;
}
Semantic chunking (splitting by paragraphs, sections, or headings) consistently outperforms naive token-based splitting. When a chunk boundary falls in the middle of a sentence or code block, you lose coherence and the model struggles to use that information effectively.
Context Window Budgeting
For production systems, I recommend explicit budget allocation. You decide upfront how many tokens go to each component, and the system enforces those limits.
function ContextBudget(maxTokens) {
this.maxTokens = maxTokens;
this.allocations = {
systemPrompt: { max: 2000, required: true },
keyFacts: { max: 500, required: true },
retrievedDocs: { max: 4000, required: false },
conversationSummary: { max: 1000, required: false },
recentMessages: { max: 0, required: true }, // gets remainder
responseReserve: { max: 4000, required: true }
};
}
ContextBudget.prototype.calculate = function() {
var self = this;
var reserved = 0;
var flexKeys = [];
Object.keys(this.allocations).forEach(function(key) {
var alloc = self.allocations[key];
if (key !== "recentMessages") {
reserved += alloc.max;
}
});
// Recent messages get whatever is left
this.allocations.recentMessages.max = this.maxTokens - reserved;
return JSON.parse(JSON.stringify(this.allocations));
};
// Usage
var budget = new ContextBudget(128000);
var plan = budget.calculate();
console.log("Token budget plan:");
console.log(JSON.stringify(plan, null, 2));
// {
// "systemPrompt": { "max": 2000, "required": true },
// "keyFacts": { "max": 500, "required": true },
// "retrievedDocs": { "max": 4000, "required": false },
// "conversationSummary": { "max": 1000, "required": false },
// "recentMessages": { "max": 116500, "required": true },
// "responseReserve": { "max": 4000, "required": true }
// }
The response reserve is critical and frequently overlooked. If you fill the context window completely, the model has no room to generate a response. I have seen production systems fail silently because the context was 127,900 tokens in a 128K window, leaving room for maybe a sentence of output before hitting the limit. Always reserve at least 2,000-4,000 tokens for the response, more if you expect long answers.
Retrieval-Augmented Context
Instead of keeping everything in the conversation history, you can store information externally and pull in only what is relevant to the current query. This is the core idea behind RAG (Retrieval-Augmented Generation), but you can apply it to conversation context too.
var conversationStore = [];
function storeMessage(message, embedding) {
conversationStore.push({
message: message,
embedding: embedding,
timestamp: Date.now()
});
}
function cosineSimilarity(a, b) {
var dotProduct = 0;
var normA = 0;
var normB = 0;
for (var i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
function retrieveRelevantContext(queryEmbedding, maxTokens, topK) {
var scored = conversationStore.map(function(item) {
return {
message: item.message,
similarity: cosineSimilarity(queryEmbedding, item.embedding),
timestamp: item.timestamp
};
});
scored.sort(function(a, b) { return b.similarity - a.similarity; });
var selected = [];
var currentTokens = 0;
var limit = topK || 10;
for (var i = 0; i < Math.min(scored.length, limit); i++) {
var msgTokens = countTokens(scored[i].message.content, "gpt-4o");
if (currentTokens + msgTokens > maxTokens) break;
selected.push(scored[i].message);
currentTokens += msgTokens;
}
return selected;
}
This approach shines when your conversations cover many different topics. Instead of keeping a linear history, you pull in only the messages that are semantically relevant to the current question. The downside is the latency and cost of generating embeddings for every message. For most applications, I combine retrieval with a sliding window: the most recent N messages are always included, and older messages are retrieved by relevance.
Prompt Caching
Several providers now offer prompt caching, where repeated prefixes in your messages are cached and charged at a reduced rate. This matters for context management because your system prompt and frequently-used context can be structured as a stable prefix.
// Structure messages to maximize cache hits
function buildCachedContext(systemPrompt, staticContext, dynamicMessages) {
// The stable prefix (system prompt + static context) gets cached
var messages = [
{
role: "system",
content: systemPrompt + "\n\n---\n\nReference Documentation:\n" + staticContext
}
];
// Dynamic messages change per request, so they won't be cached
// But the prefix above will be cached across requests
return messages.concat(dynamicMessages);
}
// Anthropic-style cache control
function buildAnthropicCachedContext(systemPrompt, referenceDoc, conversation) {
return {
system: [
{
type: "text",
text: systemPrompt,
cache_control: { type: "ephemeral" }
},
{
type: "text",
text: "Reference documentation:\n" + referenceDoc,
cache_control: { type: "ephemeral" }
}
],
messages: conversation
};
}
With OpenAI, prompt caching happens automatically for repeated prefixes longer than 1,024 tokens. With Anthropic, you use explicit cache_control markers. The key principle is the same: put stable content at the front and changing content at the end. A well-structured cache can reduce your costs by 50-90% on repeated interactions.
Comparing Context Strategies Across Providers
Different providers have different strengths, and your context strategy should account for them.
| Strategy | OpenAI | Anthropic | |
|---|---|---|---|
| Max context | 128K | 200K | 2M |
| Prompt caching | Automatic (1024+ prefix) | Explicit cache_control | Context caching API |
| Token counting | tiktoken (exact) | Approximate (use API response) | tokenCount API |
| Streaming | Yes | Yes | Yes |
| Long-context recall | Good | Very good | Good |
| Cost at 100K tokens | ~$0.25 input | ~$0.30 input | ~$0.0125 input |
Google's Gemini has the largest context window by far, but context window size alone is not the deciding factor. I have found that Claude handles long-context recall tasks more reliably than GPT-4o at equivalent context lengths, while Gemini's 2M window is useful for whole-codebase analysis where you genuinely need to process everything at once.
Managing Context in Multi-Turn Chat Applications
In a real chat application, you need to manage context across multiple turns, handle concurrent users, and deal with edge cases like very long individual messages.
function ChatSession(options) {
this.sessionId = options.sessionId;
this.model = options.model || "gpt-4o";
this.maxContextTokens = options.maxContextTokens || 120000;
this.responseReserve = options.responseReserve || 4000;
this.systemPrompt = options.systemPrompt || "You are a helpful assistant.";
this.messages = [];
this.memory = new HierarchicalMemory({
maxRecentMessages: 30,
maxRecentTokens: 10000,
model: this.model
});
}
ChatSession.prototype.sendMessage = function(userMessage) {
var self = this;
// Guard against single messages that are too large
var msgTokens = countTokens(userMessage, this.model);
var maxSingleMessage = Math.floor(this.maxContextTokens * 0.5);
if (msgTokens > maxSingleMessage) {
return Promise.reject(
new Error("Message too large: " + msgTokens + " tokens exceeds limit of " + maxSingleMessage)
);
}
return this.memory.addMessage({ role: "user", content: userMessage })
.then(function() {
var context = self.memory.buildContext(self.systemPrompt);
var contextTokens = countMessageTokens(context, self.model);
console.log("Context size: " + contextTokens + " tokens (" +
Math.round(contextTokens / self.maxContextTokens * 100) + "% of limit)");
return client.chat.completions.create({
model: self.model,
messages: context,
max_tokens: self.responseReserve
});
})
.then(function(response) {
var assistantMessage = response.choices[0].message.content;
return self.memory.addMessage({
role: "assistant",
content: assistantMessage
}).then(function() {
return {
content: assistantMessage,
usage: {
promptTokens: response.usage.prompt_tokens,
completionTokens: response.usage.completion_tokens,
totalTokens: response.usage.total_tokens
}
};
});
});
};
Complete Working Example
Here is a full conversation manager that combines sliding window, summarization, and priority-based message selection into a single cohesive system. This is production-ready code that I have used as the foundation for several deployed chatbots.
var OpenAI = require("openai");
var { encoding_for_model } = require("tiktoken");
var openai = new OpenAI();
// ============================================
// Token Counting Utilities
// ============================================
function countTokens(text, model) {
var enc = encoding_for_model(model || "gpt-4o");
var count = enc.encode(text).length;
enc.free();
return count;
}
function countMessageTokens(messages, model) {
var total = 0;
var enc = encoding_for_model(model || "gpt-4o");
messages.forEach(function(msg) {
total += 4;
total += enc.encode(msg.role).length;
total += enc.encode(msg.content).length;
});
total += 2;
enc.free();
return total;
}
// ============================================
// Conversation Manager
// ============================================
function ConversationManager(options) {
this.model = options.model || "gpt-4o";
this.maxContextTokens = options.maxContextTokens || 120000;
this.responseReserve = options.responseReserve || 4000;
this.summaryThreshold = options.summaryThreshold || 40;
this.systemPrompt = options.systemPrompt || "You are a helpful assistant.";
this.messages = [];
this.summary = "";
this.keyFacts = [];
this.turnCount = 0;
this.totalTokensUsed = 0;
}
ConversationManager.prototype.addMessage = function(role, content, options) {
var msg = {
role: role,
content: content,
turnNumber: this.turnCount,
priority: (options && options.priority) || 0,
timestamp: Date.now()
};
this.messages.push(msg);
if (role === "user") {
this.turnCount++;
}
};
ConversationManager.prototype.getUsableBudget = function() {
var systemTokens = countTokens(this.systemPrompt, this.model) + 10;
var summaryTokens = this.summary ? countTokens(this.summary, this.model) + 10 : 0;
var factsTokens = this.keyFacts.length > 0
? countTokens("Key facts:\n- " + this.keyFacts.join("\n- "), this.model) + 10
: 0;
return this.maxContextTokens - this.responseReserve - systemTokens - summaryTokens - factsTokens;
};
ConversationManager.prototype.selectMessages = function() {
var budget = this.getUsableBudget();
var selected = [];
var usedTokens = 0;
// Phase 1: Add recent messages (most important)
for (var i = this.messages.length - 1; i >= 0; i--) {
var msg = this.messages[i];
var tokens = countTokens(msg.content, this.model) + 6;
if (usedTokens + tokens > budget * 0.8) {
break; // Reserve 20% of budget for priority messages
}
selected.unshift({ message: msg, index: i });
usedTokens += tokens;
}
// Phase 2: Scan older messages for high-priority ones
var oldestSelected = selected.length > 0 ? selected[0].index : this.messages.length;
for (var j = 0; j < oldestSelected; j++) {
var oldMsg = this.messages[j];
if (oldMsg.priority >= 10 || this._isImportant(oldMsg)) {
var oldTokens = countTokens(oldMsg.content, this.model) + 6;
if (usedTokens + oldTokens <= budget) {
selected.push({ message: oldMsg, index: j });
usedTokens += oldTokens;
}
}
}
// Re-sort by original index
selected.sort(function(a, b) { return a.index - b.index; });
return selected.map(function(item) {
return { role: item.message.role, content: item.message.content };
});
};
ConversationManager.prototype._isImportant = function(msg) {
var importantPatterns = [
/\b(my name is|i am|i'm)\b/i,
/\b(requirement|must|always|never|constraint)\b/i,
/\b(actually|correction|no,|wrong|I meant)\b/i,
/\b(remember|don't forget|keep in mind)\b/i
];
return importantPatterns.some(function(pattern) {
return pattern.test(msg.content);
});
};
ConversationManager.prototype.buildPrompt = function() {
var messages = [{ role: "system", content: this.systemPrompt }];
if (this.keyFacts.length > 0) {
messages.push({
role: "system",
content: "Key facts from this conversation:\n- " + this.keyFacts.join("\n- ")
});
}
if (this.summary) {
messages.push({
role: "system",
content: "Summary of earlier conversation:\n" + this.summary
});
}
var selected = this.selectMessages();
messages = messages.concat(selected);
return messages;
};
ConversationManager.prototype.compress = function() {
var self = this;
if (this.messages.length < this.summaryThreshold) {
return Promise.resolve();
}
var splitPoint = Math.floor(this.messages.length * 0.4);
var toCompress = this.messages.slice(0, splitPoint);
this.messages = this.messages.slice(splitPoint);
var conversationText = toCompress.map(function(m) {
return m.role.toUpperCase() + ": " + m.content;
}).join("\n\n");
// Extract facts and summarize in parallel
var factsPromise = openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: "Extract key facts as a JSON object with a \"facts\" array of strings. Include: names, preferences, technical requirements, constraints, decisions. Only concrete facts."
},
{
role: "user",
content: conversationText
}
],
max_tokens: 400,
response_format: { type: "json_object" }
});
var existingSummary = self.summary
? "Existing summary to update:\n" + self.summary + "\n\nNew conversation to incorporate:\n"
: "Conversation to summarize:\n";
var summaryPromise = openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: "Create a concise summary preserving chronological order, key technical details, and decisions made."
},
{
role: "user",
content: existingSummary + conversationText
}
],
max_tokens: 600
});
return Promise.all([factsPromise, summaryPromise]).then(function(results) {
// Merge new facts
try {
var parsed = JSON.parse(results[0].choices[0].message.content);
var newFacts = parsed.facts || [];
newFacts.forEach(function(fact) {
if (self.keyFacts.indexOf(fact) === -1) {
self.keyFacts.push(fact);
}
});
} catch (e) {
console.error("Failed to parse facts:", e.message);
}
// Update summary
self.summary = results[1].choices[0].message.content;
console.log("Compressed conversation: " + toCompress.length + " messages summarized");
console.log("Key facts: " + self.keyFacts.length);
console.log("Remaining messages: " + self.messages.length);
});
};
ConversationManager.prototype.chat = function(userInput) {
var self = this;
this.addMessage("user", userInput);
return this.compress().then(function() {
var prompt = self.buildPrompt();
var promptTokens = countMessageTokens(prompt, self.model);
console.log("[Turn " + self.turnCount + "] Prompt: " + promptTokens + " tokens | " +
"Messages: " + self.messages.length + " | " +
"Facts: " + self.keyFacts.length + " | " +
"Has summary: " + (self.summary ? "yes" : "no"));
return openai.chat.completions.create({
model: self.model,
messages: prompt,
max_tokens: self.responseReserve
});
}).then(function(response) {
var reply = response.choices[0].message.content;
self.addMessage("assistant", reply);
self.totalTokensUsed += response.usage.total_tokens;
return {
reply: reply,
stats: {
turn: self.turnCount,
promptTokens: response.usage.prompt_tokens,
completionTokens: response.usage.completion_tokens,
totalSessionTokens: self.totalTokensUsed,
messagesInMemory: self.messages.length,
keyFacts: self.keyFacts.length,
hasSummary: !!self.summary
}
};
});
};
// ============================================
// Usage Example
// ============================================
function main() {
var manager = new ConversationManager({
model: "gpt-4o",
maxContextTokens: 120000,
responseReserve: 4000,
summaryThreshold: 30,
systemPrompt: "You are a senior Node.js architect. Help the user design and build production systems. Be concise and practical."
});
// Simulate a multi-turn conversation
var questions = [
"My name is Alex and I'm building a real-time analytics dashboard. We're using Node.js with PostgreSQL.",
"What's the best way to handle WebSocket connections at scale?",
"We expect about 50,000 concurrent connections. Is that feasible with Node.js?",
"Should we use Redis for pub/sub between our Node.js instances?",
"Show me a basic implementation of the WebSocket server with Redis pub/sub.",
"Actually, we also need to support HTTP long-polling as a fallback. How does that change things?"
];
var chain = Promise.resolve();
questions.forEach(function(question) {
chain = chain.then(function() {
return manager.chat(question);
}).then(function(result) {
console.log("\nUser: " + questions[0]);
console.log("Assistant: " + result.reply.substring(0, 200) + "...");
console.log("Stats:", JSON.stringify(result.stats));
console.log("---");
});
});
return chain;
}
main().catch(function(err) {
console.error("Error:", err);
});
This example demonstrates all three strategies working together: the sliding window selects recent messages, the summarization system compresses older turns, and priority-based selection ensures important messages (like the user's name and project requirements) are never lost even when they fall outside the recent window.
Common Issues & Troubleshooting
1. Token Count Mismatch Between Client and Server
Error: This model's maximum context length is 128000 tokens.
However, your messages resulted in 128,247 tokens.
Please reduce the length of the messages.
This happens when your client-side token counting is slightly off from the server's count. OpenAI uses different tokenizers for different models, and the overhead per message (role tokens, separators) is not always documented precisely. Fix: Always leave a buffer of at least 500 tokens below the stated limit. Count with the correct model-specific tokenizer, not a generic one.
2. Summarization Losing Critical Information
Symptoms: The model "forgets" user requirements or constraints that were mentioned early in the conversation and then summarized.
This is a fundamental limitation of lossy compression. Summaries cannot preserve everything. Fix: Use the hierarchical memory approach with explicit key fact extraction. Facts like user names, project constraints, and explicit requirements should be extracted into the key facts tier where they persist indefinitely, separate from the summary.
3. tiktoken Memory Leak
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
If you call encoding_for_model() in a loop without calling enc.free(), you will leak WASM memory. Each encoder instance allocates memory that the JavaScript garbage collector cannot reclaim. Fix: Always call enc.free() when you are done with an encoder, or create a single encoder instance and reuse it across calls.
// BAD: leaks memory
function countTokensBad(text) {
var enc = encoding_for_model("gpt-4o");
return enc.encode(text).length;
// enc.free() never called!
}
// GOOD: properly frees memory
function countTokensGood(text) {
var enc = encoding_for_model("gpt-4o");
var count = enc.encode(text).length;
enc.free();
return count;
}
4. Concurrent Session Context Corruption
When managing multiple chat sessions in the same process, shared mutable state can corrupt context across sessions.
User A sees: "As we discussed earlier, your React project..."
(But User A never mentioned React - that was User B's conversation)
Fix: Each session must have its own ConversationManager instance. Never share message arrays or memory objects between sessions. Use the session ID as a key in a Map or store session state externally in Redis or a database.
5. Runaway Summarization Costs
When your compression triggers too aggressively, you can end up spending more on summarization API calls than on the actual conversation. I have seen cases where every other turn triggered a compression cycle.
Fix: Set your summaryThreshold high enough that compression is infrequent. A threshold of 30-50 messages is reasonable for most applications. Also use gpt-4o-mini (not the full model) for summarization. The cost difference is 10-20x for a task that does not require frontier intelligence.
Best Practices
Always reserve response tokens. Calculate your usable context as
maxTokens - responseReserve - systemPromptTokens, not justmaxTokens. Failing to do this is the number one cause of truncated or failed responses in production.Count tokens, do not estimate. A "2000 word" document might be 2,400 tokens or 3,200 tokens depending on the content. Use tiktoken or the provider's token counting API. Estimation formulas like "tokens = words * 1.3" will betray you at the worst possible time.
Put the most important context at the beginning and end. Due to the "lost in the middle" effect, models recall information at the start and end of the context more reliably than information in the middle. Place your system prompt and key facts first, recent messages last, and summaries in between.
Use the cheapest model for context management tasks. Summarization, fact extraction, and relevance scoring do not need GPT-4o or Claude Opus. Use GPT-4o-mini, Claude Haiku, or similar lightweight models for these auxiliary tasks. The quality difference is negligible for compression work.
Log your context usage in production. Track prompt tokens, completion tokens, and context utilization percentage per request. This data tells you when to adjust your budget allocations, when compression is triggering too often, and when users are hitting limits.
Design for graceful degradation. When the context is full and compression fails, your system should still work. Drop the summary, drop retrieved context, keep only the system prompt and last few messages. A slightly forgetful chatbot is better than one that throws errors.
Test with adversarial conversation lengths. Do not just test with 5-turn conversations. Simulate 200-turn sessions, sessions with very long individual messages, and sessions that change topics repeatedly. These edge cases expose problems in context management that short tests miss.
Version your system prompts. If you change the system prompt, the cached prefix invalidates and your costs spike. Track system prompt versions and deploy changes deliberately, understanding the caching impact.
References
- OpenAI Tokenizer Tool - Interactive token counting for OpenAI models
- OpenAI Token Counting Guide - Official guide to tiktoken usage
- Anthropic Prompt Caching - Cache control documentation for Anthropic API
- Lost in the Middle: How Language Models Use Long Contexts - Research on long-context recall degradation
- tiktoken npm package - Token counting library for Node.js
- OpenAI Chat Completions API - Official API documentation