Agent Memory Systems: Short-Term and Long-Term
Build agent memory systems with short-term buffers, long-term PostgreSQL storage, episodic/semantic memory, and vector retrieval in Node.js.
Agent Memory Systems: Short-Term and Long-Term
AI agents without memory are goldfish. Every conversation starts from zero, every lesson learned evaporates the moment the context window fills up, and every user preference has to be re-stated. Building real memory systems — short-term buffers for the current session and long-term persistent stores for accumulated knowledge — is what separates a toy chatbot from an agent that actually improves over time.
This article walks through the architecture of agent memory systems, from in-memory conversation buffers through PostgreSQL-backed episodic and semantic memory with vector retrieval, and shows you how to wire it all together in Node.js.
Prerequisites
- Node.js v18+ and npm
- PostgreSQL 15+ with the
pgvectorextension installed - An OpenAI API key (for generating embeddings)
- Working knowledge of Express.js and SQL
- Familiarity with LLM prompt construction and token limits
Why Agents Need Memory Beyond the Context Window
Every LLM has a finite context window. Even models advertising 128k or 200k tokens hit a wall eventually. But the real problem is not just size — it is cost and relevance. Stuffing 100k tokens of raw conversation history into every prompt is expensive, slow, and noisy. Most of that history is irrelevant to the current query.
Memory systems solve this by acting as an external brain. Short-term memory holds what matters right now — the current conversation thread, intermediate reasoning steps, scratchpad notes. Long-term memory persists across sessions and stores distilled knowledge: what happened in past conversations, what facts the agent has learned, what tool-usage patterns work best.
The human brain does this naturally. Working memory holds roughly seven items. Important experiences get consolidated into long-term memory during sleep. Irrelevant details get forgotten. Agent memory systems should mirror this architecture.
Short-Term Memory
Short-term memory lives in-process. It exists for the duration of a session and gets discarded (or selectively consolidated) when the session ends.
Conversation Buffer
The simplest form of short-term memory is a conversation buffer — an ordered list of messages in the current interaction.
var EventEmitter = require("events");
function ConversationBuffer(options) {
options = options || {};
this.maxMessages = options.maxMessages || 50;
this.maxTokens = options.maxTokens || 8000;
this.messages = [];
this.emitter = new EventEmitter();
}
ConversationBuffer.prototype.add = function (role, content, metadata) {
var message = {
role: role,
content: content,
timestamp: Date.now(),
tokenEstimate: Math.ceil(content.length / 4),
metadata: metadata || {}
};
this.messages.push(message);
this._enforceLimit();
this.emitter.emit("message_added", message);
return message;
};
ConversationBuffer.prototype._enforceLimit = function () {
// Remove oldest messages if we exceed count limit
while (this.messages.length > this.maxMessages) {
var removed = this.messages.shift();
this.emitter.emit("message_evicted", removed);
}
// Remove oldest messages if we exceed token budget
var totalTokens = this._estimateTokens();
while (totalTokens > this.maxTokens && this.messages.length > 1) {
var removed = this.messages.shift();
this.emitter.emit("message_evicted", removed);
totalTokens = this._estimateTokens();
}
};
ConversationBuffer.prototype._estimateTokens = function () {
var total = 0;
for (var i = 0; i < this.messages.length; i++) {
total += this.messages[i].tokenEstimate;
}
return total;
};
ConversationBuffer.prototype.getMessages = function () {
return this.messages.map(function (m) {
return { role: m.role, content: m.content };
});
};
ConversationBuffer.prototype.getRecent = function (n) {
return this.messages.slice(-n);
};
ConversationBuffer.prototype.clear = function () {
this.messages = [];
};
The key design decisions here: a dual limit on both message count and estimated tokens, and an event emitter so that evicted messages can be intercepted by a consolidation layer (more on that later).
Working Memory (Scratchpad)
Beyond conversation history, agents need a scratchpad for intermediate state — partial results, extracted entities, in-progress reasoning chains. This is working memory.
function WorkingMemory() {
this.entries = {};
this.created = Date.now();
}
WorkingMemory.prototype.set = function (key, value, ttlMs) {
this.entries[key] = {
value: value,
createdAt: Date.now(),
expiresAt: ttlMs ? Date.now() + ttlMs : null
};
};
WorkingMemory.prototype.get = function (key) {
var entry = this.entries[key];
if (!entry) return null;
if (entry.expiresAt && Date.now() > entry.expiresAt) {
delete this.entries[key];
return null;
}
return entry.value;
};
WorkingMemory.prototype.getAll = function () {
var self = this;
var result = {};
var keys = Object.keys(this.entries);
keys.forEach(function (key) {
var val = self.get(key); // triggers TTL check
if (val !== null) {
result[key] = val;
}
});
return result;
};
WorkingMemory.prototype.remove = function (key) {
delete this.entries[key];
};
WorkingMemory.prototype.toPromptString = function () {
var all = this.getAll();
var keys = Object.keys(all);
if (keys.length === 0) return "";
var lines = keys.map(function (k) {
var val = typeof all[k] === "string" ? all[k] : JSON.stringify(all[k]);
return "- " + k + ": " + val;
});
return "## Current Working Memory\n" + lines.join("\n");
};
Working memory entries have optional TTLs. An entity extracted from a user message might be relevant for 5 minutes but not for the rest of the session. This prevents stale intermediate state from cluttering prompts.
Long-Term Memory Types
Long-term memory persists across sessions and serves three distinct purposes, mirroring how human long-term memory is categorized in cognitive science.
Episodic Memory
Episodic memory records specific past interactions — what the user asked, what the agent did, what the outcome was. Think of it as a diary. It answers questions like "What did we talk about last Tuesday?" or "What happened the last time I asked about database migrations?"
Semantic Memory
Semantic memory stores facts, knowledge, and generalizations. It is the agent's accumulated knowledge base, detached from specific episodes. "The user prefers TypeScript" or "The production database runs PostgreSQL 15" are semantic memories.
Procedural Memory
Procedural memory captures learned behaviors — which tools work well for which tasks, what prompt patterns produce good results, what sequences of actions reliably solve particular problem types. It is the agent's muscle memory.
Implementing Long-Term Memory with PostgreSQL
PostgreSQL with the pgvector extension gives us relational storage for structured memory records and vector similarity search for semantic retrieval — all in one database.
Schema
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE agent_memories (
id SERIAL PRIMARY KEY,
agent_id VARCHAR(64) NOT NULL,
user_id VARCHAR(64),
memory_type VARCHAR(20) NOT NULL CHECK (memory_type IN ('episodic', 'semantic', 'procedural')),
content TEXT NOT NULL,
summary TEXT,
embedding vector(1536),
importance REAL DEFAULT 0.5,
access_count INTEGER DEFAULT 0,
last_accessed TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ,
metadata JSONB DEFAULT '{}'::jsonb
);
CREATE INDEX idx_memories_agent ON agent_memories(agent_id);
CREATE INDEX idx_memories_type ON agent_memories(agent_id, memory_type);
CREATE INDEX idx_memories_embedding ON agent_memories
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_memories_importance ON agent_memories(agent_id, importance DESC);
CREATE INDEX idx_memories_metadata ON agent_memories USING gin (metadata);
The schema stores all three memory types in one table with a discriminator column. The embedding column holds 1536-dimensional vectors from OpenAI's text-embedding-3-small model. The importance score (0.0 to 1.0) lets retrieval prioritize high-value memories.
Long-Term Memory Store
var pg = require("pg");
var fetch = require("node-fetch");
function LongTermMemory(options) {
this.pool = new pg.Pool({
connectionString: options.connectionString
});
this.agentId = options.agentId;
this.openaiKey = options.openaiKey;
this.embeddingModel = options.embeddingModel || "text-embedding-3-small";
}
LongTermMemory.prototype.generateEmbedding = function (text) {
var self = this;
return fetch("https://api.openai.com/v1/embeddings", {
method: "POST",
headers: {
"Authorization": "Bearer " + self.openaiKey,
"Content-Type": "application/json"
},
body: JSON.stringify({
model: self.embeddingModel,
input: text.substring(0, 8000)
})
})
.then(function (res) { return res.json(); })
.then(function (data) {
if (data.error) {
throw new Error("Embedding error: " + data.error.message);
}
return data.data[0].embedding;
});
};
LongTermMemory.prototype.store = function (memoryType, content, options) {
var self = this;
options = options || {};
return self.generateEmbedding(content).then(function (embedding) {
var query = [
"INSERT INTO agent_memories",
"(agent_id, user_id, memory_type, content, summary, embedding,",
" importance, expires_at, metadata)",
"VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)",
"RETURNING id"
].join(" ");
var embeddingStr = "[" + embedding.join(",") + "]";
var params = [
self.agentId,
options.userId || null,
memoryType,
content,
options.summary || null,
embeddingStr,
options.importance || 0.5,
options.expiresAt || null,
JSON.stringify(options.metadata || {})
];
return self.pool.query(query, params);
}).then(function (result) {
return result.rows[0].id;
});
};
Episodic Memory: Storing Past Interactions
Episodic memories capture the full arc of an interaction — what was asked, what the agent did, what the outcome was.
LongTermMemory.prototype.storeEpisode = function (episode) {
var content = [
"User request: " + episode.userMessage,
"Agent action: " + episode.agentAction,
"Outcome: " + episode.outcome,
"Tools used: " + (episode.toolsUsed || []).join(", ")
].join("\n");
var summary = "User asked about " + episode.topic +
". Agent " + episode.agentAction.substring(0, 100) +
". Result: " + episode.outcome.substring(0, 100);
return this.store("episodic", content, {
userId: episode.userId,
summary: summary,
importance: episode.importance || 0.5,
metadata: {
topic: episode.topic,
toolsUsed: episode.toolsUsed,
successful: episode.successful,
sessionId: episode.sessionId
}
});
};
Semantic Memory: Knowledge Base with Vector Search
Semantic memories are facts and preferences extracted from conversations. They get stored with embeddings and retrieved via vector similarity.
LongTermMemory.prototype.storeKnowledge = function (fact, options) {
options = options || {};
return this.store("semantic", fact, {
userId: options.userId,
summary: fact.substring(0, 200),
importance: options.importance || 0.6,
metadata: {
source: options.source || "conversation",
category: options.category || "general",
confidence: options.confidence || 0.8
}
});
};
LongTermMemory.prototype.searchSemantic = function (query, options) {
var self = this;
options = options || {};
var limit = options.limit || 5;
var threshold = options.threshold || 0.7;
return self.generateEmbedding(query).then(function (queryEmbedding) {
var embeddingStr = "[" + queryEmbedding.join(",") + "]";
var sql = [
"SELECT id, content, summary, importance, metadata,",
" 1 - (embedding <=> $1::vector) AS similarity,",
" created_at, access_count",
"FROM agent_memories",
"WHERE agent_id = $2",
" AND memory_type = $3",
" AND (expires_at IS NULL OR expires_at > NOW())",
" AND 1 - (embedding <=> $1::vector) > $4",
"ORDER BY similarity DESC",
"LIMIT $5"
].join(" ");
return self.pool.query(sql, [embeddingStr, self.agentId, "semantic", threshold, limit]);
}).then(function (result) {
// Update access counts
var ids = result.rows.map(function (r) { return r.id; });
if (ids.length > 0) {
self.pool.query(
"UPDATE agent_memories SET access_count = access_count + 1, last_accessed = NOW() WHERE id = ANY($1)",
[ids]
);
}
return result.rows;
});
};
The <=> operator is pgvector's cosine distance. We convert it to similarity with 1 - distance. The threshold parameter (default 0.7) filters out weakly related memories so the agent does not get distracted by noise.
Procedural Memory: Learned Tool Patterns
Procedural memory records what the agent has learned about how to accomplish tasks effectively.
LongTermMemory.prototype.storeToolPattern = function (pattern) {
var content = [
"Task type: " + pattern.taskType,
"Tool sequence: " + pattern.toolSequence.join(" -> "),
"Success rate: " + pattern.successRate,
"Avg duration: " + pattern.avgDuration + "ms",
"Notes: " + (pattern.notes || "none")
].join("\n");
return this.store("procedural", content, {
importance: Math.min(pattern.successRate, 1.0),
metadata: {
taskType: pattern.taskType,
toolSequence: pattern.toolSequence,
successRate: pattern.successRate,
sampleSize: pattern.sampleSize
}
});
};
LongTermMemory.prototype.getToolPatterns = function (taskDescription) {
var self = this;
return self.generateEmbedding(taskDescription).then(function (embedding) {
var embeddingStr = "[" + embedding.join(",") + "]";
var sql = [
"SELECT content, metadata,",
" 1 - (embedding <=> $1::vector) AS similarity",
"FROM agent_memories",
"WHERE agent_id = $2 AND memory_type = 'procedural'",
" AND (metadata->>'successRate')::float > 0.6",
"ORDER BY similarity DESC, importance DESC",
"LIMIT 3"
].join(" ");
return self.pool.query(sql, [embeddingStr, self.agentId]);
}).then(function (result) {
return result.rows;
});
};
This is where agents get genuinely useful over time. After the agent discovers that a particular sequence of API calls reliably solves a class of problems, that pattern gets recorded. Future encounters with similar tasks surface the proven approach.
Memory Retrieval Strategies
Retrieval is the critical bottleneck. Storing memories is easy. Getting the right ones back at the right time is the hard part.
Hybrid Scoring
Pure vector similarity is not enough. A memory from five minutes ago about the current topic should outrank a slightly more similar memory from six months ago about something tangential. Hybrid scoring combines multiple signals.
function scoreMemory(memory, queryEmbedding, now) {
var similarity = memory.similarity || 0;
var recencyMs = now - new Date(memory.created_at).getTime();
var recencyDays = recencyMs / (1000 * 60 * 60 * 24);
var recencyScore = Math.exp(-0.05 * recencyDays); // exponential decay
var importance = memory.importance || 0.5;
var accessFrequency = Math.min(memory.access_count / 20, 1.0);
// Weighted combination
var score = (
similarity * 0.40 +
recencyScore * 0.25 +
importance * 0.25 +
accessFrequency * 0.10
);
return score;
}
LongTermMemory.prototype.retrieve = function (query, options) {
var self = this;
options = options || {};
var limit = options.limit || 10;
var finalLimit = options.finalLimit || 5;
return self.generateEmbedding(query).then(function (queryEmbedding) {
var embeddingStr = "[" + queryEmbedding.join(",") + "]";
// Fetch more candidates than needed, then re-rank
var sql = [
"SELECT id, content, summary, importance, metadata,",
" 1 - (embedding <=> $1::vector) AS similarity,",
" created_at, access_count, memory_type",
"FROM agent_memories",
"WHERE agent_id = $2",
" AND (expires_at IS NULL OR expires_at > NOW())",
"ORDER BY embedding <=> $1::vector",
"LIMIT $3"
].join(" ");
return self.pool.query(sql, [embeddingStr, self.agentId, limit])
.then(function (result) {
var now = Date.now();
var scored = result.rows.map(function (row) {
row.compositeScore = scoreMemory(row, queryEmbedding, now);
return row;
});
scored.sort(function (a, b) {
return b.compositeScore - a.compositeScore;
});
return scored.slice(0, finalLimit);
});
});
};
The weights here are a starting point. In production, you would tune them based on your agent's domain. A customer support agent might weight recency higher. A research agent might weight importance and similarity higher.
Memory Consolidation
Memory consolidation is the bridge between short-term and long-term memory. When a conversation ends (or a buffer evicts messages), the system decides what is worth persisting.
function MemoryConsolidator(options) {
this.longTermMemory = options.longTermMemory;
this.llmClient = options.llmClient;
this.minImportance = options.minImportance || 0.4;
}
MemoryConsolidator.prototype.consolidate = function (messages, userId) {
var self = this;
// Use the LLM to extract memorable content
var transcript = messages.map(function (m) {
return m.role + ": " + m.content;
}).join("\n");
var extractionPrompt = [
"Analyze this conversation and extract memories worth preserving.",
"Return a JSON array of objects with these fields:",
"- type: 'episodic', 'semantic', or 'procedural'",
"- content: the memory content",
"- importance: 0.0 to 1.0",
"- topic: brief topic label",
"",
"Only extract genuinely useful information. Skip small talk and filler.",
"Episodic: specific interactions with clear outcomes.",
"Semantic: facts, preferences, or knowledge learned.",
"Procedural: successful strategies or tool patterns discovered.",
"",
"Conversation:",
transcript
].join("\n");
return self.llmClient.complete(extractionPrompt).then(function (response) {
var memories;
try {
memories = JSON.parse(response);
} catch (e) {
console.error("Failed to parse consolidation output:", e.message);
return [];
}
var storePromises = [];
memories.forEach(function (memory) {
if (memory.importance < self.minImportance) return;
var promise = self.longTermMemory.store(memory.type, memory.content, {
userId: userId,
importance: memory.importance,
metadata: { topic: memory.topic, source: "consolidation" }
});
storePromises.push(promise);
});
return Promise.all(storePromises);
});
};
The consolidation step uses the LLM itself to decide what is worth remembering. This is expensive but effective. A cheaper alternative is rule-based extraction: always store messages containing entities, always store successful tool invocations, never store greetings.
Forgetting Mechanisms
Memory systems that never forget eventually drown in noise. Intentional forgetting is as important as intentional remembering.
LongTermMemory.prototype.decay = function () {
// Remove expired memories
var expiredQuery = "DELETE FROM agent_memories WHERE expires_at IS NOT NULL AND expires_at < NOW()";
// Reduce importance of old, infrequently accessed memories
var decayQuery = [
"UPDATE agent_memories",
"SET importance = importance * 0.95",
"WHERE last_accessed < NOW() - INTERVAL '30 days'",
" AND access_count < 3",
" AND importance > 0.1"
].join(" ");
// Delete memories that have decayed below threshold
var pruneQuery = "DELETE FROM agent_memories WHERE importance < 0.1 AND access_count < 2";
var self = this;
return self.pool.query(expiredQuery)
.then(function (r) {
console.log("Expired memories removed:", r.rowCount);
return self.pool.query(decayQuery);
})
.then(function (r) {
console.log("Memories decayed:", r.rowCount);
return self.pool.query(pruneQuery);
})
.then(function (r) {
console.log("Low-value memories pruned:", r.rowCount);
});
};
Run this on a cron schedule — daily or weekly depending on volume. The three-step process mirrors how human memory works: explicit expiration (TTL), gradual fading (decay), and pruning of weak memories.
Memory-Augmented Prompt Construction
The final piece is assembling all memory sources into a coherent prompt that gives the LLM the context it needs without blowing the token budget.
function MemoryAugmentedPromptBuilder(options) {
this.conversationBuffer = options.conversationBuffer;
this.workingMemory = options.workingMemory;
this.longTermMemory = options.longTermMemory;
this.maxMemoryTokens = options.maxMemoryTokens || 2000;
}
MemoryAugmentedPromptBuilder.prototype.build = function (systemPrompt, userMessage) {
var self = this;
return self.longTermMemory.retrieve(userMessage, { finalLimit: 5 })
.then(function (memories) {
var parts = [];
// System prompt
parts.push({ role: "system", content: systemPrompt });
// Long-term memory context
if (memories.length > 0) {
var memoryBlock = "## Relevant Memories\n";
memories.forEach(function (mem) {
var typeLabel = "[" + mem.memory_type.toUpperCase() + "]";
var score = mem.compositeScore.toFixed(2);
memoryBlock += typeLabel + " (relevance: " + score + ") " +
(mem.summary || mem.content.substring(0, 200)) + "\n\n";
});
var tokenEstimate = Math.ceil(memoryBlock.length / 4);
if (tokenEstimate <= self.maxMemoryTokens) {
parts.push({
role: "system",
content: memoryBlock
});
}
}
// Working memory
var workingStr = self.workingMemory.toPromptString();
if (workingStr) {
parts.push({ role: "system", content: workingStr });
}
// Conversation history
var history = self.conversationBuffer.getMessages();
history.forEach(function (msg) {
parts.push(msg);
});
// Current user message
parts.push({ role: "user", content: userMessage });
return parts;
});
};
The ordering matters. System prompt first, then long-term memory context, then working memory, then conversation history, then the current message. This puts the most stable context at the top (where the LLM is least likely to lose track of it) and the most dynamic context at the bottom (where recency bias helps).
Complete Working Example
Here is the full agent memory system wired together in a single Express endpoint.
var express = require("express");
var pg = require("pg");
// -- Initialize components --
var pool = new pg.Pool({
connectionString: process.env.POSTGRES_CONNECTION_STRING
});
var conversationBuffers = {}; // sessionId -> ConversationBuffer
var workingMemories = {}; // sessionId -> WorkingMemory
var longTermMemory = new LongTermMemory({
connectionString: process.env.POSTGRES_CONNECTION_STRING,
agentId: "support-agent-v1",
openaiKey: process.env.OPENAI_API_KEY
});
var consolidator = new MemoryConsolidator({
longTermMemory: longTermMemory,
llmClient: { complete: callLLM }, // your LLM wrapper
minImportance: 0.4
});
function getSessionMemory(sessionId) {
if (!conversationBuffers[sessionId]) {
conversationBuffers[sessionId] = new ConversationBuffer({
maxMessages: 40,
maxTokens: 6000
});
// Wire up consolidation on eviction
conversationBuffers[sessionId].emitter.on("message_evicted", function (msg) {
// Batch evicted messages for consolidation
if (!conversationBuffers[sessionId]._evicted) {
conversationBuffers[sessionId]._evicted = [];
}
conversationBuffers[sessionId]._evicted.push(msg);
});
}
if (!workingMemories[sessionId]) {
workingMemories[sessionId] = new WorkingMemory();
}
return {
buffer: conversationBuffers[sessionId],
working: workingMemories[sessionId]
};
}
var app = express();
app.use(express.json());
app.post("/agent/chat", function (req, res) {
var sessionId = req.body.sessionId;
var userId = req.body.userId;
var userMessage = req.body.message;
if (!sessionId || !userMessage) {
return res.status(400).json({ error: "sessionId and message required" });
}
var session = getSessionMemory(sessionId);
// Add user message to buffer
session.buffer.add("user", userMessage, { userId: userId });
// Build memory-augmented prompt
var builder = new MemoryAugmentedPromptBuilder({
conversationBuffer: session.buffer,
workingMemory: session.working,
longTermMemory: longTermMemory,
maxMemoryTokens: 2000
});
var systemPrompt = [
"You are a helpful support agent.",
"Use the provided memories to personalize your responses.",
"If you learn new facts about the user, note them explicitly."
].join(" ");
builder.build(systemPrompt, userMessage)
.then(function (messages) {
return callLLM(messages);
})
.then(function (response) {
// Add assistant response to buffer
session.buffer.add("assistant", response);
// Extract any new knowledge the agent identified
extractAndStoreKnowledge(response, userId);
res.json({
response: response,
sessionId: sessionId
});
})
.catch(function (err) {
console.error("Agent chat error:", err);
res.status(500).json({ error: "Internal agent error" });
});
});
// Session end -> consolidate memories
app.post("/agent/end-session", function (req, res) {
var sessionId = req.body.sessionId;
var userId = req.body.userId;
var session = getSessionMemory(sessionId);
consolidator.consolidate(session.buffer.messages, userId)
.then(function (stored) {
// Cleanup
delete conversationBuffers[sessionId];
delete workingMemories[sessionId];
res.json({ consolidated: stored.length });
})
.catch(function (err) {
console.error("Consolidation error:", err);
res.status(500).json({ error: "Consolidation failed" });
});
});
function extractAndStoreKnowledge(response, userId) {
// Simple pattern: if the agent says "I'll remember that" or "noted",
// extract the preceding user message as semantic memory
if (/i('ll| will) remember|noted|got it/i.test(response)) {
var session = Object.values(conversationBuffers)[0]; // simplified
if (session && session.messages.length >= 2) {
var lastUserMsg = session.messages[session.messages.length - 2];
if (lastUserMsg.role === "user") {
longTermMemory.storeKnowledge(lastUserMsg.content, {
userId: userId,
importance: 0.7,
source: "explicit_acknowledgment"
}).catch(function (err) {
console.error("Knowledge storage error:", err);
});
}
}
}
}
function callLLM(messagesOrPrompt) {
// Implement your LLM call here (OpenAI, Anthropic, etc.)
// Return a promise that resolves with the response text
var fetch = require("node-fetch");
var messages = Array.isArray(messagesOrPrompt)
? messagesOrPrompt
: [{ role: "user", content: messagesOrPrompt }];
return fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: {
"Authorization": "Bearer " + process.env.OPENAI_API_KEY,
"Content-Type": "application/json"
},
body: JSON.stringify({
model: "gpt-4o",
messages: messages,
temperature: 0.7
})
})
.then(function (r) { return r.json(); })
.then(function (data) {
return data.choices[0].message.content;
});
}
// Run memory decay daily
var cron = require("node-cron");
cron.schedule("0 3 * * *", function () {
console.log("Running memory decay...");
longTermMemory.decay().catch(function (err) {
console.error("Memory decay error:", err);
});
});
var port = process.env.PORT || 3000;
app.listen(port, function () {
console.log("Agent with memory running on port " + port);
});
This gives you a fully functional agent with conversation buffering, working memory, cross-session episodic and semantic storage, vector-based retrieval with hybrid scoring, LLM-powered consolidation at session end, and automatic memory decay.
Common Issues and Troubleshooting
1. pgvector Extension Not Found
ERROR: could not open extension control file
"/usr/share/postgresql/15/extension/vector.control": No such file or directory
This means pgvector is not installed on your PostgreSQL server. On Ubuntu: sudo apt install postgresql-15-pgvector. On macOS with Homebrew: brew install pgvector. On managed databases (AWS RDS, DigitalOcean), enable it through the provider's dashboard — not all tiers support it.
2. Embedding Dimension Mismatch
ERROR: expected 1536 dimensions, not 3072
This happens when you switch embedding models without updating the schema. The text-embedding-3-small model produces 1536-dimensional vectors. The text-embedding-3-large model produces 3072. If you change models, you must alter the column: ALTER TABLE agent_memories ALTER COLUMN embedding TYPE vector(3072) and regenerate all existing embeddings. Do not mix dimensions.
3. IVFFlat Index Requires Training Data
ERROR: index "idx_memories_embedding" is not valid
IVFFlat indexes require a minimum number of rows before they work correctly. The lists parameter (set to 100 in our schema) should be roughly sqrt(n) where n is your expected row count. If you have fewer than 100 rows, the index degrades. Either reduce lists (e.g., to 10) for small datasets, or switch to HNSW indexing: USING hnsw (embedding vector_cosine_ops) which works well at any scale.
4. Memory Consolidation Produces Invalid JSON
SyntaxError: Unexpected token < in JSON at position 0
The LLM sometimes returns markdown-wrapped JSON (json ... ) or adds explanatory text before the array. Wrap the parse in a more resilient extractor:
function parseJsonFromLLM(text) {
// Strip markdown code fences
var cleaned = text.replace(/```json?\n?/g, "").replace(/```/g, "").trim();
// Try to find a JSON array
var match = cleaned.match(/\[[\s\S]*\]/);
if (match) {
return JSON.parse(match[0]);
}
throw new Error("No JSON array found in LLM output");
}
5. Token Budget Overflows on Memory-Heavy Prompts
Error: This model's maximum context length is 128000 tokens.
However, your messages resulted in 135421 tokens.
This happens when too many memories get injected. Always enforce a hard token cap in the prompt builder. Count estimated tokens for each memory block and stop adding memories once you hit the budget. The maxMemoryTokens parameter in MemoryAugmentedPromptBuilder handles this, but you should also add per-memory truncation:
var truncatedContent = mem.content.length > 500
? mem.content.substring(0, 500) + "..."
: mem.content;
Best Practices
Separate memory types rigorously. Episodic, semantic, and procedural memories have different lifecycles, retrieval patterns, and decay rates. Storing them in one table with a discriminator column is fine for queries, but do not let the boundaries blur in your application logic.
Embed summaries, not raw content. Generating embeddings from a 200-word summary produces better retrieval than embedding a 5000-word transcript. The embedding captures the gist, which is what you want for similarity search. Store the full content separately for when you need the details.
Budget tokens aggressively for memory context. Reserve no more than 15-20% of your total context window for memory injection. If your model supports 128k tokens, cap memory at 20k. The rest needs to go to the system prompt, conversation history, and leaving room for the response.
Run memory decay on a schedule, not inline. Decay and pruning queries touch a lot of rows. Running them inside a request handler adds latency. Use a cron job or a background worker. Daily is a good starting frequency.
Use importance scoring as a first-class signal. Not all memories are created equal. A user's stated preference ("I always deploy to us-east-1") is more important than a casual remark ("nice weather today"). Score importance at write time and use it as a retrieval weight.
Test retrieval quality, not just retrieval speed. Build a small evaluation set: 20-30 queries where you know which memories should be returned. Run retrieval and score precision/recall. Tune your hybrid scoring weights based on actual results, not intuition.
Implement explicit memory commands. Let users say "remember that I prefer dark mode" or "forget what I told you about my old API key." Explicit memory operations build trust and give users control. Parse these commands before they hit the LLM.
Keep embedding model versions pinned. Changing from
text-embedding-3-smallto a different model changes the vector space. Old embeddings become incompatible with new queries. If you must migrate, re-embed everything in a batch job and swap atomically.Log consolidation decisions. When the LLM extracts memories during consolidation, log what it extracted and what it ignored. This is your audit trail for debugging weird agent behavior ("why does it think I like TypeScript?"). Store the raw consolidation output alongside the individual memories.
References
- pgvector: Open-Source Vector Similarity Search for PostgreSQL
- OpenAI Embeddings Documentation
- Cognitive Architectures for Language Agents (CoALA)
- MemGPT: Towards LLMs as Operating Systems
- LangChain Memory Module Documentation
- Letta (formerly MemGPT) Agent Framework
- Human Memory Systems — Atkinson-Shiffrin Model