Building Conversational Memory with LLM APIs

Shane

2/13/2026

25 min read

Build conversational memory systems with session management, summarization, entity tracking, and PostgreSQL persistence in Node.js.

llm nodejs sessions conversational memory chat history context

Building Conversational Memory with LLM APIs

Large language models are stateless by design. Every API call starts from zero context, which means your chatbot forgets everything the moment a response is returned. Conversational memory is the engineering layer that bridges this gap, giving your LLM-powered applications the ability to recall past interactions, track entities, and build coherent multi-turn conversations. This article covers the full spectrum of memory architectures, from simple sliding windows to hybrid systems with summarization, entity tracking, and semantic retrieval, all built in Node.js with PostgreSQL persistence.

Prerequisites

Node.js v18 or later installed
PostgreSQL 14+ running locally or remotely
An OpenAI API key (or compatible LLM API)
Familiarity with Express.js and basic SQL
Working knowledge of LLM chat completion APIs

Install the required packages:

npm install openai pg uuid express

Why Conversational Memory Matters

Without memory, every interaction with an LLM is isolated. A user says "My name is Sarah" in turn one, asks "What's my name?" in turn two, and the model has no idea. This is not a limitation of the model itself — it is a limitation of how we call it. The chat completion API accepts a messages array, and whatever you put in that array is the model's entire universe of context.

This creates three fundamental engineering problems:

Context window limits. Models have finite token budgets. GPT-4o gives you 128K tokens. Claude gives you 200K. Sounds like a lot until your chatbot has been running for 200 turns and the conversation history alone consumes 80K tokens, leaving little room for system prompts, retrieval results, or tool definitions.
Cost scaling. You pay per token, both input and output. Sending the entire conversation history on every request means your costs grow quadratically with conversation length. A 50-turn conversation costs far more than 50 times what a single turn costs.
Relevance decay. Not all past messages are equally important. The user's preference stated 100 turns ago matters more than the small talk from turn 47. Naive approaches that just truncate old messages lose critical context while preserving irrelevant chatter.

Memory systems solve all three problems by intelligently managing what context reaches the model and in what form.

In-Memory Conversation Storage with Session Management

The simplest memory implementation stores conversations in a JavaScript object keyed by session ID. This works well for prototypes and low-traffic applications.

var conversations = {};

function getOrCreateSession(sessionId) {
  if (!conversations[sessionId]) {
    conversations[sessionId] = {
      id: sessionId,
      messages: [],
      createdAt: new Date(),
      lastAccessedAt: new Date(),
      metadata: {}
    };
  }
  conversations[sessionId].lastAccessedAt = new Date();
  return conversations[sessionId];
}

function addMessage(sessionId, role, content) {
  var session = getOrCreateSession(sessionId);
  var message = {
    role: role,
    content: content,
    timestamp: new Date(),
    tokenEstimate: Math.ceil(content.length / 4)
  };
  session.messages.push(message);
  return message;
}

function getMessages(sessionId) {
  var session = getOrCreateSession(sessionId);
  return session.messages.map(function(msg) {
    return { role: msg.role, content: msg.content };
  });
}

The tokenEstimate field uses a rough heuristic of 4 characters per token. It is not precise, but it is fast and good enough for deciding when to trim. For production systems, use the tiktoken library for exact counts.

The obvious limitation here is that everything lives in process memory. Restart the server and all conversations vanish. That is where PostgreSQL comes in.

Persisting Conversations to PostgreSQL

A durable memory system needs a database. PostgreSQL is an excellent choice because it handles JSON natively, supports full-text search, and has the pgvector extension for embedding-based retrieval.

First, create the schema:

CREATE TABLE conversations (
  id UUID PRIMARY KEY,
  created_at TIMESTAMP DEFAULT NOW(),
  last_accessed_at TIMESTAMP DEFAULT NOW(),
  metadata JSONB DEFAULT '{}'::jsonb
);

CREATE TABLE messages (
  id SERIAL PRIMARY KEY,
  conversation_id UUID REFERENCES conversations(id) ON DELETE CASCADE,
  role VARCHAR(20) NOT NULL,
  content TEXT NOT NULL,
  token_count INTEGER DEFAULT 0,
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE conversation_summaries (
  id SERIAL PRIMARY KEY,
  conversation_id UUID REFERENCES conversations(id) ON DELETE CASCADE,
  summary TEXT NOT NULL,
  messages_summarized INTEGER NOT NULL,
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE conversation_entities (
  id SERIAL PRIMARY KEY,
  conversation_id UUID REFERENCES conversations(id) ON DELETE CASCADE,
  entity_name VARCHAR(255) NOT NULL,
  entity_type VARCHAR(100),
  entity_value TEXT,
  last_mentioned_at TIMESTAMP DEFAULT NOW(),
  mention_count INTEGER DEFAULT 1
);

CREATE INDEX idx_messages_conversation ON messages(conversation_id, created_at);
CREATE INDEX idx_entities_conversation ON conversation_entities(conversation_id);
CREATE INDEX idx_entities_name ON conversation_entities(entity_name);
CREATE INDEX idx_conversations_last_accessed ON conversations(last_accessed_at);

Now the data access layer:

var { Pool } = require("pg");
var { v4: uuidv4 } = require("uuid");

var pool = new Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING
});

function createConversation(metadata) {
  var id = uuidv4();
  return pool.query(
    "INSERT INTO conversations (id, metadata) VALUES ($1, $2) RETURNING *",
    [id, JSON.stringify(metadata || {})]
  ).then(function(result) {
    return result.rows[0];
  });
}

function saveMessage(conversationId, role, content, tokenCount) {
  return pool.query(
    "INSERT INTO messages (conversation_id, role, content, token_count) VALUES ($1, $2, $3, $4) RETURNING *",
    [conversationId, role, content, tokenCount || 0]
  ).then(function(result) {
    return pool.query(
      "UPDATE conversations SET last_accessed_at = NOW() WHERE id = $1",
      [conversationId]
    ).then(function() {
      return result.rows[0];
    });
  });
}

function getRecentMessages(conversationId, limit) {
  return pool.query(
    "SELECT role, content, created_at FROM messages WHERE conversation_id = $1 ORDER BY created_at DESC LIMIT $2",
    [conversationId, limit || 50]
  ).then(function(result) {
    return result.rows.reverse();
  });
}

This gives you durable, queryable conversation storage. But storing every message is only half the problem. The real challenge is deciding which messages to include in the prompt.

Implementing a Sliding Window Memory

The sliding window is the most common memory strategy. Keep the last N messages, discard everything else. It is simple, predictable, and keeps token usage bounded.

var MAX_MESSAGES = 20;
var MAX_TOKENS = 4000;

function slidingWindowMemory(conversationId) {
  return getRecentMessages(conversationId, MAX_MESSAGES).then(function(messages) {
    var totalTokens = 0;
    var windowMessages = [];

    for (var i = messages.length - 1; i >= 0; i--) {
      var tokenCount = Math.ceil(messages[i].content.length / 4);
      if (totalTokens + tokenCount > MAX_TOKENS) {
        break;
      }
      totalTokens += tokenCount;
      windowMessages.unshift(messages[i]);
    }

    return windowMessages;
  });
}

This iterates backward from the most recent message, adding messages until the token budget is exhausted. The dual constraint on both message count and token count is important. A single very long message could blow your budget even with a small window size.

The weakness of sliding windows is obvious: important context from earlier in the conversation gets dropped entirely. The user's name, their project details, their preferences — all gone once they scroll past the window boundary.

Summarization Memory

Summarization memory solves the information loss problem by compressing old messages into a concise summary. When messages fall outside the sliding window, you pass them to the LLM with a summarization prompt and store the result.

var OpenAI = require("openai");
var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function summarizeMessages(messages) {
  var transcript = messages.map(function(msg) {
    return msg.role + ": " + msg.content;
  }).join("\n");

  return openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: "Summarize this conversation excerpt. Preserve all key facts, user preferences, decisions made, and action items. Be concise but thorough. Output only the summary, no preamble."
      },
      {
        role: "user",
        content: transcript
      }
    ],
    max_tokens: 500,
    temperature: 0.2
  }).then(function(response) {
    return response.choices[0].message.content;
  });
}

function saveSummary(conversationId, summary, messageCount) {
  return pool.query(
    "INSERT INTO conversation_summaries (conversation_id, summary, messages_summarized) VALUES ($1, $2, $3) RETURNING *",
    [conversationId, summary, messageCount]
  ).then(function(result) {
    return result.rows[0];
  });
}

function getLatestSummary(conversationId) {
  return pool.query(
    "SELECT summary, messages_summarized FROM conversation_summaries WHERE conversation_id = $1 ORDER BY created_at DESC LIMIT 1",
    [conversationId]
  ).then(function(result) {
    return result.rows[0] || null;
  });
}

The summarization prompt is critical. A vague instruction like "summarize this" produces generic summaries that lose important details. Be specific about what to preserve: facts, preferences, decisions, and action items.

Use a cheap, fast model for summarization. GPT-4o-mini or Claude Haiku work well. You do not need a frontier model to compress text — you need it for the actual conversation.

Entity Memory

Entity memory extracts and tracks specific pieces of information across the conversation: names, dates, preferences, project details, technical specifications. Instead of relying on the model to infer these from raw messages, you explicitly extract and store them.

function extractEntities(message) {
  return openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: 'Extract named entities and key facts from this message. Return a JSON array of objects with keys: name, type, value. Types should be: person, preference, project, date, technical, location, organization. If no entities found, return an empty array. Return ONLY valid JSON.'
      },
      {
        role: "user",
        content: message
      }
    ],
    max_tokens: 300,
    temperature: 0
  }).then(function(response) {
    try {
      return JSON.parse(response.choices[0].message.content);
    } catch (e) {
      return [];
    }
  });
}

function upsertEntity(conversationId, entity) {
  return pool.query(
    "SELECT id, mention_count FROM conversation_entities WHERE conversation_id = $1 AND entity_name = $2",
    [conversationId, entity.name]
  ).then(function(result) {
    if (result.rows.length > 0) {
      return pool.query(
        "UPDATE conversation_entities SET entity_value = $1, last_mentioned_at = NOW(), mention_count = mention_count + 1 WHERE id = $2",
        [entity.value, result.rows[0].id]
      );
    }
    return pool.query(
      "INSERT INTO conversation_entities (conversation_id, entity_name, entity_type, entity_value) VALUES ($1, $2, $3, $4)",
      [conversationId, entity.name, entity.type, entity.value]
    );
  });
}

function getEntities(conversationId) {
  return pool.query(
    "SELECT entity_name, entity_type, entity_value, mention_count FROM conversation_entities WHERE conversation_id = $1 ORDER BY mention_count DESC, last_mentioned_at DESC",
    [conversationId]
  ).then(function(result) {
    return result.rows;
  });
}

Entity memory has a distinct advantage: it provides structured, queryable facts rather than unstructured text. You can ask "What is the user's name?" by querying the entities table rather than hoping the model remembers it from a 500-line conversation history.

The extraction step does add latency and cost to every turn. In practice, you should run extraction asynchronously — fire it off after saving the message but before waiting for the result. The entities will be available for the next turn, which is when they are actually needed.

Hybrid Memory Architecture

The most effective memory systems combine multiple strategies. Here is the architecture I use in production:

System prompt — static instructions, persona, tools
Entity context — structured facts about the user and conversation
Conversation summary — compressed history of everything before the window
Sliding window — the last N messages verbatim

var WINDOW_SIZE = 15;
var SUMMARY_THRESHOLD = 30;

function buildHybridMemory(conversationId) {
  return Promise.all([
    getLatestSummary(conversationId),
    getEntities(conversationId),
    getRecentMessages(conversationId, WINDOW_SIZE),
    getTotalMessageCount(conversationId)
  ]).then(function(results) {
    var summary = results[0];
    var entities = results[1];
    var recentMessages = results[2];
    var totalCount = results[3];

    var memoryContext = [];

    if (entities.length > 0) {
      var entityBlock = "Known facts about this conversation:\n";
      entities.forEach(function(entity) {
        entityBlock += "- " + entity.entity_name + " (" + entity.entity_type + "): " + entity.entity_value + "\n";
      });
      memoryContext.push(entityBlock);
    }

    if (summary && totalCount > WINDOW_SIZE) {
      memoryContext.push("Summary of earlier conversation:\n" + summary.summary);
    }

    return {
      memoryContext: memoryContext.join("\n\n"),
      recentMessages: recentMessages,
      stats: {
        totalMessages: totalCount,
        windowSize: recentMessages.length,
        entityCount: entities.length,
        hasSummary: !!summary
      }
    };
  });
}

function getTotalMessageCount(conversationId) {
  return pool.query(
    "SELECT COUNT(*) as count FROM messages WHERE conversation_id = $1",
    [conversationId]
  ).then(function(result) {
    return parseInt(result.rows[0].count);
  });
}

The summary is only triggered after the conversation exceeds a threshold length. For short conversations, the sliding window alone is sufficient. This avoids unnecessary summarization calls and their associated cost.

Memory Retrieval with Embeddings

For long-running conversations or agents that need to recall specific details from hundreds of turns ago, embedding-based retrieval is essential. Instead of relying on recency or summarization, you embed each message and retrieve the most semantically relevant ones based on the current query.

First, add the pgvector extension and an embedding column:

CREATE EXTENSION IF NOT EXISTS vector;
ALTER TABLE messages ADD COLUMN embedding vector(1536);
CREATE INDEX idx_messages_embedding ON messages USING ivfflat (embedding vector_cosine_ops);

Then build the retrieval pipeline:

function embedText(text) {
  return openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  }).then(function(response) {
    return response.data[0].embedding;
  });
}

function saveMessageWithEmbedding(conversationId, role, content) {
  return embedText(content).then(function(embedding) {
    var embeddingStr = "[" + embedding.join(",") + "]";
    return pool.query(
      "INSERT INTO messages (conversation_id, role, content, token_count, embedding) VALUES ($1, $2, $3, $4, $5::vector) RETURNING *",
      [conversationId, role, content, Math.ceil(content.length / 4), embeddingStr]
    );
  }).then(function(result) {
    return result.rows[0];
  });
}

function retrieveRelevantMessages(conversationId, query, limit) {
  return embedText(query).then(function(queryEmbedding) {
    var embeddingStr = "[" + queryEmbedding.join(",") + "]";
    return pool.query(
      "SELECT role, content, created_at, 1 - (embedding <=> $1::vector) as similarity FROM messages WHERE conversation_id = $2 AND embedding IS NOT NULL ORDER BY embedding <=> $1::vector LIMIT $3",
      [embeddingStr, conversationId, limit || 5]
    );
  }).then(function(result) {
    return result.rows;
  });
}

Embedding retrieval works best as an augmentation to the hybrid memory approach. Use it to inject relevant historical context that the sliding window and summary might have missed. A typical pattern is to embed the user's current message, retrieve the top 3-5 most similar past exchanges, and include them in a "relevant past context" block in the prompt.

Be aware that embedding every message adds latency and cost. For cost-sensitive applications, batch embed at summarization time rather than on every message, or only embed messages that contain substantive content (skip "ok", "thanks", "got it").

Conversation Branching and Forking

Some applications need the ability to fork a conversation — "what if I had said X instead of Y?" This is common in agent systems where you want to explore multiple reasoning paths, or in collaborative tools where users want to try alternative approaches without losing the original thread.

function forkConversation(sourceConversationId, forkAtMessageId) {
  var newId = uuidv4();

  return pool.query(
    "INSERT INTO conversations (id, metadata) SELECT $1, metadata || '{\"forked_from\": \"" + sourceConversationId + "\"}'::jsonb FROM conversations WHERE id = $2 RETURNING *",
    [newId, sourceConversationId]
  ).then(function() {
    var query;
    if (forkAtMessageId) {
      query = "INSERT INTO messages (conversation_id, role, content, token_count, created_at) " +
              "SELECT $1, role, content, token_count, created_at FROM messages " +
              "WHERE conversation_id = $2 AND id <= $3 ORDER BY created_at";
      return pool.query(query, [newId, sourceConversationId, forkAtMessageId]);
    }
    query = "INSERT INTO messages (conversation_id, role, content, token_count, created_at) " +
            "SELECT $1, role, content, token_count, created_at FROM messages " +
            "WHERE conversation_id = $2 ORDER BY created_at";
    return pool.query(query, [newId, sourceConversationId]);
  }).then(function() {
    return pool.query(
      "INSERT INTO conversation_entities (conversation_id, entity_name, entity_type, entity_value, mention_count) " +
      "SELECT $1, entity_name, entity_type, entity_value, mention_count FROM conversation_entities " +
      "WHERE conversation_id = $2",
      [newId, sourceConversationId]
    );
  }).then(function() {
    return newId;
  });
}

This copies messages up to a specified point and all associated entities into a new conversation. The fork metadata tracks lineage, which is useful for debugging and analytics.

Memory Sharing Across Conversations

In multi-agent or multi-session systems, you sometimes need shared memory — facts that persist across separate conversations. A user's profile, preferences, or project context should not be re-extracted every time they start a new chat.

function getSharedEntities(userId) {
  return pool.query(
    "SELECT DISTINCT ON (entity_name) entity_name, entity_type, entity_value, last_mentioned_at " +
    "FROM conversation_entities ce " +
    "JOIN conversations c ON ce.conversation_id = c.id " +
    "WHERE c.metadata->>'userId' = $1 " +
    "ORDER BY entity_name, last_mentioned_at DESC",
    [userId]
  ).then(function(result) {
    return result.rows;
  });
}

function injectSharedMemory(userId, conversationId) {
  return getSharedEntities(userId).then(function(entities) {
    if (entities.length === 0) return "";

    var context = "Known facts about this user (from previous conversations):\n";
    entities.forEach(function(entity) {
      context += "- " + entity.entity_name + ": " + entity.entity_value + "\n";
    });
    return context;
  });
}

This approach queries entities across all conversations for a given user, taking the most recently mentioned value for each entity name. It gives new conversations immediate access to accumulated knowledge about the user without requiring the user to repeat themselves.

TTL and Cleanup Strategies

Conversations accumulate. Without cleanup, your database grows indefinitely, your queries slow down, and you are storing context that nobody will ever retrieve again. Implement time-to-live policies and regular cleanup.

var CONVERSATION_TTL_DAYS = 30;
var MESSAGE_ARCHIVE_DAYS = 90;

function cleanupExpiredConversations() {
  return pool.query(
    "DELETE FROM conversations WHERE last_accessed_at < NOW() - INTERVAL '" + CONVERSATION_TTL_DAYS + " days' RETURNING id"
  ).then(function(result) {
    console.log("Cleaned up " + result.rowCount + " expired conversations");
    return result.rowCount;
  });
}

function archiveOldMessages() {
  return pool.query(
    "WITH old_messages AS ( " +
    "  SELECT m.id, m.conversation_id FROM messages m " +
    "  JOIN conversations c ON m.conversation_id = c.id " +
    "  WHERE m.created_at < NOW() - INTERVAL '" + MESSAGE_ARCHIVE_DAYS + " days' " +
    "  AND c.last_accessed_at > NOW() - INTERVAL '" + CONVERSATION_TTL_DAYS + " days' " +
    ") " +
    "DELETE FROM messages WHERE id IN (SELECT id FROM old_messages) RETURNING conversation_id"
  ).then(function(result) {
    var affectedConversations = {};
    result.rows.forEach(function(row) {
      affectedConversations[row.conversation_id] = true;
    });
    console.log("Archived " + result.rowCount + " old messages from " + Object.keys(affectedConversations).length + " conversations");
    return result.rowCount;
  });
}

function scheduleCleanup() {
  setInterval(function() {
    cleanupExpiredConversations().catch(function(err) {
      console.error("Cleanup failed:", err.message);
    });
  }, 24 * 60 * 60 * 1000);
}

For active conversations with many old messages, trigger a summarization before archiving. This preserves the important context in compressed form while removing the raw messages.

Consider partitioning the messages table by month in PostgreSQL if you have high volume. Partition-level drops are far more efficient than row-level deletes for bulk cleanup.

Memory-Aware Prompt Construction

All of these memory components need to come together in a well-structured prompt. The order and formatting of memory blocks significantly affects how the model uses the information.

function constructPrompt(systemPrompt, memoryContext, recentMessages, retrievedMessages) {
  var messages = [];

  var systemContent = systemPrompt;
  if (memoryContext) {
    systemContent += "\n\n---\n\n" + memoryContext;
  }

  if (retrievedMessages && retrievedMessages.length > 0) {
    systemContent += "\n\n---\n\nRelevant context from earlier in this conversation:\n";
    retrievedMessages.forEach(function(msg) {
      systemContent += "[" + msg.role + "]: " + msg.content + "\n";
    });
  }

  messages.push({ role: "system", content: systemContent });

  recentMessages.forEach(function(msg) {
    messages.push({ role: msg.role, content: msg.content });
  });

  return messages;
}

The key principle: put durable, structured context (entities, summaries, retrieved passages) in the system message where the model treats it as authoritative background. Put the recent conversation in the user/assistant turn sequence where the model treats it as the active dialogue. This separation helps the model distinguish between "things I know about this user" and "what we are currently discussing."

Complete Working Example

Here is a complete conversational memory system that ties everything together. This Express.js server provides a chat API with hybrid memory: sliding window for recent messages, automatic summarization when conversations grow long, entity extraction on every user message, and memory-aware prompt construction.

var express = require("express");
var { Pool } = require("pg");
var { v4: uuidv4 } = require("uuid");
var OpenAI = require("openai");

var app = express();
app.use(express.json());

var pool = new Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING
});

var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

var WINDOW_SIZE = 15;
var MAX_WINDOW_TOKENS = 4000;
var SUMMARY_TRIGGER = 30;
var SYSTEM_PROMPT = "You are a helpful assistant. Use the provided context about the user and conversation history to give personalized, relevant responses.";

// --- Database helpers ---

function createConversation(userId) {
  var id = uuidv4();
  return pool.query(
    "INSERT INTO conversations (id, metadata) VALUES ($1, $2) RETURNING *",
    [id, JSON.stringify({ userId: userId })]
  ).then(function(result) {
    return result.rows[0];
  });
}

function saveMessage(conversationId, role, content) {
  var tokenCount = Math.ceil(content.length / 4);
  return pool.query(
    "INSERT INTO messages (conversation_id, role, content, token_count) VALUES ($1, $2, $3, $4) RETURNING *",
    [conversationId, role, content, tokenCount]
  ).then(function(result) {
    pool.query("UPDATE conversations SET last_accessed_at = NOW() WHERE id = $1", [conversationId]);
    return result.rows[0];
  });
}

function getRecentMessages(conversationId) {
  return pool.query(
    "SELECT role, content, token_count, created_at FROM messages WHERE conversation_id = $1 ORDER BY created_at DESC LIMIT $2",
    [conversationId, WINDOW_SIZE]
  ).then(function(result) {
    var messages = result.rows.reverse();
    var totalTokens = 0;
    var trimmed = [];

    for (var i = messages.length - 1; i >= 0; i--) {
      if (totalTokens + messages[i].token_count > MAX_WINDOW_TOKENS) break;
      totalTokens += messages[i].token_count;
      trimmed.unshift(messages[i]);
    }
    return trimmed;
  });
}

function getMessageCount(conversationId) {
  return pool.query(
    "SELECT COUNT(*) as count FROM messages WHERE conversation_id = $1",
    [conversationId]
  ).then(function(result) {
    return parseInt(result.rows[0].count);
  });
}

// --- Summarization ---

function getUnsummarizedMessages(conversationId, afterId) {
  var query = "SELECT role, content FROM messages WHERE conversation_id = $1";
  var params = [conversationId];

  if (afterId) {
    query += " AND id > $2";
    params.push(afterId);
  }

  query += " ORDER BY created_at LIMIT $" + (params.length + 1);
  params.push(SUMMARY_TRIGGER);

  return pool.query(query, params).then(function(result) {
    return result.rows;
  });
}

function summarize(conversationId) {
  return pool.query(
    "SELECT summary, messages_summarized FROM conversation_summaries WHERE conversation_id = $1 ORDER BY created_at DESC LIMIT 1",
    [conversationId]
  ).then(function(result) {
    var existingSummary = result.rows[0] || null;

    return getMessageCount(conversationId).then(function(total) {
      var summarizedCount = existingSummary ? existingSummary.messages_summarized : 0;
      var unsummarized = total - summarizedCount;

      if (unsummarized < SUMMARY_TRIGGER) return existingSummary;

      return getUnsummarizedMessages(conversationId).then(function(messages) {
        var transcript = "";
        if (existingSummary) {
          transcript += "Previous summary:\n" + existingSummary.summary + "\n\nNew messages:\n";
        }
        messages.forEach(function(msg) {
          transcript += msg.role + ": " + msg.content + "\n";
        });

        return openai.chat.completions.create({
          model: "gpt-4o-mini",
          messages: [
            {
              role: "system",
              content: "Create a comprehensive summary of this conversation. Preserve all key facts, user preferences, decisions, technical details, and action items. If a previous summary is provided, integrate the new messages into an updated summary. Be concise but thorough. Output only the summary."
            },
            { role: "user", content: transcript }
          ],
          max_tokens: 600,
          temperature: 0.2
        }).then(function(response) {
          var summary = response.choices[0].message.content;
          return pool.query(
            "INSERT INTO conversation_summaries (conversation_id, summary, messages_summarized) VALUES ($1, $2, $3) RETURNING *",
            [conversationId, summary, total]
          ).then(function(result) {
            return result.rows[0];
          });
        });
      });
    });
  });
}

// --- Entity extraction ---

function extractAndSaveEntities(conversationId, content) {
  return openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: 'Extract key entities and facts from this message. Return a JSON array of objects with "name", "type", and "value" keys. Types: person, preference, project, technical, date, location, organization. Return ONLY valid JSON. Empty array if none found.'
      },
      { role: "user", content: content }
    ],
    max_tokens: 300,
    temperature: 0
  }).then(function(response) {
    var entities;
    try {
      entities = JSON.parse(response.choices[0].message.content);
    } catch (e) {
      return [];
    }

    if (!Array.isArray(entities)) return [];

    var promises = entities.map(function(entity) {
      return pool.query(
        "SELECT id FROM conversation_entities WHERE conversation_id = $1 AND entity_name = $2",
        [conversationId, entity.name]
      ).then(function(result) {
        if (result.rows.length > 0) {
          return pool.query(
            "UPDATE conversation_entities SET entity_value = $1, last_mentioned_at = NOW(), mention_count = mention_count + 1 WHERE id = $2",
            [entity.value, result.rows[0].id]
          );
        }
        return pool.query(
          "INSERT INTO conversation_entities (conversation_id, entity_name, entity_type, entity_value) VALUES ($1, $2, $3, $4)",
          [conversationId, entity.name, entity.type, entity.value]
        );
      });
    });

    return Promise.all(promises).then(function() {
      return entities;
    });
  }).catch(function(err) {
    console.error("Entity extraction failed:", err.message);
    return [];
  });
}

function getEntities(conversationId) {
  return pool.query(
    "SELECT entity_name, entity_type, entity_value FROM conversation_entities WHERE conversation_id = $1 ORDER BY mention_count DESC LIMIT 20",
    [conversationId]
  ).then(function(result) {
    return result.rows;
  });
}

// --- Memory assembly ---

function assembleMemory(conversationId) {
  return Promise.all([
    getRecentMessages(conversationId),
    getEntities(conversationId),
    summarize(conversationId)
  ]).then(function(results) {
    var recentMessages = results[0];
    var entities = results[1];
    var summary = results[2];

    var systemContent = SYSTEM_PROMPT;

    if (entities.length > 0) {
      systemContent += "\n\n## Known Facts\n";
      entities.forEach(function(e) {
        systemContent += "- " + e.entity_name + " (" + e.entity_type + "): " + e.entity_value + "\n";
      });
    }

    if (summary) {
      systemContent += "\n\n## Conversation Summary\n" + summary.summary;
    }

    var messages = [{ role: "system", content: systemContent }];
    recentMessages.forEach(function(msg) {
      messages.push({ role: msg.role, content: msg.content });
    });

    return messages;
  });
}

// --- API routes ---

app.post("/api/conversations", function(req, res) {
  var userId = req.body.userId || "anonymous";
  createConversation(userId).then(function(conversation) {
    res.json({ conversationId: conversation.id });
  }).catch(function(err) {
    res.status(500).json({ error: err.message });
  });
});

app.post("/api/conversations/:id/messages", function(req, res) {
  var conversationId = req.params.id;
  var userMessage = req.body.message;

  if (!userMessage) {
    return res.status(400).json({ error: "Message is required" });
  }

  saveMessage(conversationId, "user", userMessage)
    .then(function() {
      // Run entity extraction in parallel with memory assembly
      extractAndSaveEntities(conversationId, userMessage);

      return assembleMemory(conversationId);
    })
    .then(function(messages) {
      return openai.chat.completions.create({
        model: "gpt-4o",
        messages: messages,
        max_tokens: 1000,
        temperature: 0.7
      });
    })
    .then(function(response) {
      var assistantContent = response.choices[0].message.content;
      return saveMessage(conversationId, "assistant", assistantContent).then(function() {
        res.json({
          response: assistantContent,
          usage: response.usage
        });
      });
    })
    .catch(function(err) {
      console.error("Chat error:", err.message);
      res.status(500).json({ error: "Failed to process message" });
    });
});

app.get("/api/conversations/:id/memory", function(req, res) {
  var conversationId = req.params.id;

  Promise.all([
    getEntities(conversationId),
    getMessageCount(conversationId),
    pool.query(
      "SELECT summary, messages_summarized, created_at FROM conversation_summaries WHERE conversation_id = $1 ORDER BY created_at DESC LIMIT 1",
      [conversationId]
    )
  ]).then(function(results) {
    res.json({
      entities: results[0],
      totalMessages: results[1],
      latestSummary: results[2].rows[0] || null
    });
  }).catch(function(err) {
    res.status(500).json({ error: err.message });
  });
});

var PORT = process.env.PORT || 3000;
app.listen(PORT, function() {
  console.log("Conversational memory server running on port " + PORT);
});

To test the server:

# Start the server
node server.js

# Create a conversation
curl -X POST http://localhost:3000/api/conversations \
  -H "Content-Type: application/json" \
  -d '{"userId": "user-123"}'

# Send messages
curl -X POST http://localhost:3000/api/conversations/CONV_ID/messages \
  -H "Content-Type: application/json" \
  -d '{"message": "Hi, my name is Sarah and I am working on a React e-commerce project"}'

curl -X POST http://localhost:3000/api/conversations/CONV_ID/messages \
  -H "Content-Type: application/json" \
  -d '{"message": "What was my name again?"}'

# Inspect memory state
curl http://localhost:3000/api/conversations/CONV_ID/memory

Common Issues and Troubleshooting

1. "Error: connection refused" or "ECONNREFUSED 127.0.0.1:5432"

PostgreSQL is not running or not accepting connections on the expected port. Verify the service is started and the connection string is correct.

Error: connect ECONNREFUSED 127.0.0.1:5432
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1595:16)

Check your connection string format: postgresql://user:password@localhost:5432/dbname. If using Docker, ensure the container is running and the port is mapped. Also verify pg_hba.conf allows connections from your host.

2. "error: relation 'conversation_entities' does not exist"

The database schema has not been applied. Run the SQL schema script before starting the server. This is especially common when connecting to a fresh database or after a database migration that missed a table.

error: relation "conversation_entities" does not exist
    at /node_modules/pg-pool/index.js:45:11

Run the full schema script from the article against your database: psql -d yourdb -f schema.sql.

3. "RateLimitError: Rate limit reached for gpt-4o-mini"

Entity extraction and summarization both call the LLM API. In high-throughput scenarios, you can hit rate limits quickly. Implement exponential backoff and consider queuing entity extraction.

RateLimitError: 429 Rate limit reached for gpt-4o-mini in organization org-xxx on tokens per min (TPM): Limit 200000

Fix: add retry logic with exponential backoff, or batch entity extraction calls. For summarization, run it as a background job rather than inline with the request.

function retryWithBackoff(fn, retries, delay) {
  return fn().catch(function(err) {
    if (retries <= 0 || err.status !== 429) throw err;
    return new Promise(function(resolve) {
      setTimeout(resolve, delay);
    }).then(function() {
      return retryWithBackoff(fn, retries - 1, delay * 2);
    });
  });
}

4. "SyntaxError: Unexpected token" when parsing entity extraction response

The LLM occasionally returns malformed JSON, especially with smaller models. Always wrap JSON parsing in a try/catch and return a safe default.

SyntaxError: Unexpected token 'H', "Here are t"... is not valid JSON
    at JSON.parse (<anonymous>)

This happens when the model ignores the "return ONLY valid JSON" instruction and adds conversational preamble. Mitigation strategies: use response_format: { type: "json_object" } if available on your model, set temperature to 0, and always have a fallback path for parse failures.

5. "Error: could not access file 'vector': No such file or directory"

The pgvector extension is not installed on your PostgreSQL instance. This is required only for the embedding retrieval feature.

ERROR:  could not access file "$libdir/vector": No such file or directory

Install pgvector following the instructions for your platform. On Ubuntu: sudo apt install postgresql-14-pgvector. On macOS with Homebrew: brew install pgvector. For managed databases (RDS, Supabase), pgvector is usually available as a supported extension that you enable through the dashboard.

Best Practices

Budget your tokens explicitly. Allocate a fixed percentage of your context window to each memory component: 10% for entities, 15% for summary, 40% for recent messages, 35% for the model's response and system prompt. Track actual usage and adjust.
Use cheap models for memory operations. Summarization, entity extraction, and embedding generation do not need your frontier model. GPT-4o-mini, Claude Haiku, or similar models are fast, cheap, and more than capable for these tasks. Reserve the expensive model for the actual conversation.
Run memory operations asynchronously. Entity extraction and summarization can happen after the response is sent to the user. The extracted entities will be available for the next turn, which is when they matter. This keeps response latency low.
Version your summaries. Never overwrite a summary — always create a new row. This gives you an audit trail and the ability to roll back if a summarization run produces poor results. It also lets you analyze how the conversation's summary evolved over time.
Implement circuit breakers for LLM calls in the memory pipeline. If the entity extraction or summarization API is down, the chat should still work. Degrade gracefully by skipping memory enrichment rather than failing the entire request. Your users care about getting a response, not about whether entity extraction succeeded.
Set reasonable TTLs and enforce them. Conversations from six months ago that nobody has touched are dead weight. Clean them up. If you need long-term user knowledge, promote important entities to a dedicated user profile store that persists independently of individual conversations.
Test with long conversations. It is easy to build a memory system that works for 10-turn conversations and falls apart at 500 turns. Seed your test database with realistic long conversations and verify that memory assembly stays within token budgets, summarization produces coherent results, and latency remains acceptable.
Monitor memory quality over time. Log the memory context that gets sent with each request. Periodically review these logs to check whether summaries are accurate, entities are correct, and the model is actually using the provided context. Bad memory is worse than no memory — it confidently misleads the model.

References

OpenAI Chat Completions API — Official documentation for the messages array format and token counting
pgvector — Open-source vector similarity search for PostgreSQL
OpenAI Embeddings Guide — Embedding models and best practices for semantic search
Anthropic Claude Messages API — Claude's approach to conversation memory and context management
LangChain Memory Documentation — Reference implementations of buffer, summary, and entity memory types
PostgreSQL Partitioning — Table partitioning strategies for high-volume message storage

Building Conversational Memory with LLM APIs

Building Conversational Memory with LLM APIs

Prerequisites

Why Conversational Memory Matters

In-Memory Conversation Storage with Session Management

Persisting Conversations to PostgreSQL

Implementing a Sliding Window Memory

Summarization Memory

Entity Memory

Hybrid Memory Architecture

Memory Retrieval with Embeddings

Conversation Branching and Forking

Memory Sharing Across Conversations

TTL and Cleanup Strategies

Memory-Aware Prompt Construction

Complete Working Example

Common Issues and Troubleshooting

Best Practices

References

Quick Links

Need Expert Help?