Embeddings

Document Chunking Strategies for RAG

Master document chunking for RAG with fixed, sentence, paragraph, and semantic strategies plus metadata preservation in Node.js.

Document Chunking Strategies for RAG

Overview

Document chunking is the single most impactful decision you will make when building a Retrieval-Augmented Generation (RAG) system. How you split documents into chunks directly determines retrieval quality, and poor chunking will sabotage even the best embedding model and vector database. This article covers every major chunking strategy, shows you how to implement them in Node.js, and gives you a practical framework for choosing the right approach for your data.

Prerequisites

  • Node.js v18+ installed
  • Basic understanding of RAG architecture (embeddings, vector search, LLM generation)
  • Familiarity with text processing concepts
  • A working knowledge of JavaScript and npm

Install the dependencies we will use throughout this article:

npm install natural compromise cheerio marked tiktoken

Why Chunking Matters for RAG Quality

When a user asks a question to your RAG system, the pipeline looks like this: the query gets embedded into a vector, your vector database finds the most similar chunks, and those chunks get injected into the LLM prompt as context. If your chunks are poorly constructed, the retrieved context will be either too vague to answer the question or too fragmented to make sense.

I have seen teams spend weeks fine-tuning embedding models and switching vector databases when the real problem was that their chunks were cutting sentences in half or stuffing entire chapters into a single vector. Chunking is not a preprocessing step you rush through. It is a core architectural decision.

The fundamental tension in chunking is this: smaller chunks give you more precise retrieval but lose surrounding context, while larger chunks preserve context but dilute the relevance signal. Every strategy in this article is a different approach to managing that tradeoff.

Fixed-Size Chunking

Fixed-size chunking is the simplest approach. You split text into chunks of a predetermined size, measured in either characters or tokens. It is fast, predictable, and works reasonably well as a baseline.

Character-Based Fixed Chunking

var DEFAULT_CHUNK_SIZE = 1000;
var DEFAULT_OVERLAP = 200;

function chunkByCharacters(text, chunkSize, overlap) {
  chunkSize = chunkSize || DEFAULT_CHUNK_SIZE;
  overlap = overlap || DEFAULT_OVERLAP;

  var chunks = [];
  var start = 0;

  while (start < text.length) {
    var end = Math.min(start + chunkSize, text.length);
    var chunk = text.slice(start, end);

    chunks.push({
      content: chunk,
      startIndex: start,
      endIndex: end,
      chunkIndex: chunks.length
    });

    start = start + chunkSize - overlap;
  }

  return chunks;
}

The problem with raw character splitting is obvious: you will cut words, sentences, and paragraphs at arbitrary points. The word "implementation" might become "implementa" at the end of one chunk and "tion" at the start of the next. This is why overlap exists, but it does not solve the fundamental problem.

Token-Based Fixed Chunking

Token counts are more meaningful than character counts because embedding models and LLMs operate on tokens, not characters. A chunk of 512 tokens maps directly to the model's context window.

var tiktoken = require("tiktoken");

function chunkByTokens(text, maxTokens, overlapTokens) {
  maxTokens = maxTokens || 512;
  overlapTokens = overlapTokens || 50;

  var encoder = tiktoken.encoding_for_model("text-embedding-ada-002");
  var tokens = encoder.encode(text);
  var chunks = [];
  var start = 0;

  while (start < tokens.length) {
    var end = Math.min(start + maxTokens, tokens.length);
    var chunkTokens = tokens.slice(start, end);
    var chunkText = encoder.decode(chunkTokens);

    chunks.push({
      content: Buffer.from(chunkText).toString("utf-8"),
      tokenCount: chunkTokens.length,
      chunkIndex: chunks.length
    });

    start = start + maxTokens - overlapTokens;
  }

  encoder.free();
  return chunks;
}

Token-based chunking is my recommended starting point for any RAG system. It gives you direct control over how much context fits into each embedding and each LLM prompt.

Sentence-Based Chunking

Sentence-based chunking respects natural language boundaries. Instead of cutting at arbitrary positions, you split on sentence endings and then group sentences together until you hit a size limit.

var natural = require("natural");
var tokenizer = new natural.SentenceTokenizer();

function chunkBySentences(text, maxChunkSize, overlapSentences) {
  maxChunkSize = maxChunkSize || 1000;
  overlapSentences = overlapSentences || 2;

  var sentences = tokenizer.tokenize(text);
  var chunks = [];
  var currentChunk = [];
  var currentLength = 0;

  for (var i = 0; i < sentences.length; i++) {
    var sentence = sentences[i];

    if (currentLength + sentence.length > maxChunkSize && currentChunk.length > 0) {
      chunks.push({
        content: currentChunk.join(" "),
        sentenceCount: currentChunk.length,
        chunkIndex: chunks.length
      });

      // Keep overlap sentences for context continuity
      var overlapStart = Math.max(0, currentChunk.length - overlapSentences);
      var overlapContent = currentChunk.slice(overlapStart);
      currentChunk = overlapContent;
      currentLength = overlapContent.join(" ").length;
    }

    currentChunk.push(sentence);
    currentLength = currentLength + sentence.length + 1;
  }

  if (currentChunk.length > 0) {
    chunks.push({
      content: currentChunk.join(" "),
      sentenceCount: currentChunk.length,
      chunkIndex: chunks.length
    });
  }

  return chunks;
}

Sentence-based chunking produces much more readable chunks than fixed-size splitting. The overlap is measured in sentences rather than characters, which means the overlap content always makes grammatical sense. This is a significant advantage when the LLM needs to reason over the retrieved context.

Paragraph-Based Chunking

Paragraphs are the most natural unit of thought in written text. A paragraph typically covers one idea, making it a strong candidate for a self-contained chunk.

function chunkByParagraphs(text, maxChunkSize, minParagraphLength) {
  maxChunkSize = maxChunkSize || 1500;
  minParagraphLength = minParagraphLength || 50;

  var paragraphs = text.split(/\n\s*\n/).filter(function(p) {
    return p.trim().length >= minParagraphLength;
  });

  var chunks = [];
  var currentChunk = [];
  var currentLength = 0;

  for (var i = 0; i < paragraphs.length; i++) {
    var paragraph = paragraphs[i].trim();

    if (currentLength + paragraph.length > maxChunkSize && currentChunk.length > 0) {
      chunks.push({
        content: currentChunk.join("\n\n"),
        paragraphCount: currentChunk.length,
        chunkIndex: chunks.length
      });

      // Keep last paragraph as overlap
      currentChunk = [currentChunk[currentChunk.length - 1]];
      currentLength = currentChunk[0].length;
    }

    currentChunk.push(paragraph);
    currentLength = currentLength + paragraph.length + 2;
  }

  if (currentChunk.length > 0) {
    chunks.push({
      content: currentChunk.join("\n\n"),
      paragraphCount: currentChunk.length,
      chunkIndex: chunks.length
    });
  }

  return chunks;
}

Paragraph-based chunking works extremely well for well-structured documents like articles, documentation, and reports. It falls apart with poorly formatted text that has no paragraph breaks or with text that uses single line breaks instead of double.

Semantic Chunking

Semantic chunking is the most sophisticated approach. Instead of splitting on syntactic boundaries (sentences, paragraphs), you split where the topic actually changes. This requires computing embeddings for individual sentences and detecting where the cosine similarity between consecutive sentences drops below a threshold.

var natural = require("natural");
var sentenceTokenizer = new natural.SentenceTokenizer();

function cosineSimilarity(vecA, vecB) {
  var dotProduct = 0;
  var normA = 0;
  var normB = 0;

  for (var i = 0; i < vecA.length; i++) {
    dotProduct = dotProduct + vecA[i] * vecB[i];
    normA = normA + vecA[i] * vecA[i];
    normB = normB + vecB[i] * vecB[i];
  }

  var denominator = Math.sqrt(normA) * Math.sqrt(normB);
  if (denominator === 0) return 0;
  return dotProduct / denominator;
}

function chunkBySemantic(text, getEmbedding, similarityThreshold, maxChunkSize) {
  similarityThreshold = similarityThreshold || 0.75;
  maxChunkSize = maxChunkSize || 1500;

  var sentences = sentenceTokenizer.tokenize(text);

  // Get embeddings for all sentences
  return getEmbeddings(sentences, getEmbedding).then(function(embeddings) {
    var chunks = [];
    var currentChunk = [sentences[0]];
    var currentLength = sentences[0].length;

    for (var i = 1; i < sentences.length; i++) {
      var similarity = cosineSimilarity(embeddings[i - 1], embeddings[i]);
      var wouldExceedMax = currentLength + sentences[i].length > maxChunkSize;

      if (similarity < similarityThreshold || wouldExceedMax) {
        chunks.push({
          content: currentChunk.join(" "),
          sentenceCount: currentChunk.length,
          chunkIndex: chunks.length,
          splitReason: wouldExceedMax ? "max_size" : "topic_change"
        });

        currentChunk = [];
        currentLength = 0;
      }

      currentChunk.push(sentences[i]);
      currentLength = currentLength + sentences[i].length + 1;
    }

    if (currentChunk.length > 0) {
      chunks.push({
        content: currentChunk.join(" "),
        sentenceCount: currentChunk.length,
        chunkIndex: chunks.length,
        splitReason: "end_of_document"
      });
    }

    return chunks;
  });
}

function getEmbeddings(sentences, getEmbedding) {
  var promises = sentences.map(function(sentence) {
    return getEmbedding(sentence);
  });
  return Promise.all(promises);
}

Semantic chunking produces the highest quality chunks because each chunk is topically coherent. The downside is cost and latency: you need to embed every sentence individually before you can split. For a 10,000-word document, that could be hundreds of embedding API calls. I recommend batching these calls and caching the results.

Recursive Chunking

Recursive chunking is a pragmatic approach that tries to preserve the largest meaningful units possible. You start by splitting on the coarsest boundary (double newlines for paragraphs), and if any resulting chunk exceeds the size limit, you split it on the next finest boundary (single newlines, then sentences, then characters).

function chunkRecursive(text, maxChunkSize, separators, overlap) {
  maxChunkSize = maxChunkSize || 1000;
  overlap = overlap || 100;
  separators = separators || ["\n\n", "\n", ". ", " "];

  if (text.length <= maxChunkSize) {
    return [{
      content: text,
      chunkIndex: 0,
      separatorUsed: "none"
    }];
  }

  var separator = separators[0];
  var remainingSeparators = separators.slice(1);
  var sections = text.split(separator);

  var chunks = [];
  var currentChunk = "";

  for (var i = 0; i < sections.length; i++) {
    var section = sections[i];
    var candidateChunk = currentChunk
      ? currentChunk + separator + section
      : section;

    if (candidateChunk.length > maxChunkSize) {
      if (currentChunk.length > 0) {
        // Current chunk is within limits, save it
        if (currentChunk.length <= maxChunkSize) {
          chunks.push({
            content: currentChunk,
            chunkIndex: chunks.length,
            separatorUsed: separator
          });
        } else if (remainingSeparators.length > 0) {
          // Recursively split with finer separators
          var subChunks = chunkRecursive(
            currentChunk, maxChunkSize, remainingSeparators, overlap
          );
          subChunks.forEach(function(sc) {
            sc.chunkIndex = chunks.length;
            chunks.push(sc);
          });
        } else {
          // Last resort: force split
          chunks.push({
            content: currentChunk.substring(0, maxChunkSize),
            chunkIndex: chunks.length,
            separatorUsed: "force_split"
          });
        }
      }
      currentChunk = section;
    } else {
      currentChunk = candidateChunk;
    }
  }

  if (currentChunk.length > 0) {
    if (currentChunk.length <= maxChunkSize) {
      chunks.push({
        content: currentChunk,
        chunkIndex: chunks.length,
        separatorUsed: separator
      });
    } else if (remainingSeparators.length > 0) {
      var finalSubChunks = chunkRecursive(
        currentChunk, maxChunkSize, remainingSeparators, overlap
      );
      finalSubChunks.forEach(function(sc) {
        sc.chunkIndex = chunks.length;
        chunks.push(sc);
      });
    }
  }

  return chunks;
}

Recursive chunking is what LangChain popularized with its RecursiveCharacterTextSplitter, and for good reason. It handles heterogeneous content well because it adapts the splitting strategy to the actual structure of the text. A well-formatted document with clear paragraphs will get split on paragraph boundaries, while a wall of text will fall through to sentence or character splitting.

Handling Different Document Formats

Real-world RAG systems ingest documents in multiple formats. Each format requires specific preprocessing before chunking.

Markdown Documents

function preprocessMarkdown(markdown) {
  var sections = [];
  var currentSection = { heading: "", level: 0, content: [] };

  var lines = markdown.split("\n");

  for (var i = 0; i < lines.length; i++) {
    var line = lines[i];
    var headingMatch = line.match(/^(#{1,6})\s+(.+)$/);

    if (headingMatch) {
      if (currentSection.content.length > 0 || currentSection.heading) {
        sections.push({
          heading: currentSection.heading,
          level: currentSection.level,
          content: currentSection.content.join("\n").trim()
        });
      }
      currentSection = {
        heading: headingMatch[2],
        level: headingMatch[1].length,
        content: []
      };
    } else {
      currentSection.content.push(line);
    }
  }

  if (currentSection.content.length > 0 || currentSection.heading) {
    sections.push({
      heading: currentSection.heading,
      level: currentSection.level,
      content: currentSection.content.join("\n").trim()
    });
  }

  return sections;
}

Markdown is the friendliest format for chunking because headings provide explicit section boundaries. I always split markdown on headings first, then apply size-based chunking within each section.

HTML Documents

var cheerio = require("cheerio");

function preprocessHTML(html) {
  var $ = cheerio.load(html);

  // Remove scripts, styles, and navigation
  $("script, style, nav, header, footer").remove();

  var sections = [];

  $("h1, h2, h3, h4, h5, h6").each(function() {
    var heading = $(this).text().trim();
    var level = parseInt(this.tagName.charAt(1));
    var contentParts = [];

    var sibling = $(this).next();
    while (sibling.length && !sibling.is("h1, h2, h3, h4, h5, h6")) {
      var text = sibling.text().trim();
      if (text.length > 0) {
        contentParts.push(text);
      }
      sibling = sibling.next();
    }

    sections.push({
      heading: heading,
      level: level,
      content: contentParts.join("\n\n")
    });
  });

  return sections;
}

Code Files

Code requires special handling because splitting inside a function or class definition will produce nonsensical chunks. I recommend splitting on top-level constructs: functions, classes, and module-level blocks.

function chunkCodeFile(code, language, maxChunkSize) {
  maxChunkSize = maxChunkSize || 1500;

  // Split on function/class boundaries
  var patterns = {
    javascript: /(?=\nfunction\s|\nvar\s+\w+\s*=\s*function|\nmodule\.exports)/g,
    python: /(?=\ndef\s|\nclass\s)/g,
    java: /(?=\npublic\s|\nprivate\s|\nprotected\s|\nclass\s)/g
  };

  var pattern = patterns[language];
  if (!pattern) {
    return chunkRecursive(code, maxChunkSize);
  }

  var sections = code.split(pattern).filter(function(s) {
    return s.trim().length > 0;
  });

  var chunks = [];

  for (var i = 0; i < sections.length; i++) {
    var section = sections[i].trim();

    if (section.length <= maxChunkSize) {
      chunks.push({
        content: section,
        chunkIndex: chunks.length,
        type: "code_block"
      });
    } else {
      // Large function/class: split on blank lines within it
      var subChunks = chunkRecursive(section, maxChunkSize, ["\n\n", "\n"]);
      subChunks.forEach(function(sc) {
        sc.chunkIndex = chunks.length;
        sc.type = "code_block_fragment";
        chunks.push(sc);
      });
    }
  }

  return chunks;
}

Overlap Strategies and Why They Matter

Overlap is the most underestimated parameter in chunking. Without overlap, critical information that spans a chunk boundary gets split across two chunks, and neither chunk contains the full context. Consider a sentence like "The API returns a 429 status code, which means you should implement exponential backoff." If the chunk boundary falls between "429 status code," and "which means," neither chunk is useful on its own.

There are three overlap strategies worth considering:

Fixed overlap uses a constant number of characters or tokens. Simple but effective. I recommend 10-15% of your chunk size as a starting point.

Sentence overlap keeps the last N sentences from the previous chunk at the beginning of the next chunk. This guarantees grammatically complete overlap.

Sliding window overlap creates chunks at regular intervals but with a large overlap, effectively creating a dense set of overlapping views over the document. This is expensive in storage but can improve retrieval recall significantly.

function chunkWithSlidingWindow(text, windowSize, stepSize) {
  windowSize = windowSize || 1000;
  stepSize = stepSize || 200; // High overlap: 80%

  var chunks = [];
  var start = 0;

  while (start < text.length) {
    var end = Math.min(start + windowSize, text.length);

    chunks.push({
      content: text.slice(start, end),
      startIndex: start,
      endIndex: end,
      chunkIndex: chunks.length,
      overlapRatio: 1 - (stepSize / windowSize)
    });

    start = start + stepSize;
  }

  return chunks;
}

The sliding window approach produces many more chunks, which increases storage costs and embedding API usage. But in my experience, it consistently delivers the best retrieval accuracy for complex questions that require information from multiple parts of a document.

Chunk Size Optimization

Chunk size is a tuning parameter, not a fixed constant. The optimal size depends on your embedding model, your query patterns, and your content type.

Too small (under 100 tokens): Chunks lose context. A chunk like "returns a 429 error" tells you nothing about which API, under what conditions, or how to handle it. The embedding for this chunk will match many irrelevant queries about 429 errors.

Too large (over 1000 tokens): Chunks become diluted. A chunk containing three different topics will have an embedding that is a vague average of all three topics, making it a mediocre match for any specific query.

The sweet spot: For most use cases, I have found 200-500 tokens to work best. Technical documentation tends to work better at the higher end (400-500 tokens) because concepts require more context. FAQ-style content works well at 150-250 tokens because each question-answer pair is naturally concise.

function analyzeChunkSizes(chunks) {
  var sizes = chunks.map(function(c) { return c.content.length; });
  sizes.sort(function(a, b) { return a - b; });

  var sum = sizes.reduce(function(acc, s) { return acc + s; }, 0);

  return {
    count: sizes.length,
    mean: Math.round(sum / sizes.length),
    median: sizes[Math.floor(sizes.length / 2)],
    min: sizes[0],
    max: sizes[sizes.length - 1],
    stdDev: Math.round(Math.sqrt(
      sizes.reduce(function(acc, s) {
        return acc + Math.pow(s - (sum / sizes.length), 2);
      }, 0) / sizes.length
    ))
  };
}

High standard deviation in chunk sizes is a red flag. It means your chunking is inconsistent, which leads to unpredictable retrieval behavior. Aim for a coefficient of variation (stdDev / mean) under 0.5.

Metadata Preservation Per Chunk

Every chunk should carry metadata about where it came from. Without metadata, retrieved chunks are just floating text with no provenance. This metadata is critical for citation, debugging, and re-ranking.

function enrichChunkMetadata(chunk, documentMeta, sections) {
  return {
    content: chunk.content,
    metadata: {
      // Document-level metadata
      sourceDocument: documentMeta.filename,
      documentTitle: documentMeta.title,
      documentType: documentMeta.type,
      documentUrl: documentMeta.url || null,

      // Position metadata
      chunkIndex: chunk.chunkIndex,
      totalChunks: null, // Set after all chunks are created
      pageNumber: documentMeta.pageNumber || null,

      // Section metadata
      sectionHeading: findParentHeading(chunk, sections),
      sectionPath: buildSectionPath(chunk, sections),

      // Content metadata
      charCount: chunk.content.length,
      wordCount: chunk.content.split(/\s+/).length,

      // Timestamps
      indexedAt: new Date().toISOString(),
      documentModified: documentMeta.lastModified || null
    }
  };
}

function findParentHeading(chunk, sections) {
  for (var i = sections.length - 1; i >= 0; i--) {
    if (sections[i].startIndex <= chunk.startIndex) {
      return sections[i].heading;
    }
  }
  return null;
}

function buildSectionPath(chunk, sections) {
  var path = [];
  var currentLevel = Infinity;

  for (var i = sections.length - 1; i >= 0; i--) {
    if (sections[i].startIndex <= chunk.startIndex && sections[i].level < currentLevel) {
      path.unshift(sections[i].heading);
      currentLevel = sections[i].level;
    }
  }

  return path.join(" > ");
}

The sectionPath is particularly valuable. When your RAG system retrieves a chunk, the LLM can see that it came from "User Guide > Authentication > OAuth2 Flow," which provides hierarchical context even if the chunk itself does not mention authentication.

Special Handling for Tables, Code Blocks, and Lists

Certain content structures must never be split mid-element. A table row without its header is meaningless. A code block split in the middle will not compile. A numbered list with items 1-3 in one chunk and 4-6 in another loses its logical structure.

function extractProtectedBlocks(text) {
  var blocks = [];
  var placeholder = "___PROTECTED_BLOCK_{index}___";

  // Protect code blocks
  var codeBlockRegex = /```[\s\S]*?```/g;
  var match;

  while ((match = codeBlockRegex.exec(text)) !== null) {
    blocks.push({
      content: match[0],
      type: "code_block",
      index: blocks.length
    });
  }

  // Protect tables (markdown)
  var tableRegex = /\|.+\|\n\|[-\s|:]+\|\n(\|.+\|\n)*/g;

  while ((match = tableRegex.exec(text)) !== null) {
    blocks.push({
      content: match[0],
      type: "table",
      index: blocks.length
    });
  }

  // Replace protected blocks with placeholders
  var processedText = text;
  for (var i = blocks.length - 1; i >= 0; i--) {
    processedText = processedText.replace(
      blocks[i].content,
      placeholder.replace("{index}", i)
    );
  }

  return {
    text: processedText,
    blocks: blocks,
    restore: function(chunkedText) {
      var restored = chunkedText;
      for (var j = 0; j < blocks.length; j++) {
        restored = restored.replace(
          placeholder.replace("{index}", j),
          blocks[j].content
        );
      }
      return restored;
    }
  };
}

The strategy is simple: extract protected blocks before chunking, replace them with placeholders, chunk the remaining text, then restore the blocks. If a protected block exceeds the chunk size limit, keep it as its own chunk rather than splitting it.

Hierarchical Indexing with Parent-Child Chunks

One of the most effective advanced techniques is hierarchical chunking, where you create both large "parent" chunks and smaller "child" chunks. During retrieval, you search against the fine-grained child chunks for precision, but you return the parent chunk to the LLM for context.

function createHierarchicalChunks(text, parentSize, childSize) {
  parentSize = parentSize || 2000;
  childSize = childSize || 500;

  var parentChunks = chunkByParagraphs(text, parentSize);
  var hierarchy = [];

  for (var i = 0; i < parentChunks.length; i++) {
    var parent = parentChunks[i];
    var parentId = "parent_" + i;

    var children = chunkBySentences(parent.content, childSize, 1);

    hierarchy.push({
      id: parentId,
      role: "parent",
      content: parent.content,
      childIds: children.map(function(_, j) {
        return parentId + "_child_" + j;
      })
    });

    for (var j = 0; j < children.length; j++) {
      hierarchy.push({
        id: parentId + "_child_" + j,
        role: "child",
        content: children[j].content,
        parentId: parentId
      });
    }
  }

  return hierarchy;
}

function retrieveWithHierarchy(queryEmbedding, index, topK) {
  topK = topK || 5;

  // Search only child chunks for precision
  var childResults = index.search(queryEmbedding, {
    filter: { role: "child" },
    topK: topK
  });

  // Deduplicate by parent and return parent content
  var seenParents = {};
  var parentChunks = [];

  for (var i = 0; i < childResults.length; i++) {
    var parentId = childResults[i].metadata.parentId;
    if (!seenParents[parentId]) {
      seenParents[parentId] = true;
      parentChunks.push(index.getById(parentId));
    }
  }

  return parentChunks;
}

This pattern gives you the best of both worlds. The child chunks have focused embeddings that match specific queries well. The parent chunks provide the LLM with enough surrounding context to generate a coherent answer.

Evaluating Chunk Quality

You cannot improve what you do not measure. Here is a practical evaluation framework for chunk quality.

function evaluateChunkQuality(chunks, testQueries, getEmbedding, expectedChunkIndices) {
  var results = {
    totalQueries: testQueries.length,
    hits: 0,
    misses: 0,
    averageRank: 0,
    details: []
  };

  return Promise.all(testQueries.map(function(query, queryIndex) {
    return getEmbedding(query).then(function(queryEmbedding) {
      var similarities = chunks.map(function(chunk, chunkIdx) {
        return {
          chunkIndex: chunkIdx,
          similarity: cosineSimilarity(queryEmbedding, chunk.embedding)
        };
      });

      similarities.sort(function(a, b) {
        return b.similarity - a.similarity;
      });

      var expectedIndex = expectedChunkIndices[queryIndex];
      var rank = -1;

      for (var i = 0; i < similarities.length; i++) {
        if (similarities[i].chunkIndex === expectedIndex) {
          rank = i + 1;
          break;
        }
      }

      var hit = rank <= 5; // Top-5 retrieval
      if (hit) results.hits++;
      else results.misses++;

      results.details.push({
        query: query,
        expectedChunk: expectedIndex,
        actualRank: rank,
        topSimilarity: similarities[0].similarity,
        hit: hit
      });

      return rank;
    });
  })).then(function(ranks) {
    var validRanks = ranks.filter(function(r) { return r > 0; });
    results.averageRank = validRanks.reduce(function(a, b) { return a + b; }, 0) / validRanks.length;
    results.retrievalAccuracy = results.hits / results.totalQueries;
    return results;
  });
}

Build a test set of 20-50 queries with known correct chunks. Run your evaluation after every change to your chunking strategy. A retrieval accuracy below 80% at top-5 means your chunking needs work.

Complete Working Example

Here is a complete Node.js chunking library that ties together all the strategies discussed above.

// chunking-pipeline.js
var natural = require("natural");
var cheerio = require("cheerio");
var crypto = require("crypto");

var sentenceTokenizer = new natural.SentenceTokenizer();

// ============================================================
// Core Chunking Strategies
// ============================================================

var ChunkingStrategies = {
  fixed: function(text, options) {
    var chunkSize = options.chunkSize || 1000;
    var overlap = options.overlap || 200;
    var chunks = [];
    var start = 0;

    while (start < text.length) {
      var end = Math.min(start + chunkSize, text.length);
      chunks.push({
        content: text.slice(start, end),
        strategy: "fixed",
        startIndex: start,
        endIndex: end
      });
      start = start + chunkSize - overlap;
    }

    return chunks;
  },

  sentence: function(text, options) {
    var maxSize = options.chunkSize || 1000;
    var overlapCount = options.overlapSentences || 2;
    var sentences = sentenceTokenizer.tokenize(text);
    var chunks = [];
    var current = [];
    var currentLen = 0;

    for (var i = 0; i < sentences.length; i++) {
      if (currentLen + sentences[i].length > maxSize && current.length > 0) {
        chunks.push({
          content: current.join(" "),
          strategy: "sentence",
          sentenceCount: current.length
        });
        var keep = Math.max(0, current.length - overlapCount);
        current = current.slice(keep);
        currentLen = current.join(" ").length;
      }
      current.push(sentences[i]);
      currentLen = currentLen + sentences[i].length + 1;
    }

    if (current.length > 0) {
      chunks.push({
        content: current.join(" "),
        strategy: "sentence",
        sentenceCount: current.length
      });
    }

    return chunks;
  },

  paragraph: function(text, options) {
    var maxSize = options.chunkSize || 1500;
    var minLength = options.minParagraphLength || 30;
    var paragraphs = text.split(/\n\s*\n/).filter(function(p) {
      return p.trim().length >= minLength;
    });

    var chunks = [];
    var current = [];
    var currentLen = 0;

    for (var i = 0; i < paragraphs.length; i++) {
      var para = paragraphs[i].trim();

      if (currentLen + para.length > maxSize && current.length > 0) {
        chunks.push({
          content: current.join("\n\n"),
          strategy: "paragraph",
          paragraphCount: current.length
        });
        current = [current[current.length - 1]];
        currentLen = current[0].length;
      }

      current.push(para);
      currentLen = currentLen + para.length + 2;
    }

    if (current.length > 0) {
      chunks.push({
        content: current.join("\n\n"),
        strategy: "paragraph",
        paragraphCount: current.length
      });
    }

    return chunks;
  },

  recursive: function(text, options) {
    var maxSize = options.chunkSize || 1000;
    var separators = ["\n\n", "\n", ". ", " "];

    function splitRecursive(txt, seps) {
      if (txt.length <= maxSize) {
        return [{ content: txt, strategy: "recursive" }];
      }

      if (seps.length === 0) {
        return [{ content: txt.substring(0, maxSize), strategy: "recursive" }];
      }

      var sep = seps[0];
      var parts = txt.split(sep);
      var results = [];
      var current = "";

      for (var i = 0; i < parts.length; i++) {
        var candidate = current ? current + sep + parts[i] : parts[i];
        if (candidate.length > maxSize) {
          if (current) {
            if (current.length <= maxSize) {
              results.push({ content: current, strategy: "recursive" });
            } else {
              var sub = splitRecursive(current, seps.slice(1));
              results = results.concat(sub);
            }
          }
          current = parts[i];
        } else {
          current = candidate;
        }
      }

      if (current) {
        if (current.length <= maxSize) {
          results.push({ content: current, strategy: "recursive" });
        } else {
          var finalSub = splitRecursive(current, seps.slice(1));
          results = results.concat(finalSub);
        }
      }

      return results;
    }

    return splitRecursive(text, separators);
  }
};

// ============================================================
// Document Format Preprocessors
// ============================================================

var Preprocessors = {
  markdown: function(text) {
    var sections = [];
    var currentHeading = "";
    var currentLevel = 0;
    var currentContent = [];
    var lines = text.split("\n");

    for (var i = 0; i < lines.length; i++) {
      var match = lines[i].match(/^(#{1,6})\s+(.+)$/);
      if (match) {
        if (currentContent.length > 0 || currentHeading) {
          sections.push({
            heading: currentHeading,
            level: currentLevel,
            content: currentContent.join("\n").trim()
          });
        }
        currentHeading = match[2];
        currentLevel = match[1].length;
        currentContent = [];
      } else {
        currentContent.push(lines[i]);
      }
    }

    if (currentContent.length > 0 || currentHeading) {
      sections.push({
        heading: currentHeading,
        level: currentLevel,
        content: currentContent.join("\n").trim()
      });
    }

    return sections;
  },

  html: function(html) {
    var $ = cheerio.load(html);
    $("script, style, nav, header, footer, aside").remove();

    var sections = [];
    $("h1, h2, h3, h4, h5, h6").each(function() {
      var heading = $(this).text().trim();
      var level = parseInt(this.tagName.charAt(1));
      var parts = [];
      var sibling = $(this).next();

      while (sibling.length && !sibling.is("h1,h2,h3,h4,h5,h6")) {
        var txt = sibling.text().trim();
        if (txt) parts.push(txt);
        sibling = sibling.next();
      }

      sections.push({ heading: heading, level: level, content: parts.join("\n\n") });
    });

    // Fallback: if no headings found, return body text
    if (sections.length === 0) {
      sections.push({
        heading: $("title").text() || "",
        level: 1,
        content: $("body").text().trim()
      });
    }

    return sections;
  },

  plaintext: function(text) {
    return [{ heading: "", level: 0, content: text }];
  }
};

// ============================================================
// Chunking Pipeline
// ============================================================

function ChunkingPipeline(options) {
  this.strategy = options.strategy || "sentence";
  this.chunkSize = options.chunkSize || 800;
  this.overlap = options.overlap || 100;
  this.overlapSentences = options.overlapSentences || 2;
  this.preserveBlocks = options.preserveBlocks !== false;
}

ChunkingPipeline.prototype.process = function(text, format, documentMeta) {
  format = format || "plaintext";
  documentMeta = documentMeta || {};

  // Step 1: Detect and protect special blocks
  var protectedResult = null;
  if (this.preserveBlocks) {
    protectedResult = this._extractProtectedBlocks(text);
    text = protectedResult.text;
  }

  // Step 2: Preprocess by format
  var preprocessor = Preprocessors[format] || Preprocessors.plaintext;
  var sections = preprocessor(text);

  // Step 3: Chunk each section
  var allChunks = [];
  var chunkFn = ChunkingStrategies[this.strategy];

  if (!chunkFn) {
    throw new Error("Unknown chunking strategy: " + this.strategy);
  }

  var self = this;
  var globalIndex = 0;

  for (var s = 0; s < sections.length; s++) {
    var section = sections[s];
    if (!section.content || section.content.trim().length === 0) continue;

    var sectionChunks = chunkFn(section.content, {
      chunkSize: self.chunkSize,
      overlap: self.overlap,
      overlapSentences: self.overlapSentences
    });

    for (var c = 0; c < sectionChunks.length; c++) {
      var chunk = sectionChunks[c];

      // Restore protected blocks
      if (protectedResult) {
        chunk.content = protectedResult.restore(chunk.content);
      }

      // Enrich with metadata
      chunk.chunkIndex = globalIndex;
      chunk.sectionHeading = section.heading || null;
      chunk.sectionLevel = section.level || 0;
      chunk.metadata = {
        source: documentMeta.filename || "unknown",
        title: documentMeta.title || null,
        url: documentMeta.url || null,
        format: format,
        chunkId: self._generateChunkId(chunk.content),
        charCount: chunk.content.length,
        wordCount: chunk.content.split(/\s+/).filter(Boolean).length,
        processedAt: new Date().toISOString()
      };

      allChunks.push(chunk);
      globalIndex++;
    }
  }

  // Set totalChunks on all chunks
  for (var t = 0; t < allChunks.length; t++) {
    allChunks[t].metadata.totalChunks = allChunks.length;
  }

  return allChunks;
};

ChunkingPipeline.prototype._extractProtectedBlocks = function(text) {
  var blocks = [];
  var processed = text;

  // Protect fenced code blocks
  processed = processed.replace(/```[\s\S]*?```/g, function(match) {
    var idx = blocks.length;
    blocks.push({ content: match, type: "code" });
    return "___BLOCK_" + idx + "___";
  });

  // Protect markdown tables
  processed = processed.replace(
    /\|.+\|\n\|[-\s|:]+\|\n(\|.+\|\n)*/g,
    function(match) {
      var idx = blocks.length;
      blocks.push({ content: match, type: "table" });
      return "___BLOCK_" + idx + "___";
    }
  );

  return {
    text: processed,
    blocks: blocks,
    restore: function(chunkedText) {
      var result = chunkedText;
      for (var i = 0; i < blocks.length; i++) {
        result = result.replace("___BLOCK_" + i + "___", blocks[i].content);
      }
      return result;
    }
  };
};

ChunkingPipeline.prototype._generateChunkId = function(content) {
  return crypto.createHash("md5").update(content).digest("hex").substring(0, 12);
};

// ============================================================
// Quality Evaluation
// ============================================================

function evaluateChunks(chunks) {
  var sizes = chunks.map(function(c) { return c.content.length; });
  var totalSize = sizes.reduce(function(a, b) { return a + b; }, 0);
  var meanSize = totalSize / sizes.length;
  var variance = sizes.reduce(function(acc, s) {
    return acc + Math.pow(s - meanSize, 2);
  }, 0) / sizes.length;
  var stdDev = Math.sqrt(variance);

  sizes.sort(function(a, b) { return a - b; });

  var emptyChunks = chunks.filter(function(c) {
    return c.content.trim().length < 20;
  });

  return {
    totalChunks: chunks.length,
    totalCharacters: totalSize,
    meanChunkSize: Math.round(meanSize),
    medianChunkSize: sizes[Math.floor(sizes.length / 2)],
    minChunkSize: sizes[0],
    maxChunkSize: sizes[sizes.length - 1],
    stdDev: Math.round(stdDev),
    coefficientOfVariation: (stdDev / meanSize).toFixed(3),
    emptyOrTinyChunks: emptyChunks.length,
    qualityScore: calculateQualityScore(meanSize, stdDev, emptyChunks.length, chunks.length)
  };
}

function calculateQualityScore(meanSize, stdDev, emptyCount, totalCount) {
  var score = 100;

  // Penalize extreme mean sizes
  if (meanSize < 100) score = score - 20;
  else if (meanSize > 3000) score = score - 15;

  // Penalize high variance
  var cv = stdDev / meanSize;
  if (cv > 0.8) score = score - 25;
  else if (cv > 0.5) score = score - 10;

  // Penalize empty/tiny chunks
  var emptyRatio = emptyCount / totalCount;
  if (emptyRatio > 0.1) score = score - 20;
  else if (emptyRatio > 0.05) score = score - 10;

  return Math.max(0, Math.min(100, score));
}

// ============================================================
// Usage Example
// ============================================================

function main() {
  var sampleMarkdown = [
    "# Getting Started with Express.js",
    "",
    "Express.js is a minimal web framework for Node.js. It provides a robust",
    "set of features for building web applications and APIs. Express has been",
    "the de facto standard for Node.js web development for over a decade.",
    "",
    "## Installation",
    "",
    "Install Express using npm. You will also want to install nodemon for",
    "development to automatically restart your server when files change.",
    "",
    "```",
    "npm install express",
    "npm install --save-dev nodemon",
    "```",
    "",
    "## Creating Your First Server",
    "",
    "Create a file called app.js in your project root. Import Express and",
    "create an application instance. Define your routes and start listening",
    "on a port.",
    "",
    "The basic pattern is to create an Express app, define middleware and",
    "routes, then call app.listen() with a port number. Express will handle",
    "incoming HTTP requests and route them to the appropriate handler.",
    "",
    "## Middleware",
    "",
    "Middleware functions have access to the request object, the response",
    "object, and the next middleware function. They can execute code, modify",
    "the request and response, end the request-response cycle, or call the",
    "next middleware. Middleware is the backbone of Express applications."
  ].join("\n");

  var pipeline = new ChunkingPipeline({
    strategy: "sentence",
    chunkSize: 500,
    overlapSentences: 1,
    preserveBlocks: true
  });

  var chunks = pipeline.process(sampleMarkdown, "markdown", {
    filename: "express-guide.md",
    title: "Getting Started with Express.js",
    url: "https://example.com/express-guide"
  });

  console.log("=== Chunking Results ===\n");
  for (var i = 0; i < chunks.length; i++) {
    console.log("Chunk " + i + " [" + chunks[i].sectionHeading + "]:");
    console.log(chunks[i].content.substring(0, 120) + "...");
    console.log("Words: " + chunks[i].metadata.wordCount);
    console.log("---");
  }

  console.log("\n=== Quality Report ===\n");
  var quality = evaluateChunks(chunks);
  console.log(JSON.stringify(quality, null, 2));
}

main();

module.exports = {
  ChunkingPipeline: ChunkingPipeline,
  ChunkingStrategies: ChunkingStrategies,
  Preprocessors: Preprocessors,
  evaluateChunks: evaluateChunks
};

Save this as chunking-pipeline.js and run it with node chunking-pipeline.js to see the chunking and quality evaluation in action.

Common Issues and Troubleshooting

1. Chunks Are Empty or Contain Only Whitespace

Error: Chunk at index 7 has empty content after trimming

This happens when your separator pattern matches multiple consecutive times, creating empty segments. The fix is to filter out empty segments after splitting:

var segments = text.split("\n\n").filter(function(s) {
  return s.trim().length > 0;
});

Also check for documents that use \r\n line endings. Normalize line endings before chunking with text.replace(/\r\n/g, "\n").

2. Protected Blocks Exceed Chunk Size Limit

Warning: Code block at line 142 is 3,847 characters, exceeding chunk size of 1,000

A single code block or table can exceed your chunk size limit. You have two options: increase the chunk size limit for chunks that contain protected blocks, or treat oversized protected blocks as standalone chunks. I recommend the latter:

if (block.content.length > maxChunkSize) {
  // Keep as standalone chunk rather than splitting code
  chunks.push({
    content: block.content,
    strategy: "protected_block",
    oversized: true
  });
}

3. Sentence Tokenizer Fails on Non-Standard Text

TypeError: Cannot read property 'length' of undefined
  at SentenceTokenizer.tokenize

The Natural library's sentence tokenizer can choke on text with unusual punctuation, URLs, or numeric sequences like "v2.3.1." that look like sentence endings. Preprocess your text to handle these cases:

function sanitizeForSentenceTokenization(text) {
  // Protect version numbers
  text = text.replace(/v(\d+\.\d+\.\d+)/g, "v$1_PROTECTED_DOT_");
  // Protect URLs
  text = text.replace(/(https?:\/\/[^\s]+)/g, function(url) {
    return url.replace(/\./g, "_PROTECTED_DOT_");
  });
  return text;
}

4. Duplicate Content from Excessive Overlap

Issue: Vector search returns the same passage from 4 different chunks, all with similarity > 0.95

If your overlap is too large relative to your chunk size, you end up with chunks that are mostly identical. This wastes storage and pollutes search results. Keep your overlap under 20% of your chunk size, or deduplicate search results by content hash:

function deduplicateResults(results, similarityThreshold) {
  similarityThreshold = similarityThreshold || 0.95;
  var seen = [];

  return results.filter(function(result) {
    for (var i = 0; i < seen.length; i++) {
      if (cosineSimilarity(result.embedding, seen[i]) > similarityThreshold) {
        return false;
      }
    }
    seen.push(result.embedding);
    return true;
  });
}

5. Memory Issues with Large Documents

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

Processing a 500-page PDF as a single string will exhaust Node.js memory. Stream-process large documents by splitting them into pages or sections first, then chunking each section independently:

function processLargeDocument(pages, pipeline, documentMeta) {
  var allChunks = [];

  for (var i = 0; i < pages.length; i++) {
    var pageChunks = pipeline.process(pages[i], "plaintext", {
      filename: documentMeta.filename,
      title: documentMeta.title,
      pageNumber: i + 1
    });
    allChunks = allChunks.concat(pageChunks);
  }

  // Re-index after combining
  for (var j = 0; j < allChunks.length; j++) {
    allChunks[j].chunkIndex = j;
    allChunks[j].metadata.totalChunks = allChunks.length;
  }

  return allChunks;
}

Best Practices

  • Start with sentence-based chunking and a 400-token target size. This is the best default for most use cases. Only switch strategies when you have evidence that another approach performs better on your specific data.

  • Always include overlap between chunks. A 10-15% overlap prevents information loss at chunk boundaries. Sentence overlap is preferable to character overlap because it preserves grammatical completeness.

  • Preserve metadata aggressively. Every chunk should carry its source document, section heading, page number, and position within the document. This metadata is essential for citation, debugging, and contextual re-ranking.

  • Never split code blocks, tables, or structured lists. These elements lose their meaning when fragmented. Extract them as protected blocks before chunking, and keep them intact even if they exceed the chunk size limit.

  • Build a retrieval evaluation set early. Create 30-50 test queries with known correct answers and measure retrieval accuracy at top-5. Run this evaluation whenever you change your chunking strategy, chunk size, or overlap. Without measurement, you are tuning blind.

  • Normalize text before chunking. Standardize line endings, collapse excessive whitespace, remove invisible Unicode characters, and convert smart quotes to straight quotes. Inconsistent formatting creates inconsistent chunks.

  • Use hierarchical chunking for long documents. Create parent chunks (1500-2000 characters) and child chunks (300-500 characters). Search against children for precision, return parents to the LLM for context. This consistently outperforms single-level chunking on complex queries.

  • Monitor chunk size distribution in production. A coefficient of variation above 0.5 indicates your chunking is inconsistent. Log chunk size statistics and alert on drift, because document format changes will silently degrade your chunking quality.

  • Version your chunking configuration. When you change chunk size, overlap, or strategy, you need to re-embed all your documents. Track your chunking parameters alongside your embedding model version so you can reproduce results and roll back if needed.

References

Powered by Contentful