Document Chunking Strategies for RAG
Master document chunking for RAG with fixed, sentence, paragraph, and semantic strategies plus metadata preservation in Node.js.
Document Chunking Strategies for RAG
Overview
Document chunking is the single most impactful decision you will make when building a Retrieval-Augmented Generation (RAG) system. How you split documents into chunks directly determines retrieval quality, and poor chunking will sabotage even the best embedding model and vector database. This article covers every major chunking strategy, shows you how to implement them in Node.js, and gives you a practical framework for choosing the right approach for your data.
Prerequisites
- Node.js v18+ installed
- Basic understanding of RAG architecture (embeddings, vector search, LLM generation)
- Familiarity with text processing concepts
- A working knowledge of JavaScript and npm
Install the dependencies we will use throughout this article:
npm install natural compromise cheerio marked tiktoken
Why Chunking Matters for RAG Quality
When a user asks a question to your RAG system, the pipeline looks like this: the query gets embedded into a vector, your vector database finds the most similar chunks, and those chunks get injected into the LLM prompt as context. If your chunks are poorly constructed, the retrieved context will be either too vague to answer the question or too fragmented to make sense.
I have seen teams spend weeks fine-tuning embedding models and switching vector databases when the real problem was that their chunks were cutting sentences in half or stuffing entire chapters into a single vector. Chunking is not a preprocessing step you rush through. It is a core architectural decision.
The fundamental tension in chunking is this: smaller chunks give you more precise retrieval but lose surrounding context, while larger chunks preserve context but dilute the relevance signal. Every strategy in this article is a different approach to managing that tradeoff.
Fixed-Size Chunking
Fixed-size chunking is the simplest approach. You split text into chunks of a predetermined size, measured in either characters or tokens. It is fast, predictable, and works reasonably well as a baseline.
Character-Based Fixed Chunking
var DEFAULT_CHUNK_SIZE = 1000;
var DEFAULT_OVERLAP = 200;
function chunkByCharacters(text, chunkSize, overlap) {
chunkSize = chunkSize || DEFAULT_CHUNK_SIZE;
overlap = overlap || DEFAULT_OVERLAP;
var chunks = [];
var start = 0;
while (start < text.length) {
var end = Math.min(start + chunkSize, text.length);
var chunk = text.slice(start, end);
chunks.push({
content: chunk,
startIndex: start,
endIndex: end,
chunkIndex: chunks.length
});
start = start + chunkSize - overlap;
}
return chunks;
}
The problem with raw character splitting is obvious: you will cut words, sentences, and paragraphs at arbitrary points. The word "implementation" might become "implementa" at the end of one chunk and "tion" at the start of the next. This is why overlap exists, but it does not solve the fundamental problem.
Token-Based Fixed Chunking
Token counts are more meaningful than character counts because embedding models and LLMs operate on tokens, not characters. A chunk of 512 tokens maps directly to the model's context window.
var tiktoken = require("tiktoken");
function chunkByTokens(text, maxTokens, overlapTokens) {
maxTokens = maxTokens || 512;
overlapTokens = overlapTokens || 50;
var encoder = tiktoken.encoding_for_model("text-embedding-ada-002");
var tokens = encoder.encode(text);
var chunks = [];
var start = 0;
while (start < tokens.length) {
var end = Math.min(start + maxTokens, tokens.length);
var chunkTokens = tokens.slice(start, end);
var chunkText = encoder.decode(chunkTokens);
chunks.push({
content: Buffer.from(chunkText).toString("utf-8"),
tokenCount: chunkTokens.length,
chunkIndex: chunks.length
});
start = start + maxTokens - overlapTokens;
}
encoder.free();
return chunks;
}
Token-based chunking is my recommended starting point for any RAG system. It gives you direct control over how much context fits into each embedding and each LLM prompt.
Sentence-Based Chunking
Sentence-based chunking respects natural language boundaries. Instead of cutting at arbitrary positions, you split on sentence endings and then group sentences together until you hit a size limit.
var natural = require("natural");
var tokenizer = new natural.SentenceTokenizer();
function chunkBySentences(text, maxChunkSize, overlapSentences) {
maxChunkSize = maxChunkSize || 1000;
overlapSentences = overlapSentences || 2;
var sentences = tokenizer.tokenize(text);
var chunks = [];
var currentChunk = [];
var currentLength = 0;
for (var i = 0; i < sentences.length; i++) {
var sentence = sentences[i];
if (currentLength + sentence.length > maxChunkSize && currentChunk.length > 0) {
chunks.push({
content: currentChunk.join(" "),
sentenceCount: currentChunk.length,
chunkIndex: chunks.length
});
// Keep overlap sentences for context continuity
var overlapStart = Math.max(0, currentChunk.length - overlapSentences);
var overlapContent = currentChunk.slice(overlapStart);
currentChunk = overlapContent;
currentLength = overlapContent.join(" ").length;
}
currentChunk.push(sentence);
currentLength = currentLength + sentence.length + 1;
}
if (currentChunk.length > 0) {
chunks.push({
content: currentChunk.join(" "),
sentenceCount: currentChunk.length,
chunkIndex: chunks.length
});
}
return chunks;
}
Sentence-based chunking produces much more readable chunks than fixed-size splitting. The overlap is measured in sentences rather than characters, which means the overlap content always makes grammatical sense. This is a significant advantage when the LLM needs to reason over the retrieved context.
Paragraph-Based Chunking
Paragraphs are the most natural unit of thought in written text. A paragraph typically covers one idea, making it a strong candidate for a self-contained chunk.
function chunkByParagraphs(text, maxChunkSize, minParagraphLength) {
maxChunkSize = maxChunkSize || 1500;
minParagraphLength = minParagraphLength || 50;
var paragraphs = text.split(/\n\s*\n/).filter(function(p) {
return p.trim().length >= minParagraphLength;
});
var chunks = [];
var currentChunk = [];
var currentLength = 0;
for (var i = 0; i < paragraphs.length; i++) {
var paragraph = paragraphs[i].trim();
if (currentLength + paragraph.length > maxChunkSize && currentChunk.length > 0) {
chunks.push({
content: currentChunk.join("\n\n"),
paragraphCount: currentChunk.length,
chunkIndex: chunks.length
});
// Keep last paragraph as overlap
currentChunk = [currentChunk[currentChunk.length - 1]];
currentLength = currentChunk[0].length;
}
currentChunk.push(paragraph);
currentLength = currentLength + paragraph.length + 2;
}
if (currentChunk.length > 0) {
chunks.push({
content: currentChunk.join("\n\n"),
paragraphCount: currentChunk.length,
chunkIndex: chunks.length
});
}
return chunks;
}
Paragraph-based chunking works extremely well for well-structured documents like articles, documentation, and reports. It falls apart with poorly formatted text that has no paragraph breaks or with text that uses single line breaks instead of double.
Semantic Chunking
Semantic chunking is the most sophisticated approach. Instead of splitting on syntactic boundaries (sentences, paragraphs), you split where the topic actually changes. This requires computing embeddings for individual sentences and detecting where the cosine similarity between consecutive sentences drops below a threshold.
var natural = require("natural");
var sentenceTokenizer = new natural.SentenceTokenizer();
function cosineSimilarity(vecA, vecB) {
var dotProduct = 0;
var normA = 0;
var normB = 0;
for (var i = 0; i < vecA.length; i++) {
dotProduct = dotProduct + vecA[i] * vecB[i];
normA = normA + vecA[i] * vecA[i];
normB = normB + vecB[i] * vecB[i];
}
var denominator = Math.sqrt(normA) * Math.sqrt(normB);
if (denominator === 0) return 0;
return dotProduct / denominator;
}
function chunkBySemantic(text, getEmbedding, similarityThreshold, maxChunkSize) {
similarityThreshold = similarityThreshold || 0.75;
maxChunkSize = maxChunkSize || 1500;
var sentences = sentenceTokenizer.tokenize(text);
// Get embeddings for all sentences
return getEmbeddings(sentences, getEmbedding).then(function(embeddings) {
var chunks = [];
var currentChunk = [sentences[0]];
var currentLength = sentences[0].length;
for (var i = 1; i < sentences.length; i++) {
var similarity = cosineSimilarity(embeddings[i - 1], embeddings[i]);
var wouldExceedMax = currentLength + sentences[i].length > maxChunkSize;
if (similarity < similarityThreshold || wouldExceedMax) {
chunks.push({
content: currentChunk.join(" "),
sentenceCount: currentChunk.length,
chunkIndex: chunks.length,
splitReason: wouldExceedMax ? "max_size" : "topic_change"
});
currentChunk = [];
currentLength = 0;
}
currentChunk.push(sentences[i]);
currentLength = currentLength + sentences[i].length + 1;
}
if (currentChunk.length > 0) {
chunks.push({
content: currentChunk.join(" "),
sentenceCount: currentChunk.length,
chunkIndex: chunks.length,
splitReason: "end_of_document"
});
}
return chunks;
});
}
function getEmbeddings(sentences, getEmbedding) {
var promises = sentences.map(function(sentence) {
return getEmbedding(sentence);
});
return Promise.all(promises);
}
Semantic chunking produces the highest quality chunks because each chunk is topically coherent. The downside is cost and latency: you need to embed every sentence individually before you can split. For a 10,000-word document, that could be hundreds of embedding API calls. I recommend batching these calls and caching the results.
Recursive Chunking
Recursive chunking is a pragmatic approach that tries to preserve the largest meaningful units possible. You start by splitting on the coarsest boundary (double newlines for paragraphs), and if any resulting chunk exceeds the size limit, you split it on the next finest boundary (single newlines, then sentences, then characters).
function chunkRecursive(text, maxChunkSize, separators, overlap) {
maxChunkSize = maxChunkSize || 1000;
overlap = overlap || 100;
separators = separators || ["\n\n", "\n", ". ", " "];
if (text.length <= maxChunkSize) {
return [{
content: text,
chunkIndex: 0,
separatorUsed: "none"
}];
}
var separator = separators[0];
var remainingSeparators = separators.slice(1);
var sections = text.split(separator);
var chunks = [];
var currentChunk = "";
for (var i = 0; i < sections.length; i++) {
var section = sections[i];
var candidateChunk = currentChunk
? currentChunk + separator + section
: section;
if (candidateChunk.length > maxChunkSize) {
if (currentChunk.length > 0) {
// Current chunk is within limits, save it
if (currentChunk.length <= maxChunkSize) {
chunks.push({
content: currentChunk,
chunkIndex: chunks.length,
separatorUsed: separator
});
} else if (remainingSeparators.length > 0) {
// Recursively split with finer separators
var subChunks = chunkRecursive(
currentChunk, maxChunkSize, remainingSeparators, overlap
);
subChunks.forEach(function(sc) {
sc.chunkIndex = chunks.length;
chunks.push(sc);
});
} else {
// Last resort: force split
chunks.push({
content: currentChunk.substring(0, maxChunkSize),
chunkIndex: chunks.length,
separatorUsed: "force_split"
});
}
}
currentChunk = section;
} else {
currentChunk = candidateChunk;
}
}
if (currentChunk.length > 0) {
if (currentChunk.length <= maxChunkSize) {
chunks.push({
content: currentChunk,
chunkIndex: chunks.length,
separatorUsed: separator
});
} else if (remainingSeparators.length > 0) {
var finalSubChunks = chunkRecursive(
currentChunk, maxChunkSize, remainingSeparators, overlap
);
finalSubChunks.forEach(function(sc) {
sc.chunkIndex = chunks.length;
chunks.push(sc);
});
}
}
return chunks;
}
Recursive chunking is what LangChain popularized with its RecursiveCharacterTextSplitter, and for good reason. It handles heterogeneous content well because it adapts the splitting strategy to the actual structure of the text. A well-formatted document with clear paragraphs will get split on paragraph boundaries, while a wall of text will fall through to sentence or character splitting.
Handling Different Document Formats
Real-world RAG systems ingest documents in multiple formats. Each format requires specific preprocessing before chunking.
Markdown Documents
function preprocessMarkdown(markdown) {
var sections = [];
var currentSection = { heading: "", level: 0, content: [] };
var lines = markdown.split("\n");
for (var i = 0; i < lines.length; i++) {
var line = lines[i];
var headingMatch = line.match(/^(#{1,6})\s+(.+)$/);
if (headingMatch) {
if (currentSection.content.length > 0 || currentSection.heading) {
sections.push({
heading: currentSection.heading,
level: currentSection.level,
content: currentSection.content.join("\n").trim()
});
}
currentSection = {
heading: headingMatch[2],
level: headingMatch[1].length,
content: []
};
} else {
currentSection.content.push(line);
}
}
if (currentSection.content.length > 0 || currentSection.heading) {
sections.push({
heading: currentSection.heading,
level: currentSection.level,
content: currentSection.content.join("\n").trim()
});
}
return sections;
}
Markdown is the friendliest format for chunking because headings provide explicit section boundaries. I always split markdown on headings first, then apply size-based chunking within each section.
HTML Documents
var cheerio = require("cheerio");
function preprocessHTML(html) {
var $ = cheerio.load(html);
// Remove scripts, styles, and navigation
$("script, style, nav, header, footer").remove();
var sections = [];
$("h1, h2, h3, h4, h5, h6").each(function() {
var heading = $(this).text().trim();
var level = parseInt(this.tagName.charAt(1));
var contentParts = [];
var sibling = $(this).next();
while (sibling.length && !sibling.is("h1, h2, h3, h4, h5, h6")) {
var text = sibling.text().trim();
if (text.length > 0) {
contentParts.push(text);
}
sibling = sibling.next();
}
sections.push({
heading: heading,
level: level,
content: contentParts.join("\n\n")
});
});
return sections;
}
Code Files
Code requires special handling because splitting inside a function or class definition will produce nonsensical chunks. I recommend splitting on top-level constructs: functions, classes, and module-level blocks.
function chunkCodeFile(code, language, maxChunkSize) {
maxChunkSize = maxChunkSize || 1500;
// Split on function/class boundaries
var patterns = {
javascript: /(?=\nfunction\s|\nvar\s+\w+\s*=\s*function|\nmodule\.exports)/g,
python: /(?=\ndef\s|\nclass\s)/g,
java: /(?=\npublic\s|\nprivate\s|\nprotected\s|\nclass\s)/g
};
var pattern = patterns[language];
if (!pattern) {
return chunkRecursive(code, maxChunkSize);
}
var sections = code.split(pattern).filter(function(s) {
return s.trim().length > 0;
});
var chunks = [];
for (var i = 0; i < sections.length; i++) {
var section = sections[i].trim();
if (section.length <= maxChunkSize) {
chunks.push({
content: section,
chunkIndex: chunks.length,
type: "code_block"
});
} else {
// Large function/class: split on blank lines within it
var subChunks = chunkRecursive(section, maxChunkSize, ["\n\n", "\n"]);
subChunks.forEach(function(sc) {
sc.chunkIndex = chunks.length;
sc.type = "code_block_fragment";
chunks.push(sc);
});
}
}
return chunks;
}
Overlap Strategies and Why They Matter
Overlap is the most underestimated parameter in chunking. Without overlap, critical information that spans a chunk boundary gets split across two chunks, and neither chunk contains the full context. Consider a sentence like "The API returns a 429 status code, which means you should implement exponential backoff." If the chunk boundary falls between "429 status code," and "which means," neither chunk is useful on its own.
There are three overlap strategies worth considering:
Fixed overlap uses a constant number of characters or tokens. Simple but effective. I recommend 10-15% of your chunk size as a starting point.
Sentence overlap keeps the last N sentences from the previous chunk at the beginning of the next chunk. This guarantees grammatically complete overlap.
Sliding window overlap creates chunks at regular intervals but with a large overlap, effectively creating a dense set of overlapping views over the document. This is expensive in storage but can improve retrieval recall significantly.
function chunkWithSlidingWindow(text, windowSize, stepSize) {
windowSize = windowSize || 1000;
stepSize = stepSize || 200; // High overlap: 80%
var chunks = [];
var start = 0;
while (start < text.length) {
var end = Math.min(start + windowSize, text.length);
chunks.push({
content: text.slice(start, end),
startIndex: start,
endIndex: end,
chunkIndex: chunks.length,
overlapRatio: 1 - (stepSize / windowSize)
});
start = start + stepSize;
}
return chunks;
}
The sliding window approach produces many more chunks, which increases storage costs and embedding API usage. But in my experience, it consistently delivers the best retrieval accuracy for complex questions that require information from multiple parts of a document.
Chunk Size Optimization
Chunk size is a tuning parameter, not a fixed constant. The optimal size depends on your embedding model, your query patterns, and your content type.
Too small (under 100 tokens): Chunks lose context. A chunk like "returns a 429 error" tells you nothing about which API, under what conditions, or how to handle it. The embedding for this chunk will match many irrelevant queries about 429 errors.
Too large (over 1000 tokens): Chunks become diluted. A chunk containing three different topics will have an embedding that is a vague average of all three topics, making it a mediocre match for any specific query.
The sweet spot: For most use cases, I have found 200-500 tokens to work best. Technical documentation tends to work better at the higher end (400-500 tokens) because concepts require more context. FAQ-style content works well at 150-250 tokens because each question-answer pair is naturally concise.
function analyzeChunkSizes(chunks) {
var sizes = chunks.map(function(c) { return c.content.length; });
sizes.sort(function(a, b) { return a - b; });
var sum = sizes.reduce(function(acc, s) { return acc + s; }, 0);
return {
count: sizes.length,
mean: Math.round(sum / sizes.length),
median: sizes[Math.floor(sizes.length / 2)],
min: sizes[0],
max: sizes[sizes.length - 1],
stdDev: Math.round(Math.sqrt(
sizes.reduce(function(acc, s) {
return acc + Math.pow(s - (sum / sizes.length), 2);
}, 0) / sizes.length
))
};
}
High standard deviation in chunk sizes is a red flag. It means your chunking is inconsistent, which leads to unpredictable retrieval behavior. Aim for a coefficient of variation (stdDev / mean) under 0.5.
Metadata Preservation Per Chunk
Every chunk should carry metadata about where it came from. Without metadata, retrieved chunks are just floating text with no provenance. This metadata is critical for citation, debugging, and re-ranking.
function enrichChunkMetadata(chunk, documentMeta, sections) {
return {
content: chunk.content,
metadata: {
// Document-level metadata
sourceDocument: documentMeta.filename,
documentTitle: documentMeta.title,
documentType: documentMeta.type,
documentUrl: documentMeta.url || null,
// Position metadata
chunkIndex: chunk.chunkIndex,
totalChunks: null, // Set after all chunks are created
pageNumber: documentMeta.pageNumber || null,
// Section metadata
sectionHeading: findParentHeading(chunk, sections),
sectionPath: buildSectionPath(chunk, sections),
// Content metadata
charCount: chunk.content.length,
wordCount: chunk.content.split(/\s+/).length,
// Timestamps
indexedAt: new Date().toISOString(),
documentModified: documentMeta.lastModified || null
}
};
}
function findParentHeading(chunk, sections) {
for (var i = sections.length - 1; i >= 0; i--) {
if (sections[i].startIndex <= chunk.startIndex) {
return sections[i].heading;
}
}
return null;
}
function buildSectionPath(chunk, sections) {
var path = [];
var currentLevel = Infinity;
for (var i = sections.length - 1; i >= 0; i--) {
if (sections[i].startIndex <= chunk.startIndex && sections[i].level < currentLevel) {
path.unshift(sections[i].heading);
currentLevel = sections[i].level;
}
}
return path.join(" > ");
}
The sectionPath is particularly valuable. When your RAG system retrieves a chunk, the LLM can see that it came from "User Guide > Authentication > OAuth2 Flow," which provides hierarchical context even if the chunk itself does not mention authentication.
Special Handling for Tables, Code Blocks, and Lists
Certain content structures must never be split mid-element. A table row without its header is meaningless. A code block split in the middle will not compile. A numbered list with items 1-3 in one chunk and 4-6 in another loses its logical structure.
function extractProtectedBlocks(text) {
var blocks = [];
var placeholder = "___PROTECTED_BLOCK_{index}___";
// Protect code blocks
var codeBlockRegex = /```[\s\S]*?```/g;
var match;
while ((match = codeBlockRegex.exec(text)) !== null) {
blocks.push({
content: match[0],
type: "code_block",
index: blocks.length
});
}
// Protect tables (markdown)
var tableRegex = /\|.+\|\n\|[-\s|:]+\|\n(\|.+\|\n)*/g;
while ((match = tableRegex.exec(text)) !== null) {
blocks.push({
content: match[0],
type: "table",
index: blocks.length
});
}
// Replace protected blocks with placeholders
var processedText = text;
for (var i = blocks.length - 1; i >= 0; i--) {
processedText = processedText.replace(
blocks[i].content,
placeholder.replace("{index}", i)
);
}
return {
text: processedText,
blocks: blocks,
restore: function(chunkedText) {
var restored = chunkedText;
for (var j = 0; j < blocks.length; j++) {
restored = restored.replace(
placeholder.replace("{index}", j),
blocks[j].content
);
}
return restored;
}
};
}
The strategy is simple: extract protected blocks before chunking, replace them with placeholders, chunk the remaining text, then restore the blocks. If a protected block exceeds the chunk size limit, keep it as its own chunk rather than splitting it.
Hierarchical Indexing with Parent-Child Chunks
One of the most effective advanced techniques is hierarchical chunking, where you create both large "parent" chunks and smaller "child" chunks. During retrieval, you search against the fine-grained child chunks for precision, but you return the parent chunk to the LLM for context.
function createHierarchicalChunks(text, parentSize, childSize) {
parentSize = parentSize || 2000;
childSize = childSize || 500;
var parentChunks = chunkByParagraphs(text, parentSize);
var hierarchy = [];
for (var i = 0; i < parentChunks.length; i++) {
var parent = parentChunks[i];
var parentId = "parent_" + i;
var children = chunkBySentences(parent.content, childSize, 1);
hierarchy.push({
id: parentId,
role: "parent",
content: parent.content,
childIds: children.map(function(_, j) {
return parentId + "_child_" + j;
})
});
for (var j = 0; j < children.length; j++) {
hierarchy.push({
id: parentId + "_child_" + j,
role: "child",
content: children[j].content,
parentId: parentId
});
}
}
return hierarchy;
}
function retrieveWithHierarchy(queryEmbedding, index, topK) {
topK = topK || 5;
// Search only child chunks for precision
var childResults = index.search(queryEmbedding, {
filter: { role: "child" },
topK: topK
});
// Deduplicate by parent and return parent content
var seenParents = {};
var parentChunks = [];
for (var i = 0; i < childResults.length; i++) {
var parentId = childResults[i].metadata.parentId;
if (!seenParents[parentId]) {
seenParents[parentId] = true;
parentChunks.push(index.getById(parentId));
}
}
return parentChunks;
}
This pattern gives you the best of both worlds. The child chunks have focused embeddings that match specific queries well. The parent chunks provide the LLM with enough surrounding context to generate a coherent answer.
Evaluating Chunk Quality
You cannot improve what you do not measure. Here is a practical evaluation framework for chunk quality.
function evaluateChunkQuality(chunks, testQueries, getEmbedding, expectedChunkIndices) {
var results = {
totalQueries: testQueries.length,
hits: 0,
misses: 0,
averageRank: 0,
details: []
};
return Promise.all(testQueries.map(function(query, queryIndex) {
return getEmbedding(query).then(function(queryEmbedding) {
var similarities = chunks.map(function(chunk, chunkIdx) {
return {
chunkIndex: chunkIdx,
similarity: cosineSimilarity(queryEmbedding, chunk.embedding)
};
});
similarities.sort(function(a, b) {
return b.similarity - a.similarity;
});
var expectedIndex = expectedChunkIndices[queryIndex];
var rank = -1;
for (var i = 0; i < similarities.length; i++) {
if (similarities[i].chunkIndex === expectedIndex) {
rank = i + 1;
break;
}
}
var hit = rank <= 5; // Top-5 retrieval
if (hit) results.hits++;
else results.misses++;
results.details.push({
query: query,
expectedChunk: expectedIndex,
actualRank: rank,
topSimilarity: similarities[0].similarity,
hit: hit
});
return rank;
});
})).then(function(ranks) {
var validRanks = ranks.filter(function(r) { return r > 0; });
results.averageRank = validRanks.reduce(function(a, b) { return a + b; }, 0) / validRanks.length;
results.retrievalAccuracy = results.hits / results.totalQueries;
return results;
});
}
Build a test set of 20-50 queries with known correct chunks. Run your evaluation after every change to your chunking strategy. A retrieval accuracy below 80% at top-5 means your chunking needs work.
Complete Working Example
Here is a complete Node.js chunking library that ties together all the strategies discussed above.
// chunking-pipeline.js
var natural = require("natural");
var cheerio = require("cheerio");
var crypto = require("crypto");
var sentenceTokenizer = new natural.SentenceTokenizer();
// ============================================================
// Core Chunking Strategies
// ============================================================
var ChunkingStrategies = {
fixed: function(text, options) {
var chunkSize = options.chunkSize || 1000;
var overlap = options.overlap || 200;
var chunks = [];
var start = 0;
while (start < text.length) {
var end = Math.min(start + chunkSize, text.length);
chunks.push({
content: text.slice(start, end),
strategy: "fixed",
startIndex: start,
endIndex: end
});
start = start + chunkSize - overlap;
}
return chunks;
},
sentence: function(text, options) {
var maxSize = options.chunkSize || 1000;
var overlapCount = options.overlapSentences || 2;
var sentences = sentenceTokenizer.tokenize(text);
var chunks = [];
var current = [];
var currentLen = 0;
for (var i = 0; i < sentences.length; i++) {
if (currentLen + sentences[i].length > maxSize && current.length > 0) {
chunks.push({
content: current.join(" "),
strategy: "sentence",
sentenceCount: current.length
});
var keep = Math.max(0, current.length - overlapCount);
current = current.slice(keep);
currentLen = current.join(" ").length;
}
current.push(sentences[i]);
currentLen = currentLen + sentences[i].length + 1;
}
if (current.length > 0) {
chunks.push({
content: current.join(" "),
strategy: "sentence",
sentenceCount: current.length
});
}
return chunks;
},
paragraph: function(text, options) {
var maxSize = options.chunkSize || 1500;
var minLength = options.minParagraphLength || 30;
var paragraphs = text.split(/\n\s*\n/).filter(function(p) {
return p.trim().length >= minLength;
});
var chunks = [];
var current = [];
var currentLen = 0;
for (var i = 0; i < paragraphs.length; i++) {
var para = paragraphs[i].trim();
if (currentLen + para.length > maxSize && current.length > 0) {
chunks.push({
content: current.join("\n\n"),
strategy: "paragraph",
paragraphCount: current.length
});
current = [current[current.length - 1]];
currentLen = current[0].length;
}
current.push(para);
currentLen = currentLen + para.length + 2;
}
if (current.length > 0) {
chunks.push({
content: current.join("\n\n"),
strategy: "paragraph",
paragraphCount: current.length
});
}
return chunks;
},
recursive: function(text, options) {
var maxSize = options.chunkSize || 1000;
var separators = ["\n\n", "\n", ". ", " "];
function splitRecursive(txt, seps) {
if (txt.length <= maxSize) {
return [{ content: txt, strategy: "recursive" }];
}
if (seps.length === 0) {
return [{ content: txt.substring(0, maxSize), strategy: "recursive" }];
}
var sep = seps[0];
var parts = txt.split(sep);
var results = [];
var current = "";
for (var i = 0; i < parts.length; i++) {
var candidate = current ? current + sep + parts[i] : parts[i];
if (candidate.length > maxSize) {
if (current) {
if (current.length <= maxSize) {
results.push({ content: current, strategy: "recursive" });
} else {
var sub = splitRecursive(current, seps.slice(1));
results = results.concat(sub);
}
}
current = parts[i];
} else {
current = candidate;
}
}
if (current) {
if (current.length <= maxSize) {
results.push({ content: current, strategy: "recursive" });
} else {
var finalSub = splitRecursive(current, seps.slice(1));
results = results.concat(finalSub);
}
}
return results;
}
return splitRecursive(text, separators);
}
};
// ============================================================
// Document Format Preprocessors
// ============================================================
var Preprocessors = {
markdown: function(text) {
var sections = [];
var currentHeading = "";
var currentLevel = 0;
var currentContent = [];
var lines = text.split("\n");
for (var i = 0; i < lines.length; i++) {
var match = lines[i].match(/^(#{1,6})\s+(.+)$/);
if (match) {
if (currentContent.length > 0 || currentHeading) {
sections.push({
heading: currentHeading,
level: currentLevel,
content: currentContent.join("\n").trim()
});
}
currentHeading = match[2];
currentLevel = match[1].length;
currentContent = [];
} else {
currentContent.push(lines[i]);
}
}
if (currentContent.length > 0 || currentHeading) {
sections.push({
heading: currentHeading,
level: currentLevel,
content: currentContent.join("\n").trim()
});
}
return sections;
},
html: function(html) {
var $ = cheerio.load(html);
$("script, style, nav, header, footer, aside").remove();
var sections = [];
$("h1, h2, h3, h4, h5, h6").each(function() {
var heading = $(this).text().trim();
var level = parseInt(this.tagName.charAt(1));
var parts = [];
var sibling = $(this).next();
while (sibling.length && !sibling.is("h1,h2,h3,h4,h5,h6")) {
var txt = sibling.text().trim();
if (txt) parts.push(txt);
sibling = sibling.next();
}
sections.push({ heading: heading, level: level, content: parts.join("\n\n") });
});
// Fallback: if no headings found, return body text
if (sections.length === 0) {
sections.push({
heading: $("title").text() || "",
level: 1,
content: $("body").text().trim()
});
}
return sections;
},
plaintext: function(text) {
return [{ heading: "", level: 0, content: text }];
}
};
// ============================================================
// Chunking Pipeline
// ============================================================
function ChunkingPipeline(options) {
this.strategy = options.strategy || "sentence";
this.chunkSize = options.chunkSize || 800;
this.overlap = options.overlap || 100;
this.overlapSentences = options.overlapSentences || 2;
this.preserveBlocks = options.preserveBlocks !== false;
}
ChunkingPipeline.prototype.process = function(text, format, documentMeta) {
format = format || "plaintext";
documentMeta = documentMeta || {};
// Step 1: Detect and protect special blocks
var protectedResult = null;
if (this.preserveBlocks) {
protectedResult = this._extractProtectedBlocks(text);
text = protectedResult.text;
}
// Step 2: Preprocess by format
var preprocessor = Preprocessors[format] || Preprocessors.plaintext;
var sections = preprocessor(text);
// Step 3: Chunk each section
var allChunks = [];
var chunkFn = ChunkingStrategies[this.strategy];
if (!chunkFn) {
throw new Error("Unknown chunking strategy: " + this.strategy);
}
var self = this;
var globalIndex = 0;
for (var s = 0; s < sections.length; s++) {
var section = sections[s];
if (!section.content || section.content.trim().length === 0) continue;
var sectionChunks = chunkFn(section.content, {
chunkSize: self.chunkSize,
overlap: self.overlap,
overlapSentences: self.overlapSentences
});
for (var c = 0; c < sectionChunks.length; c++) {
var chunk = sectionChunks[c];
// Restore protected blocks
if (protectedResult) {
chunk.content = protectedResult.restore(chunk.content);
}
// Enrich with metadata
chunk.chunkIndex = globalIndex;
chunk.sectionHeading = section.heading || null;
chunk.sectionLevel = section.level || 0;
chunk.metadata = {
source: documentMeta.filename || "unknown",
title: documentMeta.title || null,
url: documentMeta.url || null,
format: format,
chunkId: self._generateChunkId(chunk.content),
charCount: chunk.content.length,
wordCount: chunk.content.split(/\s+/).filter(Boolean).length,
processedAt: new Date().toISOString()
};
allChunks.push(chunk);
globalIndex++;
}
}
// Set totalChunks on all chunks
for (var t = 0; t < allChunks.length; t++) {
allChunks[t].metadata.totalChunks = allChunks.length;
}
return allChunks;
};
ChunkingPipeline.prototype._extractProtectedBlocks = function(text) {
var blocks = [];
var processed = text;
// Protect fenced code blocks
processed = processed.replace(/```[\s\S]*?```/g, function(match) {
var idx = blocks.length;
blocks.push({ content: match, type: "code" });
return "___BLOCK_" + idx + "___";
});
// Protect markdown tables
processed = processed.replace(
/\|.+\|\n\|[-\s|:]+\|\n(\|.+\|\n)*/g,
function(match) {
var idx = blocks.length;
blocks.push({ content: match, type: "table" });
return "___BLOCK_" + idx + "___";
}
);
return {
text: processed,
blocks: blocks,
restore: function(chunkedText) {
var result = chunkedText;
for (var i = 0; i < blocks.length; i++) {
result = result.replace("___BLOCK_" + i + "___", blocks[i].content);
}
return result;
}
};
};
ChunkingPipeline.prototype._generateChunkId = function(content) {
return crypto.createHash("md5").update(content).digest("hex").substring(0, 12);
};
// ============================================================
// Quality Evaluation
// ============================================================
function evaluateChunks(chunks) {
var sizes = chunks.map(function(c) { return c.content.length; });
var totalSize = sizes.reduce(function(a, b) { return a + b; }, 0);
var meanSize = totalSize / sizes.length;
var variance = sizes.reduce(function(acc, s) {
return acc + Math.pow(s - meanSize, 2);
}, 0) / sizes.length;
var stdDev = Math.sqrt(variance);
sizes.sort(function(a, b) { return a - b; });
var emptyChunks = chunks.filter(function(c) {
return c.content.trim().length < 20;
});
return {
totalChunks: chunks.length,
totalCharacters: totalSize,
meanChunkSize: Math.round(meanSize),
medianChunkSize: sizes[Math.floor(sizes.length / 2)],
minChunkSize: sizes[0],
maxChunkSize: sizes[sizes.length - 1],
stdDev: Math.round(stdDev),
coefficientOfVariation: (stdDev / meanSize).toFixed(3),
emptyOrTinyChunks: emptyChunks.length,
qualityScore: calculateQualityScore(meanSize, stdDev, emptyChunks.length, chunks.length)
};
}
function calculateQualityScore(meanSize, stdDev, emptyCount, totalCount) {
var score = 100;
// Penalize extreme mean sizes
if (meanSize < 100) score = score - 20;
else if (meanSize > 3000) score = score - 15;
// Penalize high variance
var cv = stdDev / meanSize;
if (cv > 0.8) score = score - 25;
else if (cv > 0.5) score = score - 10;
// Penalize empty/tiny chunks
var emptyRatio = emptyCount / totalCount;
if (emptyRatio > 0.1) score = score - 20;
else if (emptyRatio > 0.05) score = score - 10;
return Math.max(0, Math.min(100, score));
}
// ============================================================
// Usage Example
// ============================================================
function main() {
var sampleMarkdown = [
"# Getting Started with Express.js",
"",
"Express.js is a minimal web framework for Node.js. It provides a robust",
"set of features for building web applications and APIs. Express has been",
"the de facto standard for Node.js web development for over a decade.",
"",
"## Installation",
"",
"Install Express using npm. You will also want to install nodemon for",
"development to automatically restart your server when files change.",
"",
"```",
"npm install express",
"npm install --save-dev nodemon",
"```",
"",
"## Creating Your First Server",
"",
"Create a file called app.js in your project root. Import Express and",
"create an application instance. Define your routes and start listening",
"on a port.",
"",
"The basic pattern is to create an Express app, define middleware and",
"routes, then call app.listen() with a port number. Express will handle",
"incoming HTTP requests and route them to the appropriate handler.",
"",
"## Middleware",
"",
"Middleware functions have access to the request object, the response",
"object, and the next middleware function. They can execute code, modify",
"the request and response, end the request-response cycle, or call the",
"next middleware. Middleware is the backbone of Express applications."
].join("\n");
var pipeline = new ChunkingPipeline({
strategy: "sentence",
chunkSize: 500,
overlapSentences: 1,
preserveBlocks: true
});
var chunks = pipeline.process(sampleMarkdown, "markdown", {
filename: "express-guide.md",
title: "Getting Started with Express.js",
url: "https://example.com/express-guide"
});
console.log("=== Chunking Results ===\n");
for (var i = 0; i < chunks.length; i++) {
console.log("Chunk " + i + " [" + chunks[i].sectionHeading + "]:");
console.log(chunks[i].content.substring(0, 120) + "...");
console.log("Words: " + chunks[i].metadata.wordCount);
console.log("---");
}
console.log("\n=== Quality Report ===\n");
var quality = evaluateChunks(chunks);
console.log(JSON.stringify(quality, null, 2));
}
main();
module.exports = {
ChunkingPipeline: ChunkingPipeline,
ChunkingStrategies: ChunkingStrategies,
Preprocessors: Preprocessors,
evaluateChunks: evaluateChunks
};
Save this as chunking-pipeline.js and run it with node chunking-pipeline.js to see the chunking and quality evaluation in action.
Common Issues and Troubleshooting
1. Chunks Are Empty or Contain Only Whitespace
Error: Chunk at index 7 has empty content after trimming
This happens when your separator pattern matches multiple consecutive times, creating empty segments. The fix is to filter out empty segments after splitting:
var segments = text.split("\n\n").filter(function(s) {
return s.trim().length > 0;
});
Also check for documents that use \r\n line endings. Normalize line endings before chunking with text.replace(/\r\n/g, "\n").
2. Protected Blocks Exceed Chunk Size Limit
Warning: Code block at line 142 is 3,847 characters, exceeding chunk size of 1,000
A single code block or table can exceed your chunk size limit. You have two options: increase the chunk size limit for chunks that contain protected blocks, or treat oversized protected blocks as standalone chunks. I recommend the latter:
if (block.content.length > maxChunkSize) {
// Keep as standalone chunk rather than splitting code
chunks.push({
content: block.content,
strategy: "protected_block",
oversized: true
});
}
3. Sentence Tokenizer Fails on Non-Standard Text
TypeError: Cannot read property 'length' of undefined
at SentenceTokenizer.tokenize
The Natural library's sentence tokenizer can choke on text with unusual punctuation, URLs, or numeric sequences like "v2.3.1." that look like sentence endings. Preprocess your text to handle these cases:
function sanitizeForSentenceTokenization(text) {
// Protect version numbers
text = text.replace(/v(\d+\.\d+\.\d+)/g, "v$1_PROTECTED_DOT_");
// Protect URLs
text = text.replace(/(https?:\/\/[^\s]+)/g, function(url) {
return url.replace(/\./g, "_PROTECTED_DOT_");
});
return text;
}
4. Duplicate Content from Excessive Overlap
Issue: Vector search returns the same passage from 4 different chunks, all with similarity > 0.95
If your overlap is too large relative to your chunk size, you end up with chunks that are mostly identical. This wastes storage and pollutes search results. Keep your overlap under 20% of your chunk size, or deduplicate search results by content hash:
function deduplicateResults(results, similarityThreshold) {
similarityThreshold = similarityThreshold || 0.95;
var seen = [];
return results.filter(function(result) {
for (var i = 0; i < seen.length; i++) {
if (cosineSimilarity(result.embedding, seen[i]) > similarityThreshold) {
return false;
}
}
seen.push(result.embedding);
return true;
});
}
5. Memory Issues with Large Documents
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
Processing a 500-page PDF as a single string will exhaust Node.js memory. Stream-process large documents by splitting them into pages or sections first, then chunking each section independently:
function processLargeDocument(pages, pipeline, documentMeta) {
var allChunks = [];
for (var i = 0; i < pages.length; i++) {
var pageChunks = pipeline.process(pages[i], "plaintext", {
filename: documentMeta.filename,
title: documentMeta.title,
pageNumber: i + 1
});
allChunks = allChunks.concat(pageChunks);
}
// Re-index after combining
for (var j = 0; j < allChunks.length; j++) {
allChunks[j].chunkIndex = j;
allChunks[j].metadata.totalChunks = allChunks.length;
}
return allChunks;
}
Best Practices
Start with sentence-based chunking and a 400-token target size. This is the best default for most use cases. Only switch strategies when you have evidence that another approach performs better on your specific data.
Always include overlap between chunks. A 10-15% overlap prevents information loss at chunk boundaries. Sentence overlap is preferable to character overlap because it preserves grammatical completeness.
Preserve metadata aggressively. Every chunk should carry its source document, section heading, page number, and position within the document. This metadata is essential for citation, debugging, and contextual re-ranking.
Never split code blocks, tables, or structured lists. These elements lose their meaning when fragmented. Extract them as protected blocks before chunking, and keep them intact even if they exceed the chunk size limit.
Build a retrieval evaluation set early. Create 30-50 test queries with known correct answers and measure retrieval accuracy at top-5. Run this evaluation whenever you change your chunking strategy, chunk size, or overlap. Without measurement, you are tuning blind.
Normalize text before chunking. Standardize line endings, collapse excessive whitespace, remove invisible Unicode characters, and convert smart quotes to straight quotes. Inconsistent formatting creates inconsistent chunks.
Use hierarchical chunking for long documents. Create parent chunks (1500-2000 characters) and child chunks (300-500 characters). Search against children for precision, return parents to the LLM for context. This consistently outperforms single-level chunking on complex queries.
Monitor chunk size distribution in production. A coefficient of variation above 0.5 indicates your chunking is inconsistent. Log chunk size statistics and alert on drift, because document format changes will silently degrade your chunking quality.
Version your chunking configuration. When you change chunk size, overlap, or strategy, you need to re-embed all your documents. Track your chunking parameters alongside your embedding model version so you can reproduce results and roll back if needed.
References
- OpenAI Embeddings Guide - Official documentation on embedding models and best practices for text preparation.
- LangChain Text Splitters Documentation - Reference implementation of recursive character text splitting and other strategies.
- Pinecone Chunking Strategies Guide - Comprehensive overview of chunking approaches with benchmarks.
- Natural Language Toolkit for Node.js - The
naturalnpm package used for sentence tokenization in this article. - Tiktoken for JavaScript - Token counting library compatible with OpenAI models.
- Cheerio HTML Parser - Fast HTML parsing library used for preprocessing HTML documents before chunking.