Building a Research Agent with Web Search
Build a research agent that searches the web, reads sources, synthesizes findings, and produces cited reports in Node.js.
Building a Research Agent with Web Search
A research agent automates the process of answering complex questions by searching the web, reading multiple sources, reconciling conflicting information, and producing a structured report with citations. This article walks through the full architecture and implementation of a production-grade research agent in Node.js, covering everything from query planning to fact-checking and output formatting.
Prerequisites
- Node.js v18 or later
- An API key for a search provider (Brave Search API or SerpAPI)
- An OpenAI API key (or Anthropic API key) for LLM-powered synthesis
- Familiarity with async programming in Node.js
- Basic understanding of LLM tool calling and agent loops
- npm packages:
axios,cheerio,openai,node-cache
Install the dependencies:
npm install axios cheerio openai node-cache
What a Research Agent Does
A research agent follows a loop that mirrors how a human researcher works: receive a question, break it into searchable queries, run those searches, read the results, extract relevant information, identify gaps, run follow-up searches, and finally synthesize everything into a coherent answer with sources.
The core loop looks like this:
Query → Plan → Search → Read → Summarize → Identify Gaps → Search Again → Synthesize → Report
This is fundamentally different from a single LLM call. A single call relies entirely on training data. A research agent grounds its answers in live, retrievable sources. That distinction matters when you need current information, verifiable facts, or coverage of a niche topic the model was never trained on.
The agent must handle real-world messiness: paywalled sites, rate limits, conflicting sources, dead links, irrelevant results, and duplicate content. A naive implementation that just searches and concatenates results produces garbage. A good research agent is opinionated about what it reads, skeptical about what it finds, and disciplined about what it reports.
Designing the Research Agent Architecture
The agent breaks down into five components, each with a clear responsibility:
Planner — Takes the user's question and generates a set of search queries. A question like "What are the performance differences between PostgreSQL and MySQL for analytical workloads?" might produce queries like "PostgreSQL vs MySQL OLAP performance benchmarks 2024", "PostgreSQL analytical query optimization", and "MySQL columnar storage performance".
Searcher — Executes search queries against a web search API and returns ranked results with titles, URLs, and snippets.
Reader — Fetches web pages, strips HTML boilerplate, and extracts the meaningful text content.
Synthesizer — Takes the collected source summaries and produces a coherent answer that addresses the original question, noting where sources agree and disagree.
Reporter — Formats the final output as a structured document with inline citations and a references section.
// agent/researchAgent.js
var Planner = require("./planner");
var Searcher = require("./searcher");
var Reader = require("./reader");
var Synthesizer = require("./synthesizer");
var Reporter = require("./reporter");
function ResearchAgent(config) {
this.planner = new Planner(config);
this.searcher = new Searcher(config);
this.reader = new Reader(config);
this.synthesizer = new Synthesizer(config);
this.reporter = new Reporter(config);
this.maxIterations = config.maxIterations || 3;
this.maxSourcesPerQuery = config.maxSourcesPerQuery || 5;
this.sources = [];
this.summaries = [];
}
ResearchAgent.prototype.research = function (question) {
var self = this;
return self.planner.generateQueries(question)
.then(function (queries) {
return self._iterativeSearch(question, queries, 0);
})
.then(function () {
return self.synthesizer.synthesize(question, self.summaries);
})
.then(function (synthesis) {
return self.reporter.format(question, synthesis, self.sources);
});
};
ResearchAgent.prototype._iterativeSearch = function (question, queries, iteration) {
var self = this;
if (iteration >= self.maxIterations) {
return Promise.resolve();
}
return self._executeSearchBatch(queries)
.then(function (results) {
return self._readAndSummarize(results);
})
.then(function () {
return self.planner.identifyGaps(question, self.summaries);
})
.then(function (gaps) {
if (gaps.length === 0) {
return Promise.resolve();
}
return self._iterativeSearch(question, gaps, iteration + 1);
});
};
module.exports = ResearchAgent;
Implementing Web Search Tools
The searcher wraps a web search API. I recommend Brave Search API for its generous free tier, but SerpAPI works just as well. The key design decision is normalizing results into a consistent format regardless of which provider you use.
// agent/searcher.js
var axios = require("axios");
var NodeCache = require("node-cache");
function Searcher(config) {
this.apiKey = config.searchApiKey;
this.provider = config.searchProvider || "brave";
this.cache = new NodeCache({ stdTTL: 3600 });
this.rateLimiter = {
lastCall: 0,
minInterval: 1000
};
}
Searcher.prototype.search = function (query) {
var self = this;
var cacheKey = "search:" + query.toLowerCase().trim();
var cached = self.cache.get(cacheKey);
if (cached) {
return Promise.resolve(cached);
}
return self._waitForRateLimit()
.then(function () {
if (self.provider === "brave") {
return self._searchBrave(query);
}
return self._searchSerpApi(query);
})
.then(function (results) {
self.cache.set(cacheKey, results);
return results;
});
};
Searcher.prototype._searchBrave = function (query) {
return axios.get("https://api.search.brave.com/res/v1/web/search", {
headers: {
"Accept": "application/json",
"Accept-Encoding": "gzip",
"X-Subscription-Token": this.apiKey
},
params: {
q: query,
count: 10
}
}).then(function (response) {
var results = response.data.web && response.data.web.results || [];
return results.map(function (r) {
return {
title: r.title,
url: r.url,
snippet: r.description || "",
source: "brave"
};
});
});
};
Searcher.prototype._searchSerpApi = function (query) {
return axios.get("https://serpapi.com/search", {
params: {
q: query,
api_key: this.apiKey,
engine: "google"
}
}).then(function (response) {
var results = response.data.organic_results || [];
return results.map(function (r) {
return {
title: r.title,
url: r.link,
snippet: r.snippet || "",
source: "serpapi"
};
});
});
};
Searcher.prototype._waitForRateLimit = function () {
var self = this;
var now = Date.now();
var elapsed = now - self.rateLimiter.lastCall;
var waitTime = Math.max(0, self.rateLimiter.minInterval - elapsed);
return new Promise(function (resolve) {
setTimeout(function () {
self.rateLimiter.lastCall = Date.now();
resolve();
}, waitTime);
});
};
module.exports = Searcher;
The cache prevents redundant API calls when follow-up queries overlap with earlier searches. The rate limiter ensures you stay within API quotas. Both are essential for cost control.
Page Fetching and Content Extraction
Fetching a web page is easy. Extracting the useful content is hard. Most pages are 90% navigation, ads, footers, and JavaScript. The reader needs to strip all of that and return clean text.
Cheerio handles HTML parsing on the server side without a browser. The extraction strategy focuses on finding the main content area and removing known noise elements:
// agent/reader.js
var axios = require("axios");
var cheerio = require("cheerio");
var NodeCache = require("node-cache");
function Reader(config) {
this.cache = new NodeCache({ stdTTL: 7200 });
this.timeout = config.fetchTimeout || 10000;
this.maxContentLength = config.maxContentLength || 50000;
}
Reader.prototype.read = function (url) {
var self = this;
var cached = self.cache.get(url);
if (cached) {
return Promise.resolve(cached);
}
return axios.get(url, {
timeout: self.timeout,
maxContentLength: 5 * 1024 * 1024,
headers: {
"User-Agent": "ResearchBot/1.0 (Educational Research Agent)",
"Accept": "text/html,application/xhtml+xml"
},
validateStatus: function (status) {
return status < 400;
}
}).then(function (response) {
var content = self._extractContent(response.data, url);
self.cache.set(url, content);
return content;
}).catch(function (err) {
return {
url: url,
title: "",
text: "",
error: err.message,
success: false
};
});
};
Reader.prototype._extractContent = function (html, url) {
var $ = cheerio.load(html);
// Remove noise elements
var noiseSelectors = [
"script", "style", "nav", "footer", "header",
".sidebar", ".advertisement", ".ad", ".popup",
".cookie-banner", ".newsletter-signup", ".social-share",
"#comments", ".comments", "iframe", "noscript"
];
noiseSelectors.forEach(function (selector) {
$(selector).remove();
});
// Try to find main content area
var mainContent = $("article").first();
if (mainContent.length === 0) mainContent = $("[role='main']").first();
if (mainContent.length === 0) mainContent = $(".post-content").first();
if (mainContent.length === 0) mainContent = $(".article-body").first();
if (mainContent.length === 0) mainContent = $("main").first();
if (mainContent.length === 0) mainContent = $("body");
var text = mainContent.text()
.replace(/\s+/g, " ")
.replace(/\n{3,}/g, "\n\n")
.trim();
// Truncate if too long
if (text.length > this.maxContentLength) {
text = text.substring(0, this.maxContentLength) + "... [truncated]";
}
var title = $("title").text().trim() ||
$("h1").first().text().trim() ||
"";
return {
url: url,
title: title,
text: text,
wordCount: text.split(/\s+/).length,
success: true
};
};
module.exports = Reader;
A few things worth noting here. The User-Agent header identifies your bot honestly. Some sites block unknown crawlers, and pretending to be a browser is both fragile and ethically questionable. The validateStatus check prevents axios from throwing on redirects. The truncation prevents you from blowing up your LLM context window with a single enormous page.
Implementing the Reading Comprehension Step
Raw page text is too long and noisy to feed directly to the synthesizer. Each source needs a focused summary that answers what this source contributes to the research question. This is the reading comprehension step.
// agent/comprehension.js
var OpenAI = require("openai");
function Comprehension(config) {
this.client = new OpenAI({ apiKey: config.openaiApiKey });
this.model = config.comprehensionModel || "gpt-4o-mini";
}
Comprehension.prototype.summarizeSource = function (question, source) {
var self = this;
if (!source.success || source.text.length < 100) {
return Promise.resolve(null);
}
var prompt = [
"You are a research assistant. Given the research question and a source document,",
"extract the key facts, claims, data points, and arguments that are relevant to",
"answering the research question. Include specific numbers, dates, and quotes.",
"Note any caveats, limitations, or biases in the source.",
"If the source is not relevant to the question, respond with IRRELEVANT.",
"",
"Research question: " + question,
"",
"Source title: " + source.title,
"Source URL: " + source.url,
"",
"Source content:",
source.text.substring(0, 15000)
].join("\n");
return self.client.chat.completions.create({
model: self.model,
messages: [{ role: "user", content: prompt }],
max_tokens: 1000,
temperature: 0.1
}).then(function (response) {
var summary = response.choices[0].message.content.trim();
if (summary === "IRRELEVANT") {
return null;
}
return {
url: source.url,
title: source.title,
summary: summary,
relevance: "relevant"
};
});
};
module.exports = Comprehension;
Using a smaller, cheaper model (gpt-4o-mini) for comprehension is deliberate. This step runs once per source, and you might read 15-30 sources in a single research session. Sending all of that through a flagship model would be expensive. The comprehension task is straightforward enough that a smaller model handles it well.
Source Tracking and Citation Management
Every claim in the final report needs a citation back to its source. The source tracker maintains a deduplicated registry of all sources and assigns each one a stable reference number.
// agent/sourceTracker.js
var crypto = require("crypto");
function SourceTracker() {
this.sources = [];
this.urlIndex = {};
this.contentHashes = {};
}
SourceTracker.prototype.addSource = function (source) {
// Deduplicate by URL
var normalizedUrl = this._normalizeUrl(source.url);
if (this.urlIndex[normalizedUrl] !== undefined) {
return this.urlIndex[normalizedUrl];
}
// Deduplicate by content hash
var contentHash = this._hashContent(source.summary || source.text || "");
if (this.contentHashes[contentHash]) {
return this.contentHashes[contentHash];
}
var index = this.sources.length + 1;
this.sources.push({
index: index,
url: source.url,
title: source.title,
summary: source.summary || "",
accessedAt: new Date().toISOString()
});
this.urlIndex[normalizedUrl] = index;
if (contentHash) {
this.contentHashes[contentHash] = index;
}
return index;
};
SourceTracker.prototype._normalizeUrl = function (url) {
try {
var parsed = new URL(url);
// Remove trailing slash, fragments, common tracking params
parsed.hash = "";
parsed.searchParams.delete("utm_source");
parsed.searchParams.delete("utm_medium");
parsed.searchParams.delete("utm_campaign");
parsed.searchParams.delete("ref");
return parsed.toString().replace(/\/$/, "");
} catch (e) {
return url.toLowerCase().trim();
}
};
SourceTracker.prototype._hashContent = function (text) {
if (!text || text.length < 50) return null;
var normalized = text.substring(0, 500).toLowerCase().replace(/\s+/g, " ");
return crypto.createHash("md5").update(normalized).digest("hex");
};
SourceTracker.prototype.getReferences = function () {
return this.sources.map(function (s) {
return "[" + s.index + "] " + s.title + " - " + s.url +
" (accessed " + s.accessedAt.split("T")[0] + ")";
});
};
module.exports = SourceTracker;
URL normalization is critical. The same article might appear in search results as https://example.com/article, https://example.com/article/, and https://example.com/article?utm_source=google. Without normalization, you end up citing the same source three times. Content hashing catches the less obvious case where different URLs serve identical content, such as syndicated articles or mirror sites.
Iterative Research
The first round of searches rarely answers the question completely. The planner needs to review what was found, identify gaps, and generate follow-up queries.
// agent/planner.js
var OpenAI = require("openai");
function Planner(config) {
this.client = new OpenAI({ apiKey: config.openaiApiKey });
this.model = config.plannerModel || "gpt-4o";
}
Planner.prototype.generateQueries = function (question) {
var prompt = [
"You are a research planner. Given a research question, generate 3-5 specific",
"search queries that would help answer it comprehensively. Each query should",
"target a different aspect of the question. Return ONLY a JSON array of strings.",
"",
"Research question: " + question
].join("\n");
return this.client.chat.completions.create({
model: this.model,
messages: [{ role: "user", content: prompt }],
max_tokens: 500,
temperature: 0.3
}).then(function (response) {
var text = response.choices[0].message.content.trim();
// Strip markdown code fences if present
text = text.replace(/```json\n?/g, "").replace(/```\n?/g, "");
return JSON.parse(text);
});
};
Planner.prototype.identifyGaps = function (question, summaries) {
if (summaries.length === 0) {
return Promise.resolve([]);
}
var summaryText = summaries.map(function (s, i) {
return "Source " + (i + 1) + ": " + s.summary;
}).join("\n\n");
var prompt = [
"You are a research planner reviewing collected sources for completeness.",
"Given the research question and summaries of sources found so far,",
"identify any important gaps in the research. If there are gaps, return",
"a JSON array of follow-up search queries. If the research is complete,",
"return an empty array [].",
"",
"Research question: " + question,
"",
"Sources collected so far:",
summaryText
].join("\n");
return this.client.chat.completions.create({
model: this.model,
messages: [{ role: "user", content: prompt }],
max_tokens: 500,
temperature: 0.3
}).then(function (response) {
var text = response.choices[0].message.content.trim();
text = text.replace(/```json\n?/g, "").replace(/```\n?/g, "");
return JSON.parse(text);
});
};
module.exports = Planner;
The planner uses the more capable model (gpt-4o) because query generation and gap analysis require reasoning about what is missing, which is a harder task than summarizing what is present.
Handling Conflicting Information
When multiple sources disagree, the agent should not silently pick one version. It should flag the conflict and present both sides. This is built into the synthesizer.
// agent/synthesizer.js
var OpenAI = require("openai");
function Synthesizer(config) {
this.client = new OpenAI({ apiKey: config.openaiApiKey });
this.model = config.synthesisModel || "gpt-4o";
}
Synthesizer.prototype.synthesize = function (question, summaries) {
var numberedSummaries = summaries.map(function (s, i) {
return "[" + (i + 1) + "] " + s.title + "\n" + s.summary;
}).join("\n\n---\n\n");
var prompt = [
"You are a research synthesizer. Given a question and numbered source summaries,",
"produce a comprehensive answer. Follow these rules strictly:",
"",
"1. Cite every factual claim with [N] references matching the source numbers.",
"2. When sources conflict, explicitly state the disagreement and cite both sides.",
"3. Distinguish between well-supported facts (multiple sources) and single-source claims.",
"4. Note when information might be outdated and suggest verification.",
"5. Organize the answer with clear sections and headers.",
"6. End with a confidence assessment: what is well-established vs. uncertain.",
"",
"Research question: " + question,
"",
"Sources:",
numberedSummaries
].join("\n");
return this.client.chat.completions.create({
model: this.model,
messages: [{ role: "user", content: prompt }],
max_tokens: 3000,
temperature: 0.2
}).then(function (response) {
return response.choices[0].message.content.trim();
});
};
module.exports = Synthesizer;
Temperature 0.2 is intentional. Research synthesis should be conservative and factual. Higher temperatures introduce creativity, which is the last thing you want in a fact-based report.
Implementing a Fact-Checking Step
Before final output, a fact-checking pass reviews the synthesized report for internal consistency and claims that are not supported by the collected sources.
// agent/factChecker.js
var OpenAI = require("openai");
function FactChecker(config) {
this.client = new OpenAI({ apiKey: config.openaiApiKey });
this.model = config.factCheckModel || "gpt-4o";
}
FactChecker.prototype.check = function (synthesis, summaries) {
var sourceText = summaries.map(function (s, i) {
return "[" + (i + 1) + "] " + s.summary;
}).join("\n\n");
var prompt = [
"You are a fact-checker. Review the following research synthesis against the",
"source summaries provided. Identify:",
"",
"1. Claims in the synthesis that are NOT supported by any source (hallucinations)",
"2. Claims that contradict the sources",
"3. Missing citations where claims should reference a source",
"4. Numerical or factual inconsistencies",
"",
"Return a JSON object with:",
'{ "issues": [{ "claim": "...", "problem": "...", "severity": "high|medium|low" }],',
' "verified": true/false }',
"",
"Synthesis:",
synthesis,
"",
"Sources:",
sourceText
].join("\n");
return this.client.chat.completions.create({
model: this.model,
messages: [{ role: "user", content: prompt }],
max_tokens: 1500,
temperature: 0.1
}).then(function (response) {
var text = response.choices[0].message.content.trim();
text = text.replace(/```json\n?/g, "").replace(/```\n?/g, "");
return JSON.parse(text);
});
};
module.exports = FactChecker;
If the fact-checker finds high-severity issues, you can either flag them in the output or loop back to the synthesizer with corrections. In practice, I recommend flagging rather than auto-correcting, because the correction loop can introduce new errors.
Output Formatting
The reporter takes the synthesis, sources, and optional fact-check results, and produces a clean, structured document.
// agent/reporter.js
function Reporter() {}
Reporter.prototype.format = function (question, synthesis, sources, factCheck) {
var sections = [];
sections.push("# Research Report");
sections.push("");
sections.push("**Question:** " + question);
sections.push("**Date:** " + new Date().toISOString().split("T")[0]);
sections.push("**Sources consulted:** " + sources.length);
sections.push("");
sections.push("---");
sections.push("");
sections.push("## Findings");
sections.push("");
sections.push(synthesis);
if (factCheck && factCheck.issues && factCheck.issues.length > 0) {
sections.push("");
sections.push("## Verification Notes");
sections.push("");
factCheck.issues.forEach(function (issue) {
var icon = issue.severity === "high" ? "[!]" :
issue.severity === "medium" ? "[?]" : "[i]";
sections.push("- " + icon + " " + issue.problem + ': "' + issue.claim + '"');
});
}
sections.push("");
sections.push("## References");
sections.push("");
sources.forEach(function (s, i) {
sections.push((i + 1) + ". [" + s.title + "](" + s.url + ") — accessed " +
(s.accessedAt || new Date().toISOString()).split("T")[0]);
});
sections.push("");
sections.push("---");
sections.push("*Generated by Research Agent v1.0*");
return sections.join("\n");
};
module.exports = Reporter;
Rate Limiting and Caching Search Results
We already saw per-component caching above, but a production agent needs centralized rate limiting across all API calls. Here is a generic rate limiter that can be shared across searcher, reader, and LLM calls:
// agent/rateLimiter.js
function RateLimiter(options) {
this.maxRequests = options.maxRequests || 10;
this.windowMs = options.windowMs || 60000;
this.queue = [];
this.timestamps = [];
}
RateLimiter.prototype.acquire = function () {
var self = this;
return new Promise(function (resolve) {
self._tryAcquire(resolve);
});
};
RateLimiter.prototype._tryAcquire = function (callback) {
var self = this;
var now = Date.now();
// Remove timestamps outside the window
self.timestamps = self.timestamps.filter(function (t) {
return now - t < self.windowMs;
});
if (self.timestamps.length < self.maxRequests) {
self.timestamps.push(now);
callback();
} else {
var oldestInWindow = self.timestamps[0];
var waitTime = self.windowMs - (now - oldestInWindow) + 10;
setTimeout(function () {
self._tryAcquire(callback);
}, waitTime);
}
};
module.exports = RateLimiter;
Integrate this into any component that makes external API calls. For the Brave Search free tier, you get 1 query per second and 2,000 queries per month. For OpenAI, the rate limits depend on your tier but you should still throttle to avoid hitting them during burst research sessions.
Cost Control for Research Agents
Research agents can get expensive fast. A single research session might involve 5 search API calls, 20 page fetches, 20 comprehension LLM calls, 2 planning calls, 1 synthesis call, and 1 fact-check call. Here is a cost tracker that enforces budgets:
// agent/costTracker.js
function CostTracker(budget) {
this.budget = budget || 1.00;
this.spent = 0;
this.breakdown = {
search: 0,
llm: 0,
total: 0
};
}
CostTracker.prototype.recordSearchCall = function () {
// Brave Search: ~$0.005 per query on paid plan
var cost = 0.005;
this.breakdown.search += cost;
this.spent += cost;
this._checkBudget();
};
CostTracker.prototype.recordLLMCall = function (model, inputTokens, outputTokens) {
var cost = 0;
var pricing = {
"gpt-4o": { input: 2.50 / 1000000, output: 10.00 / 1000000 },
"gpt-4o-mini": { input: 0.15 / 1000000, output: 0.60 / 1000000 }
};
var modelPricing = pricing[model] || pricing["gpt-4o-mini"];
cost = (inputTokens * modelPricing.input) + (outputTokens * modelPricing.output);
this.breakdown.llm += cost;
this.spent += cost;
this._checkBudget();
};
CostTracker.prototype._checkBudget = function () {
this.breakdown.total = this.spent;
if (this.spent >= this.budget) {
throw new Error(
"Research budget exceeded: $" + this.spent.toFixed(4) +
" of $" + this.budget.toFixed(2) + " budget. " +
"Breakdown - Search: $" + this.breakdown.search.toFixed(4) +
", LLM: $" + this.breakdown.llm.toFixed(4)
);
}
};
CostTracker.prototype.getSummary = function () {
return {
spent: this.spent,
budget: this.budget,
remaining: this.budget - this.spent,
breakdown: this.breakdown
};
};
module.exports = CostTracker;
Set conservative defaults. A $1.00 budget is usually enough for a thorough research session. If the agent hits the budget, it should synthesize what it has rather than throwing away all work.
Complete Working Example
Here is the full research agent wired together, ready to use from the command line:
// research.js — Complete research agent
var axios = require("axios");
var cheerio = require("cheerio");
var OpenAI = require("openai");
var NodeCache = require("node-cache");
var crypto = require("crypto");
// --- Configuration ---
var config = {
searchApiKey: process.env.BRAVE_SEARCH_API_KEY,
openaiApiKey: process.env.OPENAI_API_KEY,
maxIterations: 2,
maxSourcesPerQuery: 5,
fetchTimeout: 10000,
budget: 1.00
};
var openai = new OpenAI({ apiKey: config.openaiApiKey });
var searchCache = new NodeCache({ stdTTL: 3600 });
var pageCache = new NodeCache({ stdTTL: 7200 });
var sourceRegistry = [];
var urlIndex = {};
// --- Search ---
function searchWeb(query) {
var cached = searchCache.get(query);
if (cached) return Promise.resolve(cached);
return axios.get("https://api.search.brave.com/res/v1/web/search", {
headers: {
"Accept": "application/json",
"X-Subscription-Token": config.searchApiKey
},
params: { q: query, count: config.maxSourcesPerQuery }
}).then(function (res) {
var results = (res.data.web && res.data.web.results || []).map(function (r) {
return { title: r.title, url: r.url, snippet: r.description || "" };
});
searchCache.set(query, results);
return results;
});
}
// --- Read Page ---
function readPage(url) {
var cached = pageCache.get(url);
if (cached) return Promise.resolve(cached);
return axios.get(url, {
timeout: config.fetchTimeout,
headers: { "User-Agent": "ResearchBot/1.0" },
validateStatus: function (s) { return s < 400; }
}).then(function (res) {
var $ = cheerio.load(res.data);
["script", "style", "nav", "footer", "header", ".sidebar", ".ad", "iframe"]
.forEach(function (sel) { $(sel).remove(); });
var main = $("article").first();
if (!main.length) main = $("main").first();
if (!main.length) main = $("body");
var text = main.text().replace(/\s+/g, " ").trim();
if (text.length > 30000) text = text.substring(0, 30000) + "...";
var page = {
url: url,
title: $("title").text().trim() || $("h1").first().text().trim(),
text: text,
success: true
};
pageCache.set(url, page);
return page;
}).catch(function (err) {
return { url: url, title: "", text: "", success: false, error: err.message };
});
}
// --- Track Source ---
function trackSource(source) {
var normalized = source.url.replace(/\/$/, "").split("#")[0];
if (urlIndex[normalized] !== undefined) return urlIndex[normalized];
var idx = sourceRegistry.length;
sourceRegistry.push({
index: idx + 1,
url: source.url,
title: source.title,
summary: source.summary || "",
accessedAt: new Date().toISOString()
});
urlIndex[normalized] = idx;
return idx;
}
// --- LLM Call ---
function llm(prompt, model, maxTokens) {
return openai.chat.completions.create({
model: model || "gpt-4o-mini",
messages: [{ role: "user", content: prompt }],
max_tokens: maxTokens || 1000,
temperature: 0.2
}).then(function (res) {
return res.choices[0].message.content.trim();
});
}
// --- Plan ---
function planQueries(question) {
var prompt = "Generate 3-5 web search queries to research this question. " +
"Return ONLY a JSON array of strings.\n\nQuestion: " + question;
return llm(prompt, "gpt-4o", 500).then(function (text) {
text = text.replace(/```json\n?/g, "").replace(/```\n?/g, "");
return JSON.parse(text);
});
}
function identifyGaps(question, summaries) {
if (summaries.length < 2) return Promise.resolve([]);
var text = summaries.map(function (s, i) {
return "[" + (i + 1) + "] " + s;
}).join("\n");
var prompt = "Given this research question and collected summaries, return a JSON " +
"array of follow-up search queries for any gaps. Return [] if complete.\n\n" +
"Question: " + question + "\n\nSummaries:\n" + text;
return llm(prompt, "gpt-4o", 500).then(function (text) {
text = text.replace(/```json\n?/g, "").replace(/```\n?/g, "");
return JSON.parse(text);
});
}
// --- Summarize Source ---
function summarizeSource(question, page) {
if (!page.success || page.text.length < 100) return Promise.resolve(null);
var prompt = "Extract key facts relevant to this research question from the source.\n" +
"Question: " + question + "\nSource (" + page.title + "):\n" +
page.text.substring(0, 12000);
return llm(prompt, "gpt-4o-mini", 800);
}
// --- Synthesize ---
function synthesize(question, summaries) {
var numbered = summaries.map(function (s, i) {
return "[" + (i + 1) + "] " + s;
}).join("\n\n---\n\n");
var prompt = "Synthesize these sources into a comprehensive answer. Cite sources " +
"with [N]. Note conflicts between sources. End with a confidence assessment.\n\n" +
"Question: " + question + "\n\nSources:\n" + numbered;
return llm(prompt, "gpt-4o", 3000);
}
// --- Format Report ---
function formatReport(question, synthesis) {
var lines = [
"# Research Report",
"",
"**Question:** " + question,
"**Date:** " + new Date().toISOString().split("T")[0],
"**Sources consulted:** " + sourceRegistry.length,
"",
"---",
"",
"## Findings",
"",
synthesis,
"",
"## References",
""
];
sourceRegistry.forEach(function (s) {
lines.push(s.index + ". [" + s.title + "](" + s.url + ")");
});
return lines.join("\n");
}
// --- Main Research Loop ---
function research(question) {
var allSummaries = [];
console.log("Planning research queries...");
return planQueries(question)
.then(function (queries) {
console.log("Generated " + queries.length + " queries");
return executeRound(question, queries, allSummaries, 0);
})
.then(function () {
console.log("Synthesizing " + allSummaries.length + " sources...");
return synthesize(question, allSummaries);
})
.then(function (synthesis) {
var report = formatReport(question, synthesis);
return report;
});
}
function executeRound(question, queries, allSummaries, iteration) {
if (iteration >= config.maxIterations || queries.length === 0) {
return Promise.resolve();
}
console.log("Search round " + (iteration + 1) + ": " + queries.length + " queries");
// Execute searches sequentially to respect rate limits
var chain = Promise.resolve();
var roundResults = [];
queries.forEach(function (query) {
chain = chain.then(function () {
return searchWeb(query);
}).then(function (results) {
roundResults = roundResults.concat(results);
// 1-second delay between searches
return new Promise(function (r) { setTimeout(r, 1000); });
});
});
return chain.then(function () {
// Deduplicate by URL
var seen = {};
var unique = roundResults.filter(function (r) {
if (seen[r.url]) return false;
seen[r.url] = true;
return true;
});
console.log("Reading " + unique.length + " pages...");
// Read and summarize each page
var readChain = Promise.resolve();
unique.forEach(function (result) {
readChain = readChain.then(function () {
return readPage(result.url).then(function (page) {
return summarizeSource(question, page).then(function (summary) {
if (summary) {
trackSource({ url: result.url, title: result.title, summary: summary });
allSummaries.push(summary);
}
});
});
});
});
return readChain;
}).then(function () {
return identifyGaps(question, allSummaries);
}).then(function (gaps) {
if (gaps.length > 0) {
console.log("Found " + gaps.length + " gaps, running follow-up...");
return executeRound(question, gaps, allSummaries, iteration + 1);
}
});
}
// --- Run ---
var question = process.argv.slice(2).join(" ") ||
"What are the best practices for implementing rate limiting in distributed systems?";
console.log("Researching: " + question);
console.log("");
research(question)
.then(function (report) {
console.log("\n" + report);
})
.catch(function (err) {
console.error("Research failed:", err.message);
process.exit(1);
});
Run it:
export BRAVE_SEARCH_API_KEY=your-key-here
export OPENAI_API_KEY=your-key-here
node research.js "What are the trade-offs between REST and GraphQL for mobile applications?"
The agent will plan queries, search the web, read and summarize each source, check for gaps, run follow-up searches if needed, and output a formatted report with citations.
Common Issues and Troubleshooting
1. Search API returns empty results
Error: Cannot read properties of undefined (reading 'map')
This happens when the Brave Search API returns a response without the web.results field, typically due to an invalid API key or exhausted quota. Always guard against missing nested properties:
var results = response.data.web && response.data.web.results || [];
Check your API key and quota at the Brave Search dashboard.
2. Page fetch timeouts on slow sites
Error: timeout of 10000ms exceeded
Some sites take longer than 10 seconds to respond, especially during high traffic. Increase the timeout for important sources, but always set an upper bound. A 30-second timeout that blocks your entire pipeline is worse than skipping one source:
var timeout = isHighPriority ? 20000 : 10000;
Also consider that the page might be behind a JavaScript-rendered SPA. Cheerio cannot execute JavaScript. If you need to handle SPAs, you will need Puppeteer, which adds significant complexity and resource usage. For most research tasks, skipping JS-heavy pages and relying on other sources is the pragmatic choice.
3. LLM returns malformed JSON from planner
SyntaxError: Unexpected token 'H' in JSON at position 0
The LLM sometimes returns text like "Here are the queries:" followed by the JSON instead of pure JSON. The code strips markdown code fences, but it does not handle prefixed prose. Add a more robust JSON extraction:
function extractJSON(text) {
text = text.replace(/```json\n?/g, "").replace(/```\n?/g, "");
var match = text.match(/\[[\s\S]*\]/);
if (match) return JSON.parse(match[0]);
match = text.match(/\{[\s\S]*\}/);
if (match) return JSON.parse(match[0]);
throw new Error("No JSON found in LLM response: " + text.substring(0, 100));
}
4. Budget exceeded mid-research
Error: Research budget exceeded: $1.0234 of $1.00 budget. Breakdown - Search: $0.0250, LLM: $0.9984
When you hit the budget, the agent should gracefully synthesize whatever it has collected so far rather than throwing away all work. Wrap the research loop in a try-catch that falls through to synthesis:
return executeRound(question, queries, allSummaries, 0)
.catch(function (err) {
if (err.message.indexOf("budget exceeded") > -1) {
console.warn("Budget hit, synthesizing available sources...");
return Promise.resolve();
}
throw err;
})
.then(function () {
return synthesize(question, allSummaries);
});
5. Duplicate content inflating source count
When syndicated content appears on multiple sites, you end up with 5 "sources" that all say exactly the same thing, creating false confidence. The content hash in the source tracker catches exact duplicates, but paraphrased duplicates require semantic similarity detection. For production use, add an embedding-based similarity check before registering a new source.
Best Practices
Set hard limits on search depth and breadth. Two iterations of 5 queries each with 5 results per query means a maximum of 50 pages. That is usually more than enough. Unbounded research loops burn money and rarely improve quality.
Use the cheapest model that works for each step. Comprehension and summarization are simple tasks well suited for gpt-4o-mini. Planning and synthesis require more reasoning and benefit from gpt-4o. Do not use the same model for everything.
Cache aggressively at every layer. Search results, fetched pages, and source summaries should all be cached. A 1-hour TTL for search results and a 2-hour TTL for page content are reasonable defaults for most research tasks.
Always track provenance. Every fact in the final report must trace back to a specific source URL. This is not optional. Without citations, a research agent is just a chatbot with extra steps.
Handle failures gracefully at the source level. A single page timing out or returning a 403 should not crash the entire research session. Log the failure, skip the source, and continue with the remaining results.
Normalize and deduplicate URLs before reading. Strip tracking parameters, fragments, and trailing slashes. This prevents wasting API calls and LLM tokens on content you have already processed.
Respect robots.txt and rate limits. Just because you can scrape a site does not mean you should. Check robots.txt before fetching and honor Crawl-delay directives. Set a minimum of 1 second between requests to any single domain.
Keep the synthesizer temperature low. A research report should be conservative and factual. Set temperature to 0.2 or lower. Higher temperatures introduce creative phrasing that can distort factual claims.
Log cost and token usage per research session. Without visibility into costs, research agents become expensive surprises. Track every API call and present a cost summary alongside the report.
Test with known-answer questions first. Before trusting the agent with open research, run it on questions where you already know the answer. This reveals gaps in the pipeline, such as poor content extraction or hallucinated citations, before they affect real work.