Mcp

Building a Web Scraping MCP Server

Build a web scraping MCP server with cheerio, Puppeteer, structured data extraction, and caching for use with Claude Desktop.

Building a Web Scraping MCP Server

Overview

Web scraping is one of the most immediately useful capabilities you can give an AI assistant. By wrapping scraping logic inside a Model Context Protocol (MCP) server, you let Claude fetch live web pages, extract structured data, navigate JavaScript-rendered sites, and cache results — all through a clean, tool-based interface. This article walks through building a production-grade MCP server that handles static and dynamic pages, respects rate limits, and returns structured JSON that Claude can reason about directly.

Prerequisites

Before diving in, make sure you have the following:

  • Node.js v18+ installed (v20 recommended)
  • Claude Desktop installed and configured
  • Familiarity with the MCP protocol basics (tool definitions, resources, JSON-RPC)
  • Working knowledge of CSS selectors
  • A terminal and text editor

Install the core dependencies we will use throughout:

npm init -y
npm install @modelcontextprotocol/sdk cheerio puppeteer robots-parser node-cache

Here is what each package does:

Package Purpose
@modelcontextprotocol/sdk Official MCP server SDK
cheerio Fast HTML parsing (jQuery-like API, no browser needed)
puppeteer Headless Chrome for JavaScript-rendered pages
robots-parser Parse and respect robots.txt
node-cache In-memory caching for scraped results

Designing MCP Tools for Web Scraping

A good scraping MCP server needs distinct tools that map to the fundamental operations of web scraping. You do not want one monolithic "scrape everything" tool. Instead, design small, composable tools that Claude can chain together intelligently.

Here is the tool breakdown I recommend:

  1. fetch_page — Fetch raw HTML from a URL using a lightweight HTTP request
  2. extract_data — Extract structured data from HTML using CSS selectors
  3. fetch_rendered_page — Fetch a JavaScript-rendered page using Puppeteer
  4. follow_links — Extract and optionally follow links matching a pattern
  5. scrape_paginated — Handle multi-page scraping with pagination

Each tool returns structured JSON. Claude can inspect the output from one tool and decide which tool to call next. This composability is the entire point of building an MCP server rather than a monolithic scraper script.

Let us start with the server skeleton:

var { McpServer } = require("@modelcontextprotocol/sdk/server/mcp.js");
var { StdioServerTransport } = require("@modelcontextprotocol/sdk/server/stdio.js");
var cheerio = require("cheerio");
var puppeteer = require("puppeteer");
var robotsParser = require("robots-parser");
var NodeCache = require("node-cache");
var https = require("https");
var http = require("http");
var url = require("url");

var cache = new NodeCache({ stdTTL: 300, checkperiod: 60 });
var rateLimitMap = {};

var server = new McpServer({
  name: "web-scraper",
  version: "1.0.0"
});

Fetching Pages with Cheerio

The fetch_page tool is the workhorse. Most websites serve static HTML that does not require a full browser. Cheerio parses this HTML at roughly 10x the speed of Puppeteer because it never boots a browser process.

function fetchUrl(targetUrl) {
  return new Promise(function(resolve, reject) {
    var protocol = targetUrl.startsWith("https") ? https : http;
    var options = {
      headers: {
        "User-Agent": "MCPWebScraper/1.0 (compatible; research bot)",
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9"
      }
    };

    protocol.get(targetUrl, options, function(res) {
      if (res.statusCode >= 300 && res.statusCode < 400 && res.headers.location) {
        return fetchUrl(res.headers.location).then(resolve).catch(reject);
      }
      if (res.statusCode !== 200) {
        return reject(new Error("HTTP " + res.statusCode + " for " + targetUrl));
      }

      var chunks = [];
      res.on("data", function(chunk) { chunks.push(chunk); });
      res.on("end", function() { resolve(Buffer.concat(chunks).toString()); });
      res.on("error", reject);
    }).on("error", reject);
  });
}

server.tool(
  "fetch_page",
  "Fetch a web page and return its HTML content. Use this for static pages that do not require JavaScript rendering.",
  {
    url: { type: "string", description: "The URL to fetch" },
    use_cache: { type: "boolean", description: "Whether to use cached results", default: true }
  },
  function(params) {
    var cacheKey = "page:" + params.url;

    if (params.use_cache !== false) {
      var cached = cache.get(cacheKey);
      if (cached) {
        return {
          content: [{ type: "text", text: JSON.stringify({
            source: "cache", url: params.url,
            html_length: cached.length, html: cached
          }) }]
        };
      }
    }

    return checkRobotsTxt(params.url).then(function(allowed) {
      if (!allowed) {
        return {
          content: [{ type: "text", text: JSON.stringify({
            error: "Blocked by robots.txt", url: params.url
          }) }]
        };
      }

      return enforceRateLimit(params.url).then(function() {
        return fetchUrl(params.url);
      }).then(function(html) {
        cache.set(cacheKey, html);
        return {
          content: [{ type: "text", text: JSON.stringify({
            source: "live", url: params.url,
            html_length: html.length, html: html
          }) }]
        };
      });
    });
  }
);

This tool handles redirects, respects robots.txt (we will implement that shortly), caches results for 5 minutes, and returns both the HTML and its length so Claude knows what it is working with.

Extracting Structured Data with CSS Selectors

Raw HTML is not very useful to an AI. The extract_data tool takes HTML and a set of CSS selectors, then returns clean, structured JSON. This is where cheerio shines.

server.tool(
  "extract_data",
  "Extract structured data from HTML using CSS selectors. Returns an array of matched elements with their text content and attributes.",
  {
    html: { type: "string", description: "HTML content to parse" },
    selectors: {
      type: "object",
      description: "Named CSS selectors to extract. Keys are field names, values are CSS selectors.",
      additionalProperties: { type: "string" }
    },
    list_selector: {
      type: "string",
      description: "Optional: CSS selector for repeating items (e.g., '.product-card'). Each match gets all named selectors applied within it."
    },
    limit: { type: "number", description: "Max items to return", default: 50 }
  },
  function(params) {
    var $ = cheerio.load(params.html);
    var results = [];
    var limit = params.limit || 50;

    if (params.list_selector) {
      $(params.list_selector).each(function(i) {
        if (i >= limit) return false;
        var $el = $(this);
        var item = {};

        Object.keys(params.selectors).forEach(function(field) {
          var selector = params.selectors[field];
          var matched = $el.find(selector).first();
          item[field] = {
            text: matched.text().trim(),
            href: matched.attr("href") || null,
            src: matched.attr("src") || null
          };
        });

        results.push(item);
      });
    } else {
      Object.keys(params.selectors).forEach(function(field) {
        var selector = params.selectors[field];
        var matches = [];

        $(selector).each(function(i) {
          if (i >= limit) return false;
          matches.push({
            text: $(this).text().trim(),
            href: $(this).attr("href") || null,
            src: $(this).attr("src") || null,
            html: $(this).html()
          });
        });

        results.push({ field: field, selector: selector, matches: matches });
      });
    }

    return {
      content: [{ type: "text", text: JSON.stringify({
        item_count: results.length, data: results
      }, null, 2) }]
    };
  }
);

The list_selector parameter is critical for scraping repeating content like product listings, search results, or article feeds. Claude can call this tool like:

{
  "list_selector": ".search-result",
  "selectors": {
    "title": "h3 a",
    "price": ".price-tag",
    "rating": ".star-rating"
  }
}

And it gets back a clean array of objects — no HTML parsing needed on Claude's end.

Handling JavaScript-Rendered Pages with Puppeteer

Some pages load their content via JavaScript after the initial page load. Single-page apps built with React, Vue, or Angular are the usual suspects. For these, we need a real browser.

var browserInstance = null;

function getBrowser() {
  if (browserInstance) return Promise.resolve(browserInstance);

  return puppeteer.launch({
    headless: "new",
    args: [
      "--no-sandbox",
      "--disable-setuid-sandbox",
      "--disable-dev-shm-usage",
      "--disable-gpu",
      "--single-process"
    ]
  }).then(function(browser) {
    browserInstance = browser;
    return browser;
  });
}

server.tool(
  "fetch_rendered_page",
  "Fetch a page using a headless browser to execute JavaScript. Use this for SPAs and pages that load content dynamically. Slower than fetch_page (~3-8 seconds).",
  {
    url: { type: "string", description: "The URL to fetch" },
    wait_for: { type: "string", description: "CSS selector to wait for before capturing HTML" },
    wait_timeout: { type: "number", description: "Max wait time in ms", default: 10000 },
    scroll_to_bottom: { type: "boolean", description: "Scroll to bottom to trigger lazy loading", default: false }
  },
  function(params) {
    var cacheKey = "rendered:" + params.url;
    var cached = cache.get(cacheKey);
    if (cached) {
      return {
        content: [{ type: "text", text: JSON.stringify({
          source: "cache", url: params.url,
          html_length: cached.length, html: cached
        }) }]
      };
    }

    return checkRobotsTxt(params.url).then(function(allowed) {
      if (!allowed) {
        return {
          content: [{ type: "text", text: JSON.stringify({
            error: "Blocked by robots.txt", url: params.url
          }) }]
        };
      }

      return enforceRateLimit(params.url).then(function() {
        return getBrowser();
      }).then(function(browser) {
        return browser.newPage().then(function(page) {
          return page.setUserAgent("MCPWebScraper/1.0 (compatible; research bot)")
            .then(function() {
              return page.goto(params.url, {
                waitUntil: "networkidle2",
                timeout: params.wait_timeout || 10000
              });
            })
            .then(function() {
              if (params.wait_for) {
                return page.waitForSelector(params.wait_for, {
                  timeout: params.wait_timeout || 10000
                });
              }
            })
            .then(function() {
              if (params.scroll_to_bottom) {
                return page.evaluate(function() {
                  return new Promise(function(resolve) {
                    var totalHeight = 0;
                    var distance = 300;
                    var timer = setInterval(function() {
                      window.scrollBy(0, distance);
                      totalHeight += distance;
                      if (totalHeight >= document.body.scrollHeight) {
                        clearInterval(timer);
                        resolve();
                      }
                    }, 200);
                  });
                });
              }
            })
            .then(function() {
              return page.content();
            })
            .then(function(html) {
              cache.set(cacheKey, html);
              return page.close().then(function() {
                return {
                  content: [{ type: "text", text: JSON.stringify({
                    source: "live-rendered", url: params.url,
                    html_length: html.length, html: html
                  }) }]
                };
              });
            });
        });
      });
    }).catch(function(err) {
      return {
        content: [{ type: "text", text: JSON.stringify({
          error: err.message, url: params.url
        }) }]
      };
    });
  }
);

A few design decisions worth calling out:

  • Browser reuse: We keep a single browser instance alive and create new pages (tabs) for each request. Launching a browser takes 2-4 seconds. Reusing one cuts subsequent requests down to under a second for the launch overhead.
  • scroll_to_bottom: Many sites lazy-load images and content as you scroll. This parameter triggers a scroll loop that fires lazy loaders.
  • wait_for: Critical for SPAs. Without this, you capture the page before React or Vue has finished rendering. Always pass in a selector that appears only after the content loads.

Typical timing: fetch_page completes in 200-500ms. fetch_rendered_page takes 3-8 seconds depending on the target site's complexity. Choose accordingly.

Following Links and Pagination

Web scraping almost never stops at a single page. You need to follow links and handle pagination.

server.tool(
  "follow_links",
  "Extract links from HTML that match a pattern. Returns an array of URLs with their anchor text.",
  {
    html: { type: "string", description: "HTML content to scan for links" },
    base_url: { type: "string", description: "Base URL for resolving relative links" },
    pattern: { type: "string", description: "Regex pattern to filter links (applied to href)" },
    selector: { type: "string", description: "CSS selector to narrow which links to extract", default: "a[href]" },
    limit: { type: "number", description: "Max links to return", default: 25 }
  },
  function(params) {
    var $ = cheerio.load(params.html);
    var links = [];
    var seen = {};
    var regex = params.pattern ? new RegExp(params.pattern) : null;
    var limit = params.limit || 25;

    $(params.selector || "a[href]").each(function() {
      if (links.length >= limit) return false;

      var href = $(this).attr("href");
      if (!href || href.startsWith("#") || href.startsWith("javascript:")) return;

      // Resolve relative URLs
      try {
        var resolved = new URL(href, params.base_url).toString();
      } catch (e) {
        return;
      }

      if (seen[resolved]) return;
      if (regex && !regex.test(resolved)) return;

      seen[resolved] = true;
      links.push({
        url: resolved,
        text: $(this).text().trim(),
        rel: $(this).attr("rel") || null
      });
    });

    return {
      content: [{ type: "text", text: JSON.stringify({
        link_count: links.length, links: links
      }, null, 2) }]
    };
  }
);

Now for the paginated scraping tool, which chains multiple page fetches together:

server.tool(
  "scrape_paginated",
  "Scrape multiple pages following a pagination pattern. Fetches each page and extracts data using provided selectors.",
  {
    start_url: { type: "string", description: "URL of the first page" },
    next_selector: { type: "string", description: "CSS selector for the next page link" },
    item_selector: { type: "string", description: "CSS selector for repeating items" },
    fields: {
      type: "object",
      description: "Named selectors for fields within each item",
      additionalProperties: { type: "string" }
    },
    max_pages: { type: "number", description: "Maximum pages to scrape", default: 5 }
  },
  function(params) {
    var allItems = [];
    var currentUrl = params.start_url;
    var pageCount = 0;
    var maxPages = params.max_pages || 5;

    function scrapePage(pageUrl) {
      if (pageCount >= maxPages || !pageUrl) {
        return Promise.resolve({
          content: [{
            type: "text",
            text: JSON.stringify({
              pages_scraped: pageCount,
              total_items: allItems.length,
              data: allItems
            }, null, 2)
          }]
        });
      }

      return enforceRateLimit(pageUrl).then(function() {
        return fetchUrl(pageUrl);
      }).then(function(html) {
        pageCount++;
        var $ = cheerio.load(html);

        $(params.item_selector).each(function() {
          var $el = $(this);
          var item = { _page: pageCount, _source_url: pageUrl };

          Object.keys(params.fields).forEach(function(field) {
            var matched = $el.find(params.fields[field]).first();
            item[field] = matched.text().trim();
            var href = matched.attr("href");
            if (href) item[field + "_url"] = href;
          });

          allItems.push(item);
        });

        // Find next page link
        var nextLink = $(params.next_selector).attr("href");
        if (nextLink) {
          try {
            var nextUrl = new URL(nextLink, pageUrl).toString();
            return scrapePage(nextUrl);
          } catch (e) {
            // Invalid URL, stop pagination
          }
        }

        return {
          content: [{
            type: "text",
            text: JSON.stringify({
              pages_scraped: pageCount,
              total_items: allItems.length,
              data: allItems
            }, null, 2)
          }]
        };
      }).catch(function(err) {
        return {
          content: [{
            type: "text",
            text: JSON.stringify({
              error: err.message,
              pages_scraped: pageCount,
              total_items: allItems.length,
              data: allItems
            }, null, 2)
          }]
        };
      });
    }

    return scrapePage(currentUrl);
  }
);

The scrape_paginated tool is the most powerful one in the set. Claude tells it where to start, what the "next" button looks like, and what data to extract. The tool follows the pagination chain and returns everything in one structured response. The max_pages default of 5 prevents runaway scraping — you never want a tool that can accidentally fire off 500 requests.

Rate Limiting and Politeness

If you are building tools that hit other people's servers, you need to be a good citizen. This means two things: respecting robots.txt and rate limiting your requests.

var robotsCache = {};

function checkRobotsTxt(targetUrl) {
  var parsed = new URL(targetUrl);
  var robotsUrl = parsed.origin + "/robots.txt";

  if (robotsCache[parsed.origin]) {
    var robot = robotsCache[parsed.origin];
    return Promise.resolve(robot.isAllowed(targetUrl, "MCPWebScraper"));
  }

  return fetchUrl(robotsUrl).then(function(robotsTxt) {
    var robot = robotsParser(robotsUrl, robotsTxt);
    robotsCache[parsed.origin] = robot;
    return robot.isAllowed(targetUrl, "MCPWebScraper");
  }).catch(function() {
    // No robots.txt means everything is allowed
    return true;
  });
}

function enforceRateLimit(targetUrl) {
  var parsed = new URL(targetUrl);
  var domain = parsed.hostname;
  var now = Date.now();
  var minInterval = 2000; // 2 seconds between requests to same domain

  if (rateLimitMap[domain] && (now - rateLimitMap[domain]) < minInterval) {
    var waitTime = minInterval - (now - rateLimitMap[domain]);
    return new Promise(function(resolve) {
      setTimeout(function() {
        rateLimitMap[domain] = Date.now();
        resolve();
      }, waitTime);
    });
  }

  rateLimitMap[domain] = now;
  return Promise.resolve();
}

The rate limiter enforces a minimum 2-second gap between requests to the same domain. This is conservative enough to avoid getting blocked on most sites while still being practical for multi-page scrapes. For robots.txt, we cache the parsed result per origin so we only fetch it once per domain.

If a site publishes a Crawl-delay directive in its robots.txt, you should honor that too. The robots-parser library exposes this via robot.getCrawlDelay("MCPWebScraper"). Here is how to integrate it:

function getRateLimit(targetUrl) {
  var parsed = new URL(targetUrl);
  var origin = parsed.origin;

  if (robotsCache[origin]) {
    var crawlDelay = robotsCache[origin].getCrawlDelay("MCPWebScraper");
    if (crawlDelay) return crawlDelay * 1000; // convert seconds to ms
  }

  return 2000; // default 2 seconds
}

Caching Scraped Results as MCP Resources

MCP resources let you expose data that Claude can reference without re-fetching. This is perfect for scraped results that should persist across a conversation.

var scrapedResources = {};

server.resource(
  "scraped-pages",
  "scraped://pages/{url}",
  function(uri) {
    var targetUrl = decodeURIComponent(uri.pathname.slice(1));
    var data = scrapedResources[targetUrl];

    if (!data) {
      return {
        contents: [{
          uri: uri.toString(),
          text: JSON.stringify({ error: "No cached data for this URL" }),
          mimeType: "application/json"
        }]
      };
    }

    return {
      contents: [{
        uri: uri.toString(),
        text: JSON.stringify(data, null, 2),
        mimeType: "application/json"
      }]
    };
  }
);

// Utility to save scraped data as a resource
function saveAsResource(targetUrl, data) {
  scrapedResources[targetUrl] = {
    url: targetUrl,
    scraped_at: new Date().toISOString(),
    data: data
  };
}

When Claude scrapes a page and extracts data, the server can save it as a resource. Later in the conversation, Claude can reference scraped://pages/https%3A%2F%2Fexample.com to access the data without re-scraping. This is especially useful when Claude needs to compare data from multiple pages or revisit earlier results after exploring new ones.

You can also expose a resource listing that shows all cached pages:

server.resource(
  "scraped-index",
  "scraped://index",
  function(uri) {
    var index = Object.keys(scrapedResources).map(function(key) {
      return {
        url: key,
        scraped_at: scrapedResources[key].scraped_at,
        fields: Object.keys(scrapedResources[key].data || {})
      };
    });

    return {
      contents: [{
        uri: uri.toString(),
        text: JSON.stringify({ total: index.length, pages: index }, null, 2),
        mimeType: "application/json"
      }]
    };
  }
);

Error Handling for Unreliable Web Targets

Web scraping is inherently unreliable. Pages go down, structures change, rate limits kick in, and CAPTCHAs appear. Good error handling is not optional.

function withRetry(fn, maxRetries, delayMs) {
  var retries = 0;

  function attempt() {
    return fn().catch(function(err) {
      retries++;
      if (retries >= maxRetries) {
        throw err;
      }

      var backoff = delayMs * Math.pow(2, retries - 1);
      console.error(
        "Retry " + retries + "/" + maxRetries +
        " after " + backoff + "ms: " + err.message
      );

      return new Promise(function(resolve) {
        setTimeout(resolve, backoff);
      }).then(attempt);
    });
  }

  return attempt();
}

function safeFetch(targetUrl) {
  return withRetry(function() {
    return fetchUrl(targetUrl);
  }, 3, 1000);
}

The retry wrapper uses exponential backoff: 1 second, 2 seconds, 4 seconds. This handles transient network errors and temporary 503 responses. For a production MCP server, I also recommend wrapping every tool handler in a try-catch that returns a structured error rather than crashing:

function wrapToolHandler(handler) {
  return function(params) {
    try {
      var result = handler(params);
      if (result && typeof result.then === "function") {
        return result.catch(function(err) {
          return {
            content: [{
              type: "text",
              text: JSON.stringify({
                error: err.message,
                error_type: err.constructor.name,
                params: params
              })
            }]
          };
        });
      }
      return result;
    } catch (err) {
      return {
        content: [{
          type: "text",
          text: JSON.stringify({
            error: err.message,
            error_type: err.constructor.name,
            params: params
          })
        }]
      };
    }
  };
}

This ensures Claude always gets a structured JSON response, even when things go wrong. Claude can then decide whether to retry, try a different approach, or report the error to the user. Never let an unhandled exception crash your MCP server — a dead server is worse than an error message.

Returning Structured JSON from Scraped Content

The key insight behind this entire architecture is that Claude does not want HTML. It wants structured data. Every tool in this server is designed to move content one step closer to clean JSON.

Here is the typical flow Claude follows:

  1. fetch_page returns raw HTML (or fetch_rendered_page for SPAs)
  2. extract_data transforms that HTML into structured JSON using CSS selectors
  3. follow_links discovers additional URLs to scrape
  4. scrape_paginated automates multi-page extraction

The output from extract_data looks like this when scraping a product listing:

{
  "item_count": 3,
  "data": [
    {
      "title": { "text": "Wireless Keyboard", "href": "/products/kb-100", "src": null },
      "price": { "text": "$49.99", "href": null, "src": null },
      "image": { "text": "", "href": null, "src": "/images/kb-100.jpg" }
    },
    {
      "title": { "text": "Ergonomic Mouse", "href": "/products/ms-200", "src": null },
      "price": { "text": "$34.99", "href": null, "src": null },
      "image": { "text": "", "href": null, "src": "/images/ms-200.jpg" }
    },
    {
      "title": { "text": "USB-C Hub", "href": "/products/hub-50", "src": null },
      "price": { "text": "$29.99", "href": null, "src": null },
      "image": { "text": "", "href": null, "src": "/images/hub-50.jpg" }
    }
  ]
}

Claude can immediately reason about this data — compare prices, filter by criteria, generate summaries — without wading through thousands of lines of HTML markup. The href and src fields on every extracted element give Claude the option to follow links or reference images without a second extraction pass.

Complete Working Example

Here is the full MCP server in a single file. Save this as scraper-server.js:

var { McpServer } = require("@modelcontextprotocol/sdk/server/mcp.js");
var { StdioServerTransport } = require("@modelcontextprotocol/sdk/server/stdio.js");
var cheerio = require("cheerio");
var puppeteer = require("puppeteer");
var robotsParser = require("robots-parser");
var NodeCache = require("node-cache");
var https = require("https");
var http = require("http");

// --- Configuration ---
var RATE_LIMIT_MS = 2000;
var CACHE_TTL = 300;
var MAX_RETRIES = 3;
var RETRY_DELAY = 1000;
var USER_AGENT = "MCPWebScraper/1.0 (compatible; research bot)";

// --- State ---
var cache = new NodeCache({ stdTTL: CACHE_TTL, checkperiod: 60 });
var rateLimitMap = {};
var robotsCache = {};
var browserInstance = null;
var scrapedResources = {};

// --- Utilities ---

function fetchUrl(targetUrl) {
  return new Promise(function(resolve, reject) {
    var protocol = targetUrl.startsWith("https") ? https : http;
    var options = {
      headers: {
        "User-Agent": USER_AGENT,
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9"
      }
    };

    protocol.get(targetUrl, options, function(res) {
      if (res.statusCode >= 300 && res.statusCode < 400 && res.headers.location) {
        var redirectUrl;
        try {
          redirectUrl = new URL(res.headers.location, targetUrl).toString();
        } catch (e) {
          return reject(new Error("Invalid redirect URL: " + res.headers.location));
        }
        return fetchUrl(redirectUrl).then(resolve).catch(reject);
      }
      if (res.statusCode !== 200) {
        res.resume();
        return reject(new Error("HTTP " + res.statusCode + " for " + targetUrl));
      }

      var chunks = [];
      res.on("data", function(chunk) { chunks.push(chunk); });
      res.on("end", function() { resolve(Buffer.concat(chunks).toString()); });
      res.on("error", reject);
    }).on("error", reject);
  });
}

function withRetry(fn, maxRetries, delayMs) {
  var retries = 0;
  function attempt() {
    return fn().catch(function(err) {
      retries++;
      if (retries >= maxRetries) throw err;
      var backoff = delayMs * Math.pow(2, retries - 1);
      console.error("Retry " + retries + "/" + maxRetries + " after " + backoff + "ms");
      return new Promise(function(resolve) {
        setTimeout(resolve, backoff);
      }).then(attempt);
    });
  }
  return attempt();
}

function checkRobotsTxt(targetUrl) {
  var parsed = new URL(targetUrl);
  var origin = parsed.origin;
  var robotsUrl = origin + "/robots.txt";

  if (robotsCache[origin]) {
    return Promise.resolve(robotsCache[origin].isAllowed(targetUrl, USER_AGENT));
  }

  return fetchUrl(robotsUrl).then(function(robotsTxt) {
    robotsCache[origin] = robotsParser(robotsUrl, robotsTxt);
    return robotsCache[origin].isAllowed(targetUrl, USER_AGENT);
  }).catch(function() {
    return true;
  });
}

function enforceRateLimit(targetUrl) {
  var domain = new URL(targetUrl).hostname;
  var now = Date.now();

  if (rateLimitMap[domain] && (now - rateLimitMap[domain]) < RATE_LIMIT_MS) {
    var waitTime = RATE_LIMIT_MS - (now - rateLimitMap[domain]);
    return new Promise(function(resolve) {
      setTimeout(function() {
        rateLimitMap[domain] = Date.now();
        resolve();
      }, waitTime);
    });
  }

  rateLimitMap[domain] = now;
  return Promise.resolve();
}

function getBrowser() {
  if (browserInstance) return Promise.resolve(browserInstance);
  return puppeteer.launch({
    headless: "new",
    args: ["--no-sandbox", "--disable-setuid-sandbox", "--disable-dev-shm-usage"]
  }).then(function(browser) {
    browserInstance = browser;
    return browser;
  });
}

// --- MCP Server Setup ---

var server = new McpServer({
  name: "web-scraper",
  version: "1.0.0"
});

// Tool: fetch_page
server.tool(
  "fetch_page",
  "Fetch a web page and return raw HTML. Fast, no JS rendering. ~200-500ms.",
  {
    url: { type: "string", description: "URL to fetch" },
    use_cache: { type: "boolean", description: "Use cached result if available" }
  },
  function(params) {
    var cacheKey = "page:" + params.url;

    if (params.use_cache !== false) {
      var cached = cache.get(cacheKey);
      if (cached) {
        return {
          content: [{ type: "text", text: JSON.stringify({
            source: "cache", url: params.url,
            html_length: cached.length, html: cached
          }) }]
        };
      }
    }

    return checkRobotsTxt(params.url).then(function(allowed) {
      if (!allowed) {
        return {
          content: [{ type: "text", text: JSON.stringify({
            error: "Blocked by robots.txt", url: params.url
          }) }]
        };
      }
      return enforceRateLimit(params.url).then(function() {
        return withRetry(function() { return fetchUrl(params.url); }, MAX_RETRIES, RETRY_DELAY);
      }).then(function(html) {
        cache.set(cacheKey, html);
        return {
          content: [{ type: "text", text: JSON.stringify({
            source: "live", url: params.url,
            html_length: html.length, html: html
          }) }]
        };
      });
    }).catch(function(err) {
      return {
        content: [{ type: "text", text: JSON.stringify({
          error: err.message, url: params.url
        }) }]
      };
    });
  }
);

// Tool: extract_data
server.tool(
  "extract_data",
  "Extract structured data from HTML using CSS selectors. Returns clean JSON.",
  {
    html: { type: "string", description: "HTML to parse" },
    selectors: {
      type: "object",
      description: "Named CSS selectors (key=field name, value=selector)",
      additionalProperties: { type: "string" }
    },
    list_selector: { type: "string", description: "CSS selector for repeating items" },
    limit: { type: "number", description: "Max items to return" }
  },
  function(params) {
    var $ = cheerio.load(params.html);
    var results = [];
    var limit = params.limit || 50;

    if (params.list_selector) {
      $(params.list_selector).each(function(i) {
        if (i >= limit) return false;
        var $el = $(this);
        var item = {};
        Object.keys(params.selectors).forEach(function(field) {
          var matched = $el.find(params.selectors[field]).first();
          item[field] = {
            text: matched.text().trim(),
            href: matched.attr("href") || null,
            src: matched.attr("src") || null
          };
        });
        results.push(item);
      });
    } else {
      Object.keys(params.selectors).forEach(function(field) {
        var matches = [];
        $(params.selectors[field]).each(function(i) {
          if (i >= limit) return false;
          matches.push({
            text: $(this).text().trim(),
            href: $(this).attr("href") || null,
            src: $(this).attr("src") || null
          });
        });
        results.push({ field: field, matches: matches });
      });
    }

    return {
      content: [{ type: "text", text: JSON.stringify({
        item_count: results.length, data: results
      }, null, 2) }]
    };
  }
);

// Tool: fetch_rendered_page
server.tool(
  "fetch_rendered_page",
  "Fetch a JavaScript-rendered page using headless Chrome. Slower (~3-8s) but handles SPAs.",
  {
    url: { type: "string", description: "URL to fetch" },
    wait_for: { type: "string", description: "CSS selector to wait for before capturing" },
    wait_timeout: { type: "number", description: "Max wait in ms" },
    scroll_to_bottom: { type: "boolean", description: "Scroll to trigger lazy loading" }
  },
  function(params) {
    var cacheKey = "rendered:" + params.url;
    var cached = cache.get(cacheKey);
    if (cached) {
      return {
        content: [{ type: "text", text: JSON.stringify({
          source: "cache", url: params.url,
          html_length: cached.length, html: cached
        }) }]
      };
    }

    return checkRobotsTxt(params.url).then(function(allowed) {
      if (!allowed) {
        return {
          content: [{ type: "text", text: JSON.stringify({
            error: "Blocked by robots.txt", url: params.url
          }) }]
        };
      }
      return enforceRateLimit(params.url).then(function() {
        return getBrowser();
      }).then(function(browser) {
        return browser.newPage().then(function(page) {
          var timeout = params.wait_timeout || 10000;
          return page.setUserAgent(USER_AGENT)
            .then(function() {
              return page.goto(params.url, {
                waitUntil: "networkidle2", timeout: timeout
              });
            })
            .then(function() {
              if (params.wait_for) {
                return page.waitForSelector(params.wait_for, { timeout: timeout });
              }
            })
            .then(function() {
              if (params.scroll_to_bottom) {
                return page.evaluate(function() {
                  return new Promise(function(resolve) {
                    var total = 0;
                    var timer = setInterval(function() {
                      window.scrollBy(0, 300);
                      total += 300;
                      if (total >= document.body.scrollHeight) {
                        clearInterval(timer);
                        resolve();
                      }
                    }, 200);
                  });
                });
              }
            })
            .then(function() { return page.content(); })
            .then(function(html) {
              cache.set(cacheKey, html);
              return page.close().then(function() {
                return {
                  content: [{ type: "text", text: JSON.stringify({
                    source: "live-rendered", url: params.url,
                    html_length: html.length, html: html
                  }) }]
                };
              });
            });
        });
      });
    }).catch(function(err) {
      return {
        content: [{ type: "text", text: JSON.stringify({
          error: err.message, url: params.url
        }) }]
      };
    });
  }
);

// Tool: follow_links
server.tool(
  "follow_links",
  "Extract links from HTML matching a regex pattern. Returns URLs and anchor text.",
  {
    html: { type: "string", description: "HTML to scan" },
    base_url: { type: "string", description: "Base URL for resolving relative links" },
    pattern: { type: "string", description: "Regex to filter hrefs" },
    selector: { type: "string", description: "CSS selector for links" },
    limit: { type: "number", description: "Max links" }
  },
  function(params) {
    var $ = cheerio.load(params.html);
    var links = [];
    var seen = {};
    var regex = params.pattern ? new RegExp(params.pattern) : null;
    var limit = params.limit || 25;

    $(params.selector || "a[href]").each(function() {
      if (links.length >= limit) return false;
      var href = $(this).attr("href");
      if (!href || href.startsWith("#") || href.startsWith("javascript:")) return;

      try {
        var resolved = new URL(href, params.base_url).toString();
      } catch (e) { return; }

      if (seen[resolved]) return;
      if (regex && !regex.test(resolved)) return;

      seen[resolved] = true;
      links.push({ url: resolved, text: $(this).text().trim() });
    });

    return {
      content: [{ type: "text", text: JSON.stringify({
        link_count: links.length, links: links
      }, null, 2) }]
    };
  }
);

// Tool: scrape_paginated
server.tool(
  "scrape_paginated",
  "Scrape multiple pages following pagination. Fetches each page, extracts items.",
  {
    start_url: { type: "string", description: "First page URL" },
    next_selector: { type: "string", description: "CSS selector for next-page link" },
    item_selector: { type: "string", description: "CSS selector for items" },
    fields: {
      type: "object",
      description: "Named selectors for item fields",
      additionalProperties: { type: "string" }
    },
    max_pages: { type: "number", description: "Max pages to scrape" }
  },
  function(params) {
    var allItems = [];
    var pageCount = 0;
    var maxPages = params.max_pages || 5;

    function scrapePage(pageUrl) {
      if (pageCount >= maxPages || !pageUrl) {
        return Promise.resolve({
          content: [{ type: "text", text: JSON.stringify({
            pages_scraped: pageCount,
            total_items: allItems.length,
            data: allItems
          }, null, 2) }]
        });
      }
      return enforceRateLimit(pageUrl).then(function() {
        return withRetry(function() { return fetchUrl(pageUrl); }, MAX_RETRIES, RETRY_DELAY);
      }).then(function(html) {
        pageCount++;
        var $ = cheerio.load(html);
        $(params.item_selector).each(function() {
          var $el = $(this);
          var item = { _page: pageCount, _source_url: pageUrl };
          Object.keys(params.fields).forEach(function(field) {
            var matched = $el.find(params.fields[field]).first();
            item[field] = matched.text().trim();
            var href = matched.attr("href");
            if (href) item[field + "_url"] = href;
          });
          allItems.push(item);
        });
        var nextLink = $(params.next_selector).attr("href");
        if (nextLink) {
          try {
            return scrapePage(new URL(nextLink, pageUrl).toString());
          } catch (e) { /* stop */ }
        }
        return {
          content: [{ type: "text", text: JSON.stringify({
            pages_scraped: pageCount,
            total_items: allItems.length,
            data: allItems
          }, null, 2) }]
        };
      }).catch(function(err) {
        return {
          content: [{ type: "text", text: JSON.stringify({
            error: err.message,
            pages_scraped: pageCount,
            partial_data: allItems
          }, null, 2) }]
        };
      });
    }

    return scrapePage(params.start_url);
  }
);

// Resource: cached scraped results
server.resource(
  "scraped-pages",
  "scraped://pages/{url}",
  function(uri) {
    var targetUrl = decodeURIComponent(uri.pathname.slice(1));
    var data = scrapedResources[targetUrl] || { error: "No cached data" };
    return {
      contents: [{
        uri: uri.toString(),
        text: JSON.stringify(data, null, 2),
        mimeType: "application/json"
      }]
    };
  }
);

// --- Start Server ---

var transport = new StdioServerTransport();
server.connect(transport).then(function() {
  console.error("Web Scraper MCP Server running on stdio");
});

// Cleanup on exit
process.on("SIGINT", function() {
  if (browserInstance) browserInstance.close();
  process.exit(0);
});

Project Structure

Your final project directory should look like this:

scraper-mcp-server/
  package.json
  scraper-server.js
  node_modules/

Total disk footprint is roughly 350MB due to Puppeteer's bundled Chromium. If you do not need the fetch_rendered_page tool, swap puppeteer for puppeteer-core and point it at an existing Chrome installation to save ~280MB.

Claude Desktop Configuration

Add this to your claude_desktop_config.json:

{
  "mcpServers": {
    "web-scraper": {
      "command": "node",
      "args": ["C:/projects/scraper-mcp-server/scraper-server.js"]
    }
  }
}

On macOS and Linux, the config file lives at ~/.config/claude/claude_desktop_config.json. On Windows, it is at %APPDATA%\Claude\claude_desktop_config.json.

Testing It

Restart Claude Desktop after updating the config. Then try prompts like:

  • "Fetch the Hacker News front page and extract the top 10 story titles and links"
  • "Scrape the first 3 pages of results from this site and extract product names and prices"
  • "Fetch this React app page using the rendered page tool, waiting for the .content div to load"

Claude will chain the tools together: fetch the page, extract links or data, follow pagination, and present structured results.

Common Issues and Troubleshooting

1. Puppeteer Fails to Launch

Error: Failed to launch the browser process!
/home/user/.cache/puppeteer/chrome/linux-xxx/chrome-linux64/chrome:
error while loading shared libraries: libnss3.so: cannot open shared object file

This happens on Linux servers missing Chrome's system dependencies. Fix it:

# Debian/Ubuntu
sudo apt-get install -y libx11-xcb1 libxcomposite1 libxdamage1 \
  libxrandr2 libgbm1 libasound2 libpangocairo-1.0-0 libatk1.0-0 \
  libcups2 libnss3 libxss1

# Or use puppeteer's built-in installer
npx puppeteer browsers install chrome

On Windows, if you see Error: Failed to launch the browser process! spawn UNKNOWN, make sure the Chromium binary was downloaded correctly by running npx puppeteer browsers install chrome in your project directory.

2. ETIMEDOUT on Certain Domains

Error: connect ETIMEDOUT 104.26.10.78:443

Some sites block datacenter IP ranges or require specific TLS configurations. Three options:

  • Increase your timeout: { timeout: 30000 }
  • Add a custom TLS configuration to the fetch options
  • Use fetch_rendered_page instead, as Puppeteer's browser handles TLS differently than Node's https module

3. Empty Extraction Results

{ "item_count": 0, "data": [] }

This usually means the page loads content via JavaScript and you used fetch_page instead of fetch_rendered_page. Check by viewing the page source in your browser (Ctrl+U). If the content is not in the source HTML, you need the Puppeteer tool.

Another common cause: your CSS selectors are wrong. Test them in Chrome DevTools first using document.querySelectorAll(".your-selector") in the console.

4. Rate Limiting / HTTP 429 Responses

Error: HTTP 429 for https://api.example.com/search?q=test

The target site is telling you to slow down. The built-in 2-second delay might not be enough. Increase RATE_LIMIT_MS for aggressive sites:

var RATE_LIMIT_MS = 5000; // 5 seconds between requests

For APIs that return Retry-After headers, you could parse that value and wait accordingly before the next request.

5. Memory Leaks with Puppeteer

If you notice memory growing over time, it is usually because browser pages are not being closed. Always close pages in a finally block:

var page;
return browser.newPage()
  .then(function(p) { page = p; return page.goto(pageUrl); })
  .then(function() { return page.content(); })
  .then(function(html) {
    return page.close().then(function() { return html; });
  })
  .catch(function(err) {
    if (page) page.close();
    throw err;
  });

Monitor memory with process.memoryUsage().heapUsed. If it grows past 500MB, consider recycling the browser instance entirely by calling browserInstance.close() and setting browserInstance = null.

6. Character Encoding Issues

Garbled output: “instead of quotesâ€

This happens when a page serves content in a non-UTF-8 encoding but we decode it as UTF-8. Check the Content-Type header for charset info and decode accordingly:

var iconv = require("iconv-lite");

// In the response handler:
var contentType = res.headers["content-type"] || "";
var charsetMatch = contentType.match(/charset=([^\s;]+)/i);
var encoding = charsetMatch ? charsetMatch[1] : "utf-8";
var html = iconv.decode(Buffer.concat(chunks), encoding);

Install iconv-lite with npm install iconv-lite if you encounter this on non-English sites.

Best Practices

  • Always check robots.txt first. It is both ethical and practical. Sites that block scrapers will also block you at the IP level if you ignore their directives, making all subsequent requests fail.

  • Cache aggressively. Web pages change infrequently within a single conversation. A 5-minute cache TTL eliminates redundant requests and makes tool chaining instant. For longer sessions, consider a 15-minute TTL.

  • Return structured JSON, not raw HTML. Claude works dramatically better with clean JSON than with raw HTML. The extract_data tool exists specifically for this purpose — use it as the second step after any fetch.

  • Set hard limits on pagination. Never let max_pages exceed a reasonable bound. Five pages is a good default. Ten is the practical maximum for most use cases. A runaway pagination loop can fire hundreds of requests before you notice.

  • Use fetch_page by default, fetch_rendered_page only when needed. The Puppeteer tool is 10-20x slower and consumes significantly more memory. Only reach for it when you know the target page requires JavaScript rendering. Most content sites, blogs, and documentation pages work fine with a simple HTTP fetch.

  • Include timing and size metadata in responses. When Claude sees html_length: 145832, it knows it is dealing with a large page and can adjust its extraction strategy. When it sees source: "cache", it knows the data might be slightly stale.

  • Handle partial failures gracefully in paginated scraping. If page 3 of 5 fails, return the data from pages 1-2 along with the error. Partial data is almost always more useful than no data.

  • Use descriptive tool descriptions. The tool description is what Claude reads to decide which tool to use. Include performance characteristics ("200-500ms" vs "3-8s") so Claude can make informed choices.

  • Sanitize output size. Some pages are enormous. Consider truncating HTML beyond a reasonable limit (500KB) or stripping out scripts, styles, and SVGs before returning content to Claude. This keeps context windows manageable.

  • Log to stderr, not stdout. MCP servers communicate over stdout. Any logging must go to stderr via console.error(). Writing to stdout will corrupt the JSON-RPC protocol stream and crash the connection.

References

Powered by Contentful