Automated Testing with LLMs

Shane

2/13/2026

27 min read

Generate automated tests with LLMs including unit tests, integration tests, edge cases, and test data for Node.js applications.

llm nodejs quality automated-testing developer tools test generation

Automated Testing with LLMs

Large language models are remarkably good at reading source code and producing structured output, which makes them a natural fit for automated test generation. Instead of treating LLMs as a replacement for your testing discipline, you can use them as a force multiplier that drafts test suites, identifies edge cases you missed, and generates realistic test fixtures -- all while keeping a human in the loop for validation. This article walks through practical techniques for integrating LLMs into your Node.js testing workflow, from generating individual unit tests to building a full test generation pipeline that plugs into CI/CD.

Prerequisites

Node.js 18 or later installed
An OpenAI API key (or equivalent LLM provider)
Familiarity with Mocha, Chai, and Express.js
Basic understanding of how LLM APIs work (prompt in, completion out)
A working Express.js application you want to test

Install the dependencies you will need:

npm install openai mocha chai sinon supertest --save-dev

How LLMs Can Augment Testing Workflows

The testing problem in most organizations is not that engineers refuse to write tests. It is that writing thorough tests is tedious and time-consuming, so test suites end up shallow. LLMs change the economics of test authoring. A model can read a 200-line module and produce 40 test cases in under a minute, including edge cases that would take a human 30 minutes to reason through.

The key insight is that LLMs are not replacing your judgment about what to test. They are doing the mechanical work of translating specifications into executable assertions. You still review the output, but you start from a draft instead of a blank file.

There are several places in the testing lifecycle where LLMs add real value:

Unit test generation from source code analysis
Integration test scaffolding from API specifications
Edge case identification that humans systematically overlook
Test data generation with realistic, varied fixtures
Failure analysis that interprets stack traces and suggests fixes
Mutation testing where the LLM generates targeted code mutations

Let us walk through each of these in detail.

Generating Unit Tests from Source Code

The most straightforward use case is feeding a function's source code to an LLM and asking for unit tests. The model can infer the expected behavior from function names, parameter names, conditional branches, and return values.

var fs = require("fs");
var OpenAI = require("openai");

var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function generateUnitTests(sourceCode, functionName, callback) {
  var prompt = [
    "You are an expert Node.js test engineer. Generate a complete Mocha/Chai test suite for the following function.",
    "Use describe/it blocks. Use chai expect assertions. Cover happy paths, error cases, and edge cases.",
    "Use var, function(), and require() syntax. No const, let, or arrow functions.",
    "Return ONLY the test code, no explanation.",
    "",
    "Function name: " + functionName,
    "Source code:",
    "```javascript",
    sourceCode,
    "```"
  ].join("\n");

  client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    temperature: 0.2
  }).then(function(response) {
    var testCode = response.choices[0].message.content;
    testCode = testCode.replace(/```javascript\n?/g, "").replace(/```\n?/g, "");
    callback(null, testCode);
  }).catch(function(err) {
    callback(err);
  });
}

The temperature setting matters. At 0.2, the model produces deterministic, conventional test patterns. At higher temperatures you get more creative edge cases but also more hallucinated assertions. I recommend 0.2 for unit tests and 0.5 for edge case brainstorming.

Generating Integration Tests from API Specifications

If you have an OpenAPI specification or even just route handler source code, an LLM can generate supertest-based integration tests that exercise your endpoints with realistic payloads.

function generateIntegrationTests(openApiSpec, callback) {
  var prompt = [
    "Generate Mocha/Chai integration tests using supertest for the following API specification.",
    "Test each endpoint with valid requests, invalid requests, missing required fields, and authentication failures.",
    "Use var, function(), and require() syntax throughout.",
    "Include setup and teardown hooks where appropriate.",
    "",
    "API Specification:",
    JSON.stringify(openApiSpec, null, 2)
  ].join("\n");

  client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    temperature: 0.2
  }).then(function(response) {
    var testCode = response.choices[0].message.content;
    testCode = testCode.replace(/```javascript\n?/g, "").replace(/```\n?/g, "");
    callback(null, testCode);
  }).catch(function(err) {
    callback(err);
  });
}

For Express.js applications, you can also extract route definitions programmatically by reading the route files and passing them directly to the LLM. This approach works even when you do not have a formal API specification, which is most real-world projects.

Test Case Generation from Requirements

User stories and requirements documents are another excellent input for LLM-driven test generation. The model translates natural language acceptance criteria into executable test cases.

function generateTestsFromRequirements(requirements, callback) {
  var prompt = [
    "Convert the following user story and acceptance criteria into Mocha/Chai test cases.",
    "Generate both positive and negative test scenarios.",
    "For each acceptance criterion, generate at least 2 test cases.",
    "Use descriptive test names that reflect the business requirement.",
    "",
    "Requirements:",
    requirements
  ].join("\n");

  client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    temperature: 0.3
  }).then(function(response) {
    callback(null, response.choices[0].message.content);
  }).catch(function(err) {
    callback(err);
  });
}

This is especially powerful in teams practicing BDD, where you already have structured requirements. The LLM bridges the gap between the Gherkin-style acceptance criteria and the actual test implementation.

Using LLMs to Identify Edge Cases Humans Miss

This is where LLMs genuinely shine. Humans are terrible at systematic edge case enumeration because we get anchored on the happy path. An LLM can analyze a function and produce a comprehensive list of boundary conditions, type coercion pitfalls, and failure modes.

function identifyEdgeCases(sourceCode, callback) {
  var prompt = [
    "Analyze the following Node.js function and identify ALL edge cases that should be tested.",
    "Consider: null/undefined inputs, empty strings, empty arrays, negative numbers, zero,",
    "very large inputs, special characters, concurrent access, type coercion, prototype pollution,",
    "encoding issues, timezone problems, floating point precision, and any domain-specific edge cases.",
    "",
    "For each edge case, provide:",
    "1. A description of the scenario",
    "2. The specific input values",
    "3. The expected behavior",
    "4. Why this edge case matters",
    "",
    "Source code:",
    "```javascript",
    sourceCode,
    "```"
  ].join("\n");

  client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    temperature: 0.5
  }).then(function(response) {
    callback(null, response.choices[0].message.content);
  }).catch(function(err) {
    callback(err);
  });
}

In my experience, an LLM will consistently find 3-5 edge cases per function that a senior engineer would have missed. The most common categories are Unicode handling, empty collection inputs, and off-by-one errors in pagination logic.

Implementing a Test Generation Pipeline

A production-grade test generation system is more than a single API call. You need a pipeline that reads code, analyzes it, generates tests, validates they actually compile and run, and outputs the results. Here is the architecture:

Source Code --> Parse & Extract Functions --> Generate Tests per Function
    --> Validate Syntax --> Run Tests --> Filter Passing Tests --> Output

var fs = require("fs");
var path = require("path");
var childProcess = require("child_process");
var OpenAI = require("openai");

var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function TestGenerationPipeline(options) {
  this.sourceDir = options.sourceDir;
  this.outputDir = options.outputDir;
  this.model = options.model || "gpt-4o";
  this.maxRetries = options.maxRetries || 2;
}

TestGenerationPipeline.prototype.run = function(callback) {
  var self = this;
  var sourceFiles = fs.readdirSync(this.sourceDir).filter(function(f) {
    return f.endsWith(".js");
  });

  var results = { generated: 0, validated: 0, failed: 0, files: [] };
  var index = 0;

  function processNext() {
    if (index >= sourceFiles.length) {
      return callback(null, results);
    }

    var file = sourceFiles[index];
    index++;

    var sourceCode = fs.readFileSync(path.join(self.sourceDir, file), "utf8");

    self.generateTests(sourceCode, file, function(err, testCode) {
      if (err) {
        results.failed++;
        return processNext();
      }

      var testFile = path.join(self.outputDir, file.replace(".js", ".test.js"));
      fs.writeFileSync(testFile, testCode);

      self.validateTests(testFile, 0, function(validationErr, passed) {
        if (passed) {
          results.validated++;
          results.files.push(testFile);
        } else {
          results.failed++;
          fs.unlinkSync(testFile);
        }
        results.generated++;
        processNext();
      });
    });
  }

  processNext();
};

TestGenerationPipeline.prototype.generateTests = function(sourceCode, fileName, callback) {
  var prompt = [
    "Generate a complete Mocha/Chai test suite for the following Node.js module.",
    "The module is from file: " + fileName,
    "Requirements:",
    "- Use var, function(), require() syntax only",
    "- Use chai expect style assertions",
    "- Mock external dependencies with sinon",
    "- Cover happy paths, error handling, and edge cases",
    "- Each describe block should have at least 3 test cases",
    "- Include proper before/after hooks for setup and teardown",
    "- Return ONLY valid JavaScript, no markdown fences",
    "",
    "Source code:",
    sourceCode
  ].join("\n");

  client.chat.completions.create({
    model: this.model,
    messages: [{ role: "user", content: prompt }],
    temperature: 0.2
  }).then(function(response) {
    var code = response.choices[0].message.content;
    code = code.replace(/```javascript\n?/g, "").replace(/```\n?/g, "");
    callback(null, code);
  }).catch(function(err) {
    callback(err);
  });
};

TestGenerationPipeline.prototype.validateTests = function(testFile, attempt, callback) {
  var self = this;

  childProcess.exec("npx mocha " + testFile + " --dry-run --timeout 5000", function(err) {
    if (!err) {
      return callback(null, true);
    }

    if (attempt < self.maxRetries) {
      var testCode = fs.readFileSync(testFile, "utf8");
      var fixPrompt = [
        "The following test file has errors. Fix them and return ONLY the corrected JavaScript code.",
        "Error: " + err.message,
        "",
        "Test code:",
        testCode
      ].join("\n");

      client.chat.completions.create({
        model: self.model,
        messages: [{ role: "user", content: fixPrompt }],
        temperature: 0.1
      }).then(function(response) {
        var fixed = response.choices[0].message.content;
        fixed = fixed.replace(/```javascript\n?/g, "").replace(/```\n?/g, "");
        fs.writeFileSync(testFile, fixed);
        self.validateTests(testFile, attempt + 1, callback);
      }).catch(function() {
        callback(null, false);
      });
    } else {
      callback(null, false);
    }
  });
};

module.exports = TestGenerationPipeline;

The retry loop is critical. LLM-generated tests frequently have minor issues on the first attempt -- missing requires, wrong module paths, or incorrect mock setups. Feeding the error back to the model for a second pass fixes the majority of these issues.

Generating Test Data and Fixtures with LLMs

Realistic test data is one of the most underappreciated aspects of good tests. An LLM can generate diverse, realistic fixtures that cover a broader range of inputs than hand-written test data.

function generateTestFixtures(schemaDescription, count, callback) {
  var prompt = [
    "Generate " + count + " realistic test fixtures as a JSON array for the following data schema.",
    "Make the data diverse: include edge cases like very long strings, special characters,",
    "Unicode, empty optional fields, boundary values for numbers, and realistic variations.",
    "Return ONLY valid JSON, no explanation.",
    "",
    "Schema:",
    schemaDescription
  ].join("\n");

  client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    temperature: 0.7,
    response_format: { type: "json_object" }
  }).then(function(response) {
    try {
      var fixtures = JSON.parse(response.choices[0].message.content);
      callback(null, fixtures);
    } catch (e) {
      callback(new Error("Failed to parse fixture JSON: " + e.message));
    }
  }).catch(function(err) {
    callback(err);
  });
}

// Usage
generateTestFixtures(
  "User object: { name: string, email: string, age: number (18-120), role: 'admin'|'user'|'moderator', bio: string (optional, max 500 chars) }",
  20,
  function(err, fixtures) {
    if (err) return console.error(err);
    fs.writeFileSync("test/fixtures/users.json", JSON.stringify(fixtures, null, 2));
    console.log("Generated " + Object.keys(fixtures).length + " test fixtures");
  }
);

The temperature of 0.7 is deliberate here. You want diversity in test data, not deterministic repetition. Higher temperature produces more varied and interesting edge case data.

Using LLMs for Test Result Analysis

When tests fail, the stack trace and assertion output can be cryptic, especially in integration tests with complex setup. An LLM can parse test output and provide actionable analysis.

function analyzeTestFailures(testOutput, sourceCode, callback) {
  var prompt = [
    "Analyze the following test failure output and provide:",
    "1. Root cause analysis - what specifically is failing and why",
    "2. Whether this is a test bug or a code bug",
    "3. A specific fix with code changes",
    "4. Whether any other tests might be affected by the same issue",
    "",
    "Test output:",
    testOutput,
    "",
    "Source code under test:",
    sourceCode
  ].join("\n");

  client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    temperature: 0.1
  }).then(function(response) {
    callback(null, response.choices[0].message.content);
  }).catch(function(err) {
    callback(err);
  });
}

This is particularly valuable in CI environments where a developer sees a red build and needs to quickly understand what broke. Piping the test output through an LLM analysis step can save significant debugging time.

Mutation Testing Assisted by LLMs

Traditional mutation testing tools make random syntactic changes to your code and check whether your tests catch them. LLMs can generate smarter, more semantically meaningful mutations that are more likely to represent real bugs.

function generateMutations(sourceCode, callback) {
  var prompt = [
    "Generate 10 realistic code mutations for the following function.",
    "Each mutation should represent a plausible bug a developer might introduce.",
    "Focus on:",
    "- Off-by-one errors",
    "- Wrong comparison operators (< vs <=, === vs ==)",
    "- Missing null checks",
    "- Incorrect default values",
    "- Swapped function arguments",
    "- Missing await/callback calls",
    "",
    "Return each mutation as a JSON object with: description, originalLine, mutatedLine, lineNumber",
    "Return a JSON array, no explanation.",
    "",
    "Source code:",
    sourceCode
  ].join("\n");

  client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    temperature: 0.4,
    response_format: { type: "json_object" }
  }).then(function(response) {
    try {
      var mutations = JSON.parse(response.choices[0].message.content);
      callback(null, mutations);
    } catch (e) {
      callback(new Error("Failed to parse mutations: " + e.message));
    }
  }).catch(function(err) {
    callback(err);
  });
}

The semantic mutations an LLM generates are far more useful than random token flipping. A traditional mutation tool might change + to -, which is trivially caught. An LLM might change users.filter(u => u.active) to users.filter(u => u.active && u.verified), which tests a real business logic boundary.

Visual Regression and Accessibility Testing

LLMs with vision capabilities can analyze screenshots for visual regressions and evaluate UI elements for accessibility compliance.

function analyzeScreenshotForRegression(baselineBase64, currentBase64, callback) {
  var prompt = [
    "Compare these two UI screenshots. The first is the baseline and the second is the current version.",
    "Identify any visual differences including:",
    "- Layout shifts or alignment changes",
    "- Color or typography changes",
    "- Missing or added elements",
    "- Spacing or padding differences",
    "Return a JSON object with: { hasRegression: boolean, differences: string[], severity: 'none'|'minor'|'major' }"
  ].join("\n");

  client.chat.completions.create({
    model: "gpt-4o",
    messages: [{
      role: "user",
      content: [
        { type: "text", text: prompt },
        { type: "image_url", image_url: { url: "data:image/png;base64," + baselineBase64 } },
        { type: "image_url", image_url: { url: "data:image/png;base64," + currentBase64 } }
      ]
    }],
    temperature: 0.1
  }).then(function(response) {
    callback(null, response.choices[0].message.content);
  }).catch(function(err) {
    callback(err);
  });
}

function analyzeAccessibility(htmlContent, callback) {
  var prompt = [
    "Analyze the following HTML for accessibility issues (WCAG 2.1 AA compliance).",
    "Check for: missing alt text, poor color contrast indicators, missing ARIA labels,",
    "incorrect heading hierarchy, missing form labels, keyboard navigation issues,",
    "and missing skip navigation links.",
    "Return a JSON array of issues with: { element, issue, severity, fix }",
    "",
    "HTML:",
    htmlContent
  ].join("\n");

  client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    temperature: 0.1
  }).then(function(response) {
    callback(null, response.choices[0].message.content);
  }).catch(function(err) {
    callback(err);
  });
}

LLM-based accessibility testing does not replace tools like axe-core, but it catches semantic issues that rule-based tools miss -- like a button labeled "Click here" that provides no context, or an icon-only button whose aria-label does not match its visual meaning.

Quality Gates for Generated Tests

You should never blindly trust LLM-generated tests. Implement quality gates that every generated test must pass before it enters your codebase.

function QualityGate(options) {
  this.minAssertions = options.minAssertions || 2;
  this.maxTestDuration = options.maxTestDuration || 10000;
  this.requireErrorHandling = options.requireErrorHandling !== false;
}

QualityGate.prototype.validate = function(testCode, callback) {
  var issues = [];

  // Gate 1: Must parse as valid JavaScript
  try {
    new Function(testCode);
  } catch (e) {
    issues.push("Syntax error: " + e.message);
  }

  // Gate 2: Must contain assertions
  var assertionPatterns = [/expect\(/, /assert\./, /should\./, /\.to\./, /\.equal\(/, /\.deep\./];
  var assertionCount = 0;
  assertionPatterns.forEach(function(pattern) {
    var matches = testCode.match(new RegExp(pattern.source, "g"));
    if (matches) assertionCount += matches.length;
  });

  if (assertionCount < this.minAssertions) {
    issues.push("Insufficient assertions: found " + assertionCount + ", minimum " + this.minAssertions);
  }

  // Gate 3: Must have describe/it structure
  if (!/describe\s*\(/.test(testCode)) {
    issues.push("Missing describe block");
  }
  if (!/it\s*\(/.test(testCode)) {
    issues.push("Missing it block");
  }

  // Gate 4: Must handle error cases
  if (this.requireErrorHandling) {
    var hasErrorTests = /error|throw|reject|fail|invalid|missing|null|undefined/i.test(testCode);
    if (!hasErrorTests) {
      issues.push("No error handling test cases detected");
    }
  }

  // Gate 5: Must not contain hardcoded API keys or secrets
  var secretPatterns = [/sk-[a-zA-Z0-9]{20,}/, /api[_-]?key\s*[:=]\s*["'][^"']{10,}["']/i];
  secretPatterns.forEach(function(pattern) {
    if (pattern.test(testCode)) {
      issues.push("Possible hardcoded secret detected");
    }
  });

  // Gate 6: Run the actual tests
  if (issues.length === 0) {
    var tmpFile = "/tmp/test_validate_" + Date.now() + ".js";
    fs.writeFileSync(tmpFile, testCode);

    childProcess.exec(
      "npx mocha " + tmpFile + " --timeout " + this.maxTestDuration + " --exit",
      function(err, stdout, stderr) {
        fs.unlinkSync(tmpFile);
        if (err) {
          issues.push("Test execution failed: " + stderr.split("\n").slice(0, 5).join("\n"));
        }
        callback(null, { passed: issues.length === 0, issues: issues });
      }
    );
  } else {
    callback(null, { passed: false, issues: issues });
  }
};

These gates catch the most common problems with LLM-generated tests: syntax errors from hallucinated APIs, empty test bodies with no real assertions, tests that pass trivially by not asserting anything meaningful, and tests that accidentally leak secrets from the training data.

Cost-Effective Test Generation

LLM API calls add up fast. A large codebase with hundreds of modules can cost serious money if you generate tests naively. Here are strategies to control costs.

var crypto = require("crypto");

function TestGenerationCache(cacheDir) {
  this.cacheDir = cacheDir;
  if (!fs.existsSync(cacheDir)) {
    fs.mkdirSync(cacheDir, { recursive: true });
  }
}

TestGenerationCache.prototype.getHash = function(sourceCode) {
  return crypto.createHash("sha256").update(sourceCode).digest("hex");
};

TestGenerationCache.prototype.get = function(sourceCode) {
  var hash = this.getHash(sourceCode);
  var cachePath = path.join(this.cacheDir, hash + ".js");
  if (fs.existsSync(cachePath)) {
    return fs.readFileSync(cachePath, "utf8");
  }
  return null;
};

TestGenerationCache.prototype.set = function(sourceCode, testCode) {
  var hash = this.getHash(sourceCode);
  var cachePath = path.join(this.cacheDir, hash + ".js");
  fs.writeFileSync(cachePath, testCode);
};

// Batch processing to reduce overhead
function batchGenerateTests(modules, batchSize, callback) {
  var cache = new TestGenerationCache(".test-cache");
  var results = [];
  var batches = [];

  // Separate cached and uncached modules
  var uncached = [];
  modules.forEach(function(mod) {
    var cached = cache.get(mod.source);
    if (cached) {
      results.push({ file: mod.file, tests: cached, fromCache: true });
    } else {
      uncached.push(mod);
    }
  });

  // Batch uncached modules into groups
  for (var i = 0; i < uncached.length; i += batchSize) {
    batches.push(uncached.slice(i, i + batchSize));
  }

  var batchIndex = 0;

  function processNextBatch() {
    if (batchIndex >= batches.length) {
      return callback(null, results);
    }

    var batch = batches[batchIndex];
    batchIndex++;

    // Combine multiple small modules into a single prompt
    var combinedPrompt = batch.map(function(mod, idx) {
      return "=== Module " + (idx + 1) + ": " + mod.file + " ===\n" + mod.source;
    }).join("\n\n");

    var prompt = [
      "Generate separate Mocha/Chai test suites for each of the following " + batch.length + " modules.",
      "Separate each test suite with the comment: // === Tests for: {filename} ===",
      "Use var, function(), require() syntax.",
      "",
      combinedPrompt
    ].join("\n");

    client.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: prompt }],
      temperature: 0.2
    }).then(function(response) {
      var output = response.choices[0].message.content;
      var sections = output.split(/\/\/ === Tests for: .+ ===/);

      batch.forEach(function(mod, idx) {
        var testCode = sections[idx + 1] || "";
        cache.set(mod.source, testCode.trim());
        results.push({ file: mod.file, tests: testCode.trim(), fromCache: false });
      });

      // Rate limiting: wait 1 second between batches
      setTimeout(processNextBatch, 1000);
    }).catch(function(err) {
      callback(err);
    });
  }

  processNextBatch();
}

The caching strategy here is content-addressed. If the source code has not changed, the cached test is returned immediately. This alone cuts costs by 80-90% in typical CI workflows where most files do not change between runs. Batching multiple small modules into a single prompt further reduces the per-module cost.

Integrating LLM Test Generation into CI/CD

The final step is wiring your test generation pipeline into your CI/CD system. Here is a practical GitHub Actions workflow.

name: LLM Test Generation
on:
  pull_request:
    paths:
      - 'routes/**'
      - 'models/**'
      - 'utils/**'

jobs:
  generate-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      - run: npm ci

      - name: Get changed files
        id: changes
        run: |
          echo "files=$(git diff --name-only origin/main...HEAD -- 'routes/*.js' 'models/*.js' 'utils/*.js' | tr '\n' ',')" >> $GITHUB_OUTPUT

      - name: Generate tests for changed files
        if: steps.changes.outputs.files != ''
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: node scripts/generate-tests.js --files "${{ steps.changes.outputs.files }}"

      - name: Run generated tests
        run: npx mocha generated-tests/**/*.test.js --timeout 10000 --exit

      - name: Upload test report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: generated-tests
          path: generated-tests/

The key design decision is to only generate tests for changed files. This keeps both cost and execution time manageable. The generated tests run as part of the PR check, giving reviewers additional confidence in the changes.

Complete Working Example

Here is a complete, self-contained test generation tool that reads Express.js route handlers, generates Mocha/Chai test suites, validates them, and outputs the test files.

#!/usr/bin/env node

var fs = require("fs");
var path = require("path");
var childProcess = require("child_process");
var OpenAI = require("openai");

var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// Configuration
var CONFIG = {
  model: "gpt-4o",
  temperature: 0.2,
  maxRetries: 2,
  testTimeout: 10000,
  outputDir: path.join(process.cwd(), "generated-tests")
};

// Ensure output directory exists
if (!fs.existsSync(CONFIG.outputDir)) {
  fs.mkdirSync(CONFIG.outputDir, { recursive: true });
}

function readRouteFile(filePath) {
  var content = fs.readFileSync(filePath, "utf8");
  return {
    path: filePath,
    name: path.basename(filePath, ".js"),
    content: content
  };
}

function extractRouteInfo(sourceCode) {
  var routes = [];
  var routePattern = /router\.(get|post|put|patch|delete)\s*\(\s*["']([^"']+)["']/g;
  var match;

  while ((match = routePattern.exec(sourceCode)) !== null) {
    routes.push({ method: match[1].toUpperCase(), path: match[2] });
  }

  return routes;
}

function generateTestSuite(routeFile, callback) {
  var routes = extractRouteInfo(routeFile.content);
  var routeSummary = routes.map(function(r) {
    return r.method + " " + r.path;
  }).join(", ");

  var prompt = [
    "Generate a comprehensive Mocha/Chai test suite for the following Express.js route handler module.",
    "",
    "REQUIREMENTS:",
    "- Use var, function(), require() syntax ONLY. No const, let, arrow functions, or ES modules.",
    "- Use supertest for HTTP assertions",
    "- Use sinon for mocking external dependencies (database calls, APIs, etc.)",
    "- Use chai expect style assertions",
    "- Mock the Express app properly: var app = require('express')(); app.use('/', router);",
    "- Include these test categories:",
    "  1. Happy path: valid requests return expected status codes and response bodies",
    "  2. Validation: missing/invalid parameters return 400 with error messages",
    "  3. Error handling: internal errors return 500 with generic error message",
    "  4. Edge cases: empty strings, very long strings, special characters, SQL injection attempts",
    "  5. Authentication/authorization if applicable",
    "- Each route should have at least 4 test cases",
    "- Include proper before/after hooks for setup and teardown",
    "- Add descriptive test names that explain the expected behavior",
    "",
    "Routes detected: " + routeSummary,
    "",
    "IMPORTANT: The require path for the route module should be: " + routeFile.path,
    "",
    "Source code:",
    "```javascript",
    routeFile.content,
    "```",
    "",
    "Return ONLY valid JavaScript test code. No markdown fences. No explanations."
  ].join("\n");

  console.log("Generating tests for: " + routeFile.name + " (" + routes.length + " routes)");

  client.chat.completions.create({
    model: CONFIG.model,
    messages: [{ role: "user", content: prompt }],
    temperature: CONFIG.temperature
  }).then(function(response) {
    var testCode = response.choices[0].message.content;
    testCode = testCode.replace(/```javascript\n?/g, "").replace(/```\n?/g, "").trim();
    callback(null, testCode);
  }).catch(function(err) {
    callback(err);
  });
}

function validateTestSyntax(testCode) {
  try {
    new Function(testCode);
    return { valid: true };
  } catch (e) {
    return { valid: false, error: e.message };
  }
}

function runTests(testFilePath, callback) {
  var cmd = "npx mocha \"" + testFilePath + "\" --timeout " + CONFIG.testTimeout + " --exit --reporter json";

  childProcess.exec(cmd, { timeout: CONFIG.testTimeout + 5000 }, function(err, stdout, stderr) {
    var result = { passed: false, output: stdout, error: stderr };

    try {
      var report = JSON.parse(stdout);
      result.stats = report.stats;
      result.passed = report.stats.failures === 0;
      result.totalTests = report.stats.tests;
      result.passingTests = report.stats.passes;
      result.failingTests = report.stats.failures;
    } catch (e) {
      result.parseError = e.message;
    }

    callback(null, result);
  });
}

function attemptFix(testCode, errorMessage, attempt, callback) {
  if (attempt >= CONFIG.maxRetries) {
    return callback(null, null);
  }

  var prompt = [
    "The following Mocha/Chai test file has errors. Fix the errors and return the corrected code.",
    "Use var, function(), require() syntax ONLY. No const, let, arrow functions.",
    "Return ONLY the corrected JavaScript code. No markdown fences. No explanations.",
    "",
    "Error:",
    errorMessage,
    "",
    "Test code:",
    testCode
  ].join("\n");

  client.chat.completions.create({
    model: CONFIG.model,
    messages: [{ role: "user", content: prompt }],
    temperature: 0.1
  }).then(function(response) {
    var fixed = response.choices[0].message.content;
    fixed = fixed.replace(/```javascript\n?/g, "").replace(/```\n?/g, "").trim();
    callback(null, fixed);
  }).catch(function(err) {
    callback(err);
  });
}

function processRouteFile(filePath, callback) {
  var routeFile = readRouteFile(filePath);
  var outputPath = path.join(CONFIG.outputDir, routeFile.name + ".test.js");

  generateTestSuite(routeFile, function(err, testCode) {
    if (err) {
      console.error("  Generation failed: " + err.message);
      return callback(err);
    }

    // Validate syntax
    var syntaxResult = validateTestSyntax(testCode);
    if (!syntaxResult.valid) {
      console.log("  Syntax error detected, attempting fix...");
      attemptFix(testCode, syntaxResult.error, 0, function(fixErr, fixedCode) {
        if (fixErr || !fixedCode) {
          console.error("  Could not fix syntax errors after retries");
          return callback(new Error("Syntax validation failed"));
        }
        testCode = fixedCode;
        writeAndRun(testCode, outputPath, routeFile.name, callback);
      });
    } else {
      writeAndRun(testCode, outputPath, routeFile.name, callback);
    }
  });
}

function writeAndRun(testCode, outputPath, name, callback) {
  fs.writeFileSync(outputPath, testCode);
  console.log("  Written to: " + outputPath);

  runTests(outputPath, function(runErr, result) {
    if (runErr) {
      console.error("  Test execution error: " + runErr.message);
      return callback(runErr);
    }

    if (result.passed) {
      console.log("  All " + result.totalTests + " tests passed");
      callback(null, { file: outputPath, name: name, result: result });
    } else {
      console.log("  " + result.failingTests + "/" + result.totalTests + " tests failed, attempting fix...");

      var errorInfo = result.error || result.output;
      var currentCode = fs.readFileSync(outputPath, "utf8");

      attemptFix(currentCode, errorInfo, 0, function(fixErr, fixedCode) {
        if (fixErr || !fixedCode) {
          console.log("  Could not fix test failures. Keeping file for manual review.");
          callback(null, { file: outputPath, name: name, result: result, needsReview: true });
          return;
        }

        fs.writeFileSync(outputPath, fixedCode);
        runTests(outputPath, function(rerunErr, rerunResult) {
          if (rerunResult && rerunResult.passed) {
            console.log("  Fixed! All " + rerunResult.totalTests + " tests now pass");
          } else {
            console.log("  Still has failures after fix. File saved for manual review.");
          }
          callback(null, {
            file: outputPath,
            name: name,
            result: rerunResult || result,
            needsReview: !(rerunResult && rerunResult.passed)
          });
        });
      });
    }
  });
}

// Main execution
function main() {
  var args = process.argv.slice(2);

  if (args.length === 0) {
    console.log("Usage: node generate-tests.js <route-file-or-directory>");
    console.log("Example: node generate-tests.js routes/articles.js");
    console.log("Example: node generate-tests.js routes/");
    process.exit(1);
  }

  var target = args[0];
  var files = [];

  if (fs.statSync(target).isDirectory()) {
    files = fs.readdirSync(target)
      .filter(function(f) { return f.endsWith(".js"); })
      .map(function(f) { return path.join(target, f); });
  } else {
    files = [target];
  }

  console.log("Test Generation Pipeline");
  console.log("========================");
  console.log("Files to process: " + files.length);
  console.log("Output directory: " + CONFIG.outputDir);
  console.log("");

  var results = [];
  var index = 0;

  function processNext() {
    if (index >= files.length) {
      printSummary(results);
      return;
    }

    var file = files[index];
    index++;
    console.log("[" + index + "/" + files.length + "] Processing: " + file);

    processRouteFile(file, function(err, result) {
      if (err) {
        results.push({ file: file, error: err.message });
      } else {
        results.push(result);
      }
      processNext();
    });
  }

  processNext();
}

function printSummary(results) {
  console.log("\n========================");
  console.log("Generation Summary");
  console.log("========================");

  var passed = results.filter(function(r) { return r.result && r.result.passed; });
  var needsReview = results.filter(function(r) { return r.needsReview; });
  var failed = results.filter(function(r) { return r.error; });

  console.log("Total files processed: " + results.length);
  console.log("Fully passing:         " + passed.length);
  console.log("Needs review:          " + needsReview.length);
  console.log("Failed to generate:    " + failed.length);

  if (needsReview.length > 0) {
    console.log("\nFiles needing manual review:");
    needsReview.forEach(function(r) {
      console.log("  - " + r.file);
    });
  }

  if (failed.length > 0) {
    console.log("\nFailed files:");
    failed.forEach(function(r) {
      console.log("  - " + r.file + ": " + r.error);
    });
  }
}

main();

Save this as scripts/generate-tests.js and run it:

export OPENAI_API_KEY=your-key-here
node scripts/generate-tests.js routes/articles.js

The tool will read the route file, extract route definitions, generate a test suite, validate the syntax, run the tests, attempt to fix any failures, and output the final test file to generated-tests/.

Common Issues and Troubleshooting

1. Generated tests import non-existent modules

Error: Cannot find module '../utils/helpers'
    at Function.Module._resolveFilename (node:internal/modules/cjs/loader:1075:15)

LLMs hallucinate require paths. The model sees patterns in your code that suggest a helper module exists even when it does not. Fix this by including a list of actual project files in your prompt, or by post-processing the generated code to validate all require paths against the filesystem.

2. Tests pass trivially with no real assertions

  12 passing (45ms)
  0 failing

This looks great until you realize every test body is expect(true).to.equal(true). The LLM generated placeholder assertions instead of testing actual behavior. This is why the quality gate that counts meaningful assertions is critical. If you see a test suite with a suspiciously high pass rate and very fast execution, inspect the assertions manually.

3. Rate limiting from the LLM provider

Error: 429 Too Many Requests
  Rate limit exceeded. Please retry after 20s.

When processing many files, you will hit rate limits. Implement exponential backoff with jitter:

function retryWithBackoff(fn, maxRetries, callback) {
  var attempt = 0;

  function tryOnce() {
    fn(function(err, result) {
      if (err && err.status === 429 && attempt < maxRetries) {
        attempt++;
        var delay = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
        console.log("Rate limited. Retrying in " + Math.round(delay / 1000) + "s...");
        setTimeout(tryOnce, delay);
      } else {
        callback(err, result);
      }
    });
  }

  tryOnce();
}

4. Generated tests have stale mocks that do not match current function signatures

TypeError: sinon.stub(db, 'findUser').returns(...)
  findUser is not a function

This happens when the LLM hallucinates method names based on the function being tested rather than the actual dependency interface. The fix is to include the dependency source code in the prompt alongside the code under test, so the model knows exactly what methods are available to mock.

5. Async test timeouts from missing done() callbacks

Error: Timeout of 2000ms exceeded. For async tests and hooks, ensure "done()"
is called; if returning a Promise, ensure it resolves.

LLMs frequently generate tests with mixed async patterns -- starting with a callback-style done parameter but then returning a Promise or forgetting to call done(). Include explicit instructions in your prompt about which async pattern to use, and consider adding a post-processing step that ensures consistent async handling across all generated tests.

Best Practices

Always review generated tests before committing them. LLM-generated tests can have subtle issues like testing implementation details instead of behavior, or making assertions that are technically correct but meaningless. A human review pass is non-negotiable.
Use content-addressed caching to avoid regenerating tests for unchanged code. Hash the source file content and cache the generated tests. This cuts API costs by 80-90% in CI environments and makes the pipeline fast enough for PR-level feedback.
Include dependency source code in your prompts. The most common failure mode for generated tests is incorrect mocking. Giving the LLM the actual interface of the dependencies being mocked dramatically improves accuracy.
Set temperature low for test generation (0.1-0.2) and higher for edge case discovery (0.4-0.6). Deterministic output is what you want for test structure and assertions. Save the creativity for brainstorming what to test, not how to test it.
Implement quality gates that run generated tests before accepting them. At minimum: syntax validation, assertion counting, test execution, and secret detection. Tests that do not pass all gates should be flagged for manual review, not silently discarded.
Generate tests incrementally on changed files, not the entire codebase. Full-codebase generation is expensive and slow. Use git diff to identify changed files and only generate tests for those. This keeps CI feedback loops under 5 minutes.
Version your prompts alongside your code. The prompt template is part of your testing infrastructure. When you improve it, you want that improvement tracked and reviewable just like any other code change.
Use structured output (JSON mode) when you need parseable results. For mutation lists, fixture generation, and failure analysis, request JSON output. This eliminates brittle string parsing of the LLM's natural language responses.
Monitor your test generation costs and set budget alerts. Track token usage per run and set hard limits. A runaway CI pipeline generating tests on every commit can burn through API credits quickly.

References

OpenAI API Documentation - API reference for the completions endpoint used throughout this article
Mocha Testing Framework - Documentation for the test runner used in all examples
Chai Assertion Library - Reference for the expect-style assertions
Sinon.JS - Mocking and stubbing library used for dependency isolation
Supertest - HTTP assertion library for testing Express.js routes
Mutation Testing Theory - Background on mutation testing principles
WCAG 2.1 Guidelines - Web Content Accessibility Guidelines referenced in accessibility testing section

Automated Testing with LLMs

Automated Testing with LLMs

Prerequisites

How LLMs Can Augment Testing Workflows

Generating Unit Tests from Source Code

Generating Integration Tests from API Specifications

Test Case Generation from Requirements

Using LLMs to Identify Edge Cases Humans Miss

Implementing a Test Generation Pipeline

Generating Test Data and Fixtures with LLMs

Using LLMs for Test Result Analysis

Mutation Testing Assisted by LLMs

Visual Regression and Accessibility Testing

Quality Gates for Generated Tests

Cost-Effective Test Generation

Integrating LLM Test Generation into CI/CD

Complete Working Example

Common Issues and Troubleshooting

1. Generated tests import non-existent modules

2. Tests pass trivially with no real assertions

3. Rate limiting from the LLM provider

4. Generated tests have stale mocks that do not match current function signatures

5. Async test timeouts from missing done() callbacks

Best Practices

References

Quick Links

Need Expert Help?