Use Cases

Building a Code Generation Pipeline with LLMs

Build a code generation pipeline with LLMs including specification parsing, sandbox testing, iterative refinement, and quality gates in Node.js.

Building a Code Generation Pipeline with LLMs

Overview

Code generation with LLMs is not a single API call — it is a multi-stage pipeline that transforms a specification into production-ready code through parsing, generation, validation, and iterative refinement. This article walks through designing and building a complete code generation pipeline in Node.js that takes natural language specifications, generates Express.js API code with tests, validates everything in a sandboxed environment, and iterates on failures until the output meets quality gates. If you have ever pasted code from ChatGPT into your project and spent an hour fixing it, this pipeline is the engineering solution to that problem.

Prerequisites

  • Node.js v18 or later installed
  • An Anthropic API key (Claude) or OpenAI API key
  • Familiarity with Express.js and REST API patterns
  • Basic understanding of AST parsing and code linting (ESLint)
  • A working knowledge of child processes in Node.js

What a Code Generation Pipeline Does

A code generation pipeline is a structured system that converts a specification into validated, tested code. Think of it as a compiler where the input is English instead of a programming language. The pipeline follows a deterministic flow:

Specification → Parse → Generate → Post-Process → Test → Validate → Output

Each stage has a clear input, output, and failure mode. If testing fails, the pipeline loops back to generation with error context. If validation fails, it loops back to post-processing. This iterative approach is what separates a production pipeline from a naive "ask the LLM and hope for the best" workflow.

The key insight is that LLMs are probabilistic. They will produce code that looks right but has subtle bugs — wrong import paths, missing error handling, incompatible library versions. The pipeline exists to catch those errors systematically and correct them before a human ever sees the output.

Designing the Pipeline Stages

A well-designed pipeline separates concerns into discrete stages, each with its own responsibility and failure handling. Here is the architecture I use in production:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Spec      │────>│   Context   │────>│   Parse     │
│   Input     │     │   Gather    │     │   Spec      │
└─────────────┘     └─────────────┘     └─────────────┘
                                              │
                                              v
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Quality   │<────│   Post-     │<────│   Generate  │
│   Gates     │     │   Process   │     │   Code      │
└─────────────┘     └─────────────┘     └─────────────┘
      │                                       ^
      v                                       │
┌─────────────┐     ┌─────────────┐           │
│   Sandbox   │────>│   Iterate   │───────────┘
│   Test      │     │   on Errors │
└─────────────┘     └─────────────┘
      │
      v
┌─────────────┐
│   Output    │
│   Files     │
└─────────────┘

Each stage is implemented as an independent function that takes structured input and returns structured output. This makes it trivial to swap out implementations — you can replace the LLM provider, change the testing framework, or add new quality gates without touching the rest of the pipeline.

Gathering Context

Before generating any code, the pipeline needs to understand the project it is generating code for. Dropping an LLM into a codebase without context produces generic code that does not match existing conventions.

var fs = require("fs");
var path = require("path");
var glob = require("glob");

function gatherProjectContext(projectRoot) {
  var context = {
    packageJson: null,
    existingRoutes: [],
    models: [],
    middleware: [],
    conventions: {}
  };

  // Read package.json for dependencies and scripts
  var pkgPath = path.join(projectRoot, "package.json");
  if (fs.existsSync(pkgPath)) {
    context.packageJson = JSON.parse(fs.readFileSync(pkgPath, "utf8"));
  }

  // Find existing route files to understand patterns
  var routeFiles = glob.sync(path.join(projectRoot, "routes", "**", "*.js"));
  routeFiles.forEach(function(filePath) {
    var content = fs.readFileSync(filePath, "utf8");
    context.existingRoutes.push({
      path: path.relative(projectRoot, filePath),
      content: content.substring(0, 2000) // First 2000 chars for pattern matching
    });
  });

  // Find model files
  var modelFiles = glob.sync(path.join(projectRoot, "models", "**", "*.js"));
  modelFiles.forEach(function(filePath) {
    var content = fs.readFileSync(filePath, "utf8");
    context.models.push({
      path: path.relative(projectRoot, filePath),
      content: content.substring(0, 2000)
    });
  });

  // Detect conventions from existing code
  context.conventions = detectConventions(context.existingRoutes);

  return context;
}

function detectConventions(routeFiles) {
  var conventions = {
    usesAsync: false,
    errorHandling: "callback",
    exportStyle: "module.exports",
    indentation: "spaces",
    semicolons: true
  };

  if (routeFiles.length === 0) return conventions;

  var sampleContent = routeFiles[0].content;

  if (sampleContent.indexOf("async") !== -1) {
    conventions.usesAsync = true;
  }
  if (sampleContent.indexOf("try {") !== -1) {
    conventions.errorHandling = "try-catch";
  }
  if (sampleContent.indexOf("exports.") !== -1) {
    conventions.exportStyle = "exports";
  }
  if (sampleContent.indexOf("\t") !== -1) {
    conventions.indentation = "tabs";
  }

  return conventions;
}

This context is injected into every LLM prompt so the generated code matches the project's existing patterns. Without this step, you end up with code that works in isolation but looks alien in your codebase.

Specification Parsing

Natural language specifications are ambiguous. The pipeline needs to convert them into a structured format that the code generation stage can consume reliably.

var Anthropic = require("@anthropic-ai/sdk");

var anthropic = new Anthropic();

function parseSpecification(rawSpec) {
  return anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 2000,
    messages: [
      {
        role: "user",
        content: "Parse the following feature specification into a structured JSON format.\n\n" +
          "Specification:\n" + rawSpec + "\n\n" +
          "Return a JSON object with these fields:\n" +
          "- name: string (camelCase identifier)\n" +
          "- description: string (one sentence summary)\n" +
          "- endpoints: array of { method, path, description, requestBody, responseBody, statusCodes }\n" +
          "- models: array of { name, fields: [{ name, type, required, validation }] }\n" +
          "- middleware: array of { name, description, appliesTo }\n" +
          "- dependencies: array of npm package names needed\n\n" +
          "Return ONLY valid JSON, no markdown fences."
      }
    ]
  }).then(function(response) {
    var text = response.content[0].text.trim();
    // Strip markdown fences if the model included them anyway
    text = text.replace(/^```json\n?/, "").replace(/\n?```$/, "");
    return JSON.parse(text);
  });
}

The parsed specification becomes the contract for every downstream stage. If the spec says there are three endpoints, the generator must produce three endpoints, and the test suite must test all three. This structured representation eliminates the ambiguity that causes LLM outputs to drift from the original intent.

Implementing Code Generation with Claude

The generation stage is where the LLM does the heavy lifting. The quality of the output depends almost entirely on the prompt. After building dozens of these pipelines, I have landed on a prompt structure that consistently produces usable code.

function generateCode(parsedSpec, projectContext) {
  var systemPrompt = buildSystemPrompt(projectContext);
  var userPrompt = buildGenerationPrompt(parsedSpec);

  return anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 8000,
    system: systemPrompt,
    messages: [
      { role: "user", content: userPrompt }
    ]
  }).then(function(response) {
    var text = response.content[0].text;
    return extractFiles(text);
  });
}

function buildSystemPrompt(projectContext) {
  var prompt = "You are a senior Node.js engineer generating production-ready Express.js code.\n\n";
  prompt += "Project conventions:\n";
  prompt += "- Error handling: " + projectContext.conventions.errorHandling + "\n";
  prompt += "- Export style: " + projectContext.conventions.exportStyle + "\n";
  prompt += "- Indentation: " + projectContext.conventions.indentation + "\n";
  prompt += "- Uses async/await: " + projectContext.conventions.usesAsync + "\n\n";

  prompt += "Existing dependencies available:\n";
  if (projectContext.packageJson && projectContext.packageJson.dependencies) {
    Object.keys(projectContext.packageJson.dependencies).forEach(function(dep) {
      prompt += "- " + dep + "@" + projectContext.packageJson.dependencies[dep] + "\n";
    });
  }

  prompt += "\nExample route pattern from this project:\n";
  if (projectContext.existingRoutes.length > 0) {
    prompt += "```javascript\n" + projectContext.existingRoutes[0].content.substring(0, 1000) + "\n```\n";
  }

  prompt += "\nRules:\n";
  prompt += "- Use var for all variable declarations\n";
  prompt += "- Use function() for all function expressions\n";
  prompt += "- Use require() for all imports\n";
  prompt += "- Include complete error handling\n";
  prompt += "- Include input validation\n";
  prompt += "- Return files in the format: === FILE: path/to/file.js ===\n";

  return prompt;
}

function buildGenerationPrompt(parsedSpec) {
  var prompt = "Generate the following Express.js API feature:\n\n";
  prompt += "Name: " + parsedSpec.name + "\n";
  prompt += "Description: " + parsedSpec.description + "\n\n";

  prompt += "Endpoints:\n";
  parsedSpec.endpoints.forEach(function(endpoint, i) {
    prompt += (i + 1) + ". " + endpoint.method.toUpperCase() + " " + endpoint.path + "\n";
    prompt += "   Description: " + endpoint.description + "\n";
    if (endpoint.requestBody) {
      prompt += "   Request body: " + JSON.stringify(endpoint.requestBody) + "\n";
    }
    prompt += "   Response: " + JSON.stringify(endpoint.responseBody) + "\n";
    prompt += "   Status codes: " + endpoint.statusCodes.join(", ") + "\n\n";
  });

  prompt += "Generate these files:\n";
  prompt += "1. Route handler file\n";
  prompt += "2. Model/data access file\n";
  prompt += "3. Input validation middleware\n";
  prompt += "4. Test file (using mocha + chai + supertest)\n\n";
  prompt += "Use the === FILE: path/to/file.js === delimiter between files.";

  return prompt;
}

function extractFiles(responseText) {
  var files = [];
  var parts = responseText.split(/===\s*FILE:\s*/);

  parts.forEach(function(part) {
    if (!part.trim()) return;

    var lines = part.split("\n");
    var filePath = lines[0].replace(/\s*===\s*$/, "").trim();

    var content = lines.slice(1).join("\n");
    // Strip markdown code fences
    content = content.replace(/^```[a-z]*\n?/, "").replace(/\n?```\s*$/, "").trim();

    if (filePath && content) {
      files.push({ path: filePath, content: content });
    }
  });

  return files;
}

The file delimiter pattern (=== FILE: path ===) is critical. It gives the LLM a structured way to output multiple files in a single response, and it gives us a reliable way to parse them. I have tried JSON-based output for multi-file generation and it breaks constantly — models escape strings incorrectly, forget closing braces, or truncate long files. Plain text with delimiters is far more robust.

Post-Processing Generated Code

Raw LLM output needs cleaning before it can be tested. Formatting inconsistencies, missing semicolons, and wrong import paths are common issues that post-processing catches automatically.

var eslint = require("eslint");
var prettier = require("prettier");

function postProcessFiles(files, projectRoot) {
  return Promise.all(files.map(function(file) {
    return postProcessFile(file, projectRoot);
  }));
}

function postProcessFile(file, projectRoot) {
  var content = file.content;

  // Fix common import path issues
  content = fixImportPaths(content, projectRoot);

  // Fix missing semicolons and formatting with Prettier
  return prettier.format(content, {
    parser: "babel",
    semi: true,
    singleQuote: false,
    tabWidth: 2,
    trailingComma: "none"
  }).then(function(formatted) {
    return {
      path: file.path,
      content: formatted,
      originalContent: file.content
    };
  }).catch(function(err) {
    // If Prettier fails, the code has syntax errors
    // Return as-is and let the testing stage catch it
    console.warn("Prettier failed for " + file.path + ": " + err.message);
    return {
      path: file.path,
      content: content,
      originalContent: file.content,
      formatError: err.message
    };
  });
}

function fixImportPaths(content, projectRoot) {
  // Fix relative paths that reference the project root
  content = content.replace(
    /require\(["']\.\.\/\.\.\/models\//g,
    'require("' + path.join(projectRoot, "models") + "/"
  );

  // Remove duplicate requires
  var seen = {};
  var lines = content.split("\n");
  var cleaned = lines.filter(function(line) {
    var match = line.match(/var\s+(\w+)\s*=\s*require\(/);
    if (match) {
      if (seen[match[1]]) return false;
      seen[match[1]] = true;
    }
    return true;
  });

  return cleaned.join("\n");
}

Post-processing is cheap insurance. It catches the 80% of formatting issues that would otherwise waste a refinement cycle. Every iteration through the LLM costs money and time, so fixing what you can deterministically is always worth it.

Automated Testing of Generated Code

This is where the pipeline proves its value. Generated code runs in a sandboxed environment with its own node_modules, temporary database connections, and isolated file system.

var childProcess = require("child_process");
var os = require("os");

function testInSandbox(files, parsedSpec) {
  var sandboxDir = path.join(os.tmpdir(), "codegen-sandbox-" + Date.now());

  // Create sandbox directory structure
  fs.mkdirSync(sandboxDir, { recursive: true });
  fs.mkdirSync(path.join(sandboxDir, "routes"), { recursive: true });
  fs.mkdirSync(path.join(sandboxDir, "models"), { recursive: true });
  fs.mkdirSync(path.join(sandboxDir, "middleware"), { recursive: true });
  fs.mkdirSync(path.join(sandboxDir, "test"), { recursive: true });

  // Write generated files to sandbox
  files.forEach(function(file) {
    var fullPath = path.join(sandboxDir, file.path);
    var dir = path.dirname(fullPath);
    fs.mkdirSync(dir, { recursive: true });
    fs.writeFileSync(fullPath, file.content);
  });

  // Create minimal package.json
  var packageJson = {
    name: "codegen-sandbox",
    version: "1.0.0",
    scripts: {
      test: "mocha test/**/*.test.js --timeout 10000 --exit"
    },
    dependencies: {
      express: "^4.18.0",
      "body-parser": "^1.20.0"
    },
    devDependencies: {
      mocha: "^10.0.0",
      chai: "^4.3.0",
      supertest: "^6.3.0"
    }
  };

  // Add spec-required dependencies
  if (parsedSpec.dependencies) {
    parsedSpec.dependencies.forEach(function(dep) {
      packageJson.dependencies[dep] = "latest";
    });
  }

  fs.writeFileSync(
    path.join(sandboxDir, "package.json"),
    JSON.stringify(packageJson, null, 2)
  );

  // Install dependencies
  return runCommand("npm install", sandboxDir, 60000)
    .then(function() {
      // Run tests
      return runCommand("npm test", sandboxDir, 30000);
    })
    .then(function(result) {
      return {
        success: true,
        output: result.stdout,
        sandboxDir: sandboxDir
      };
    })
    .catch(function(err) {
      return {
        success: false,
        error: err.message,
        stdout: err.stdout || "",
        stderr: err.stderr || "",
        sandboxDir: sandboxDir
      };
    });
}

function runCommand(command, cwd, timeout) {
  return new Promise(function(resolve, reject) {
    childProcess.exec(command, {
      cwd: cwd,
      timeout: timeout || 30000,
      maxBuffer: 1024 * 1024
    }, function(err, stdout, stderr) {
      if (err) {
        err.stdout = stdout;
        err.stderr = stderr;
        reject(err);
      } else {
        resolve({ stdout: stdout, stderr: stderr });
      }
    });
  });
}

The sandbox is disposable. Every generation attempt gets a fresh directory, a fresh npm install, and a clean test run. This eliminates stale state as a source of false results. Yes, the npm install adds 10-15 seconds per iteration, but it prevents an entire class of "works on my machine" bugs.

Iterative Refinement

When tests fail, the pipeline feeds the error output back to the LLM with instructions to fix the specific issue. This is the loop that turns mediocre LLM output into working code.

function iterativeGenerate(parsedSpec, projectContext, maxAttempts) {
  maxAttempts = maxAttempts || 3;
  var attempt = 0;
  var files = null;
  var history = [];

  function iterate() {
    attempt++;
    console.log("Generation attempt " + attempt + " of " + maxAttempts);

    var generateFn;
    if (attempt === 1) {
      generateFn = generateCode(parsedSpec, projectContext);
    } else {
      generateFn = regenerateWithErrors(parsedSpec, projectContext, history);
    }

    return generateFn
      .then(function(generatedFiles) {
        files = generatedFiles;
        return postProcessFiles(files, projectContext.projectRoot);
      })
      .then(function(processedFiles) {
        files = processedFiles;
        return testInSandbox(files, parsedSpec);
      })
      .then(function(testResult) {
        if (testResult.success) {
          console.log("Tests passed on attempt " + attempt);
          return { success: true, files: files, attempts: attempt };
        }

        history.push({
          attempt: attempt,
          files: files.map(function(f) { return { path: f.path, content: f.content }; }),
          error: testResult.error,
          stdout: testResult.stdout,
          stderr: testResult.stderr
        });

        if (attempt >= maxAttempts) {
          console.log("Max attempts reached. Returning best effort.");
          return {
            success: false,
            files: files,
            attempts: attempt,
            lastError: testResult.error
          };
        }

        return iterate();
      });
  }

  return iterate();
}

function regenerateWithErrors(parsedSpec, projectContext, history) {
  var lastAttempt = history[history.length - 1];

  var prompt = "The previous code generation had errors. Fix them.\n\n";
  prompt += "Original specification:\n" + JSON.stringify(parsedSpec, null, 2) + "\n\n";
  prompt += "Previous code that failed:\n";

  lastAttempt.files.forEach(function(file) {
    prompt += "=== FILE: " + file.path + " ===\n" + file.content + "\n\n";
  });

  prompt += "Test errors:\n";
  prompt += lastAttempt.stderr || lastAttempt.error || "Unknown error";
  prompt += "\n\nFix ALL errors and return the complete corrected files using the same === FILE: path === format.";
  prompt += "\nDo not explain the changes. Return only the corrected code files.";

  return anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 8000,
    system: buildSystemPrompt(projectContext),
    messages: [
      { role: "user", content: prompt }
    ]
  }).then(function(response) {
    return extractFiles(response.content[0].text);
  });
}

Three iterations is the sweet spot. In my experience, if the LLM cannot fix the code in three attempts with full error context, the specification is ambiguous or the task is too complex for a single generation pass. At that point, you are better off breaking the spec into smaller pieces.

Handling Multi-File Generation

Real features span multiple files — routes, models, middleware, tests, and sometimes migrations. The pipeline needs to handle dependencies between generated files.

function generateMultiFile(parsedSpec, projectContext) {
  // Define file generation order based on dependencies
  var fileOrder = [
    { type: "model", template: "Generate the data model/schema" },
    { type: "middleware", template: "Generate validation middleware" },
    { type: "route", template: "Generate the route handler" },
    { type: "test", template: "Generate the test suite" }
  ];

  var generatedFiles = [];

  function generateNext(index) {
    if (index >= fileOrder.length) {
      return Promise.resolve(generatedFiles);
    }

    var stage = fileOrder[index];
    var prompt = stage.template + " for:\n" + JSON.stringify(parsedSpec) + "\n\n";

    // Include previously generated files as context
    if (generatedFiles.length > 0) {
      prompt += "Already generated files (use these as context for imports and references):\n";
      generatedFiles.forEach(function(file) {
        prompt += "=== FILE: " + file.path + " ===\n" + file.content + "\n\n";
      });
    }

    return anthropic.messages.create({
      model: "claude-sonnet-4-20250514",
      max_tokens: 4000,
      system: buildSystemPrompt(projectContext),
      messages: [{ role: "user", content: prompt }]
    }).then(function(response) {
      var files = extractFiles(response.content[0].text);
      generatedFiles = generatedFiles.concat(files);
      return generateNext(index + 1);
    });
  }

  return generateNext(0);
}

Generating files sequentially with prior files as context produces dramatically better results than generating everything in a single prompt. The model can reference actual function signatures, actual export names, and actual file paths from earlier files instead of guessing.

Template-Based vs Free-Form Generation

Not all code generation needs an LLM. Template-based generation is faster, cheaper, and more predictable for boilerplate.

var Handlebars = require("handlebars");

function templateBasedGeneration(parsedSpec) {
  var routeTemplate = Handlebars.compile(
    'var express = require("express");\n' +
    'var router = express.Router();\n' +
    'var {{camelCase name}} = require("../models/{{name}}");\n\n' +
    '{{#each endpoints}}\n' +
    'router.{{lower method}}("{{path}}", function(req, res) {\n' +
    '  // TODO: implement {{description}}\n' +
    '  res.status(200).json({ message: "Not implemented" });\n' +
    '});\n\n' +
    '{{/each}}\n' +
    'module.exports = router;\n'
  );

  return routeTemplate(parsedSpec);
}

function hybridGeneration(parsedSpec, projectContext) {
  // Use templates for structure, LLM for implementation
  var skeleton = templateBasedGeneration(parsedSpec);

  var prompt = "Complete the implementation of this Express.js route skeleton:\n\n";
  prompt += "```javascript\n" + skeleton + "\n```\n\n";
  prompt += "Specification: " + JSON.stringify(parsedSpec, null, 2) + "\n\n";
  prompt += "Replace each TODO comment with complete, working implementation code.";
  prompt += " Keep the existing structure and imports intact.";

  return anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4000,
    system: buildSystemPrompt(projectContext),
    messages: [{ role: "user", content: prompt }]
  }).then(function(response) {
    return response.content[0].text;
  });
}

The hybrid approach — template for structure, LLM for logic — gives you the best of both worlds. The template guarantees correct file structure, import ordering, and export patterns. The LLM fills in the business logic where creativity and understanding actually matter.

Implementing Quality Gates

Quality gates are boolean checks that generated code must pass before the pipeline considers it complete. Each gate is independent and produces a clear pass/fail result.

function runQualityGates(files, sandboxDir) {
  var gates = [
    { name: "syntax", check: checkSyntax },
    { name: "lint", check: checkLint },
    { name: "tests", check: checkTests },
    { name: "coverage", check: checkCoverage },
    { name: "security", check: checkSecurity }
  ];

  var results = [];

  return gates.reduce(function(promise, gate) {
    return promise.then(function() {
      return gate.check(files, sandboxDir).then(function(result) {
        results.push({
          gate: gate.name,
          passed: result.passed,
          message: result.message
        });
        console.log("[" + (result.passed ? "PASS" : "FAIL") + "] " + gate.name + ": " + result.message);
      });
    });
  }, Promise.resolve()).then(function() {
    var allPassed = results.every(function(r) { return r.passed; });
    return { passed: allPassed, results: results };
  });
}

function checkSyntax(files) {
  var errors = [];
  files.forEach(function(file) {
    if (!file.path.endsWith(".js")) return;
    try {
      // Use Node's built-in parser to check syntax
      require("vm").createScript(file.content, { filename: file.path });
    } catch (err) {
      errors.push(file.path + ": " + err.message);
    }
  });

  return Promise.resolve({
    passed: errors.length === 0,
    message: errors.length === 0 ? "All files have valid syntax" : errors.join("; ")
  });
}

function checkLint(files, sandboxDir) {
  return runCommand(
    "npx eslint --no-eslintrc --rule '{\"no-undef\": \"error\", \"no-unused-vars\": \"warn\"}' .",
    sandboxDir,
    15000
  ).then(function() {
    return { passed: true, message: "No lint errors" };
  }).catch(function(err) {
    return { passed: false, message: err.stdout || err.message };
  });
}

function checkTests(files, sandboxDir) {
  return runCommand("npm test", sandboxDir, 30000)
    .then(function(result) {
      var passing = result.stdout.match(/(\d+) passing/);
      var count = passing ? passing[1] : "unknown";
      return { passed: true, message: count + " tests passing" };
    })
    .catch(function(err) {
      return { passed: false, message: err.stderr || err.message };
    });
}

function checkCoverage(files, sandboxDir) {
  return runCommand("npx nyc --reporter=text npm test", sandboxDir, 30000)
    .then(function(result) {
      var coverageMatch = result.stdout.match(/All files\s*\|\s*([\d.]+)/);
      var coverage = coverageMatch ? parseFloat(coverageMatch[1]) : 0;
      return {
        passed: coverage >= 70,
        message: "Coverage: " + coverage + "% (minimum: 70%)"
      };
    })
    .catch(function(err) {
      return { passed: false, message: "Coverage check failed: " + err.message };
    });
}

function checkSecurity(files) {
  var issues = [];
  files.forEach(function(file) {
    // Check for common security anti-patterns
    if (file.content.indexOf("eval(") !== -1) {
      issues.push(file.path + ": uses eval()");
    }
    if (file.content.match(/sql.*\+.*req\./i)) {
      issues.push(file.path + ": potential SQL injection");
    }
    if (file.content.indexOf("innerHTML") !== -1) {
      issues.push(file.path + ": potential XSS via innerHTML");
    }
  });

  return Promise.resolve({
    passed: issues.length === 0,
    message: issues.length === 0 ? "No security issues detected" : issues.join("; ")
  });
}

The coverage gate at 70% is intentionally lower than what you would require for hand-written code. Generated test suites tend to cover the happy path thoroughly but miss edge cases. Getting 70% coverage from a generated test suite is actually a strong signal that the code and tests are coherent.

Code Review Automation

Beyond automated checks, you can use the LLM itself to review generated code for issues that static analysis misses.

function automatedCodeReview(files, parsedSpec) {
  var codeForReview = files.map(function(file) {
    return "=== " + file.path + " ===\n" + file.content;
  }).join("\n\n");

  var prompt = "Review this generated code for a feature: " + parsedSpec.description + "\n\n";
  prompt += codeForReview + "\n\n";
  prompt += "Check for:\n";
  prompt += "1. Missing error handling\n";
  prompt += "2. Missing input validation\n";
  prompt += "3. Race conditions or concurrency issues\n";
  prompt += "4. Missing edge cases in tests\n";
  prompt += "5. Incorrect HTTP status codes\n";
  prompt += "6. Security vulnerabilities\n\n";
  prompt += "Return a JSON array of issues found. Each issue: { severity: 'high'|'medium'|'low', file: string, line: number, description: string }\n";
  prompt += "If no issues found, return an empty array [].";

  return anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 2000,
    messages: [{ role: "user", content: prompt }]
  }).then(function(response) {
    var text = response.content[0].text.trim();
    text = text.replace(/^```json\n?/, "").replace(/\n?```$/, "");
    return JSON.parse(text);
  });
}

Using a different model (or even a different prompt style) for review than for generation catches issues that the generator's blind spots would otherwise miss. It is similar to having a different developer review a pull request — fresh eyes catch different problems.

Integrating with CI/CD

Generated code should go through the same CI/CD pipeline as hand-written code. The generation pipeline outputs files and metadata that the CI system can use.

function prepareForCI(result, outputDir) {
  // Write generated files
  result.files.forEach(function(file) {
    var fullPath = path.join(outputDir, file.path);
    fs.mkdirSync(path.dirname(fullPath), { recursive: true });
    fs.writeFileSync(fullPath, file.content);
  });

  // Write generation metadata
  var metadata = {
    generatedAt: new Date().toISOString(),
    attempts: result.attempts,
    model: "claude-sonnet-4-20250514",
    specification: result.specification,
    qualityGates: result.qualityResults,
    reviewIssues: result.reviewIssues,
    cost: result.cost
  };

  fs.writeFileSync(
    path.join(outputDir, ".generation-metadata.json"),
    JSON.stringify(metadata, null, 2)
  );

  // Write a PR description
  var prBody = "## Auto-Generated Code\n\n";
  prBody += "**Feature:** " + result.specification.description + "\n\n";
  prBody += "**Files generated:**\n";
  result.files.forEach(function(file) {
    prBody += "- `" + file.path + "`\n";
  });
  prBody += "\n**Generation attempts:** " + result.attempts + "\n";
  prBody += "**Quality gates:** " + (result.qualityResults.passed ? "All passed" : "Some failed") + "\n";

  fs.writeFileSync(path.join(outputDir, ".pr-description.md"), prBody);

  return metadata;
}

The .generation-metadata.json file is crucial for auditability. When you find a bug in generated code six months later, you need to know which model produced it, what specification it was working from, and what quality gates it passed. Without this metadata, generated code becomes a black box.

Cost Tracking

Every API call costs money. A pipeline that iterates four times on a complex feature can cost several dollars. Track it.

function createCostTracker() {
  var tracker = {
    calls: [],
    totalInputTokens: 0,
    totalOutputTokens: 0
  };

  tracker.record = function(response, stage) {
    var usage = response.usage;
    var inputCost = (usage.input_tokens / 1000000) * 3.0;   // Claude Sonnet pricing
    var outputCost = (usage.output_tokens / 1000000) * 15.0;

    var entry = {
      stage: stage,
      inputTokens: usage.input_tokens,
      outputTokens: usage.output_tokens,
      cost: inputCost + outputCost,
      timestamp: new Date().toISOString()
    };

    tracker.calls.push(entry);
    tracker.totalInputTokens += usage.input_tokens;
    tracker.totalOutputTokens += usage.output_tokens;
  };

  tracker.getTotalCost = function() {
    return tracker.calls.reduce(function(sum, call) {
      return sum + call.cost;
    }, 0);
  };

  tracker.getSummary = function() {
    return {
      totalCalls: tracker.calls.length,
      totalInputTokens: tracker.totalInputTokens,
      totalOutputTokens: tracker.totalOutputTokens,
      totalCost: tracker.getTotalCost().toFixed(4),
      byStage: tracker.calls.reduce(function(acc, call) {
        if (!acc[call.stage]) acc[call.stage] = { calls: 0, cost: 0 };
        acc[call.stage].calls++;
        acc[call.stage].cost += call.cost;
        return acc;
      }, {})
    };
  };

  return tracker;
}

In production, I set a cost ceiling per generation request — typically $2.00. If the pipeline hits that ceiling, it stops iterating and returns the best result it has. Without a ceiling, a pathological specification can burn through your API budget in minutes.

Complete Working Example

Here is the full pipeline wired together. It takes a specification string, runs through all stages, and outputs production-ready files.

var fs = require("fs");
var path = require("path");
var Anthropic = require("@anthropic-ai/sdk");

var anthropic = new Anthropic();

function runPipeline(specificationText, projectRoot, outputDir) {
  var costTracker = createCostTracker();
  var startTime = Date.now();

  console.log("=== Code Generation Pipeline ===");
  console.log("Project root: " + projectRoot);
  console.log("Output dir: " + outputDir);
  console.log("");

  // Stage 1: Gather context
  console.log("[1/6] Gathering project context...");
  var projectContext = gatherProjectContext(projectRoot);
  projectContext.projectRoot = projectRoot;

  // Stage 2: Parse specification
  console.log("[2/6] Parsing specification...");
  return parseSpecification(specificationText)
    .then(function(parsedSpec) {
      console.log("Parsed spec: " + parsedSpec.name + " (" + parsedSpec.endpoints.length + " endpoints)");

      // Stage 3: Generate and iterate
      console.log("[3/6] Generating code (up to 3 attempts)...");
      return iterativeGenerate(parsedSpec, projectContext, 3)
        .then(function(genResult) {
          // Stage 4: Quality gates
          console.log("[4/6] Running quality gates...");
          return runQualityGates(genResult.files, genResult.sandboxDir || "")
            .then(function(qualityResults) {

              // Stage 5: Automated review
              console.log("[5/6] Running automated code review...");
              return automatedCodeReview(genResult.files, parsedSpec)
                .then(function(reviewIssues) {

                  // Stage 6: Output
                  console.log("[6/6] Writing output files...");
                  var result = {
                    success: genResult.success && qualityResults.passed,
                    files: genResult.files,
                    attempts: genResult.attempts,
                    specification: parsedSpec,
                    qualityResults: qualityResults,
                    reviewIssues: reviewIssues,
                    cost: costTracker.getSummary(),
                    duration: Date.now() - startTime
                  };

                  prepareForCI(result, outputDir);

                  console.log("");
                  console.log("=== Pipeline Complete ===");
                  console.log("Success: " + result.success);
                  console.log("Attempts: " + result.attempts);
                  console.log("Files: " + result.files.length);
                  console.log("Cost: $" + costTracker.getTotalCost().toFixed(4));
                  console.log("Duration: " + (result.duration / 1000).toFixed(1) + "s");
                  console.log("Review issues: " + reviewIssues.length);

                  if (reviewIssues.length > 0) {
                    console.log("\nReview issues found:");
                    reviewIssues.forEach(function(issue) {
                      console.log("  [" + issue.severity + "] " + issue.file + ":" + issue.line + " - " + issue.description);
                    });
                  }

                  return result;
                });
            });
        });
    });
}

// Usage
var spec = "Create a user management API with endpoints to create users (POST /api/users), " +
  "get a user by ID (GET /api/users/:id), list all users with pagination (GET /api/users), " +
  "and delete a user (DELETE /api/users/:id). Users have name, email, and role fields. " +
  "Email must be unique. Include input validation and proper error responses.";

runPipeline(spec, "/path/to/project", "/path/to/output")
  .then(function(result) {
    if (!result.success) {
      console.error("Pipeline completed with failures.");
      process.exit(1);
    }
  })
  .catch(function(err) {
    console.error("Pipeline error: " + err.message);
    process.exit(1);
  });

Run this with node pipeline.js and you will get a complete Express.js API feature with routes, models, validation, tests, generation metadata, and a PR description — all validated and reviewed.

Common Issues and Troubleshooting

1. SyntaxError: Unexpected token in JSON response

SyntaxError: Unexpected token '`' at position 0
    at JSON.parse (<anonymous>)
    at parseSpecification (pipeline.js:47)

This happens when the LLM wraps its JSON response in markdown code fences despite being told not to. The fix is to always strip fences before parsing. The text.replace(/^```json\n?/, "").replace(/\n?```$/, "") pattern handles this. You should also handle cases where the model returns explanatory text before or after the JSON by finding the first { and last } in the response.

2. ENOENT: npm install fails in sandbox

Error: ENOENT: no such file or directory, open '/tmp/codegen-sandbox-1706123456/node_modules/.package-lock.json'
npm ERR! code ENOENT
npm ERR! syscall open

This occurs when the sandbox directory is on a filesystem with restricted write permissions (common in containerized environments). Fix it by setting the HOME and npm_config_cache environment variables in the childProcess.exec call:

childProcess.exec(command, {
  cwd: sandboxDir,
  env: Object.assign({}, process.env, {
    HOME: sandboxDir,
    npm_config_cache: path.join(sandboxDir, ".npm-cache")
  })
});

3. MaxTokens exceeded during multi-file generation

Error: max_tokens: 4000 is less than the minimum required for this request

Complex features generate a lot of code. When the LLM hits the token limit, it truncates the output mid-file, breaking the file delimiter parser. Increase max_tokens for generation calls (8000-12000 for multi-file), and implement a fallback that generates files one at a time if the combined output exceeds limits.

4. Test timeout in sandbox: tests hang indefinitely

Error: Timeout - Async callback was not called within the 10000ms timeout specified by mochajs
    at listOnTimeout (node:internal/timers:569:17)

Generated tests often forget to close server connections or database pools. The --exit flag on mocha helps, but the real fix is to include explicit cleanup instructions in the generation prompt: "Ensure all test files call server.close() in an after() hook and that no open handles remain after tests complete."

5. Rate limiting on rapid iterations

Error: 429 Too Many Requests
{"error":{"type":"rate_limit_error","message":"Number of request tokens has exceeded your per-minute rate limit"}}

When the pipeline iterates quickly, it can hit API rate limits. Add exponential backoff between iteration attempts:

function delay(ms) {
  return new Promise(function(resolve) { setTimeout(resolve, ms); });
}

// In the iterate function, before retrying:
return delay(2000 * attempt).then(function() {
  return iterate();
});

Best Practices

  • Always sandbox generated code. Never run LLM-generated code directly in your project directory or with access to production resources. Use isolated temp directories with minimal permissions. The cost of creating a sandbox is trivial compared to the risk of generated code deleting files or corrupting data.

  • Set hard limits on iterations and cost. Three generation attempts and a $2.00 cost ceiling are sensible defaults. Without limits, a poorly-specified feature can loop forever, burning API credits and producing no useful output. Fail fast and let a human intervene.

  • Include existing code as context, not instruction. When you show the LLM existing code patterns, frame it as "here is how this project does things" rather than "modify this code." Context produces code that matches conventions. Instructions produce code that breaks existing functionality.

  • Version your prompts alongside your pipeline code. Prompt changes affect output quality as much as code changes. Store your system prompts and generation templates in version control. When a generation produces bad results, you can diff the prompt to understand why.

  • Prefer sequential file generation over all-at-once. Generating model, then middleware, then routes, then tests — with each subsequent file seeing the previous files — produces far better inter-file consistency than generating everything in a single prompt. The token cost is higher, but the quality improvement is worth it.

  • Test the test suite, not just the code. A generated test suite that passes but tests nothing is worse than useless — it gives false confidence. Check that the test file actually imports the generated code, calls the endpoints, and asserts on response bodies. A test file full of expect(true).to.be.true is a common LLM failure mode.

  • Log everything for debugging. Every prompt sent, every response received, every test result, every quality gate outcome — log it all. When generation fails, the logs tell you exactly which stage failed, what the LLM was told, and what it produced. Without logs, debugging a failed generation is guesswork.

  • Use structured output formats with simple delimiters. The === FILE: path === delimiter pattern is far more robust than asking the LLM to output valid JSON containing code. JSON inside JSON with escaped newlines and quotes breaks constantly. Plain text with clear markers is easier for the LLM to produce and easier for you to parse.

References

Powered by Contentful