Code Generation Agents: Architecture and Implementation
Build code generation agents with sandbox execution, test-driven generation, iterative refinement, and git integration in Node.js.
Code Generation Agents: Architecture and Implementation
Code generation agents are autonomous systems that go far beyond autocomplete. They read a task description, examine existing code, plan an approach, generate implementation with tests, execute those tests in a sandbox, and iterate until everything passes. This article walks through the architecture and implementation of a production-grade code generation agent in Node.js, covering sandbox execution, AST validation, git integration, and the guardrails that keep agents from doing damage.
If you have built LLM-powered tools before and found yourself wishing they could actually run code, verify their own output, and fix their own mistakes, this is the guide you have been waiting for.
Prerequisites
- Node.js v18+ installed
- Familiarity with LLM APIs (OpenAI, Anthropic, or similar)
- Understanding of ASTs (Abstract Syntax Trees) at a conceptual level
- Working knowledge of
child_process,fs, andpathmodules - Git installed and configured locally
- A project directory you are comfortable experimenting in
Code Generation Agents vs. Simple Code Completion
Code completion tools predict the next few tokens based on cursor context. They are reactive, stateless, and narrowly scoped. A code generation agent is a fundamentally different beast. It maintains state across multiple steps, understands project structure, makes architectural decisions, and validates its own output.
Here is the distinction in concrete terms:
| Capability | Code Completion | Code Generation Agent |
|---|---|---|
| Scope | Single line or block | Multiple files, full features |
| Context | Current file, maybe imports | Entire project structure |
| Validation | None | Runs tests, parses AST, checks style |
| Iteration | One-shot | Generate, test, fix, re-test loop |
| Side effects | None | Writes files, runs commands, creates branches |
The agent pattern is what turns an LLM from a fancy autocomplete into something that can actually ship code.
Agent Architecture: The Core Loop
Every code generation agent follows the same fundamental loop, regardless of implementation details:
- Understand the task -- Parse the request, identify what needs to change, what constraints exist
- Gather context -- Read existing files, understand patterns, check dependencies
- Plan -- Decide which files to create or modify, in what order, with what approach
- Generate -- Produce code (and tests) based on the plan
- Validate -- Parse the generated code, run tests, check for errors
- Iterate -- If validation fails, feed errors back into the LLM and regenerate
This loop is deceptively simple. The complexity lives in each step's implementation, and in knowing when to stop.
┌──────────────┐
│ Task Input │
└──────┬───────┘
│
v
┌──────────────┐ ┌──────────────────┐
│ Understand │────>│ Gather Context │
└──────────────┘ └────────┬─────────┘
│
v
┌──────────────────┐
│ Plan │
└────────┬─────────┘
│
v
┌──────────────────┐
┌─────>│ Generate │
│ └────────┬─────────┘
│ │
│ v
│ ┌──────────────────┐
│ │ Validate │
│ └────────┬─────────┘
│ │
│ Pass?│
│ ┌─────┴─────┐
│ │No │Yes
│ v v
│ ┌────────────┐ ┌──────┐
└──│ Fix │ │ Done │
└────────────┘ └──────┘
Implementing File System Tools
The agent needs controlled access to the file system. You do not want it writing to arbitrary paths. Here is a toolkit that wraps fs operations with safety checks:
var fs = require("fs");
var path = require("path");
function createFileTools(workspaceRoot) {
var allowedRoot = path.resolve(workspaceRoot);
function isPathAllowed(filePath) {
var resolved = path.resolve(filePath);
return resolved.startsWith(allowedRoot);
}
function readFile(filePath) {
if (!isPathAllowed(filePath)) {
return { error: "Path outside workspace: " + filePath };
}
try {
var content = fs.readFileSync(filePath, "utf8");
return { content: content, lines: content.split("\n").length };
} catch (err) {
return { error: "Cannot read file: " + err.message };
}
}
function writeFile(filePath, content) {
if (!isPathAllowed(filePath)) {
return { error: "Path outside workspace: " + filePath };
}
var maxSize = 500 * 1024; // 500KB limit
if (Buffer.byteLength(content, "utf8") > maxSize) {
return { error: "File exceeds 500KB size limit" };
}
try {
var dir = path.dirname(filePath);
if (!fs.existsSync(dir)) {
fs.mkdirSync(dir, { recursive: true });
}
fs.writeFileSync(filePath, content, "utf8");
return { success: true, path: filePath };
} catch (err) {
return { error: "Cannot write file: " + err.message };
}
}
function listFiles(dirPath, pattern) {
if (!isPathAllowed(dirPath)) {
return { error: "Path outside workspace: " + dirPath };
}
try {
var results = [];
var entries = fs.readdirSync(dirPath, { withFileTypes: true });
entries.forEach(function(entry) {
var fullPath = path.join(dirPath, entry.name);
if (entry.isDirectory()) {
var subResults = listFiles(fullPath, pattern);
if (subResults.files) {
results = results.concat(subResults.files);
}
} else if (!pattern || entry.name.match(new RegExp(pattern))) {
results.push(fullPath);
}
});
return { files: results };
} catch (err) {
return { error: "Cannot list directory: " + err.message };
}
}
function searchFiles(dirPath, searchPattern) {
var fileList = listFiles(dirPath, "\\.js$");
var matches = [];
if (!fileList.files) return { matches: [] };
fileList.files.forEach(function(filePath) {
var content = fs.readFileSync(filePath, "utf8");
var lines = content.split("\n");
lines.forEach(function(line, index) {
if (line.indexOf(searchPattern) !== -1) {
matches.push({
file: filePath,
line: index + 1,
content: line.trim()
});
}
});
});
return { matches: matches };
}
return {
readFile: readFile,
writeFile: writeFile,
listFiles: listFiles,
searchFiles: searchFiles
};
}
The isPathAllowed check is non-negotiable. Without it, a misbehaving LLM could overwrite system files, configuration, or anything else on disk. Every file operation must go through this gate.
Implementing a Code Execution Sandbox
Running LLM-generated code is inherently risky. You need a sandbox that limits execution time, memory usage, and system access. The child_process module gives us the building blocks:
var childProcess = require("child_process");
var path = require("path");
var os = require("os");
function createSandbox(options) {
var timeout = options.timeout || 30000;
var maxBuffer = options.maxBuffer || 1024 * 1024; // 1MB
var workDir = options.workDir;
function execute(command, args) {
return new Promise(function(resolve) {
var startTime = Date.now();
try {
var env = Object.assign({}, process.env);
// Strip sensitive environment variables
delete env.OPENAI_API_KEY;
delete env.ANTHROPIC_API_KEY;
delete env.AWS_SECRET_ACCESS_KEY;
delete env.DATABASE_URL;
var result = childProcess.spawnSync(command, args, {
cwd: workDir,
timeout: timeout,
maxBuffer: maxBuffer,
env: env,
stdio: ["pipe", "pipe", "pipe"],
shell: false // Prevent shell injection
});
var elapsed = Date.now() - startTime;
if (result.error) {
if (result.error.code === "ETIMEDOUT") {
resolve({
success: false,
error: "Execution timed out after " + timeout + "ms",
elapsed: elapsed
});
return;
}
resolve({
success: false,
error: result.error.message,
elapsed: elapsed
});
return;
}
resolve({
success: result.status === 0,
stdout: result.stdout ? result.stdout.toString("utf8") : "",
stderr: result.stderr ? result.stderr.toString("utf8") : "",
exitCode: result.status,
elapsed: elapsed
});
} catch (err) {
resolve({
success: false,
error: "Sandbox error: " + err.message,
elapsed: Date.now() - startTime
});
}
});
}
function runTests(testFile) {
var testRunner = path.join(workDir, "node_modules", ".bin", "mocha");
if (!require("fs").existsSync(testRunner)) {
testRunner = "npx";
return execute(testRunner, ["mocha", testFile, "--timeout", "10000", "--exit"]);
}
return execute(testRunner, [testFile, "--timeout", "10000", "--exit"]);
}
function runNode(scriptFile) {
return execute(process.execPath, [scriptFile]);
}
return {
execute: execute,
runTests: runTests,
runNode: runNode
};
}
A few critical details here. Setting shell: false prevents shell injection attacks, where the LLM might try to sneak semicolons or pipes into a command. Stripping API keys from the environment prevents the generated code from calling external services. The timeout prevents infinite loops from hanging your system.
AST-Based Code Validation
Before writing generated code to disk, parse it and verify it is syntactically valid. This catches a huge class of errors before they waste a test execution cycle:
var acorn = require("acorn");
function validateJavaScript(code, filePath) {
var errors = [];
// Syntax check via AST parsing
try {
acorn.parse(code, {
ecmaVersion: 2020,
sourceType: "script",
allowReturnOutsideFunction: false
});
} catch (parseError) {
errors.push({
type: "syntax",
message: parseError.message,
line: parseError.loc ? parseError.loc.line : null,
column: parseError.loc ? parseError.loc.column : null
});
}
// Check for common LLM mistakes
if (code.indexOf("```") !== -1) {
errors.push({
type: "format",
message: "Code contains markdown fences (```). LLM did not extract code properly."
});
}
if (code.indexOf("// TODO: implement") !== -1 || code.indexOf("// ...") !== -1) {
errors.push({
type: "incomplete",
message: "Code contains placeholder comments indicating incomplete implementation."
});
}
// Check for accidental import/export (if targeting CommonJS)
var importMatch = code.match(/^import\s+/m);
if (importMatch) {
errors.push({
type: "module",
message: "Code uses ES module import syntax but project uses CommonJS require()."
});
}
// Verify file would not be unreasonably large
var lineCount = code.split("\n").length;
if (lineCount > 2000) {
errors.push({
type: "size",
message: "Generated file has " + lineCount + " lines. Maximum is 2000."
});
}
return {
valid: errors.length === 0,
errors: errors
};
}
The markdown fence check is one you learn the hard way. LLMs love to wrap code in triple backticks even when you tell them not to. Catching it in validation saves you from writing a file that starts with ```javascript and immediately fails to parse.
Test-Driven Code Generation
The most reliable pattern for code generation agents is test-driven generation: write the tests first, then generate code to pass them. This gives the agent an unambiguous success criterion.
function buildTestFirstPrompt(task, existingCode, testFramework) {
var prompt = "You are implementing the following task:\n\n";
prompt += task.description + "\n\n";
if (existingCode.length > 0) {
prompt += "Here is the existing code in the project that is relevant:\n\n";
existingCode.forEach(function(file) {
prompt += "--- " + file.path + " ---\n";
prompt += file.content + "\n\n";
});
}
prompt += "Step 1: Write comprehensive tests for the functionality described above.\n";
prompt += "Use " + testFramework + " as the test framework.\n";
prompt += "Cover normal operation, edge cases, and error conditions.\n";
prompt += "Use require() syntax, var declarations, and function() expressions.\n\n";
prompt += "Step 2: Write the implementation that passes all tests.\n\n";
prompt += "Return your response as JSON with this structure:\n";
prompt += '{ "testFile": { "path": "...", "content": "..." },';
prompt += ' "sourceFiles": [{ "path": "...", "content": "..." }] }\n';
return prompt;
}
The key insight is that tests provide a machine-verifiable contract. The agent does not need human judgment to know if its output is correct. Either the tests pass or they do not.
Iterative Refinement: The Fix Loop
When tests fail, the agent needs to see the error output and try again. This is where the real power of the agent pattern emerges -- it can debug its own code:
function createRefinementLoop(options) {
var maxAttempts = options.maxAttempts || 5;
var llmClient = options.llmClient;
var sandbox = options.sandbox;
var fileTools = options.fileTools;
function refine(generatedFiles, testFile) {
return new Promise(function(resolve) {
var attempt = 0;
var history = [];
function tryIteration() {
attempt++;
console.log("Attempt " + attempt + " of " + maxAttempts);
// Write all generated files
generatedFiles.forEach(function(file) {
fileTools.writeFile(file.path, file.content);
});
// Write test file
fileTools.writeFile(testFile.path, testFile.content);
// Run tests
sandbox.runTests(testFile.path).then(function(result) {
history.push({
attempt: attempt,
stdout: result.stdout,
stderr: result.stderr,
success: result.success,
elapsed: result.elapsed
});
if (result.success) {
resolve({
success: true,
attempts: attempt,
files: generatedFiles,
testFile: testFile,
history: history
});
return;
}
if (attempt >= maxAttempts) {
resolve({
success: false,
attempts: attempt,
lastError: result.stderr || result.stdout,
history: history
});
return;
}
// Build fix prompt with error context
var fixPrompt = buildFixPrompt(
generatedFiles,
testFile,
result.stderr || result.stdout,
history
);
llmClient.generate(fixPrompt).then(function(fixResponse) {
var fixed = parseGeneratedFiles(fixResponse);
if (fixed) {
generatedFiles = fixed.sourceFiles;
if (fixed.testFile) {
testFile = fixed.testFile;
}
}
tryIteration();
});
});
}
tryIteration();
});
}
return { refine: refine };
}
function buildFixPrompt(sourceFiles, testFile, errorOutput, history) {
var prompt = "The tests are failing. Fix the implementation.\n\n";
prompt += "ERROR OUTPUT:\n" + errorOutput + "\n\n";
prompt += "CURRENT SOURCE FILES:\n";
sourceFiles.forEach(function(file) {
prompt += "--- " + file.path + " ---\n" + file.content + "\n\n";
});
prompt += "TEST FILE:\n--- " + testFile.path + " ---\n" + testFile.content + "\n\n";
prompt += "ATTEMPT HISTORY:\n";
history.forEach(function(h) {
prompt += "Attempt " + h.attempt + ": " + (h.success ? "PASS" : "FAIL") + "\n";
});
prompt += "\nFix the source files so all tests pass. ";
prompt += "Do NOT modify the tests unless they contain a genuine bug. ";
prompt += "Return the same JSON format as before.\n";
return prompt;
}
The attempt history is important. Without it, the LLM will sometimes oscillate between two broken implementations. Showing it the full history helps it avoid repeating mistakes.
Context Gathering: Matching Existing Patterns
A code generation agent that ignores existing project conventions will produce code that works but looks alien. Context gathering solves this:
function gatherProjectContext(workspaceRoot, fileTools) {
var context = {
packageJson: null,
existingPatterns: [],
dependencies: [],
style: {}
};
// Read package.json for dependencies and scripts
var pkgResult = fileTools.readFile(path.join(workspaceRoot, "package.json"));
if (pkgResult.content) {
try {
context.packageJson = JSON.parse(pkgResult.content);
context.dependencies = Object.keys(context.packageJson.dependencies || {});
} catch (e) {
// malformed package.json, proceed without it
}
}
// Sample existing source files to learn patterns
var sourceFiles = fileTools.listFiles(workspaceRoot, "\\.js$");
if (sourceFiles.files) {
var sampled = sourceFiles.files.slice(0, 5);
sampled.forEach(function(filePath) {
var content = fileTools.readFile(filePath);
if (content.content) {
context.existingPatterns.push({
path: filePath,
usesRequire: content.content.indexOf("require(") !== -1,
usesImport: content.content.indexOf("import ") !== -1,
usesSemicolons: content.content.indexOf(";") !== -1,
indentation: detectIndentation(content.content),
lineCount: content.lines
});
}
});
}
// Determine style consensus
var tabCount = 0;
var spaceCount = 0;
var semicolonCount = 0;
context.existingPatterns.forEach(function(p) {
if (p.indentation === "tabs") tabCount++;
else spaceCount++;
if (p.usesSemicolons) semicolonCount++;
});
context.style = {
indentation: tabCount > spaceCount ? "tabs" : "spaces",
semicolons: semicolonCount > context.existingPatterns.length / 2,
moduleSystem: context.existingPatterns.some(function(p) { return p.usesImport; })
? "esm" : "commonjs"
};
return context;
}
function detectIndentation(code) {
var lines = code.split("\n").slice(0, 50);
var tabs = 0;
var spaces = 0;
lines.forEach(function(line) {
if (line.match(/^\t/)) tabs++;
else if (line.match(/^ /)) spaces++;
});
return tabs > spaces ? "tabs" : "spaces";
}
This context gets injected into the generation prompt so the LLM knows to use require() instead of import, tabs instead of spaces, or whatever conventions the project follows.
Handling Multi-File Code Generation
Real features span multiple files. The agent needs to handle file dependencies, generate them in the right order, and maintain consistency across them:
function planMultiFileGeneration(task, context) {
var plan = {
files: [],
order: [],
dependencies: {}
};
// The LLM returns a plan like this:
// [
// { path: "models/user.js", type: "model", dependsOn: [] },
// { path: "routes/users.js", type: "route", dependsOn: ["models/user.js"] },
// { path: "test/users.test.js", type: "test", dependsOn: ["models/user.js", "routes/users.js"] }
// ]
// Topological sort to determine generation order
function topoSort(files) {
var sorted = [];
var visited = {};
var visiting = {};
function visit(filePath) {
if (visited[filePath]) return;
if (visiting[filePath]) {
throw new Error("Circular dependency detected: " + filePath);
}
visiting[filePath] = true;
var file = files.find(function(f) { return f.path === filePath; });
if (file && file.dependsOn) {
file.dependsOn.forEach(function(dep) {
visit(dep);
});
}
delete visiting[filePath];
visited[filePath] = true;
sorted.push(filePath);
}
files.forEach(function(f) { visit(f.path); });
return sorted;
}
return topoSort;
}
The topological sort ensures models are generated before routes that depend on them, and tests are generated last so they can reference everything. Without this ordering, the LLM might generate a route file that references a model that does not exist yet, producing inconsistent code.
Diff-Based Output
For modifying existing files, generating a full replacement is wasteful and error-prone. Diffs are better because they show exactly what changed, making review easier and merge conflicts less likely:
var diffLib = require("diff");
function generateDiff(originalContent, newContent, filePath) {
var patch = diffLib.createPatch(
filePath,
originalContent,
newContent,
"original",
"modified"
);
return patch;
}
function applyDiff(originalContent, patch) {
var result = diffLib.applyPatch(originalContent, patch);
if (result === false) {
return { error: "Failed to apply patch. File may have changed since diff was generated." };
}
return { content: result };
}
function parseLLMDiff(llmOutput) {
// LLMs sometimes produce approximate diffs. This normalizes them.
var lines = llmOutput.split("\n");
var hunks = [];
var currentHunk = null;
lines.forEach(function(line) {
if (line.match(/^@@/)) {
if (currentHunk) hunks.push(currentHunk);
currentHunk = { header: line, lines: [] };
} else if (currentHunk) {
currentHunk.lines.push(line);
}
});
if (currentHunk) hunks.push(currentHunk);
return hunks;
}
In practice, I have found that asking the LLM to produce standard unified diffs works about 70% of the time. For the other 30%, you need a fallback that does full file replacement. The agent should try diff-based output first and fall back gracefully.
Implementing Guardrails
Guardrails are the difference between a useful tool and a liability. Here are the ones I consider mandatory:
function createGuardrails(config) {
var maxFileSize = config.maxFileSize || 500 * 1024;
var maxFilesPerRun = config.maxFilesPerRun || 20;
var allowedExtensions = config.allowedExtensions || [".js", ".json", ".md", ".txt", ".css", ".html"];
var restrictedPaths = config.restrictedPaths || [
"node_modules",
".git",
".env",
"package-lock.json"
];
var filesWritten = 0;
function checkWrite(filePath, content) {
var errors = [];
// Check extension
var ext = path.extname(filePath);
if (allowedExtensions.indexOf(ext) === -1) {
errors.push("File extension " + ext + " is not allowed");
}
// Check restricted paths
restrictedPaths.forEach(function(restricted) {
if (filePath.indexOf(restricted) !== -1) {
errors.push("Path contains restricted segment: " + restricted);
}
});
// Check file size
if (Buffer.byteLength(content, "utf8") > maxFileSize) {
errors.push("File size exceeds limit of " + maxFileSize + " bytes");
}
// Check files-per-run limit
filesWritten++;
if (filesWritten > maxFilesPerRun) {
errors.push("Maximum files per run (" + maxFilesPerRun + ") exceeded");
}
// Check for suspicious patterns in generated code
var suspicious = [
{ pattern: /process\.env\.\w+/, message: "Code accesses environment variables" },
{ pattern: /require\s*\(\s*['"]child_process['"]/, message: "Code imports child_process" },
{ pattern: /require\s*\(\s*['"]net['"]/, message: "Code imports net module" },
{ pattern: /\.exec\s*\(/, message: "Code calls exec()" },
{ pattern: /eval\s*\(/, message: "Code uses eval()" }
];
suspicious.forEach(function(check) {
if (check.pattern.test(content)) {
errors.push("WARNING: " + check.message);
}
});
return {
allowed: errors.filter(function(e) { return e.indexOf("WARNING") === -1; }).length === 0,
errors: errors
};
}
function reset() {
filesWritten = 0;
}
return {
checkWrite: checkWrite,
reset: reset
};
}
The suspicious pattern checks are warnings, not hard blocks. You want the human operator to see them and decide. The restricted path checks and file size limits are hard blocks because the downside of getting them wrong is severe.
Integrating with Git
A well-behaved agent works on a branch, commits its changes, and makes the diff reviewable:
var childProcess = require("child_process");
function createGitTools(repoPath) {
function git(args) {
var result = childProcess.spawnSync("git", args, {
cwd: repoPath,
timeout: 15000,
encoding: "utf8"
});
return {
success: result.status === 0,
stdout: (result.stdout || "").trim(),
stderr: (result.stderr || "").trim()
};
}
function createBranch(branchName) {
var result = git(["checkout", "-b", branchName]);
if (!result.success) {
// Branch might already exist
result = git(["checkout", branchName]);
}
return result;
}
function stageFiles(files) {
return git(["add"].concat(files));
}
function commit(message) {
return git(["commit", "-m", message]);
}
function diff() {
return git(["diff", "--staged"]);
}
function getCurrentBranch() {
return git(["rev-parse", "--abbrev-ref", "HEAD"]);
}
function getStatus() {
return git(["status", "--porcelain"]);
}
function revertAll() {
git(["checkout", "."]);
git(["clean", "-fd"]);
return { success: true };
}
return {
createBranch: createBranch,
stageFiles: stageFiles,
commit: commit,
diff: diff,
getCurrentBranch: getCurrentBranch,
getStatus: getStatus,
revertAll: revertAll
};
}
The revertAll function is your emergency brake. If the agent produces garbage, one call undoes everything. Always work on a branch, never on main.
Measuring Code Quality
You need metrics to know if the agent is producing good code or just code that passes tests. Here is a basic quality scorer:
function measureCodeQuality(code, filePath) {
var metrics = {
lineCount: 0,
functionCount: 0,
avgFunctionLength: 0,
commentRatio: 0,
cyclomaticComplexity: 0,
duplicateBlocks: 0,
score: 0
};
var lines = code.split("\n");
metrics.lineCount = lines.length;
// Count functions
var functionMatches = code.match(/function\s*\w*\s*\(/g);
metrics.functionCount = functionMatches ? functionMatches.length : 0;
// Count comments
var commentLines = lines.filter(function(line) {
return line.trim().match(/^\/\//) || line.trim().match(/^\/?\*/);
});
metrics.commentRatio = lines.length > 0
? (commentLines.length / lines.length).toFixed(2)
: 0;
// Estimate cyclomatic complexity (branches)
var branches = code.match(/\b(if|else|for|while|switch|case|catch|\?\s*:)\b/g);
metrics.cyclomaticComplexity = branches ? branches.length : 0;
// Detect duplicate blocks (simplified: consecutive identical lines)
var lineSet = {};
var duplicates = 0;
lines.forEach(function(line) {
var trimmed = line.trim();
if (trimmed.length > 20) {
if (lineSet[trimmed]) duplicates++;
lineSet[trimmed] = true;
}
});
metrics.duplicateBlocks = duplicates;
// Calculate composite score (0-100)
var score = 100;
if (metrics.cyclomaticComplexity > 20) score -= 15;
if (metrics.commentRatio < 0.05) score -= 10;
if (metrics.commentRatio > 0.4) score -= 5; // over-commented
if (metrics.duplicateBlocks > 5) score -= 15;
if (metrics.functionCount > 0 && metrics.lineCount / metrics.functionCount > 50) {
score -= 10; // functions too long
}
metrics.score = Math.max(0, score);
return metrics;
}
Track these metrics over time. If the agent's quality score trends downward, you have a prompt engineering problem to solve.
Complete Working Example
Here is the full agent that ties everything together. It reads a task, examines existing code, generates an implementation with tests, runs them in a sandbox, and iterates until they pass:
var fs = require("fs");
var path = require("path");
var https = require("https");
// Import our modules from the sections above
// In a real project these would be separate files
var fileTools = createFileTools(process.cwd());
var sandbox = createSandbox({ workDir: process.cwd(), timeout: 30000 });
var guardrails = createGuardrails({});
var gitTools = createGitTools(process.cwd());
function CodeGenerationAgent(config) {
var apiKey = config.apiKey;
var model = config.model || "claude-sonnet-4-20250514";
var maxIterations = config.maxIterations || 5;
function callLLM(messages) {
return new Promise(function(resolve, reject) {
var body = JSON.stringify({
model: model,
max_tokens: 8192,
messages: messages
});
var options = {
hostname: "api.anthropic.com",
path: "/v1/messages",
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": apiKey,
"anthropic-version": "2023-06-01"
}
};
var req = https.request(options, function(res) {
var data = "";
res.on("data", function(chunk) { data += chunk; });
res.on("end", function() {
try {
var parsed = JSON.parse(data);
var text = parsed.content[0].text;
resolve(text);
} catch (e) {
reject(new Error("Failed to parse LLM response: " + e.message));
}
});
});
req.on("error", reject);
req.write(body);
req.end();
});
}
function run(taskDescription) {
console.log("Starting code generation agent...");
console.log("Task: " + taskDescription);
// Step 1: Gather context
var context = gatherProjectContext(process.cwd(), fileTools);
console.log("Project uses " + context.style.moduleSystem + " modules");
console.log("Found " + context.dependencies.length + " dependencies");
// Step 2: Create a working branch
var branchName = "agent/gen-" + Date.now();
gitTools.createBranch(branchName);
console.log("Working on branch: " + branchName);
// Step 3: Generate code with tests
var systemPrompt = "You are a code generation agent. ";
systemPrompt += "Generate production-quality Node.js code using CommonJS (require, module.exports). ";
systemPrompt += "Use var for declarations and function() for functions. ";
systemPrompt += "Always include comprehensive tests using mocha and assert. ";
systemPrompt += "Return valid JSON with testFile and sourceFiles arrays.";
var contextSummary = "Project dependencies: " + context.dependencies.join(", ") + "\n";
contextSummary += "Module system: " + context.style.moduleSystem + "\n";
contextSummary += "Style: " + (context.style.semicolons ? "semicolons" : "no semicolons") + ", ";
contextSummary += context.style.indentation + "\n";
var messages = [
{ role: "user", content: contextSummary + "\nTask: " + taskDescription +
"\n\nGenerate the implementation and tests as JSON." }
];
return callLLM(messages).then(function(response) {
// Parse the generated files from LLM response
var generated;
try {
// Extract JSON from response (handle markdown wrapping)
var jsonStr = response;
var jsonMatch = response.match(/```json?\s*([\s\S]*?)```/);
if (jsonMatch) jsonStr = jsonMatch[1];
generated = JSON.parse(jsonStr);
} catch (e) {
console.error("Failed to parse LLM output as JSON");
return { success: false, error: "Invalid LLM response format" };
}
// Step 4: Validate before writing
var allValid = true;
generated.sourceFiles.forEach(function(file) {
var validation = validateJavaScript(file.content, file.path);
if (!validation.valid) {
console.error("Validation failed for " + file.path + ":");
validation.errors.forEach(function(err) {
console.error(" " + err.type + ": " + err.message);
});
allValid = false;
}
});
if (!allValid) {
console.log("Generated code failed validation. Requesting fix...");
// Feed validation errors back to LLM for another attempt
// (abbreviated here -- uses the refinement loop from above)
}
// Step 5: Check guardrails and write files
var filesToWrite = generated.sourceFiles.concat([generated.testFile]);
filesToWrite.forEach(function(file) {
var check = guardrails.checkWrite(file.path, file.content);
if (!check.allowed) {
console.error("Guardrail blocked write to " + file.path + ": " +
check.errors.join(", "));
return;
}
check.errors.forEach(function(warning) {
console.warn(warning);
});
fileTools.writeFile(path.join(process.cwd(), file.path), file.content);
});
// Step 6: Run tests and iterate
var loop = createRefinementLoop({
maxAttempts: maxIterations,
llmClient: { generate: function(prompt) {
return callLLM([{ role: "user", content: prompt }]);
}},
sandbox: sandbox,
fileTools: fileTools
});
return loop.refine(generated.sourceFiles, generated.testFile).then(function(result) {
if (result.success) {
console.log("All tests pass after " + result.attempts + " attempt(s)");
// Step 7: Measure quality
result.files.forEach(function(file) {
var quality = measureCodeQuality(file.content, file.path);
console.log(file.path + " quality score: " + quality.score + "/100");
});
// Step 8: Commit
var allPaths = result.files.map(function(f) { return f.path; });
allPaths.push(result.testFile.path);
gitTools.stageFiles(allPaths);
gitTools.commit("feat: " + taskDescription.substring(0, 72));
console.log("Changes committed on branch " + branchName);
return { success: true, branch: branchName, attempts: result.attempts };
} else {
console.error("Failed after " + result.attempts + " attempts");
console.error("Last error: " + result.lastError);
gitTools.revertAll();
return { success: false, error: result.lastError, history: result.history };
}
});
});
}
return { run: run };
}
// Usage
var agent = CodeGenerationAgent({
apiKey: process.env.ANTHROPIC_API_KEY,
maxIterations: 5
});
agent.run("Create a URL shortener module with base62 encoding and collision detection")
.then(function(result) {
if (result.success) {
console.log("Done! Review changes on branch: " + result.branch);
} else {
console.log("Agent failed. Check logs for details.");
}
})
.catch(function(err) {
console.error("Fatal error: " + err.message);
});
Common Issues and Troubleshooting
1. LLM Returns Markdown-Wrapped Code Instead of Raw JSON
SyntaxError: Unexpected token '`' at position 0
at JSON.parse (<anonymous>)
This happens constantly. The LLM wraps its JSON response in ```json fences despite instructions not to. Always strip markdown fences before parsing:
var jsonStr = response;
var jsonMatch = response.match(/```json?\s*([\s\S]*?)```/);
if (jsonMatch) jsonStr = jsonMatch[1];
2. Sandbox Timeout on Large Test Suites
Error: Execution timed out after 30000ms
The default 30-second timeout is generous for unit tests but too short if the generated code has an infinite loop or if the test suite is large. Increase the timeout for integration tests, but more importantly, add a per-test timeout in the test framework config:
// In mocha: --timeout 10000
// Or in the test file itself:
// this.timeout(10000);
If you keep hitting timeouts, the LLM is probably generating code with infinite loops. Add a check for while (true) and for (;;) in your AST validation.
3. Generated Code References Uninstalled Dependencies
Error: Cannot find module 'lodash'
at Function.Module._resolveFilename (node:internal/modules/cjs/loader:933:15)
The LLM will import packages that seem useful but are not in your package.json. Before running tests, parse the generated code for require() calls and verify each dependency exists:
var requirePattern = /require\s*\(\s*['"]([^./][^'"]*)['"]\s*\)/g;
var match;
var missing = [];
while ((match = requirePattern.exec(code)) !== null) {
var moduleName = match[1].split("/")[0]; // handle scoped packages
var modulePath = path.join(workDir, "node_modules", moduleName);
if (!fs.existsSync(modulePath)) {
missing.push(moduleName);
}
}
4. Path Traversal in Generated File Paths
Error: Path outside workspace: /etc/passwd
LLMs can be manipulated (intentionally or through prompt confusion) into generating file paths that escape the workspace. The isPathAllowed check catches this, but you need to handle it gracefully and retry with explicit path constraints:
// In the fix prompt:
prompt += "IMPORTANT: All file paths must be relative to the project root. ";
prompt += "Do not use absolute paths or paths containing '..' segments. ";
prompt += "Allowed directories: src/, test/, lib/\n";
5. Oscillating Fix Attempts
When the agent fixes error A but introduces error B, then fixes B but reintroduces A, you are stuck in an oscillation. The attempt history helps detect this:
function detectOscillation(history) {
if (history.length < 3) return false;
var lastErrors = history.slice(-3).map(function(h) {
return (h.stderr || "").substring(0, 200);
});
return lastErrors[0] === lastErrors[2]; // Same error every other attempt
}
When oscillation is detected, change strategy: ask the LLM to rewrite from scratch rather than patch the existing code.
Best Practices
Always work on a branch. Never let an agent modify files on main. If the agent fails,
git checkout mainundoes everything cleanly. This is your most important safety net.Set a hard iteration limit. Five attempts is usually right. If the agent cannot fix its code in five tries, it is not going to fix it in fifty. Escalate to a human at that point rather than burning tokens.
Validate before executing. AST parsing catches syntax errors instantly. Running syntactically broken code through the sandbox wastes 5-10 seconds per attempt. Parse first, execute second.
Inject project context aggressively. The more existing code the LLM sees, the better it matches your project's style. Show it 3-5 representative files. This single step eliminates most "works but looks wrong" outcomes.
Log everything. Every LLM prompt, every response, every test result, every file write. When something goes wrong (and it will), you need the full transcript to debug it. Structure logs as JSON for easy parsing.
Treat the test suite as the contract, not the code. If the tests pass, the implementation is correct by definition. Resist the urge to manually inspect generated code for every run. Reserve manual review for the final PR.
Use deterministic LLM settings. Set temperature to 0 for code generation. You want reproducibility, not creativity. Save non-zero temperature for brainstorming tasks, not implementation.
Restrict network access in the sandbox. Generated code should never make HTTP requests during testing. Mock external dependencies. This prevents both security issues and flaky tests.
Size-limit generated files. A single 10,000-line generated file is almost certainly wrong. Set a maximum of 500 lines per file and instruct the LLM to split larger implementations across modules.
Version your prompts. Prompt engineering is engineering. Track prompt changes in version control alongside your agent code. When generation quality changes, you need to know which prompt change caused it.
References
- Acorn JavaScript Parser -- Lightweight, fast ECMAScript parser used for AST validation
- Node.js child_process Documentation -- Official docs for spawn, exec, and fork
- diff npm Package -- Text diffing library for generating and applying patches
- Mocha Test Framework -- Feature-rich test framework for Node.js
- Anthropic API Reference -- Claude API documentation for message-based completions
- OpenAI Function Calling -- Structured tool use for controlling LLM output format
- Tree-sitter -- Incremental parser generator for building language-aware tools