Agents

Agent Evaluation Methods and Benchmarks

Evaluate AI agents with task completion metrics, LLM-as-judge scoring, regression testing, and benchmark suites in Node.js.

Agent Evaluation Methods and Benchmarks

Evaluating AI agents is fundamentally different from evaluating a single LLM call. Agents operate over multiple steps, make tool-selection decisions, recover from errors, and produce outcomes that depend on the entire trajectory of their reasoning. Building a rigorous evaluation framework is the difference between shipping an agent that works in demos and one that holds up in production.

Prerequisites

  • Node.js v18+ installed
  • Working knowledge of LLM APIs (OpenAI, Anthropic, or similar)
  • Familiarity with agent architectures (tool-calling loops, ReAct patterns)
  • Basic understanding of statistical concepts (mean, variance, confidence intervals)
  • An LLM API key for automated scoring

Why Evaluating Agents Is Harder Than Evaluating Single LLM Calls

When you evaluate a single LLM call, you have a clean input-output pair. You send a prompt, you get a response, you compare it against a reference. Traditional NLP metrics like BLEU, ROUGE, or exact-match accuracy work reasonably well. The evaluation surface is small and deterministic enough to reason about.

Agents destroy that simplicity. An agent might take three steps or thirty to complete the same task. It might call the right tool with wrong arguments, call the wrong tool and then self-correct, or produce the correct final answer through a completely unexpected path. The evaluation surface explodes combinatorially.

Here are the specific dimensions that make agent evaluation hard:

Non-deterministic trajectories. The same agent given the same task will often take different paths on repeated runs. You cannot just check if the output matches a gold standard -- you need to evaluate whether the outcome is acceptable regardless of the path taken.

Partial credit problems. An agent that completes 80% of a multi-step task and fails on the last step is fundamentally different from one that fails immediately. Binary pass/fail scoring loses critical information.

Tool interaction side effects. Agents that write files, call APIs, or modify databases produce side effects that need verification beyond just checking the final text response.

Cost and latency variance. Two agents might both succeed, but one uses 50,000 tokens and the other uses 5,000. Evaluation must capture efficiency, not just correctness.

Cascading failures. A bad decision at step 2 can make step 7 impossible. Understanding where agents fail requires trajectory-level analysis, not just endpoint analysis.

Task Completion Rate: The Primary Metric

Task completion rate (TCR) is the single most important metric for any agent evaluation. It answers the most basic question: does the agent finish what you asked it to do?

var completionRate = successfulTasks / totalTasks;

But raw TCR is deceptively simple. You need to break it down by difficulty tier, task category, and failure mode. A 90% TCR that hides consistent failures on your most critical task type is worse than an 80% TCR with evenly distributed errors.

I track TCR at three levels:

  1. Hard completion -- the agent produced exactly the expected outcome with no manual intervention.
  2. Soft completion -- the agent produced an acceptable outcome that differs from the reference but would satisfy the user.
  3. Partial completion -- the agent made meaningful progress but did not finish the task.

This three-tier system gives you a much richer picture of agent capability than a single binary metric.

Defining Success Criteria for Agent Tasks

Before you can evaluate anything, you need to define what success looks like for each task. This is where most teams get sloppy, and it costs them later.

For each evaluation task, you need:

  • A clear task description -- what you are asking the agent to do, written exactly as a user would phrase it.
  • Expected outcomes -- the concrete, verifiable results. This might be a file with specific content, an API response matching a schema, or a database record with particular values.
  • Acceptance criteria -- the rules for determining whether the outcome is acceptable. These can be exact-match, semantic similarity thresholds, or structural validation.
  • Maximum allowed steps -- an upper bound on how many actions the agent should need. This catches agents that spiral into infinite loops or take wildly inefficient paths.

Here is what a well-defined evaluation case looks like in practice:

var evaluationCase = {
  id: "task-042",
  description: "Create a new Express route at /api/users that returns a JSON array of users from the database",
  category: "code-generation",
  difficulty: "medium",
  expectedOutcome: {
    type: "file-content",
    path: "routes/users.js",
    mustContain: ["router.get", "res.json", "module.exports"],
    mustNotContain: ["console.log(password)"],
    schema: "express-route"
  },
  acceptanceCriteria: {
    syntacticallyValid: true,
    passesLinting: true,
    semanticMatch: 0.85
  },
  maxSteps: 10,
  maxTokens: 15000,
  timeout: 60000
};

Building Evaluation Datasets

Your evaluation dataset is the foundation of your entire evaluation system. A bad dataset will produce misleading results no matter how sophisticated your scoring pipeline is.

I recommend building datasets with at least these properties:

Coverage across difficulty levels. Include trivial tasks that any agent should handle, medium tasks that test core capabilities, and hard tasks that push the limits. A 30/50/20 split (easy/medium/hard) works well for most use cases.

Realistic distribution. Weight your dataset toward the tasks your agent actually handles in production. If 60% of real usage is data retrieval, your evaluation set should reflect that.

Known failure cases. Include tasks that previous agent versions failed on. These become regression tests.

Adversarial cases. Include ambiguous instructions, tasks with missing information, and tasks that require the agent to ask clarifying questions instead of guessing.

var fs = require("fs");
var path = require("path");

function loadEvaluationDataset(datasetPath) {
  var raw = fs.readFileSync(datasetPath, "utf-8");
  var dataset = JSON.parse(raw);

  var valid = dataset.filter(function (task) {
    return task.id && task.description && task.expectedOutcome && task.category;
  });

  if (valid.length !== dataset.length) {
    console.warn(
      "Warning: " + (dataset.length - valid.length) + " tasks dropped due to missing fields"
    );
  }

  var byCategory = {};
  valid.forEach(function (task) {
    if (!byCategory[task.category]) {
      byCategory[task.category] = [];
    }
    byCategory[task.category].push(task);
  });

  return {
    tasks: valid,
    byCategory: byCategory,
    totalCount: valid.length,
    categories: Object.keys(byCategory)
  };
}

A good starting dataset has 50-100 tasks. You can start smaller (20-30) for rapid iteration, but you need volume to draw statistically meaningful conclusions. I have seen teams try to evaluate agents with 5 test cases and draw sweeping conclusions. That is not evaluation, that is anecdote.

Automated Evaluation with LLM-as-Judge

The LLM-as-judge pattern is the workhorse of modern agent evaluation. You use a separate LLM call to score whether an agent's output satisfies the task requirements. This scales far better than human evaluation while producing surprisingly well-calibrated results.

The key insight is that judging is easier than generating. An LLM that cannot reliably write production code can still reliably judge whether a piece of code meets a specification. You are reducing the problem from open-ended generation to constrained classification.

var https = require("https");

function llmJudge(taskDescription, expectedOutcome, actualOutput, callback) {
  var prompt = [
    "You are an evaluation judge for an AI agent system.",
    "Given a task description, expected outcome, and the agent's actual output,",
    "score the output on the following dimensions:",
    "",
    "1. Correctness (0-10): Does the output satisfy the task requirements?",
    "2. Completeness (0-10): Are all aspects of the task addressed?",
    "3. Quality (0-10): Is the output well-structured and production-ready?",
    "",
    "Task: " + taskDescription,
    "",
    "Expected Outcome: " + JSON.stringify(expectedOutcome),
    "",
    "Actual Output: " + actualOutput,
    "",
    "Respond with ONLY valid JSON in this format:",
    '{"correctness": <score>, "completeness": <score>, "quality": <score>,',
    '"pass": <true/false>, "reasoning": "<brief explanation>"}'
  ].join("\n");

  var requestBody = JSON.stringify({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    temperature: 0.1,
    max_tokens: 500
  });

  var options = {
    hostname: "api.openai.com",
    path: "/v1/chat/completions",
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
      "Content-Length": Buffer.byteLength(requestBody)
    }
  };

  var req = https.request(options, function (res) {
    var data = "";
    res.on("data", function (chunk) { data += chunk; });
    res.on("end", function () {
      try {
        var response = JSON.parse(data);
        var content = response.choices[0].message.content;
        var scores = JSON.parse(content);
        callback(null, scores);
      } catch (err) {
        callback(new Error("Failed to parse judge response: " + err.message));
      }
    });
  });

  req.on("error", function (err) { callback(err); });
  req.write(requestBody);
  req.end();
}

A few critical implementation details:

Use low temperature for the judge. You want consistent, reproducible scores. A temperature of 0.0-0.2 works best.

Use a stronger model than the agent being evaluated. If your agent runs on GPT-4o-mini, use GPT-4o or Claude as the judge. Judging with a weaker model introduces noise and bias.

Include rubric anchors. Do not just say "score 0-10." Define what a 3 looks like versus a 7 versus a 10. Anchored rubrics dramatically improve inter-run consistency.

Human Evaluation Protocols and Inter-Rater Agreement

LLM-as-judge is excellent for scale, but you need human evaluation to calibrate it. Humans catch failure modes that LLMs systematically miss, especially around safety, tone, and real-world usability.

A solid human evaluation protocol needs:

Structured rubrics. Give evaluators specific criteria and score anchors. "Rate quality 1-5" produces garbage data. "Rate whether the generated code handles edge cases: 1 = no edge case handling, 3 = handles common edge cases, 5 = comprehensive edge case handling" produces useful data.

Multiple raters per task. A single human opinion is unreliable. Use at least 2-3 raters per task and measure inter-rater agreement with Cohen's kappa or Krippendorff's alpha.

Calibration sessions. Before the actual evaluation, have all raters score the same 10 examples together and discuss disagreements. This aligns their interpretation of the rubric.

function calculateCohensKappa(ratingsA, ratingsB) {
  var n = ratingsA.length;
  var agreements = 0;

  for (var i = 0; i < n; i++) {
    if (ratingsA[i] === ratingsB[i]) {
      agreements++;
    }
  }

  var observedAgreement = agreements / n;

  var categoryCounts = {};
  ratingsA.concat(ratingsB).forEach(function (r) {
    categoryCounts[r] = (categoryCounts[r] || 0) + 1;
  });

  var expectedAgreement = 0;
  var categories = Object.keys(categoryCounts);
  categories.forEach(function (cat) {
    var countA = ratingsA.filter(function (r) { return r === cat; }).length;
    var countB = ratingsB.filter(function (r) { return r === cat; }).length;
    expectedAgreement += (countA / n) * (countB / n);
  });

  if (expectedAgreement === 1) return 1;

  var kappa = (observedAgreement - expectedAgreement) / (1 - expectedAgreement);
  return Math.round(kappa * 1000) / 1000;
}

A kappa above 0.8 indicates strong agreement. Between 0.6-0.8 is moderate. Below 0.6 means your rubric needs work or your task definitions are ambiguous.

Measuring Efficiency: Steps, Tokens, Cost

Correctness alone does not tell you if an agent is production-ready. Two agents might both score 95% on task completion, but if one averages 3 tool calls per task and the other averages 15, the cost and latency implications are enormous.

Track these efficiency metrics for every evaluation run:

  • Steps per task -- how many LLM calls and tool invocations the agent made.
  • Tokens consumed -- total input and output tokens across the entire trajectory.
  • Cost per task -- dollar cost based on your model's pricing.
  • Time to completion -- wall-clock time from task start to final answer.
  • Retry count -- how many times the agent had to retry after errors.
function AgentMetricsTracker() {
  this.runs = [];
}

AgentMetricsTracker.prototype.startRun = function (taskId) {
  var run = {
    taskId: taskId,
    startTime: Date.now(),
    endTime: null,
    steps: 0,
    inputTokens: 0,
    outputTokens: 0,
    toolCalls: [],
    retries: 0,
    errors: []
  };
  this.runs.push(run);
  return run;
};

AgentMetricsTracker.prototype.recordStep = function (run, stepData) {
  run.steps++;
  run.inputTokens += stepData.inputTokens || 0;
  run.outputTokens += stepData.outputTokens || 0;

  if (stepData.toolCall) {
    run.toolCalls.push({
      tool: stepData.toolCall,
      timestamp: Date.now(),
      success: stepData.success !== false
    });
  }

  if (stepData.isRetry) {
    run.retries++;
  }

  if (stepData.error) {
    run.errors.push(stepData.error);
  }
};

AgentMetricsTracker.prototype.endRun = function (run) {
  run.endTime = Date.now();
  run.durationMs = run.endTime - run.startTime;
  run.costUsd = calculateCost(run.inputTokens, run.outputTokens);
  return run;
};

function calculateCost(inputTokens, outputTokens) {
  var inputRate = 2.50 / 1000000;
  var outputRate = 10.00 / 1000000;
  return (inputTokens * inputRate) + (outputTokens * outputRate);
}

AgentMetricsTracker.prototype.getSummary = function () {
  var totalRuns = this.runs.length;
  if (totalRuns === 0) return null;

  var totalSteps = 0;
  var totalTokens = 0;
  var totalCost = 0;
  var totalDuration = 0;

  this.runs.forEach(function (run) {
    totalSteps += run.steps;
    totalTokens += run.inputTokens + run.outputTokens;
    totalCost += run.costUsd || 0;
    totalDuration += run.durationMs || 0;
  });

  return {
    totalRuns: totalRuns,
    avgSteps: Math.round((totalSteps / totalRuns) * 100) / 100,
    avgTokens: Math.round(totalTokens / totalRuns),
    avgCostUsd: Math.round((totalCost / totalRuns) * 10000) / 10000,
    avgDurationMs: Math.round(totalDuration / totalRuns),
    totalCostUsd: Math.round(totalCost * 100) / 100
  };
};

I set budget thresholds for every agent I deploy. If a task exceeds 50,000 tokens or 30 steps, something is wrong and the run should be flagged for review even if it eventually succeeds.

Latency Profiling for Agent Workflows

Latency in agent systems is not a single number. It is a distribution across multiple phases, and you need to understand where time is being spent before you can optimize.

Break latency into these segments:

  • LLM inference time -- time waiting for model responses.
  • Tool execution time -- time spent running tools (API calls, file I/O, database queries).
  • Orchestration overhead -- time spent in your agent framework between steps.
  • Queue/scheduling time -- time waiting for rate limits or resource availability.
function LatencyProfiler() {
  this.segments = [];
}

LatencyProfiler.prototype.startSegment = function (name, category) {
  var segment = {
    name: name,
    category: category,
    startTime: process.hrtime.bigint(),
    endTime: null,
    durationNs: null
  };
  this.segments.push(segment);
  return segment;
};

LatencyProfiler.prototype.endSegment = function (segment) {
  segment.endTime = process.hrtime.bigint();
  segment.durationNs = Number(segment.endTime - segment.startTime);
  segment.durationMs = segment.durationNs / 1000000;
  return segment;
};

LatencyProfiler.prototype.getReport = function () {
  var byCategory = {};

  this.segments.forEach(function (seg) {
    if (!seg.durationMs) return;
    if (!byCategory[seg.category]) {
      byCategory[seg.category] = { totalMs: 0, count: 0, segments: [] };
    }
    byCategory[seg.category].totalMs += seg.durationMs;
    byCategory[seg.category].count++;
    byCategory[seg.category].segments.push(seg.name + ": " + Math.round(seg.durationMs) + "ms");
  });

  var totalMs = 0;
  Object.keys(byCategory).forEach(function (cat) {
    totalMs += byCategory[cat].totalMs;
  });

  var report = { totalMs: Math.round(totalMs), breakdown: {} };
  Object.keys(byCategory).forEach(function (cat) {
    report.breakdown[cat] = {
      totalMs: Math.round(byCategory[cat].totalMs),
      percentage: Math.round((byCategory[cat].totalMs / totalMs) * 100),
      count: byCategory[cat].count
    };
  });

  return report;
};

In my experience, LLM inference dominates latency for most agent workflows (60-80% of total time). But tool execution can spike unpredictably, especially with external API calls. Profiling helps you decide whether to invest in model optimization, tool caching, or parallel execution.

Regression Testing Agent Behavior Across Model Updates

Model providers update their models constantly. GPT-4o today is not the same GPT-4o from three months ago. Anthropic, Google, and every other provider make ongoing tweaks that can subtly change agent behavior.

Regression testing catches these changes before they reach your users. The approach is straightforward:

  1. Maintain a fixed evaluation dataset.
  2. Run it against your agent after every model update or agent code change.
  3. Compare results against a baseline (your last known-good evaluation run).
  4. Flag any task where the score changed by more than a threshold.
var fs = require("fs");

function RegressionTracker(baselinePath) {
  this.baselinePath = baselinePath;
  this.baseline = null;

  if (fs.existsSync(baselinePath)) {
    this.baseline = JSON.parse(fs.readFileSync(baselinePath, "utf-8"));
  }
}

RegressionTracker.prototype.compare = function (currentResults) {
  if (!this.baseline) {
    return { isFirstRun: true, message: "No baseline found. Current results saved as baseline." };
  }

  var regressions = [];
  var improvements = [];
  var threshold = 1.5;

  currentResults.forEach(function (current) {
    var baselineResult = this.baseline.find(function (b) {
      return b.taskId === current.taskId;
    });

    if (!baselineResult) return;

    var scoreDiff = current.score - baselineResult.score;

    if (scoreDiff < -threshold) {
      regressions.push({
        taskId: current.taskId,
        baselineScore: baselineResult.score,
        currentScore: current.score,
        delta: scoreDiff,
        category: current.category
      });
    } else if (scoreDiff > threshold) {
      improvements.push({
        taskId: current.taskId,
        baselineScore: baselineResult.score,
        currentScore: current.score,
        delta: scoreDiff,
        category: current.category
      });
    }
  }.bind(this));

  return {
    totalTasks: currentResults.length,
    regressions: regressions,
    improvements: improvements,
    regressionRate: Math.round((regressions.length / currentResults.length) * 100) + "%",
    isAcceptable: regressions.length < currentResults.length * 0.05
  };
};

RegressionTracker.prototype.saveBaseline = function (results) {
  fs.writeFileSync(this.baselinePath, JSON.stringify(results, null, 2));
};

I run regression tests on a weekly cadence for production agents, and immediately after any model version change. The isAcceptable threshold of 5% regressions is a starting point -- tune it based on your tolerance.

A/B Testing Agent Architectures

When you are comparing two agent architectures -- say, ReAct versus plan-then-execute, or single-agent versus multi-agent -- you need controlled A/B testing. Random variation in LLM outputs makes naive comparisons unreliable.

The protocol:

  1. Run both architectures against the identical evaluation dataset.
  2. Run each task multiple times (at least 3-5 repetitions) to account for non-determinism.
  3. Compare distributions, not single numbers. Use statistical tests (Mann-Whitney U or bootstrap confidence intervals) to determine if differences are significant.
function abTestResults(resultsA, resultsB) {
  var scoresA = resultsA.map(function (r) { return r.score; });
  var scoresB = resultsB.map(function (r) { return r.score; });

  var meanA = scoresA.reduce(function (s, v) { return s + v; }, 0) / scoresA.length;
  var meanB = scoresB.reduce(function (s, v) { return s + v; }, 0) / scoresB.length;

  var varianceA = scoresA.reduce(function (s, v) {
    return s + Math.pow(v - meanA, 2);
  }, 0) / (scoresA.length - 1);

  var varianceB = scoresB.reduce(function (s, v) {
    return s + Math.pow(v - meanB, 2);
  }, 0) / (scoresB.length - 1);

  var pooledSE = Math.sqrt(
    (varianceA / scoresA.length) + (varianceB / scoresB.length)
  );

  var tStatistic = pooledSE > 0 ? (meanA - meanB) / pooledSE : 0;
  var isSignificant = Math.abs(tStatistic) > 1.96;

  return {
    architectureA: { mean: meanA.toFixed(3), variance: varianceA.toFixed(3), n: scoresA.length },
    architectureB: { mean: meanB.toFixed(3), variance: varianceB.toFixed(3), n: scoresB.length },
    delta: (meanA - meanB).toFixed(3),
    tStatistic: tStatistic.toFixed(3),
    significantAtP05: isSignificant,
    recommendation: isSignificant
      ? (meanA > meanB ? "Architecture A is significantly better" : "Architecture B is significantly better")
      : "No significant difference detected"
  };
}

Do not make architecture decisions based on a single evaluation run. I have seen teams switch entire agent frameworks based on a 2% score difference that was well within random variance. Statistical rigor matters.

Tool Usage Analytics

For tool-calling agents, understanding how the agent uses its tools is as important as understanding its final output. Track:

  • Tool selection accuracy -- did the agent pick the right tool for each step?
  • Argument correctness -- were the tool arguments properly formed?
  • Unnecessary tool calls -- did the agent call tools it did not need?
  • Missing tool calls -- did the agent skip tools it should have used?
function analyzeToolUsage(expectedTools, actualToolCalls) {
  var correctSelections = 0;
  var incorrectSelections = 0;
  var unnecessaryCalls = 0;
  var missedTools = [];

  var expectedSet = {};
  expectedTools.forEach(function (t) { expectedSet[t] = false; });

  actualToolCalls.forEach(function (call) {
    if (expectedSet.hasOwnProperty(call.tool)) {
      correctSelections++;
      expectedSet[call.tool] = true;
    } else {
      unnecessaryCalls++;
      incorrectSelections++;
    }
  });

  Object.keys(expectedSet).forEach(function (tool) {
    if (!expectedSet[tool]) {
      missedTools.push(tool);
    }
  });

  var totalExpected = expectedTools.length;
  var selectionAccuracy = totalExpected > 0
    ? Math.round((correctSelections / totalExpected) * 100)
    : 100;

  return {
    selectionAccuracy: selectionAccuracy + "%",
    correctSelections: correctSelections,
    incorrectSelections: incorrectSelections,
    unnecessaryCalls: unnecessaryCalls,
    missedTools: missedTools,
    totalActualCalls: actualToolCalls.length,
    efficiency: actualToolCalls.length > 0
      ? Math.round((correctSelections / actualToolCalls.length) * 100) + "%"
      : "N/A"
  };
}

Tool usage analytics are especially valuable for debugging. When an agent fails a task, the tool trace usually reveals the exact decision point where things went wrong.

Error Analysis: Categorizing Failure Modes

Not all failures are equal. Categorizing failure modes helps you prioritize improvements and understand systemic weaknesses.

I use this taxonomy:

  • Planning failures -- the agent chose the wrong strategy or sequence of steps.
  • Tool selection failures -- the agent picked the wrong tool for a step.
  • Argument errors -- the agent called the right tool with bad arguments.
  • Context overflow -- the agent lost track of relevant information due to context window limits.
  • Hallucination -- the agent fabricated information that was not in the context.
  • Loop/stuck -- the agent entered a repetitive cycle without making progress.
  • Timeout -- the agent exceeded the time or step budget.
  • External failures -- a tool or API the agent depended on failed.
function categorizeFailure(run) {
  if (run.durationMs > run.timeout) return "timeout";
  if (run.retries > 5) return "loop_stuck";

  var lastFiveTools = run.toolCalls.slice(-5);
  var uniqueTools = {};
  lastFiveTools.forEach(function (t) { uniqueTools[t.tool] = true; });

  if (lastFiveTools.length === 5 && Object.keys(uniqueTools).length === 1) {
    return "loop_stuck";
  }

  var externalErrors = run.errors.filter(function (e) {
    return e.includes("ECONNREFUSED") || e.includes("timeout") || e.includes("503") || e.includes("429");
  });
  if (externalErrors.length > 0) return "external_failure";

  var badArgErrors = run.errors.filter(function (e) {
    return e.includes("invalid argument") || e.includes("missing required") || e.includes("TypeError");
  });
  if (badArgErrors.length > 0) return "argument_error";

  if (run.inputTokens > 120000) return "context_overflow";

  return "planning_failure";
}

function buildFailureReport(failedRuns) {
  var categories = {};

  failedRuns.forEach(function (run) {
    var category = categorizeFailure(run);
    if (!categories[category]) {
      categories[category] = { count: 0, taskIds: [], examples: [] };
    }
    categories[category].count++;
    categories[category].taskIds.push(run.taskId);
    if (categories[category].examples.length < 3) {
      categories[category].examples.push({
        taskId: run.taskId,
        errors: run.errors.slice(0, 3),
        steps: run.steps
      });
    }
  });

  return {
    totalFailures: failedRuns.length,
    byCategory: categories,
    topFailureMode: Object.keys(categories).sort(function (a, b) {
      return categories[b].count - categories[a].count;
    })[0]
  };
}

Once you have this data, the optimization priorities become obvious. If 40% of failures are loop/stuck, you need better loop detection and exit conditions. If 30% are argument errors, you need better tool descriptions or few-shot examples.

Complete Working Example: Node.js Agent Evaluation Framework

Here is a complete evaluation framework that ties everything together. It runs an agent against a test suite, scores results with LLM-as-judge, tracks metrics, and generates comparison reports.

var fs = require("fs");
var path = require("path");
var https = require("https");

// ---- Configuration ----

var CONFIG = {
  apiKey: process.env.OPENAI_API_KEY,
  judgeModel: "gpt-4o",
  resultsDir: "./eval-results",
  baselinePath: "./eval-results/baseline.json",
  maxConcurrency: 3,
  retryLimit: 2
};

// ---- Evaluation Runner ----

function EvalRunner(agent, dataset, config) {
  this.agent = agent;
  this.dataset = dataset;
  this.config = config || CONFIG;
  this.metrics = new AgentMetricsTracker();
  this.results = [];
}

EvalRunner.prototype.run = function (callback) {
  var self = this;
  var tasks = this.dataset.tasks.slice();
  var completed = 0;
  var total = tasks.length;

  console.log("Starting evaluation: " + total + " tasks");
  console.log("Agent: " + self.agent.name);
  console.log("Timestamp: " + new Date().toISOString());
  console.log("---");

  function processNext() {
    if (tasks.length === 0) {
      if (completed === total) {
        self.finalize(callback);
      }
      return;
    }

    var task = tasks.shift();
    console.log("[" + (completed + 1) + "/" + total + "] Running: " + task.id);

    self.runSingleTask(task, function (err, result) {
      completed++;

      if (err) {
        console.error("  FAILED: " + err.message);
        self.results.push({
          taskId: task.id,
          category: task.category,
          difficulty: task.difficulty,
          status: "error",
          error: err.message,
          score: 0,
          metrics: null
        });
      } else {
        var status = result.scores.pass ? "PASS" : "FAIL";
        console.log("  " + status + " (score: " + result.scores.correctness + "/10)");
        self.results.push(result);
      }

      processNext();
    });
  }

  // Start initial batch
  var initialBatch = Math.min(self.config.maxConcurrency, tasks.length);
  for (var i = 0; i < initialBatch; i++) {
    processNext();
  }
};

EvalRunner.prototype.runSingleTask = function (task, callback) {
  var self = this;
  var run = this.metrics.startRun(task.id);

  // Execute the agent on this task
  self.agent.execute(task, run, function (err, agentOutput) {
    self.metrics.endRun(run);

    if (err) {
      return callback(err);
    }

    // Score with LLM-as-judge
    llmJudge(
      task.description,
      task.expectedOutcome,
      agentOutput,
      function (judgeErr, scores) {
        if (judgeErr) {
          return callback(judgeErr);
        }

        var toolAnalysis = null;
        if (task.expectedTools && run.toolCalls.length > 0) {
          toolAnalysis = analyzeToolUsage(task.expectedTools, run.toolCalls);
        }

        callback(null, {
          taskId: task.id,
          category: task.category,
          difficulty: task.difficulty,
          status: scores.pass ? "pass" : "fail",
          score: (scores.correctness + scores.completeness + scores.quality) / 3,
          scores: scores,
          toolAnalysis: toolAnalysis,
          metrics: {
            steps: run.steps,
            inputTokens: run.inputTokens,
            outputTokens: run.outputTokens,
            durationMs: run.durationMs,
            costUsd: run.costUsd,
            retries: run.retries
          }
        });
      }
    );
  });
};

EvalRunner.prototype.finalize = function (callback) {
  var self = this;

  // Calculate summary statistics
  var passed = self.results.filter(function (r) { return r.status === "pass"; });
  var failed = self.results.filter(function (r) { return r.status === "fail"; });
  var errors = self.results.filter(function (r) { return r.status === "error"; });

  var scores = self.results
    .filter(function (r) { return typeof r.score === "number" && r.score > 0; })
    .map(function (r) { return r.score; });

  var avgScore = scores.length > 0
    ? scores.reduce(function (s, v) { return s + v; }, 0) / scores.length
    : 0;

  // Build category breakdown
  var byCategory = {};
  self.results.forEach(function (r) {
    if (!byCategory[r.category]) {
      byCategory[r.category] = { pass: 0, fail: 0, error: 0, scores: [] };
    }
    byCategory[r.category][r.status]++;
    if (typeof r.score === "number") {
      byCategory[r.category].scores.push(r.score);
    }
  });

  // Build difficulty breakdown
  var byDifficulty = {};
  self.results.forEach(function (r) {
    if (!byDifficulty[r.difficulty]) {
      byDifficulty[r.difficulty] = { pass: 0, fail: 0, error: 0 };
    }
    byDifficulty[r.difficulty][r.status]++;
  });

  // Failure analysis
  var failedRuns = failed.map(function (r) {
    return {
      taskId: r.taskId,
      durationMs: r.metrics ? r.metrics.durationMs : 0,
      timeout: 60000,
      retries: r.metrics ? r.metrics.retries : 0,
      toolCalls: [],
      errors: r.error ? [r.error] : [],
      inputTokens: r.metrics ? r.metrics.inputTokens : 0,
      steps: r.metrics ? r.metrics.steps : 0
    };
  });

  var failureReport = failedRuns.length > 0 ? buildFailureReport(failedRuns) : null;

  // Regression check
  var regressionTracker = new RegressionTracker(self.config.baselinePath);
  var regressionReport = regressionTracker.compare(self.results);

  var report = {
    timestamp: new Date().toISOString(),
    agent: self.agent.name,
    summary: {
      total: self.results.length,
      passed: passed.length,
      failed: failed.length,
      errors: errors.length,
      passRate: Math.round((passed.length / self.results.length) * 100) + "%",
      averageScore: Math.round(avgScore * 100) / 100
    },
    efficiency: self.metrics.getSummary(),
    byCategory: byCategory,
    byDifficulty: byDifficulty,
    failureAnalysis: failureReport,
    regressions: regressionReport,
    results: self.results
  };

  // Save report
  if (!fs.existsSync(self.config.resultsDir)) {
    fs.mkdirSync(self.config.resultsDir, { recursive: true });
  }

  var reportPath = path.join(
    self.config.resultsDir,
    "eval-" + Date.now() + ".json"
  );
  fs.writeFileSync(reportPath, JSON.stringify(report, null, 2));
  console.log("\n--- Evaluation Complete ---");
  console.log("Pass rate: " + report.summary.passRate);
  console.log("Average score: " + report.summary.averageScore);
  console.log("Report saved: " + reportPath);

  callback(null, report);
};

// ---- Comparison Report Generator ----

function generateComparisonReport(reportPaths) {
  var reports = reportPaths.map(function (p) {
    return JSON.parse(fs.readFileSync(p, "utf-8"));
  });

  var comparison = {
    generated: new Date().toISOString(),
    runs: reports.map(function (r) {
      return {
        agent: r.agent,
        timestamp: r.timestamp,
        passRate: r.summary.passRate,
        averageScore: r.summary.averageScore,
        avgSteps: r.efficiency ? r.efficiency.avgSteps : "N/A",
        avgCost: r.efficiency ? r.efficiency.avgCostUsd : "N/A",
        avgDuration: r.efficiency ? r.efficiency.avgDurationMs : "N/A"
      };
    })
  };

  // Find per-task deltas between first and last run
  if (reports.length >= 2) {
    var first = reports[0];
    var last = reports[reports.length - 1];
    var deltas = [];

    last.results.forEach(function (current) {
      var baseline = first.results.find(function (b) {
        return b.taskId === current.taskId;
      });
      if (baseline && typeof baseline.score === "number" && typeof current.score === "number") {
        var delta = current.score - baseline.score;
        if (Math.abs(delta) > 0.5) {
          deltas.push({
            taskId: current.taskId,
            baselineScore: baseline.score,
            currentScore: current.score,
            delta: Math.round(delta * 100) / 100
          });
        }
      }
    });

    comparison.significantChanges = deltas.sort(function (a, b) {
      return Math.abs(b.delta) - Math.abs(a.delta);
    });
  }

  return comparison;
}

// ---- Usage Example ----

// Define a mock agent for demonstration
var myAgent = {
  name: "research-agent-v2",
  execute: function (task, run, callback) {
    // In production, this would be your actual agent logic
    // with tool calls, LLM interactions, etc.
    setTimeout(function () {
      run.steps = 4;
      run.inputTokens = 8500;
      run.outputTokens = 1200;
      callback(null, "Agent completed task: " + task.description);
    }, 100);
  }
};

// Load dataset and run evaluation
var dataset = loadEvaluationDataset("./eval-dataset.json");
var runner = new EvalRunner(myAgent, dataset);

runner.run(function (err, report) {
  if (err) {
    console.error("Evaluation failed: " + err.message);
    process.exit(1);
  }

  console.log("\nCategory breakdown:");
  Object.keys(report.byCategory).forEach(function (cat) {
    var data = report.byCategory[cat];
    console.log("  " + cat + ": " + data.pass + " pass / " + data.fail + " fail");
  });

  if (report.regressions && report.regressions.regressions) {
    console.log("\nRegressions detected: " + report.regressions.regressions.length);
    report.regressions.regressions.forEach(function (reg) {
      console.log("  " + reg.taskId + ": " + reg.baselineScore + " -> " + reg.currentScore);
    });
  }
});

Building Benchmark Suites for Common Agent Patterns

Beyond custom evaluation datasets, you should maintain benchmark suites that test common agent capabilities in isolation. These benchmarks help you identify which fundamental skills are strong or weak, independent of specific tasks.

Here are the benchmark categories I use:

Information retrieval. Can the agent find and extract specific facts from provided context? Test with varying context lengths and distractor information.

Multi-step reasoning. Can the agent chain together multiple logical steps? Test with problems that require 3-5 intermediate conclusions.

Tool selection. Given a set of available tools, does the agent consistently pick the right one? Test with overlapping tool capabilities where the distinction matters.

Error recovery. When a tool call fails, does the agent adapt its strategy? Test by injecting simulated failures at specific steps.

Instruction following. Does the agent follow complex, multi-part instructions precisely? Test with instructions that include constraints, ordering requirements, and format specifications.

var BENCHMARK_SUITE = {
  information_retrieval: [
    {
      id: "ir-001",
      description: "Find the database connection timeout setting in the provided configuration",
      context: "... (large config file with the answer buried in the middle) ...",
      expectedAnswer: "30000",
      matchType: "exact"
    }
  ],
  tool_selection: [
    {
      id: "ts-001",
      description: "Get the current weather in San Francisco",
      availableTools: ["web_search", "weather_api", "calculator", "file_reader"],
      expectedTool: "weather_api",
      wrongButCommon: "web_search"
    }
  ],
  error_recovery: [
    {
      id: "er-001",
      description: "Read the user's profile from the primary database",
      simulatedFailure: { step: 1, error: "ECONNREFUSED", tool: "primary_db_query" },
      expectedRecovery: "fallback_db_query",
      maxRecoverySteps: 3
    }
  ]
};

Run benchmarks on a regular schedule -- weekly at minimum -- and track trends over time. A sudden drop in the tool-selection benchmark after a prompt change tells you exactly what broke.

Common Issues and Troubleshooting

Issue: LLM-as-judge scores are inconsistent across runs.

Run 1: task-015 score: 8.5
Run 2: task-015 score: 4.0
Run 3: task-015 score: 7.5

This usually means your judge prompt is underspecified. Add explicit rubric anchors defining what each score level means. Set temperature to 0. If inconsistency persists, switch to a more capable judge model. Also consider using majority voting across 3 judge calls and taking the median score.

Issue: Evaluation hangs on certain tasks with no timeout.

Error: Agent exceeded maximum execution time
  at EvalRunner.runSingleTask (eval.js:142:15)
  (no further output - process hangs)

Always wrap agent execution in a timeout. Use setTimeout with a Promise.race pattern or the AbortController API. Set aggressive timeouts (60-120 seconds per task) during evaluation. A task that takes 5 minutes to complete is not useful in production regardless of whether it eventually succeeds.

Issue: Token counts do not match between evaluation runs.

Baseline: avgTokens: 12500
Current:  avgTokens: 34200
Scores:   identical

Model providers sometimes change tokenization or model behavior in subtle ways. If token counts shift dramatically without score changes, it often means the model is being more verbose in its reasoning. Check if your agent framework is passing conversation history efficiently and look for unnecessary context accumulation across steps.

Issue: Evaluation dataset becomes stale over time.

Pass rate: 99.2% (but production failure reports increasing)

Your evaluation dataset has drifted from real production usage. Implement a feedback loop: sample failed production tasks, anonymize them, and add them to your evaluation dataset quarterly. Also review your dataset for tasks that no longer reflect current agent capabilities or user needs. A perfect eval score with increasing production failures is the biggest red flag in agent evaluation.

Issue: Rate limiting causes sporadic evaluation failures.

Error: 429 Too Many Requests
  at IncomingMessage.<anonymous> (eval.js:89:22)
  status: 429, message: "Rate limit exceeded. Please retry after 30s"

Build retry logic with exponential backoff into both your agent runner and your judge calls. Limit concurrency to stay within your API tier's rate limits. For large evaluation runs, use batch APIs when available or spread the run across a longer time window.

Best Practices

  • Separate your judge model from your agent model. Using the same model to generate and evaluate creates blind spots. The judge should be at least as capable as the agent, and ideally from a different provider to avoid correlated biases.

  • Version everything. Your evaluation datasets, agent prompts, tool definitions, and judge prompts should all be version-controlled. When a regression appears, you need to know exactly what changed. Without versioning, you are debugging in the dark.

  • Run evaluations in CI/CD. Do not treat evaluation as a manual, occasional activity. Integrate your evaluation suite into your deployment pipeline. Block deploys that cause more than N% regressions.

  • Track metrics longitudinally. A single evaluation snapshot tells you almost nothing. Store every evaluation run and graph metrics over time. Trends matter more than absolute numbers. A gradual decline in tool-selection accuracy over three weeks is invisible in a single run but obvious in a time series.

  • Design for partial credit from the start. Binary pass/fail evaluation throws away information. Use multi-dimensional scoring (correctness, completeness, quality, efficiency) and weight the dimensions based on your application's priorities.

  • Include cost in your success criteria. An agent that solves every task but costs $0.50 per execution might be worse for your business than one that solves 90% of tasks at $0.03 each. Build cost thresholds into your evaluation framework as first-class metrics.

  • Maintain a "golden set" of 20-30 tasks that never changes. This is your fixed benchmark for longitudinal comparison. Your broader evaluation dataset should evolve, but the golden set stays constant so you can compare results across months and years.

  • Test the judge itself. Create a small meta-evaluation set where you know the correct scores, and verify that your LLM-as-judge produces scores within acceptable ranges. A miscalibrated judge corrupts every evaluation that depends on it.

References

  • LMSYS Chatbot Arena -- large-scale LLM evaluation through human preference
  • OpenAI Evals Framework -- open-source evaluation framework for LLMs
  • SWE-bench -- benchmark for evaluating LLM agents on real-world software engineering tasks
  • AgentBench -- multi-dimensional benchmark for LLM-as-Agent
  • GAIA Benchmark -- general AI assistant benchmark with real-world questions
  • HumanEval -- code generation evaluation by OpenAI
  • BIG-Bench -- collaborative benchmark for large language models
  • Inspect AI -- framework for LLM evaluation from the UK AI Safety Institute
Powered by Contentful