Production

Quality Assurance for AI-Generated Content

Build QA pipelines for AI-generated content with automated checks, LLM scoring, hallucination detection, and review workflows in Node.js.

Quality Assurance for AI-Generated Content

Generating content with LLMs is the easy part. The hard part is knowing whether the output is actually good enough to publish. Unlike traditional software where you can write deterministic tests against known outputs, AI-generated content lives in a probabilistic space where "correct" is subjective, hallucinations hide in plausible-sounding sentences, and quality can silently degrade when your provider swaps out a model version under you.

This article walks through building a production-grade QA pipeline for AI-generated content in Node.js. We cover automated structural checks, LLM-as-judge scoring, hallucination detection via source comparison, human review workflows, quality scorecards, regression testing, and content guardrails. If you are shipping AI-generated content at any meaningful scale, you need every one of these layers.

Prerequisites

  • Node.js 18+ installed
  • Working knowledge of Express.js and REST APIs
  • An OpenAI API key (or equivalent LLM provider)
  • MongoDB for storing content and review state
  • Basic understanding of prompt engineering
  • Familiarity with content management workflows

The QA Challenge for AI Content

Traditional QA works because your system is deterministic. Given the same input, you get the same output. You write a test, it passes or fails, and you move on. AI-generated content breaks every assumption behind that model.

Non-determinism is the first problem. Run the same prompt twice with the same model and temperature, and you will get different outputs. This means snapshot testing is useless. You cannot pin a "golden" response and diff against it. Every generation is a unique artifact that needs its own evaluation.

Defining "correct" is the second problem. What makes a technical article good? Accuracy matters, but so does tone, depth, structure, readability, and audience fit. These are multi-dimensional qualities that resist binary pass/fail classification. A piece can be factually perfect but tonally wrong for your audience, or engagingly written but riddled with subtle inaccuracies.

Silent degradation is the third problem. Model providers update their systems constantly. A prompt that produced excellent output in January might produce mediocre output in March because the underlying model weights changed. Without continuous quality monitoring, you will not notice until readers start complaining.

The solution is defense in depth: multiple independent quality checks, each catching different failure modes, combined into a pipeline that gates content before it reaches production.

Building Automated Quality Checks

Start with the cheapest, fastest checks. These catch obvious structural problems before you spend API tokens on deeper evaluation.

var validator = require("./contentValidator");

function validateStructure(content, rules) {
  var errors = [];
  var warnings = [];

  // Check minimum and maximum word count
  var wordCount = content.split(/\s+/).filter(function(w) { return w.length > 0; }).length;
  if (wordCount < rules.minWords) {
    errors.push("Content has " + wordCount + " words, minimum is " + rules.minWords);
  }
  if (rules.maxWords && wordCount > rules.maxWords) {
    warnings.push("Content has " + wordCount + " words, maximum recommended is " + rules.maxWords);
  }

  // Check for required sections via headings
  var headings = content.match(/^#{1,3}\s+.+$/gm) || [];
  var headingTexts = headings.map(function(h) {
    return h.replace(/^#+\s+/, "").toLowerCase().trim();
  });

  if (rules.requiredSections) {
    rules.requiredSections.forEach(function(section) {
      var found = headingTexts.some(function(h) {
        return h.indexOf(section.toLowerCase()) !== -1;
      });
      if (!found) {
        errors.push("Missing required section: " + section);
      }
    });
  }

  // Check for code examples if required
  var codeBlocks = content.match(/```[\s\S]*?```/g) || [];
  if (rules.minCodeBlocks && codeBlocks.length < rules.minCodeBlocks) {
    errors.push("Found " + codeBlocks.length + " code blocks, minimum is " + rules.minCodeBlocks);
  }

  // Check for broken markdown patterns
  var brokenLinks = content.match(/\[[^\]]*\]\(\s*\)/g) || [];
  if (brokenLinks.length > 0) {
    errors.push("Found " + brokenLinks.length + " empty markdown links");
  }

  // Check for placeholder text left by the LLM
  var placeholders = content.match(/\[(?:TODO|PLACEHOLDER|INSERT|TBD|FIXME)[^\]]*\]/gi) || [];
  if (placeholders.length > 0) {
    errors.push("Found placeholder text: " + placeholders.join(", "));
  }

  // Check for excessive repetition
  var sentences = content.split(/[.!?]+/).filter(function(s) { return s.trim().length > 20; });
  var seen = {};
  var duplicates = 0;
  sentences.forEach(function(s) {
    var normalized = s.trim().toLowerCase();
    if (seen[normalized]) {
      duplicates++;
    }
    seen[normalized] = true;
  });
  if (duplicates > 2) {
    warnings.push("Found " + duplicates + " duplicate sentences, possible repetition loop");
  }

  return {
    valid: errors.length === 0,
    errors: errors,
    warnings: warnings,
    stats: {
      wordCount: wordCount,
      headingCount: headings.length,
      codeBlockCount: codeBlocks.length,
      sentenceCount: sentences.length
    }
  };
}

module.exports = { validateStructure: validateStructure };

These checks run in milliseconds and catch the most common LLM failure modes: truncated output, missing sections, placeholder text the model forgot to fill in, and the dreaded repetition loop where the model gets stuck regurgitating the same paragraph.

Implementing LLM-as-Judge for Content Quality Scoring

Automated structural checks catch format problems, but they cannot evaluate whether the content is actually good. For that, you need a second LLM acting as a quality judge. This is the single most important technique in AI content QA.

var axios = require("axios");

function buildScoringPrompt(content, criteria) {
  var rubric = criteria.map(function(c, i) {
    return (i + 1) + ". " + c.name + " (weight: " + c.weight + "): " + c.description;
  }).join("\n");

  return "You are an expert content quality evaluator. Score the following content on each dimension.\n\n" +
    "SCORING RUBRIC:\n" + rubric + "\n\n" +
    "For each dimension, provide:\n" +
    "- score: integer from 1-10\n" +
    "- reasoning: one sentence explaining the score\n\n" +
    "Respond ONLY with valid JSON in this format:\n" +
    '{"scores": [{"dimension": "name", "score": N, "reasoning": "..."}], "overall": N, "summary": "..."}\n\n' +
    "CONTENT TO EVALUATE:\n" + content;
}

function scoreContent(content, options) {
  var criteria = options.criteria || [
    { name: "accuracy", weight: 3, description: "Are claims factually correct? Are code examples valid?" },
    { name: "depth", weight: 2, description: "Does it go beyond surface-level explanation?" },
    { name: "clarity", weight: 2, description: "Is the writing clear and well-organized?" },
    { name: "practical_value", weight: 2, description: "Can a reader directly apply this knowledge?" },
    { name: "completeness", weight: 1, description: "Does it cover the topic adequately?" }
  ];

  var prompt = buildScoringPrompt(content, criteria);

  return axios.post("https://api.openai.com/v1/chat/completions", {
    model: options.judgeModel || "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    temperature: 0.1,
    response_format: { type: "json_object" }
  }, {
    headers: {
      "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
      "Content-Type": "application/json"
    }
  }).then(function(response) {
    var result = JSON.parse(response.data.choices[0].message.content);

    // Calculate weighted score
    var totalWeight = 0;
    var weightedSum = 0;
    result.scores.forEach(function(s) {
      var criterion = criteria.find(function(c) { return c.name === s.dimension; });
      var weight = criterion ? criterion.weight : 1;
      totalWeight += weight;
      weightedSum += s.score * weight;
    });
    result.weightedScore = Math.round((weightedSum / totalWeight) * 10) / 10;

    return result;
  });
}

module.exports = { scoreContent: scoreContent };

A critical detail: use a low temperature (0.1 or lower) for the judge model. You want scoring to be as consistent as possible. Also, always use a different model or at least a different configuration than the one that generated the content. Self-evaluation bias is real -- models tend to rate their own output higher than it deserves.

Factual Accuracy Checking with Retrieval-Based Verification

Hallucination detection is the hardest problem in AI content QA. The model generates text that sounds authoritative and reads smoothly, but states something that is simply not true. The most effective approach is retrieval-based verification: extract specific claims from the content, then check each one against authoritative sources.

function extractClaims(content) {
  var prompt = "Extract all specific factual claims from this technical content. " +
    "Include version numbers, performance claims, API behaviors, configuration defaults, " +
    "and architectural assertions. Ignore opinions and subjective statements.\n\n" +
    "Return JSON: {\"claims\": [{\"text\": \"...\", \"type\": \"version|performance|behavior|fact\", \"confidence_needed\": \"high|medium|low\"}]}\n\n" +
    "CONTENT:\n" + content;

  return callLLM(prompt, { temperature: 0 }).then(function(result) {
    return JSON.parse(result).claims;
  });
}

function verifyClaim(claim, sources) {
  var sourceText = sources.map(function(s) {
    return "SOURCE [" + s.name + "]:\n" + s.content;
  }).join("\n\n---\n\n");

  var prompt = "Verify whether the following claim is supported, contradicted, or not addressed by the provided sources.\n\n" +
    "CLAIM: " + claim.text + "\n\n" +
    "SOURCES:\n" + sourceText + "\n\n" +
    'Return JSON: {"verdict": "supported|contradicted|unverifiable", "evidence": "...", "source": "...", "correction": "..."}';

  return callLLM(prompt, { temperature: 0 });
}

function runFactCheck(content, sourceDocuments) {
  return extractClaims(content).then(function(claims) {
    var highPriorityClaims = claims.filter(function(c) {
      return c.confidence_needed === "high";
    });

    var checks = highPriorityClaims.map(function(claim) {
      return verifyClaim(claim, sourceDocuments).then(function(result) {
        var verdict = JSON.parse(result);
        verdict.claim = claim.text;
        return verdict;
      });
    });

    return Promise.all(checks).then(function(results) {
      var contradictions = results.filter(function(r) { return r.verdict === "contradicted"; });
      var unverifiable = results.filter(function(r) { return r.verdict === "unverifiable"; });

      return {
        totalClaims: claims.length,
        checkedClaims: results.length,
        contradictions: contradictions,
        unverifiable: unverifiable,
        passRate: results.length > 0
          ? Math.round(((results.length - contradictions.length) / results.length) * 100)
          : 100,
        results: results
      };
    });
  });
}

The key insight here is that you do not try to verify everything. You extract claims, prioritize the ones where being wrong would be most damaging (version numbers, API behaviors, security recommendations), and verify those against your own source documents. An unverifiable claim is not automatically wrong, but it should trigger a flag for human review.

Detecting Hallucinations with Source Comparison

Beyond factual claims, LLMs hallucinate in subtler ways: inventing API methods that do not exist, referencing libraries with wrong function signatures, or fabricating configuration options. Source comparison catches these by diffing the generated content against known-good reference material.

function detectHallucinations(generatedContent, referenceContent, topic) {
  var prompt = "Compare the GENERATED content against the REFERENCE material. " +
    "Identify any statements in the generated content that:\n" +
    "1. Contradict the reference material\n" +
    "2. Add specific technical details not present in the reference (possible fabrication)\n" +
    "3. Use API names, function signatures, or config options not in the reference\n\n" +
    "Topic context: " + topic + "\n\n" +
    "GENERATED:\n" + generatedContent.substring(0, 4000) + "\n\n" +
    "REFERENCE:\n" + referenceContent.substring(0, 4000) + "\n\n" +
    'Return JSON: {"hallucinations": [{"text": "...", "severity": "high|medium|low", "type": "contradiction|fabrication|unsupported", "explanation": "..."}], "risk_level": "high|medium|low"}';

  return callLLM(prompt, { temperature: 0 }).then(function(result) {
    return JSON.parse(result);
  });
}

You should always run this check when you have reference documentation available. For technical content about specific libraries or APIs, pull the actual documentation as your reference source. The cost is a few extra API calls per piece of content, but catching a hallucinated API method before it goes live is worth far more than the token cost.

Tone and Style Consistency Checks

If you are generating content for a specific brand or publication, every piece needs to sound like it came from the same author. LLMs are inconsistent stylists by default; they shift register, vocabulary level, and personality across generations. A style consistency check scores content against a defined style guide.

function checkStyleConsistency(content, styleGuide) {
  var prompt = "Evaluate whether this content matches the specified writing style.\n\n" +
    "STYLE GUIDE:\n" + JSON.stringify(styleGuide, null, 2) + "\n\n" +
    "CONTENT:\n" + content.substring(0, 3000) + "\n\n" +
    "Score each style dimension 1-10 and flag specific passages that deviate from the guide.\n" +
    'Return JSON: {"scores": {"voice": N, "technicality": N, "formality": N, "conciseness": N}, "deviations": [{"passage": "...", "issue": "...", "suggestion": "..."}], "overall_consistency": N}';

  return callLLM(prompt, { temperature: 0.1 });
}

// Example style guide definition
var engineeringBlogStyle = {
  voice: "First person, experienced practitioner sharing lessons learned",
  technicality: "Advanced - assumes reader knows fundamentals, dives into implementation details",
  formality: "Informal but professional. No slang, no academic language. Direct.",
  conciseness: "Tight paragraphs, no filler. Every sentence earns its place.",
  codeStyle: "Working examples with comments. Var declarations, require() imports, no arrow functions.",
  avoidPatterns: [
    "In this article we will explore...",
    "Let's dive in!",
    "In conclusion...",
    "It's worth noting that...",
    "As we all know..."
  ]
};

The avoidPatterns array is particularly valuable. Every brand has phrases that are telltale signs of generic AI output. Train your style checker to flag these aggressively.

Implementing a Content Review Pipeline

Now we assemble the individual checks into a coherent pipeline. Content flows through stages: generation, automated QA, human review (if needed), and publication. Each stage has clear pass/fail criteria.

var EventEmitter = require("events");

function ContentPipeline(config) {
  this.config = config;
  this.emitter = new EventEmitter();
  this.stages = [
    { name: "structure", fn: this.checkStructure.bind(this), blocking: true },
    { name: "quality_score", fn: this.checkQuality.bind(this), blocking: true },
    { name: "fact_check", fn: this.checkFacts.bind(this), blocking: false },
    { name: "hallucination", fn: this.checkHallucinations.bind(this), blocking: false },
    { name: "style", fn: this.checkStyle.bind(this), blocking: false },
    { name: "guardrails", fn: this.checkGuardrails.bind(this), blocking: true }
  ];
}

ContentPipeline.prototype.run = function(content, metadata) {
  var self = this;
  var results = { stages: {}, passed: true, requiresHumanReview: false };
  var contentObj = { content: content, metadata: metadata };

  return self.stages.reduce(function(chain, stage) {
    return chain.then(function() {
      self.emitter.emit("stage:start", stage.name);
      return stage.fn(contentObj).then(function(stageResult) {
        results.stages[stage.name] = stageResult;
        self.emitter.emit("stage:complete", stage.name, stageResult);

        if (stage.blocking && !stageResult.passed) {
          results.passed = false;
          self.emitter.emit("pipeline:blocked", stage.name, stageResult);
        }

        if (stageResult.requiresHumanReview) {
          results.requiresHumanReview = true;
        }
      });
    });
  }, Promise.resolve()).then(function() {
    results.scorecard = self.buildScorecard(results);
    self.emitter.emit("pipeline:complete", results);
    return results;
  });
};

ContentPipeline.prototype.checkStructure = function(contentObj) {
  var result = validateStructure(contentObj.content, this.config.structureRules);
  return Promise.resolve({
    passed: result.valid,
    errors: result.errors,
    warnings: result.warnings,
    stats: result.stats
  });
};

ContentPipeline.prototype.checkQuality = function(contentObj) {
  return scoreContent(contentObj.content, this.config.scoring).then(function(scores) {
    var threshold = 6.5;
    return {
      passed: scores.weightedScore >= threshold,
      score: scores.weightedScore,
      dimensions: scores.scores,
      summary: scores.summary,
      requiresHumanReview: scores.weightedScore >= 5.0 && scores.weightedScore < threshold
    };
  });
};

ContentPipeline.prototype.checkGuardrails = function(contentObj) {
  var content = contentObj.content.toLowerCase();
  var violations = [];

  // Check blocked topics
  var blockedTopics = this.config.guardrails.blockedTopics || [];
  blockedTopics.forEach(function(topic) {
    if (content.indexOf(topic.toLowerCase()) !== -1) {
      violations.push("Contains blocked topic: " + topic);
    }
  });

  // Check required disclaimers
  var requiredDisclaimers = this.config.guardrails.requiredDisclaimers || [];
  requiredDisclaimers.forEach(function(disclaimer) {
    if (content.indexOf(disclaimer.toLowerCase()) === -1) {
      violations.push("Missing required disclaimer: " + disclaimer);
    }
  });

  // Check for PII patterns
  var emailPattern = /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g;
  var phonePattern = /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g;
  var emails = contentObj.content.match(emailPattern) || [];
  var phones = contentObj.content.match(phonePattern) || [];

  if (emails.length > 0) {
    violations.push("Contains possible email addresses: " + emails.join(", "));
  }
  if (phones.length > 0) {
    violations.push("Contains possible phone numbers: " + phones.join(", "));
  }

  return Promise.resolve({
    passed: violations.length === 0,
    violations: violations
  });
};

ContentPipeline.prototype.buildScorecard = function(results) {
  var dimensions = {};
  var stages = results.stages;

  dimensions.structure = stages.structure && stages.structure.passed ? 10 : 0;
  dimensions.quality = stages.quality_score ? stages.quality_score.score : 0;
  dimensions.factual = stages.fact_check ? stages.fact_check.passRate / 10 : null;
  dimensions.hallucination_risk = stages.hallucination
    ? (stages.hallucination.risk_level === "low" ? 9 : stages.hallucination.risk_level === "medium" ? 5 : 2)
    : null;
  dimensions.style = stages.style ? stages.style.overall_consistency : null;
  dimensions.guardrails = stages.guardrails && stages.guardrails.passed ? 10 : 0;

  var scores = Object.keys(dimensions).filter(function(k) {
    return dimensions[k] !== null;
  }).map(function(k) {
    return dimensions[k];
  });

  var overall = scores.length > 0
    ? Math.round((scores.reduce(function(a, b) { return a + b; }, 0) / scores.length) * 10) / 10
    : 0;

  return {
    dimensions: dimensions,
    overall: overall,
    grade: overall >= 8 ? "A" : overall >= 6.5 ? "B" : overall >= 5 ? "C" : "F",
    timestamp: new Date().toISOString()
  };
};

The pipeline uses a sequential stage execution pattern. Blocking stages halt the pipeline on failure; non-blocking stages record their results but allow processing to continue. This means a structurally invalid article gets rejected immediately without burning API tokens on quality scoring, while a style deviation gets flagged but does not prevent publication.

Human Review Workflows for High-Stakes Content

Automation handles the bulk of QA, but some content needs human eyes. Medical advice, legal information, financial guidance, and anything where being wrong has real consequences should always pass through human review. Build this into your pipeline as a first-class concept, not an afterthought.

var mongo = require("./dataAccess");

function ReviewQueue(db) {
  this.collection = db.collection("content_reviews");
}

ReviewQueue.prototype.submit = function(contentId, content, pipelineResults) {
  var reviewItem = {
    contentId: contentId,
    content: content,
    pipelineResults: pipelineResults,
    scorecard: pipelineResults.scorecard,
    status: "pending_review",
    priority: this.calculatePriority(pipelineResults),
    assignedTo: null,
    submittedAt: new Date(),
    reviewedAt: null,
    reviewNotes: null,
    decision: null
  };

  return this.collection.insertOne(reviewItem);
};

ReviewQueue.prototype.calculatePriority = function(results) {
  var score = results.scorecard.overall;
  // Lower quality scores get higher priority (reviewed first)
  if (score < 5) return "critical";
  if (score < 6.5) return "high";
  if (score < 8) return "medium";
  return "low";
};

ReviewQueue.prototype.getNextForReview = function(reviewerId) {
  return this.collection.findOneAndUpdate(
    { status: "pending_review", assignedTo: null },
    { $set: { assignedTo: reviewerId, status: "in_review" } },
    { sort: { priority: 1, submittedAt: 1 }, returnDocument: "after" }
  );
};

ReviewQueue.prototype.submitDecision = function(contentId, reviewerId, decision, notes) {
  return this.collection.updateOne(
    { contentId: contentId, assignedTo: reviewerId },
    {
      $set: {
        decision: decision,
        reviewNotes: notes,
        reviewedAt: new Date(),
        status: decision === "approved" ? "approved" : decision === "rejected" ? "rejected" : "revision_needed"
      }
    }
  );
};

The priority calculation ensures that the worst content gets reviewed first. A human reviewer should never have to wade through content that scored 9/10 before seeing the piece that scored 4/10 and might contain hallucinations.

Quality Metrics Tracking Over Time

Individual content QA is necessary but insufficient. You also need aggregate metrics to detect trends: is your content quality improving or degrading? Did that prompt change actually help? Did the model update break something?

function QualityTracker(db) {
  this.collection = db.collection("quality_metrics");
}

QualityTracker.prototype.record = function(contentId, scorecard, metadata) {
  return this.collection.insertOne({
    contentId: contentId,
    scorecard: scorecard,
    metadata: metadata,
    model: metadata.model,
    promptVersion: metadata.promptVersion,
    timestamp: new Date()
  });
};

QualityTracker.prototype.getAverageScores = function(days, groupBy) {
  var cutoff = new Date(Date.now() - days * 24 * 60 * 60 * 1000);
  var groupField = groupBy === "model" ? "$model" : "$promptVersion";

  return this.collection.aggregate([
    { $match: { timestamp: { $gte: cutoff } } },
    {
      $group: {
        _id: groupField,
        avgOverall: { $avg: "$scorecard.overall" },
        avgQuality: { $avg: "$scorecard.dimensions.quality" },
        count: { $sum: 1 },
        passRate: {
          $avg: {
            $cond: [{ $gte: ["$scorecard.overall", 6.5] }, 1, 0]
          }
        }
      }
    },
    { $sort: { avgOverall: -1 } }
  ]).toArray();
};

QualityTracker.prototype.detectRegression = function(baselineWindow, currentWindow) {
  var self = this;
  var now = Date.now();
  var baselineStart = new Date(now - (baselineWindow + currentWindow) * 24 * 60 * 60 * 1000);
  var currentStart = new Date(now - currentWindow * 24 * 60 * 60 * 1000);

  return Promise.all([
    self.collection.aggregate([
      { $match: { timestamp: { $gte: baselineStart, $lt: currentStart } } },
      { $group: { _id: null, avg: { $avg: "$scorecard.overall" } } }
    ]).toArray(),
    self.collection.aggregate([
      { $match: { timestamp: { $gte: currentStart } } },
      { $group: { _id: null, avg: { $avg: "$scorecard.overall" } } }
    ]).toArray()
  ]).then(function(results) {
    var baseline = results[0][0] ? results[0][0].avg : 0;
    var current = results[1][0] ? results[1][0].avg : 0;
    var delta = current - baseline;

    return {
      baseline: Math.round(baseline * 100) / 100,
      current: Math.round(current * 100) / 100,
      delta: Math.round(delta * 100) / 100,
      regressed: delta < -0.5,
      improved: delta > 0.5
    };
  });
};

The detectRegression method compares average quality scores between a baseline window and a current window. A delta of more than -0.5 points triggers a regression alert. This catches model updates, prompt regressions, and any other systemic quality shifts before they accumulate into a visible problem.

Regression Testing Content Quality Across Model Updates

When your LLM provider updates a model, you need to know immediately if your content quality changed. Build a regression test suite that generates content from a fixed set of prompts and compares the quality scores against your baseline.

function regressionTest(prompts, baselineScores, currentModel) {
  var results = [];

  return prompts.reduce(function(chain, prompt, index) {
    return chain.then(function() {
      return generateContent(prompt, { model: currentModel });
    }).then(function(content) {
      return scoreContent(content, { judgeModel: "gpt-4o" });
    }).then(function(scores) {
      var baseline = baselineScores[index];
      var delta = scores.weightedScore - baseline.weightedScore;
      results.push({
        prompt: prompt.name,
        baseline: baseline.weightedScore,
        current: scores.weightedScore,
        delta: delta,
        regressed: delta < -1.0,
        details: scores
      });
    });
  }, Promise.resolve()).then(function() {
    var regressions = results.filter(function(r) { return r.regressed; });
    return {
      totalTests: results.length,
      regressions: regressions.length,
      passed: regressions.length === 0,
      results: results
    };
  });
}

Run this suite on a schedule -- weekly at minimum, or triggered whenever you detect a model version change in the API response headers. Store the results so you can track quality trends across model generations.

Sampling Strategies for Manual Review at Scale

You cannot have humans review every piece of content at scale. But you cannot skip human review entirely either. The answer is strategic sampling.

function selectForReview(contentBatch, sampleConfig) {
  var selected = [];

  // Always review: lowest scoring content
  var sorted = contentBatch.slice().sort(function(a, b) {
    return a.scorecard.overall - b.scorecard.overall;
  });
  var bottomN = sorted.slice(0, sampleConfig.bottomCount || 3);
  bottomN.forEach(function(item) {
    item.reviewReason = "bottom_percentile";
    selected.push(item);
  });

  // Always review: content flagged by any QA stage
  contentBatch.forEach(function(item) {
    if (item.requiresHumanReview && selected.indexOf(item) === -1) {
      item.reviewReason = "flagged_by_pipeline";
      selected.push(item);
    }
  });

  // Random sample from passing content
  var passing = contentBatch.filter(function(item) {
    return item.scorecard.overall >= 6.5 && selected.indexOf(item) === -1;
  });
  var randomCount = Math.ceil(passing.length * (sampleConfig.randomPercent || 0.1));
  var shuffled = passing.sort(function() { return 0.5 - Math.random(); });
  shuffled.slice(0, randomCount).forEach(function(item) {
    item.reviewReason = "random_sample";
    selected.push(item);
  });

  return selected;
}

The three-tier strategy covers the important cases. Bottom-percentile review catches systematic quality problems. Flagged review catches content that automation found suspicious. Random sampling provides an unbiased quality signal and prevents over-reliance on automated scores.

Content Versioning and Rollback

When you discover that a batch of content has quality problems, you need the ability to roll back to a previous version or pull content from production entirely.

function ContentVersionStore(db) {
  this.collection = db.collection("content_versions");
}

ContentVersionStore.prototype.save = function(contentId, content, metadata) {
  return this.collection.findOne(
    { contentId: contentId },
    { sort: { version: -1 } }
  ).then(function(latest) {
    var nextVersion = latest ? latest.version + 1 : 1;
    return this.collection.insertOne({
      contentId: contentId,
      version: nextVersion,
      content: content,
      metadata: metadata,
      scorecard: metadata.scorecard,
      createdAt: new Date(),
      status: "active"
    });
  }.bind(this));
};

ContentVersionStore.prototype.rollback = function(contentId, targetVersion) {
  var self = this;
  return self.collection.updateMany(
    { contentId: contentId, version: { $gt: targetVersion } },
    { $set: { status: "rolled_back", rolledBackAt: new Date() } }
  ).then(function() {
    return self.collection.findOne({ contentId: contentId, version: targetVersion });
  });
};

ContentVersionStore.prototype.bulkRollback = function(filter, reason) {
  return this.collection.updateMany(
    filter,
    { $set: { status: "rolled_back", rollbackReason: reason, rolledBackAt: new Date() } }
  );
};

Version everything. Store the full content, the scorecard at the time of generation, and the model version that produced it. When a model update causes a regression, you can bulk-rollback every piece generated after a specific date by that specific model version.

Complete Working Example

Here is the full Express.js application that ties everything together into a working content QA service with API endpoints for pipeline execution, review queues, and quality dashboards.

var express = require("express");
var bodyParser = require("body-parser");
var MongoClient = require("mongodb").MongoClient;
var axios = require("axios");

var app = express();
app.use(bodyParser.json({ limit: "5mb" }));

var db;
var pipeline;
var reviewQueue;
var qualityTracker;
var versionStore;

// Initialize database and services
MongoClient.connect(process.env.DB_MONGO).then(function(client) {
  db = client.db(process.env.DATABASE || "content_qa");
  reviewQueue = new ReviewQueue(db);
  qualityTracker = new QualityTracker(db);
  versionStore = new ContentVersionStore(db);

  pipeline = new ContentPipeline({
    structureRules: {
      minWords: 500,
      maxWords: 10000,
      requiredSections: ["overview", "prerequisites"],
      minCodeBlocks: 1
    },
    scoring: {
      judgeModel: "gpt-4o",
      criteria: [
        { name: "accuracy", weight: 3, description: "Technical accuracy of claims and code" },
        { name: "depth", weight: 2, description: "Goes beyond surface-level explanation" },
        { name: "clarity", weight: 2, description: "Well-organized, clear writing" },
        { name: "practical_value", weight: 2, description: "Reader can apply this directly" },
        { name: "completeness", weight: 1, description: "Covers the topic adequately" }
      ]
    },
    guardrails: {
      blockedTopics: ["gambling", "weapons", "illegal activities"],
      requiredDisclaimers: []
    }
  });

  console.log("Content QA service initialized");
}).catch(function(err) {
  console.error("Failed to connect to database:", err.message);
  process.exit(1);
});

// Helper to call LLM
function callLLM(prompt, options) {
  var model = (options && options.model) || "gpt-4o";
  var temperature = (options && options.temperature !== undefined) ? options.temperature : 0.1;

  return axios.post("https://api.openai.com/v1/chat/completions", {
    model: model,
    messages: [{ role: "user", content: prompt }],
    temperature: temperature
  }, {
    headers: {
      "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
      "Content-Type": "application/json"
    }
  }).then(function(response) {
    return response.data.choices[0].message.content;
  });
}

// POST /qa/evaluate - Run full QA pipeline on content
app.post("/qa/evaluate", function(req, res) {
  var content = req.body.content;
  var metadata = req.body.metadata || {};

  if (!content) {
    return res.status(400).json({ error: "content field is required" });
  }

  pipeline.run(content, metadata).then(function(results) {
    // Record quality metrics
    return qualityTracker.record(metadata.contentId || "unknown", results.scorecard, metadata)
      .then(function() {
        // Save version
        if (metadata.contentId) {
          return versionStore.save(metadata.contentId, content, {
            scorecard: results.scorecard,
            model: metadata.model,
            promptVersion: metadata.promptVersion
          });
        }
      })
      .then(function() {
        // Route to human review if needed
        if (results.requiresHumanReview && metadata.contentId) {
          return reviewQueue.submit(metadata.contentId, content, results);
        }
      })
      .then(function() {
        res.json({
          passed: results.passed,
          requiresHumanReview: results.requiresHumanReview,
          scorecard: results.scorecard,
          stages: results.stages
        });
      });
  }).catch(function(err) {
    console.error("Pipeline error:", err);
    res.status(500).json({ error: "Pipeline evaluation failed", details: err.message });
  });
});

// GET /qa/review/next - Get next item for human review
app.get("/qa/review/next", function(req, res) {
  var reviewerId = req.query.reviewer;
  if (!reviewerId) {
    return res.status(400).json({ error: "reviewer query parameter required" });
  }

  reviewQueue.getNextForReview(reviewerId).then(function(item) {
    if (!item || !item.value) {
      return res.json({ message: "No items pending review" });
    }
    res.json(item.value);
  }).catch(function(err) {
    res.status(500).json({ error: err.message });
  });
});

// POST /qa/review/:contentId/decision - Submit review decision
app.post("/qa/review/:contentId/decision", function(req, res) {
  var contentId = req.params.contentId;
  var decision = req.body.decision;
  var notes = req.body.notes || "";
  var reviewerId = req.body.reviewerId;

  if (!decision || !reviewerId) {
    return res.status(400).json({ error: "decision and reviewerId are required" });
  }

  if (["approved", "rejected", "revision_needed"].indexOf(decision) === -1) {
    return res.status(400).json({ error: "decision must be approved, rejected, or revision_needed" });
  }

  reviewQueue.submitDecision(contentId, reviewerId, decision, notes).then(function() {
    res.json({ status: "recorded", contentId: contentId, decision: decision });
  }).catch(function(err) {
    res.status(500).json({ error: err.message });
  });
});

// GET /qa/dashboard - Quality metrics dashboard data
app.get("/qa/dashboard", function(req, res) {
  var days = parseInt(req.query.days) || 30;

  Promise.all([
    qualityTracker.getAverageScores(days, "model"),
    qualityTracker.detectRegression(14, 7),
    reviewQueue.collection.aggregate([
      { $group: { _id: "$status", count: { $sum: 1 } } }
    ]).toArray()
  ]).then(function(results) {
    res.json({
      period: days + " days",
      averageScoresByModel: results[0],
      regressionCheck: results[1],
      reviewQueueStatus: results[2]
    });
  }).catch(function(err) {
    res.status(500).json({ error: err.message });
  });
});

// POST /qa/rollback - Bulk rollback content by criteria
app.post("/qa/rollback", function(req, res) {
  var filter = req.body.filter;
  var reason = req.body.reason;

  if (!filter || !reason) {
    return res.status(400).json({ error: "filter and reason are required" });
  }

  versionStore.bulkRollback(filter, reason).then(function(result) {
    res.json({
      rolledBack: result.modifiedCount,
      reason: reason
    });
  }).catch(function(err) {
    res.status(500).json({ error: err.message });
  });
});

// GET /qa/regression-test - Run regression test suite
app.get("/qa/regression-test", function(req, res) {
  var model = req.query.model || "gpt-4o";

  // Load saved test prompts and baselines from database
  db.collection("regression_baselines").find({}).toArray().then(function(baselines) {
    if (baselines.length === 0) {
      return res.json({ message: "No regression baselines configured" });
    }

    var prompts = baselines.map(function(b) { return { name: b.name, prompt: b.prompt }; });
    var scores = baselines.map(function(b) { return { weightedScore: b.baselineScore }; });

    return regressionTest(prompts, scores, model).then(function(results) {
      res.json(results);
    });
  }).catch(function(err) {
    res.status(500).json({ error: err.message });
  });
});

var PORT = process.env.PORT || 3200;
app.listen(PORT, function() {
  console.log("Content QA service running on port " + PORT);
});

This service provides the complete workflow: content goes in through /qa/evaluate, gets scored and optionally routed to human review, quality metrics accumulate over time, and the dashboard endpoint gives you a real-time view of content quality health. The rollback endpoint lets you pull bad content from production when a model update causes regressions.

Common Issues and Troubleshooting

1. LLM judge returns invalid JSON

SyntaxError: Unexpected token 'T' at position 0
  at JSON.parse (<anonymous>)

The judge model sometimes wraps its JSON in markdown code fences or adds explanatory text before the JSON. Fix this by stripping non-JSON content before parsing:

function parseJSONResponse(text) {
  // Strip markdown code fences
  var cleaned = text.replace(/```json\n?/g, "").replace(/```\n?/g, "").trim();
  // Find first { and last }
  var start = cleaned.indexOf("{");
  var end = cleaned.lastIndexOf("}");
  if (start === -1 || end === -1) {
    throw new Error("No JSON object found in response: " + text.substring(0, 100));
  }
  return JSON.parse(cleaned.substring(start, end + 1));
}

2. Quality scores are inconsistent across runs

Run 1: weightedScore: 7.8
Run 2: weightedScore: 5.2
Run 3: weightedScore: 8.1

Even with temperature 0, LLM outputs are not fully deterministic. Use majority voting: run the judge 3 times and take the median score. This smooths out variance significantly:

function stableScore(content, options, runs) {
  var promises = [];
  for (var i = 0; i < (runs || 3); i++) {
    promises.push(scoreContent(content, options));
  }
  return Promise.all(promises).then(function(results) {
    var scores = results.map(function(r) { return r.weightedScore; }).sort();
    return results[Math.floor(scores.length / 2)]; // Return the median result
  });
}

3. Pipeline times out on large content

Error: timeout of 30000ms exceeded
  at createError (node_modules/axios/lib/core/createError.js:16:15)

Large articles take longer to evaluate. Increase the axios timeout and consider running non-blocking stages in parallel instead of sequentially:

var axiosWithTimeout = axios.create({ timeout: 120000 }); // 2 minutes

// Run non-blocking stages in parallel
var nonBlocking = stages.filter(function(s) { return !s.blocking; });
Promise.all(nonBlocking.map(function(stage) {
  return stage.fn(contentObj);
}));

4. MongoDB connection pool exhaustion during batch evaluation

MongoServerError: connection pool was cleared, rerun the operation

When evaluating many content pieces concurrently, each piece triggers multiple database writes. Use a concurrency limiter:

function evaluateBatch(items, concurrency) {
  var index = 0;
  var results = [];

  function next() {
    if (index >= items.length) return Promise.resolve();
    var current = index++;
    return pipeline.run(items[current].content, items[current].metadata)
      .then(function(result) {
        results[current] = result;
        return next();
      });
  }

  var workers = [];
  for (var i = 0; i < Math.min(concurrency, items.length); i++) {
    workers.push(next());
  }
  return Promise.all(workers).then(function() { return results; });
}

// Evaluate 100 items with max 5 concurrent
evaluateBatch(contentItems, 5);

5. Hallucination detector produces false positives on novel content

When the generated content covers material not present in reference documents, the detector flags it as "unsupported." Add a confidence threshold and distinguish between "contradicted" (definitely wrong) and "unsupported" (might be fine, needs human verification):

// Only auto-reject contradictions, flag unsupported for review
var autoReject = results.filter(function(r) { return r.verdict === "contradicted"; });
var flagForReview = results.filter(function(r) { return r.verdict === "unsupported"; });

Best Practices

  • Use a different model for judging than for generating. Self-evaluation bias is measurable and significant. If you generate with GPT-4o, judge with Claude, or vice versa. Cross-model evaluation produces more honest quality scores.

  • Version your prompts alongside your content. When quality regresses, you need to know whether the cause was a prompt change or a model change. Store the prompt version hash with every piece of generated content so you can isolate variables during debugging.

  • Set your quality thresholds conservatively at first, then relax. Start by auto-rejecting anything below 8/10, let the human review queue absorb the workload, and gradually lower the threshold as you calibrate your scoring rubric against human judgment.

  • Build guardrails as the outermost layer, not an afterthought. Blocked topics, required disclaimers, and PII checks should run before expensive LLM-based evaluation. They are cheap, fast, and catch the most dangerous failures -- the ones that create legal or reputational risk.

  • Log everything, including the raw LLM judge responses. When you disagree with an automated score, you need to see the judge's reasoning to understand whether the rubric needs adjustment or the judge hallucinated its own evaluation.

  • Treat content QA as a feedback loop, not a gate. The data from your QA pipeline should feed back into prompt engineering. If the "clarity" score is consistently low, improve your generation prompts to emphasize structure and readability. If hallucination rates spike for certain topic areas, add more reference material to those generation pipelines.

  • Invest in human review calibration. If multiple reviewers have different quality standards, your human review data is noisy and your automated scoring cannot learn from it. Run calibration sessions where reviewers score the same content independently and discuss disagreements.

  • Never trust a single quality signal. The whole point of a multi-dimensional scorecard is that no single check is reliable enough on its own. Structural checks miss subtle quality issues. LLM judges miss formatting problems. Fact checkers miss style issues. The pipeline is only as strong as the diversity of its checks.

References

Powered by Contentful