Production

User Feedback Loops for AI Quality

Build user feedback systems for AI quality improvement with collection, analysis, dashboards, and prompt optimization in Node.js.

User Feedback Loops for AI Quality

Overview

AI models do not know when they are wrong. They generate confident-sounding answers regardless of whether the output is accurate, helpful, or completely fabricated. The only reliable signal for production AI quality comes from the people actually using it. Building a structured feedback loop -- where users rate AI responses, you store that data with full context, analyze patterns, and use those patterns to improve prompts and model selection -- is the single most impactful thing you can do to improve AI quality over time. This article walks through building that entire pipeline in Node.js with PostgreSQL, from feedback collection to prompt optimization.

Prerequisites

  • Node.js v18+ and npm installed
  • PostgreSQL 14+ running locally or remotely
  • Basic familiarity with Express.js and SQL
  • An AI-powered feature in production (or a prototype you want to improve)
  • Understanding of how prompts and model parameters affect LLM output

Why User Feedback Is Essential for AI Quality

Automated evaluation metrics like BLEU scores, perplexity, and cosine similarity tell you whether a model's output is statistically similar to a reference. They do not tell you whether the output actually helped someone. I have seen systems with great automated scores produce answers that confuse users, miss context, or technically answer the question while being practically useless.

User feedback fills this gap. When a user clicks a thumbs-down on an AI-generated summary, that is a direct signal that something went wrong. When they edit an AI draft and submit a corrected version, you now have a labeled training example. When feedback scores drop after a prompt change, you know the change was a regression.

The key insight is that LLMs fail in ways that are hard to predict. A prompt that works well for 95% of inputs can catastrophically fail on 5% of edge cases. You will not find those edge cases in testing. Users will find them for you -- but only if you give them a way to tell you.

Designing Feedback UI Patterns

The feedback mechanism you choose directly affects your data quality. Here are the patterns I have used in production, ranked by signal quality:

Binary feedback (thumbs up/down) is the easiest to implement and gets the highest participation rate. Users spend less than a second on it. The downside is low granularity -- you know something was bad but not what or why.

Star ratings (1-5) give you more granularity but lower participation. Users often default to extremes (1 or 5), which makes the middle ratings unreliable. I use these for features where quality differences are subtle, like text summarization.

Text corrections are the highest-quality signal. When a user edits an AI-generated response, you get an implicit before/after pair. This is gold for prompt engineering. The downside is that only power users will do this.

Categorical feedback (checkboxes like "inaccurate", "incomplete", "off-topic", "too long") combines the ease of binary feedback with the specificity of text. I recommend this as the default choice for most AI features.

Inline annotations let users highlight specific parts of a response that are wrong. This is the highest-effort pattern and only makes sense for high-value use cases like medical or legal AI.

In practice, I layer these. Every AI response gets a thumbs up/down. The thumbs-down expands to show categorical options and an optional text field. This gets you broad coverage with optional depth.

Setting Up the Database Schema

The feedback table needs to store the full context of each interaction -- the prompt, the response, the model, and the feedback itself. Without context, feedback data is useless for debugging.

CREATE TABLE IF NOT EXISTS ai_feedback (
    id SERIAL PRIMARY KEY,
    interaction_id VARCHAR(64) NOT NULL,
    feature_name VARCHAR(128) NOT NULL,
    user_id VARCHAR(128),
    session_id VARCHAR(128),
    prompt_text TEXT NOT NULL,
    prompt_template VARCHAR(256),
    prompt_version VARCHAR(32),
    response_text TEXT NOT NULL,
    model_name VARCHAR(64) NOT NULL,
    model_params JSONB DEFAULT '{}',
    rating_type VARCHAR(32) NOT NULL,
    rating_value INTEGER NOT NULL,
    rating_categories TEXT[] DEFAULT '{}',
    correction_text TEXT,
    feedback_comment TEXT,
    response_latency_ms INTEGER,
    token_count INTEGER,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

CREATE INDEX idx_feedback_feature ON ai_feedback(feature_name);
CREATE INDEX idx_feedback_created ON ai_feedback(created_at);
CREATE INDEX idx_feedback_rating ON ai_feedback(feature_name, rating_value);
CREATE INDEX idx_feedback_interaction ON ai_feedback(interaction_id);
CREATE INDEX idx_feedback_prompt_version ON ai_feedback(prompt_template, prompt_version);

CREATE TABLE IF NOT EXISTS ai_prompt_versions (
    id SERIAL PRIMARY KEY,
    feature_name VARCHAR(128) NOT NULL,
    template_name VARCHAR(256) NOT NULL,
    version VARCHAR(32) NOT NULL,
    prompt_text TEXT NOT NULL,
    is_active BOOLEAN DEFAULT false,
    avg_rating NUMERIC(3,2) DEFAULT 0,
    total_feedback INTEGER DEFAULT 0,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    activated_at TIMESTAMP WITH TIME ZONE,
    UNIQUE(template_name, version)
);

CREATE TABLE IF NOT EXISTS ai_quality_scores (
    id SERIAL PRIMARY KEY,
    feature_name VARCHAR(128) NOT NULL,
    period_start TIMESTAMP WITH TIME ZONE NOT NULL,
    period_end TIMESTAMP WITH TIME ZONE NOT NULL,
    total_interactions INTEGER DEFAULT 0,
    feedback_count INTEGER DEFAULT 0,
    feedback_rate NUMERIC(5,4) DEFAULT 0,
    avg_rating NUMERIC(3,2) DEFAULT 0,
    positive_rate NUMERIC(5,4) DEFAULT 0,
    negative_rate NUMERIC(5,4) DEFAULT 0,
    top_categories TEXT[] DEFAULT '{}',
    model_name VARCHAR(64),
    prompt_version VARCHAR(32),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

CREATE INDEX idx_quality_feature_period ON ai_quality_scores(feature_name, period_start);

The interaction_id is critical. Every AI call should generate a unique ID that travels from the backend to the frontend and back with the feedback. This lets you tie feedback to the exact prompt/response pair.

Implementing Feedback Collection Endpoints

Here is the Express.js router for collecting feedback. It validates input, stores the feedback, and triggers async analysis.

var express = require("express");
var router = express.Router();
var { Pool } = require("pg");
var crypto = require("crypto");

var pool = new Pool({
    connectionString: process.env.POSTGRES_CONNECTION_STRING
});

// Collect feedback for an AI interaction
router.post("/api/feedback", function(req, res) {
    var body = req.body;

    if (!body.interaction_id || !body.feature_name || !body.rating_type) {
        return res.status(400).json({
            error: "Missing required fields: interaction_id, feature_name, rating_type"
        });
    }

    var validRatingTypes = ["binary", "stars", "correction"];
    if (validRatingTypes.indexOf(body.rating_type) === -1) {
        return res.status(400).json({
            error: "rating_type must be one of: " + validRatingTypes.join(", ")
        });
    }

    var ratingValue = parseInt(body.rating_value, 10);
    if (body.rating_type === "binary" && (ratingValue !== 0 && ratingValue !== 1)) {
        return res.status(400).json({ error: "Binary rating must be 0 or 1" });
    }
    if (body.rating_type === "stars" && (ratingValue < 1 || ratingValue > 5)) {
        return res.status(400).json({ error: "Star rating must be between 1 and 5" });
    }

    var query = [
        "INSERT INTO ai_feedback",
        "(interaction_id, feature_name, user_id, session_id,",
        "prompt_text, prompt_template, prompt_version,",
        "response_text, model_name, model_params,",
        "rating_type, rating_value, rating_categories,",
        "correction_text, feedback_comment,",
        "response_latency_ms, token_count, metadata)",
        "VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18)",
        "RETURNING id, created_at"
    ].join(" ");

    var params = [
        body.interaction_id,
        body.feature_name,
        body.user_id || null,
        body.session_id || null,
        body.prompt_text || "",
        body.prompt_template || null,
        body.prompt_version || null,
        body.response_text || "",
        body.model_name || "unknown",
        JSON.stringify(body.model_params || {}),
        body.rating_type,
        ratingValue,
        body.rating_categories || [],
        body.correction_text || null,
        body.feedback_comment || null,
        body.response_latency_ms || null,
        body.token_count || null,
        JSON.stringify(body.metadata || {})
    ];

    pool.query(query, params, function(err, result) {
        if (err) {
            console.error("Feedback insert error:", err.message);
            return res.status(500).json({ error: "Failed to store feedback" });
        }

        var row = result.rows[0];
        res.status(201).json({
            feedback_id: row.id,
            created_at: row.created_at,
            message: "Feedback recorded. Thank you."
        });

        // Fire-and-forget: update quality scores
        updateQualityScores(body.feature_name).catch(function(e) {
            console.error("Quality score update failed:", e.message);
        });
    });
});

// Retrieve feedback for a specific interaction
router.get("/api/feedback/:interactionId", function(req, res) {
    var query = "SELECT * FROM ai_feedback WHERE interaction_id = $1 ORDER BY created_at DESC";

    pool.query(query, [req.params.interactionId], function(err, result) {
        if (err) {
            return res.status(500).json({ error: "Failed to retrieve feedback" });
        }
        res.json({ feedback: result.rows });
    });
});

// Get aggregated feedback stats for a feature
router.get("/api/feedback/stats/:featureName", function(req, res) {
    var days = parseInt(req.query.days, 10) || 30;
    var since = new Date(Date.now() - days * 24 * 60 * 60 * 1000).toISOString();

    var query = [
        "SELECT",
        "  COUNT(*) as total_feedback,",
        "  AVG(rating_value) as avg_rating,",
        "  COUNT(CASE WHEN rating_type = 'binary' AND rating_value = 1 THEN 1 END) as positive_count,",
        "  COUNT(CASE WHEN rating_type = 'binary' AND rating_value = 0 THEN 1 END) as negative_count,",
        "  COUNT(CASE WHEN correction_text IS NOT NULL THEN 1 END) as correction_count,",
        "  model_name,",
        "  prompt_version",
        "FROM ai_feedback",
        "WHERE feature_name = $1 AND created_at >= $2",
        "GROUP BY model_name, prompt_version",
        "ORDER BY total_feedback DESC"
    ].join(" ");

    pool.query(query, [req.params.featureName, since], function(err, result) {
        if (err) {
            return res.status(500).json({ error: "Failed to retrieve stats" });
        }
        res.json({ feature: req.params.featureName, period_days: days, stats: result.rows });
    });
});

module.exports = router;

Feedback Analysis Pipeline

Raw feedback data is not useful until you aggregate it. The analysis pipeline runs on a schedule (I use a cron job every hour) and computes quality scores per feature, per model, and per prompt version.

var { Pool } = require("pg");

var pool = new Pool({
    connectionString: process.env.POSTGRES_CONNECTION_STRING
});

function updateQualityScores(featureName) {
    var periodEnd = new Date();
    var periodStart = new Date(periodEnd.getTime() - 24 * 60 * 60 * 1000);

    var query = [
        "SELECT",
        "  feature_name,",
        "  model_name,",
        "  prompt_version,",
        "  COUNT(*) as feedback_count,",
        "  AVG(rating_value)::numeric(3,2) as avg_rating,",
        "  COUNT(CASE WHEN rating_value >= 1 AND rating_type = 'binary' THEN 1 END)::numeric /",
        "    NULLIF(COUNT(CASE WHEN rating_type = 'binary' THEN 1 END), 0) as positive_rate,",
        "  array_agg(DISTINCT unnest_cat) FILTER (WHERE unnest_cat IS NOT NULL) as top_categories",
        "FROM ai_feedback",
        "LEFT JOIN LATERAL unnest(rating_categories) AS unnest_cat ON true",
        "WHERE feature_name = $1",
        "  AND created_at >= $2",
        "  AND created_at < $3",
        "GROUP BY feature_name, model_name, prompt_version"
    ].join(" ");

    return pool.query(query, [featureName, periodStart.toISOString(), periodEnd.toISOString()])
        .then(function(result) {
            var inserts = result.rows.map(function(row) {
                return pool.query(
                    [
                        "INSERT INTO ai_quality_scores",
                        "(feature_name, period_start, period_end, feedback_count,",
                        "avg_rating, positive_rate, top_categories, model_name, prompt_version)",
                        "VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)"
                    ].join(" "),
                    [
                        row.feature_name,
                        periodStart.toISOString(),
                        periodEnd.toISOString(),
                        row.feedback_count,
                        row.avg_rating,
                        row.positive_rate || 0,
                        row.top_categories || [],
                        row.model_name,
                        row.prompt_version
                    ]
                );
            });
            return Promise.all(inserts);
        });
}

function identifyProblematicPrompts(threshold) {
    threshold = threshold || 0.5;

    var query = [
        "SELECT",
        "  prompt_template,",
        "  prompt_version,",
        "  feature_name,",
        "  COUNT(*) as feedback_count,",
        "  AVG(rating_value)::numeric(3,2) as avg_rating,",
        "  COUNT(CASE WHEN rating_value = 0 AND rating_type = 'binary' THEN 1 END) as negative_count,",
        "  array_agg(DISTINCT feedback_comment) FILTER (WHERE feedback_comment IS NOT NULL) as comments",
        "FROM ai_feedback",
        "WHERE created_at >= NOW() - INTERVAL '7 days'",
        "  AND prompt_template IS NOT NULL",
        "GROUP BY prompt_template, prompt_version, feature_name",
        "HAVING AVG(rating_value) < $1 AND COUNT(*) >= 10",
        "ORDER BY avg_rating ASC"
    ].join(" ");

    return pool.query(query, [threshold]).then(function(result) {
        return result.rows;
    });
}

function getCorrections(featureName, limit) {
    limit = limit || 50;

    var query = [
        "SELECT",
        "  prompt_text,",
        "  response_text,",
        "  correction_text,",
        "  feedback_comment,",
        "  model_name,",
        "  prompt_version",
        "FROM ai_feedback",
        "WHERE feature_name = $1",
        "  AND correction_text IS NOT NULL",
        "  AND created_at >= NOW() - INTERVAL '30 days'",
        "ORDER BY created_at DESC",
        "LIMIT $2"
    ].join(" ");

    return pool.query(query, [featureName, limit]).then(function(result) {
        return result.rows;
    });
}

module.exports = {
    updateQualityScores: updateQualityScores,
    identifyProblematicPrompts: identifyProblematicPrompts,
    getCorrections: getCorrections
};

Using Feedback to Identify Problematic Prompts

Once you have aggregated scores, the next step is finding which prompts are causing problems. I run this analysis weekly and it surfaces the exact prompt templates that need attention.

The query in identifyProblematicPrompts above filters for prompts with at least 10 feedback entries and an average rating below your threshold. The comments array gives you qualitative context. When I see the same comment appearing multiple times -- "answer is outdated", "missed the main point", "too generic" -- that tells me exactly what to fix.

Here is how I use the corrections data to build better prompts:

var analysis = require("./feedback-analysis");

function generatePromptImprovementReport(featureName) {
    return Promise.all([
        analysis.identifyProblematicPrompts(0.5),
        analysis.getCorrections(featureName, 100)
    ]).then(function(results) {
        var problematic = results[0].filter(function(p) {
            return p.feature_name === featureName;
        });
        var corrections = results[1];

        var report = {
            feature: featureName,
            generated_at: new Date().toISOString(),
            problematic_prompts: problematic,
            correction_patterns: extractPatterns(corrections),
            recommendations: []
        };

        // Analyze correction patterns
        var patterns = report.correction_patterns;
        if (patterns.too_long > corrections.length * 0.3) {
            report.recommendations.push(
                "Add explicit length constraints to prompt. 30%+ of corrections shorten the response."
            );
        }
        if (patterns.factual_changes > corrections.length * 0.2) {
            report.recommendations.push(
                "Add stronger grounding instructions. 20%+ of corrections fix factual claims."
            );
        }
        if (patterns.format_changes > corrections.length * 0.15) {
            report.recommendations.push(
                "Add output format examples to prompt. 15%+ of corrections change formatting."
            );
        }

        return report;
    });
}

function extractPatterns(corrections) {
    var patterns = {
        too_long: 0,
        too_short: 0,
        factual_changes: 0,
        format_changes: 0,
        tone_changes: 0,
        total: corrections.length
    };

    corrections.forEach(function(c) {
        var originalLen = (c.response_text || "").length;
        var correctedLen = (c.correction_text || "").length;

        if (correctedLen < originalLen * 0.7) {
            patterns.too_long++;
        } else if (correctedLen > originalLen * 1.3) {
            patterns.too_short++;
        }

        var comment = (c.feedback_comment || "").toLowerCase();
        if (comment.indexOf("wrong") !== -1 || comment.indexOf("incorrect") !== -1 || comment.indexOf("inaccurate") !== -1) {
            patterns.factual_changes++;
        }
        if (comment.indexOf("format") !== -1 || comment.indexOf("layout") !== -1 || comment.indexOf("structure") !== -1) {
            patterns.format_changes++;
        }
    });

    return patterns;
}

module.exports = {
    generatePromptImprovementReport: generatePromptImprovementReport
};

Implementing a Feedback-Driven Prompt Improvement Workflow

The improvement workflow is where feedback becomes action. Here is the process I follow:

  1. Identify -- The analysis pipeline flags prompts with low scores
  2. Diagnose -- Review corrections and comments to understand what is wrong
  3. Draft -- Write a new prompt version addressing the issues
  4. Test -- Deploy the new version alongside the old one (A/B test)
  5. Evaluate -- Compare feedback scores between versions
  6. Promote -- Make the winning version the default
var { Pool } = require("pg");
var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });

function createPromptVersion(featureName, templateName, promptText) {
    // Generate version string from timestamp
    var version = "v" + Date.now().toString(36);

    var query = [
        "INSERT INTO ai_prompt_versions",
        "(feature_name, template_name, version, prompt_text)",
        "VALUES ($1, $2, $3, $4)",
        "RETURNING id, version, created_at"
    ].join(" ");

    return pool.query(query, [featureName, templateName, version, promptText])
        .then(function(result) {
            return result.rows[0];
        });
}

function activatePromptVersion(templateName, version) {
    // Deactivate all versions of this template, then activate the specified one
    return pool.query(
        "UPDATE ai_prompt_versions SET is_active = false WHERE template_name = $1",
        [templateName]
    ).then(function() {
        return pool.query(
            "UPDATE ai_prompt_versions SET is_active = true, activated_at = NOW() WHERE template_name = $1 AND version = $2 RETURNING *",
            [templateName, version]
        );
    }).then(function(result) {
        if (result.rows.length === 0) {
            throw new Error("Version not found: " + templateName + "@" + version);
        }
        return result.rows[0];
    });
}

function getActivePrompt(templateName) {
    var query = "SELECT * FROM ai_prompt_versions WHERE template_name = $1 AND is_active = true LIMIT 1";
    return pool.query(query, [templateName]).then(function(result) {
        return result.rows[0] || null;
    });
}

function compareVersions(templateName, versionA, versionB, days) {
    days = days || 7;

    var query = [
        "SELECT",
        "  prompt_version,",
        "  COUNT(*) as feedback_count,",
        "  AVG(rating_value)::numeric(3,2) as avg_rating,",
        "  COUNT(CASE WHEN rating_type = 'binary' AND rating_value = 1 THEN 1 END)::numeric /",
        "    NULLIF(COUNT(CASE WHEN rating_type = 'binary' THEN 1 END), 0) as positive_rate,",
        "  AVG(response_latency_ms)::integer as avg_latency_ms",
        "FROM ai_feedback",
        "WHERE prompt_template = $1",
        "  AND prompt_version IN ($2, $3)",
        "  AND created_at >= NOW() - ($4 || ' days')::interval",
        "GROUP BY prompt_version"
    ].join(" ");

    return pool.query(query, [templateName, versionA, versionB, days.toString()])
        .then(function(result) {
            var comparison = {
                template: templateName,
                period_days: days,
                versions: {}
            };

            result.rows.forEach(function(row) {
                comparison.versions[row.prompt_version] = {
                    feedback_count: parseInt(row.feedback_count, 10),
                    avg_rating: parseFloat(row.avg_rating),
                    positive_rate: parseFloat(row.positive_rate || 0),
                    avg_latency_ms: row.avg_latency_ms
                };
            });

            // Determine winner
            var keys = Object.keys(comparison.versions);
            if (keys.length === 2) {
                var a = comparison.versions[keys[0]];
                var b = comparison.versions[keys[1]];
                if (a.feedback_count >= 30 && b.feedback_count >= 30) {
                    comparison.winner = a.avg_rating > b.avg_rating ? keys[0] : keys[1];
                    comparison.confidence = Math.abs(a.avg_rating - b.avg_rating) > 0.2 ? "high" : "low";
                } else {
                    comparison.winner = null;
                    comparison.confidence = "insufficient_data";
                }
            }

            return comparison;
        });
}

module.exports = {
    createPromptVersion: createPromptVersion,
    activatePromptVersion: activatePromptVersion,
    getActivePrompt: getActivePrompt,
    compareVersions: compareVersions
};

A/B Testing Improvements Based on Feedback Data

A/B testing AI prompts is different from A/B testing UI changes. The variance in LLM output is higher, so you need more samples to reach statistical significance. I use a simple traffic-splitting approach at the application layer:

var crypto = require("crypto");
var promptManager = require("./prompt-manager");

function selectPromptVersion(templateName, userId) {
    // Use consistent hashing so the same user always gets the same version
    var hash = crypto.createHash("md5").update(userId + templateName).digest("hex");
    var bucket = parseInt(hash.substring(0, 8), 16) % 100;

    return getExperimentConfig(templateName).then(function(config) {
        if (!config || !config.is_running) {
            return promptManager.getActivePrompt(templateName);
        }

        // Split traffic according to config
        if (bucket < config.control_percentage) {
            return promptManager.getActivePrompt(templateName);
        } else {
            return getPromptByVersion(templateName, config.candidate_version);
        }
    });
}

function getExperimentConfig(templateName) {
    // In production, store this in the database or a config service
    var experiments = {
        "summarize-article": {
            is_running: true,
            control_percentage: 50,
            candidate_version: "vlq8k2f"
        }
    };
    return Promise.resolve(experiments[templateName] || null);
}

function getPromptByVersion(templateName, version) {
    var query = "SELECT * FROM ai_prompt_versions WHERE template_name = $1 AND version = $2 LIMIT 1";
    var { Pool } = require("pg");
    var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });

    return pool.query(query, [templateName, version]).then(function(result) {
        return result.rows[0] || null;
    });
}

module.exports = {
    selectPromptVersion: selectPromptVersion
};

The consistent hashing on userId + templateName ensures a user always sees the same version for the same feature. This avoids the jarring experience of quality fluctuating between requests.

Automated Quality Scoring from Feedback Trends

Instead of manually checking dashboards, set up automated alerts when quality degrades. This cron job runs every hour and flags features that have dropped below acceptable thresholds:

var cron = require("node-cron");
var { Pool } = require("pg");
var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });

function checkQualityAlerts() {
    var query = [
        "WITH recent AS (",
        "  SELECT feature_name,",
        "    AVG(rating_value)::numeric(3,2) as current_avg,",
        "    COUNT(*) as sample_size",
        "  FROM ai_feedback",
        "  WHERE created_at >= NOW() - INTERVAL '24 hours'",
        "  GROUP BY feature_name",
        "  HAVING COUNT(*) >= 10",
        "),",
        "baseline AS (",
        "  SELECT feature_name,",
        "    AVG(rating_value)::numeric(3,2) as baseline_avg",
        "  FROM ai_feedback",
        "  WHERE created_at >= NOW() - INTERVAL '30 days'",
        "    AND created_at < NOW() - INTERVAL '24 hours'",
        "  GROUP BY feature_name",
        ")",
        "SELECT r.feature_name, r.current_avg, b.baseline_avg, r.sample_size,",
        "  (r.current_avg - b.baseline_avg)::numeric(3,2) as delta",
        "FROM recent r",
        "JOIN baseline b ON r.feature_name = b.feature_name",
        "WHERE r.current_avg < b.baseline_avg - 0.3",
        "ORDER BY delta ASC"
    ].join(" ");

    return pool.query(query).then(function(result) {
        result.rows.forEach(function(alert) {
            console.error(
                "[QUALITY ALERT] %s dropped from %s to %s (delta: %s, n=%d)",
                alert.feature_name,
                alert.baseline_avg,
                alert.current_avg,
                alert.delta,
                alert.sample_size
            );
            // In production: send to Slack, PagerDuty, etc.
            sendAlert(alert);
        });
        return result.rows;
    });
}

function sendAlert(alert) {
    // Placeholder for your alerting integration
    console.log("Alert sent for feature:", alert.feature_name);
}

// Run every hour
cron.schedule("0 * * * *", function() {
    checkQualityAlerts().catch(function(err) {
        console.error("Quality alert check failed:", err.message);
    });
});

module.exports = { checkQualityAlerts: checkQualityAlerts };

Building Feedback Dashboards

The dashboard API serves data for a frontend that product managers and engineers can use to monitor AI quality. Here are the key endpoints:

var express = require("express");
var router = express.Router();
var { Pool } = require("pg");
var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });

// Quality trend over time for a feature
router.get("/api/dashboard/trends/:featureName", function(req, res) {
    var days = parseInt(req.query.days, 10) || 30;
    var query = [
        "SELECT",
        "  DATE_TRUNC('day', created_at) as day,",
        "  COUNT(*) as feedback_count,",
        "  AVG(rating_value)::numeric(3,2) as avg_rating,",
        "  COUNT(CASE WHEN rating_type = 'binary' AND rating_value = 1 THEN 1 END) as positive,",
        "  COUNT(CASE WHEN rating_type = 'binary' AND rating_value = 0 THEN 1 END) as negative",
        "FROM ai_feedback",
        "WHERE feature_name = $1 AND created_at >= NOW() - ($2 || ' days')::interval",
        "GROUP BY DATE_TRUNC('day', created_at)",
        "ORDER BY day ASC"
    ].join(" ");

    pool.query(query, [req.params.featureName, days.toString()], function(err, result) {
        if (err) {
            return res.status(500).json({ error: "Query failed" });
        }
        res.json({ feature: req.params.featureName, trends: result.rows });
    });
});

// Feature comparison - all features ranked by quality
router.get("/api/dashboard/features", function(req, res) {
    var query = [
        "SELECT",
        "  feature_name,",
        "  COUNT(*) as total_feedback,",
        "  AVG(rating_value)::numeric(3,2) as avg_rating,",
        "  COUNT(CASE WHEN rating_type = 'binary' AND rating_value = 0 THEN 1 END)::numeric /",
        "    NULLIF(COUNT(CASE WHEN rating_type = 'binary' THEN 1 END), 0)::numeric(5,4) as negative_rate",
        "FROM ai_feedback",
        "WHERE created_at >= NOW() - INTERVAL '30 days'",
        "GROUP BY feature_name",
        "HAVING COUNT(*) >= 5",
        "ORDER BY avg_rating ASC"
    ].join(" ");

    pool.query(query, function(err, result) {
        if (err) {
            return res.status(500).json({ error: "Query failed" });
        }
        res.json({ features: result.rows });
    });
});

// Most common negative feedback categories
router.get("/api/dashboard/categories/:featureName", function(req, res) {
    var query = [
        "SELECT",
        "  unnest(rating_categories) as category,",
        "  COUNT(*) as count",
        "FROM ai_feedback",
        "WHERE feature_name = $1",
        "  AND rating_value = 0",
        "  AND created_at >= NOW() - INTERVAL '30 days'",
        "GROUP BY unnest(rating_categories)",
        "ORDER BY count DESC",
        "LIMIT 10"
    ].join(" ");

    pool.query(query, [req.params.featureName], function(err, result) {
        if (err) {
            return res.status(500).json({ error: "Query failed" });
        }
        res.json({ feature: req.params.featureName, categories: result.rows });
    });
});

module.exports = router;

Closing the Loop: Showing Users Their Impact

Users who provide feedback are more likely to keep providing feedback if they see it making a difference. I add a simple notification when a feature they rated has improved:

function getUserImpactSummary(userId) {
    var query = [
        "WITH user_negative AS (",
        "  SELECT DISTINCT feature_name, MIN(created_at) as first_feedback",
        "  FROM ai_feedback",
        "  WHERE user_id = $1 AND rating_value = 0 AND rating_type = 'binary'",
        "  GROUP BY feature_name",
        "),",
        "improvements AS (",
        "  SELECT un.feature_name,",
        "    un.first_feedback,",
        "    qs_before.avg_rating as rating_before,",
        "    qs_after.avg_rating as rating_after",
        "  FROM user_negative un",
        "  LEFT JOIN LATERAL (",
        "    SELECT avg_rating FROM ai_quality_scores",
        "    WHERE feature_name = un.feature_name AND period_end <= un.first_feedback",
        "    ORDER BY period_end DESC LIMIT 1",
        "  ) qs_before ON true",
        "  LEFT JOIN LATERAL (",
        "    SELECT avg_rating FROM ai_quality_scores",
        "    WHERE feature_name = un.feature_name AND period_start >= un.first_feedback",
        "    ORDER BY period_start DESC LIMIT 1",
        "  ) qs_after ON true",
        "  WHERE qs_after.avg_rating > qs_before.avg_rating",
        ")",
        "SELECT * FROM improvements ORDER BY (rating_after - rating_before) DESC"
    ].join(" ");

    var { Pool } = require("pg");
    var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });

    return pool.query(query, [userId]).then(function(result) {
        return result.rows.map(function(row) {
            return {
                feature: row.feature_name,
                improvement: (parseFloat(row.rating_after) - parseFloat(row.rating_before)).toFixed(2),
                message: "Your feedback helped improve " + row.feature_name + " quality by " +
                    ((parseFloat(row.rating_after) - parseFloat(row.rating_before)) * 100).toFixed(0) + "%"
            };
        });
    });
}

Handling Feedback Bias

User feedback is inherently biased. People are far more likely to leave negative feedback than positive. In my experience, the ratio is roughly 3:1 negative to positive when using binary feedback. Here is how to account for that:

Normalize for negativity bias. If your overall positive rate is 25%, a feature with a 30% positive rate is actually performing above average, even though 70% of feedback is negative. Always compare features against each other, not against an absolute standard.

Track feedback rate, not just feedback score. If a feature gets AI responses 10,000 times per day but only 50 feedback submissions, your sample is 0.5%. That is normal -- but it means your data is heavily skewed toward users who had strong reactions. Monitor the feedback rate alongside the scores.

Weight recent feedback more heavily. Stale feedback from before a prompt change distorts your averages. Use a decaying weight or sliding window.

function getWeightedRating(featureName, halfLifeDays) {
    halfLifeDays = halfLifeDays || 14;
    var { Pool } = require("pg");
    var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });

    var query = [
        "SELECT",
        "  SUM(rating_value * EXP(-0.693 * EXTRACT(EPOCH FROM NOW() - created_at) / ($2 * 86400))) /",
        "  SUM(EXP(-0.693 * EXTRACT(EPOCH FROM NOW() - created_at) / ($2 * 86400)))::numeric(3,2) as weighted_avg",
        "FROM ai_feedback",
        "WHERE feature_name = $1 AND rating_type = 'binary'",
        "  AND created_at >= NOW() - INTERVAL '90 days'"
    ].join(" ");

    return pool.query(query, [featureName, halfLifeDays]).then(function(result) {
        return parseFloat(result.rows[0].weighted_avg);
    });
}

Integrating Feedback with Model Selection

When feedback consistently shows one model outperforming another for a specific feature, automate the switch. This function checks recent feedback and recommends (or automatically applies) model changes:

function evaluateModelPerformance(featureName) {
    var { Pool } = require("pg");
    var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });

    var query = [
        "SELECT",
        "  model_name,",
        "  COUNT(*) as feedback_count,",
        "  AVG(rating_value)::numeric(3,2) as avg_rating,",
        "  AVG(response_latency_ms)::integer as avg_latency,",
        "  AVG(token_count)::integer as avg_tokens",
        "FROM ai_feedback",
        "WHERE feature_name = $1",
        "  AND created_at >= NOW() - INTERVAL '14 days'",
        "GROUP BY model_name",
        "HAVING COUNT(*) >= 20",
        "ORDER BY avg_rating DESC"
    ].join(" ");

    return pool.query(query, [featureName]).then(function(result) {
        var models = result.rows;
        if (models.length < 2) {
            return { recommendation: "insufficient_data", models: models };
        }

        var best = models[0];
        var current = models[models.length - 1];

        var ratingDiff = parseFloat(best.avg_rating) - parseFloat(current.avg_rating);
        if (ratingDiff > 0.2) {
            return {
                recommendation: "switch",
                from: current.model_name,
                to: best.model_name,
                rating_improvement: ratingDiff.toFixed(2),
                latency_impact: best.avg_latency - current.avg_latency,
                models: models
            };
        }

        return { recommendation: "keep_current", models: models };
    });
}

Privacy Considerations for Feedback Data

Feedback data contains prompts, responses, and potentially user-identifiable information. Handle it carefully:

  • Hash user IDs before storing them in the feedback table if you only need them for deduplication and consistency (A/B testing). Use a one-way hash with a salt.
  • Redact PII from prompt/response text before storage. Run a regex pass to strip emails, phone numbers, and common ID patterns.
  • Set retention policies. I keep detailed feedback for 90 days, then aggregate it into quality scores and delete the raw rows. The aggregated data has no PII.
  • Separate storage concerns. The feedback database should not be the same database that stores user accounts. If the feedback database is compromised, the attacker should not get user credentials.
var crypto = require("crypto");

function sanitizeFeedbackData(data) {
    var sanitized = Object.assign({}, data);

    // Hash user ID
    if (sanitized.user_id) {
        var salt = process.env.FEEDBACK_HASH_SALT || "default-change-me";
        sanitized.user_id = crypto.createHash("sha256")
            .update(sanitized.user_id + salt)
            .digest("hex")
            .substring(0, 32);
    }

    // Redact common PII patterns
    var piiPatterns = [
        /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g,  // emails
        /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g,                       // phone numbers
        /\b\d{3}-\d{2}-\d{4}\b/g,                                // SSN
        /\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/g          // credit cards
    ];

    var textFields = ["prompt_text", "response_text", "correction_text", "feedback_comment"];
    textFields.forEach(function(field) {
        if (sanitized[field]) {
            piiPatterns.forEach(function(pattern) {
                sanitized[field] = sanitized[field].replace(pattern, "[REDACTED]");
            });
        }
    });

    return sanitized;
}

Complete Working Example

Here is the complete Express.js application that ties everything together. This gives you a working feedback system with collection, analysis, dashboards, and prompt management.

// app.js - Complete AI Feedback System
var express = require("express");
var bodyParser = require("body-parser");
var { Pool } = require("pg");
var crypto = require("crypto");
var cron = require("node-cron");

var app = express();
app.use(bodyParser.json({ limit: "1mb" }));

var pool = new Pool({
    connectionString: process.env.POSTGRES_CONNECTION_STRING
});

// --- Middleware: sanitize feedback data ---
function sanitizeMiddleware(req, res, next) {
    if (req.body && req.body.user_id) {
        var salt = process.env.FEEDBACK_HASH_SALT || "change-this-salt";
        req.body.user_id = crypto.createHash("sha256")
            .update(req.body.user_id + salt)
            .digest("hex")
            .substring(0, 32);
    }
    next();
}

// --- POST /api/feedback - Collect user feedback ---
app.post("/api/feedback", sanitizeMiddleware, function(req, res) {
    var body = req.body;

    if (!body.interaction_id || !body.feature_name || !body.rating_type) {
        return res.status(400).json({
            error: "Required: interaction_id, feature_name, rating_type"
        });
    }

    var ratingValue = parseInt(body.rating_value, 10);
    if (isNaN(ratingValue)) {
        return res.status(400).json({ error: "rating_value must be a number" });
    }

    var query = [
        "INSERT INTO ai_feedback",
        "(interaction_id, feature_name, user_id, session_id,",
        "prompt_text, prompt_template, prompt_version,",
        "response_text, model_name, model_params,",
        "rating_type, rating_value, rating_categories,",
        "correction_text, feedback_comment,",
        "response_latency_ms, token_count, metadata)",
        "VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18)",
        "RETURNING id, created_at"
    ].join(" ");

    var params = [
        body.interaction_id, body.feature_name,
        body.user_id || null, body.session_id || null,
        body.prompt_text || "", body.prompt_template || null,
        body.prompt_version || null, body.response_text || "",
        body.model_name || "unknown", JSON.stringify(body.model_params || {}),
        body.rating_type, ratingValue, body.rating_categories || [],
        body.correction_text || null, body.feedback_comment || null,
        body.response_latency_ms || null, body.token_count || null,
        JSON.stringify(body.metadata || {})
    ];

    pool.query(query, params, function(err, result) {
        if (err) {
            console.error("Feedback insert error:", err.message);
            return res.status(500).json({ error: "Failed to store feedback" });
        }
        res.status(201).json({
            feedback_id: result.rows[0].id,
            message: "Feedback recorded"
        });
    });
});

// --- GET /api/feedback/stats/:feature - Feature quality stats ---
app.get("/api/feedback/stats/:feature", function(req, res) {
    var days = parseInt(req.query.days, 10) || 30;

    var query = [
        "SELECT COUNT(*) as total,",
        "  AVG(rating_value)::numeric(3,2) as avg_rating,",
        "  COUNT(CASE WHEN rating_value = 1 AND rating_type='binary' THEN 1 END) as thumbs_up,",
        "  COUNT(CASE WHEN rating_value = 0 AND rating_type='binary' THEN 1 END) as thumbs_down,",
        "  model_name, prompt_version",
        "FROM ai_feedback",
        "WHERE feature_name = $1 AND created_at >= NOW() - ($2 || ' days')::interval",
        "GROUP BY model_name, prompt_version ORDER BY total DESC"
    ].join(" ");

    pool.query(query, [req.params.feature, days.toString()], function(err, result) {
        if (err) return res.status(500).json({ error: err.message });
        res.json({ feature: req.params.feature, days: days, stats: result.rows });
    });
});

// --- GET /api/dashboard/trends/:feature - Daily quality trend ---
app.get("/api/dashboard/trends/:feature", function(req, res) {
    var days = parseInt(req.query.days, 10) || 30;

    var query = [
        "SELECT DATE_TRUNC('day', created_at) as day,",
        "  COUNT(*) as count,",
        "  AVG(rating_value)::numeric(3,2) as avg_rating",
        "FROM ai_feedback",
        "WHERE feature_name = $1 AND created_at >= NOW() - ($2 || ' days')::interval",
        "GROUP BY DATE_TRUNC('day', created_at) ORDER BY day"
    ].join(" ");

    pool.query(query, [req.params.feature, days.toString()], function(err, result) {
        if (err) return res.status(500).json({ error: err.message });
        res.json({ trends: result.rows });
    });
});

// --- GET /api/dashboard/problems - Problematic prompts ---
app.get("/api/dashboard/problems", function(req, res) {
    var threshold = parseFloat(req.query.threshold) || 0.5;

    var query = [
        "SELECT prompt_template, prompt_version, feature_name,",
        "  COUNT(*) as n, AVG(rating_value)::numeric(3,2) as avg_rating",
        "FROM ai_feedback",
        "WHERE created_at >= NOW() - INTERVAL '7 days'",
        "  AND prompt_template IS NOT NULL",
        "GROUP BY prompt_template, prompt_version, feature_name",
        "HAVING AVG(rating_value) < $1 AND COUNT(*) >= 10",
        "ORDER BY avg_rating"
    ].join(" ");

    pool.query(query, [threshold], function(err, result) {
        if (err) return res.status(500).json({ error: err.message });
        res.json({ problematic_prompts: result.rows });
    });
});

// --- GET /api/dashboard/models/:feature - Model comparison ---
app.get("/api/dashboard/models/:feature", function(req, res) {
    var query = [
        "SELECT model_name,",
        "  COUNT(*) as n,",
        "  AVG(rating_value)::numeric(3,2) as avg_rating,",
        "  AVG(response_latency_ms)::integer as avg_latency",
        "FROM ai_feedback",
        "WHERE feature_name = $1 AND created_at >= NOW() - INTERVAL '14 days'",
        "GROUP BY model_name HAVING COUNT(*) >= 10 ORDER BY avg_rating DESC"
    ].join(" ");

    pool.query(query, [req.params.feature], function(err, result) {
        if (err) return res.status(500).json({ error: err.message });
        res.json({ models: result.rows });
    });
});

// --- Cron: Hourly quality check ---
cron.schedule("0 * * * *", function() {
    var query = [
        "WITH recent AS (",
        "  SELECT feature_name, AVG(rating_value)::numeric(3,2) as now_avg, COUNT(*) as n",
        "  FROM ai_feedback WHERE created_at >= NOW() - INTERVAL '24 hours'",
        "  GROUP BY feature_name HAVING COUNT(*) >= 10",
        "),",
        "baseline AS (",
        "  SELECT feature_name, AVG(rating_value)::numeric(3,2) as base_avg",
        "  FROM ai_feedback WHERE created_at BETWEEN NOW() - INTERVAL '30 days' AND NOW() - INTERVAL '24 hours'",
        "  GROUP BY feature_name",
        ")",
        "SELECT r.feature_name, r.now_avg, b.base_avg, (r.now_avg - b.base_avg)::numeric(3,2) as delta",
        "FROM recent r JOIN baseline b ON r.feature_name = b.feature_name",
        "WHERE r.now_avg < b.base_avg - 0.3"
    ].join(" ");

    pool.query(query).then(function(result) {
        result.rows.forEach(function(row) {
            console.error("[QUALITY ALERT] %s: %s -> %s (delta %s)", row.feature_name, row.base_avg, row.now_avg, row.delta);
        });
    }).catch(function(err) {
        console.error("Quality check failed:", err.message);
    });
});

// --- Start server ---
var PORT = process.env.PORT || 3500;
app.listen(PORT, function() {
    console.log("AI Feedback System running on port " + PORT);
});

Run it:

npm install express pg node-cron
export POSTGRES_CONNECTION_STRING="postgresql://user:pass@localhost:5432/ai_feedback"
node app.js

Test it:

# Submit feedback
curl -X POST http://localhost:3500/api/feedback \
  -H "Content-Type: application/json" \
  -d '{
    "interaction_id": "int_abc123",
    "feature_name": "article-summarizer",
    "rating_type": "binary",
    "rating_value": 0,
    "rating_categories": ["inaccurate", "too_long"],
    "feedback_comment": "Summary missed the main point of the article",
    "prompt_text": "Summarize the following article...",
    "response_text": "This article discusses various topics...",
    "model_name": "gpt-4o-mini",
    "prompt_template": "summarize-article",
    "prompt_version": "v1"
  }'

# Check stats
curl http://localhost:3500/api/feedback/stats/article-summarizer?days=7

# View trends
curl http://localhost:3500/api/dashboard/trends/article-summarizer?days=30

# Find problematic prompts
curl http://localhost:3500/api/dashboard/problems?threshold=0.5

# Compare models
curl http://localhost:3500/api/dashboard/models/article-summarizer

Expected output for the feedback submission:

{
  "feedback_id": 1,
  "message": "Feedback recorded"
}

Expected output for stats:

{
  "feature": "article-summarizer",
  "days": 7,
  "stats": [
    {
      "total": "47",
      "avg_rating": "0.38",
      "thumbs_up": "18",
      "thumbs_down": "29",
      "model_name": "gpt-4o-mini",
      "prompt_version": "v1"
    }
  ]
}

Common Issues and Troubleshooting

1. Feedback inserts fail with array type mismatch

error: malformed array literal: "inaccurate"
DETAIL: Array value must start with "{" or dimension information

PostgreSQL expects arrays in a specific format. When passing a JavaScript array through pg, make sure it is a proper array object, not a JSON string. The pg driver handles native arrays correctly, but if you are manually constructing the query with string interpolation, you will hit this. Always use parameterized queries and pass actual JavaScript arrays.

2. Quality scores show NaN after division

avg_rating: NaN

This happens when you divide by zero in aggregation queries. Use NULLIF around any denominator that could be zero:

-- Wrong
COUNT(positive)::numeric / COUNT(total) as rate

-- Correct
COUNT(positive)::numeric / NULLIF(COUNT(total), 0) as rate

The function returns NULL instead of throwing a division error, and your application code should handle null as "insufficient data".

3. Connection pool exhaustion under load

Error: too many clients already
TimeoutError: Connection terminated due to connection timeout

Each feedback submission opens a database connection. Under high traffic, the default pool size (10) is not enough. Increase it and add connection timeout handling:

var pool = new Pool({
    connectionString: process.env.POSTGRES_CONNECTION_STRING,
    max: 30,
    idleTimeoutMillis: 30000,
    connectionTimeoutMillis: 5000
});

Also consider batching feedback writes. Buffer submissions in memory for 5 seconds, then flush them in a single multi-row INSERT. This reduces connection pressure dramatically.

4. A/B test results are inconsistent between page loads

User reports seeing different AI quality on each refresh

This means your hashing function is not deterministic. Check that you are using the same salt and that the user ID is consistent (not a session ID that changes). The MD5 hash of userId + templateName should always produce the same bucket assignment. If you are using anonymous users, fall back to a cookie-based ID that persists across sessions.

5. Feedback timestamps are wrong across time zones

Feedback shows as submitted "tomorrow" for users in UTC+ahead timezones

Always store timestamps as TIMESTAMP WITH TIME ZONE in PostgreSQL and use NOW() for server-generated timestamps. When displaying to users, convert to their local timezone in the frontend, never on the server. If you see future-dated feedback, your server's timezone is misconfigured. Set timezone = 'UTC' in postgresql.conf or use SET timezone = 'UTC' at the session level.

Best Practices

  • Always store full context with feedback. The prompt text, response text, model name, and version are essential for debugging. Feedback without context is just a number -- you cannot act on it.

  • Use interaction IDs to link feedback to AI calls. Generate a UUID when you make the AI call, pass it to the frontend, and require it on the feedback endpoint. This creates an unbreakable chain from prompt to response to user reaction.

  • Set minimum sample sizes before making decisions. Do not change a prompt based on 5 negative reviews. I use a minimum of 30 feedback entries with at least 10 from unique users before considering a prompt change. Small samples are noise, not signal.

  • Track feedback rate alongside feedback scores. A feature with a 90% positive rate but only 0.1% feedback rate is not necessarily good -- it might just mean nobody bothers to rate it. If feedback rate drops, your feedback UI might be broken or invisible.

  • Implement feedback deduplication. Users sometimes click the thumbs-down button multiple times, or submit feedback for the same interaction from different tabs. Deduplicate on interaction_id + user_id and keep only the most recent submission.

  • Version your prompts explicitly. Never modify a prompt in place. Create a new version, A/B test it, and promote the winner. This gives you a clear audit trail and the ability to roll back when a "improvement" turns out to be a regression.

  • Separate the feedback database from production data. Feedback writes should never compete with your main application for database connections. Use a separate connection pool or a separate database entirely. If the feedback system goes down, the main application should keep working.

  • Build retention policies from day one. Feedback data grows fast. Plan for 90-day raw data retention with permanent aggregated scores. This keeps your database manageable while preserving long-term trends. Run a weekly job that aggregates old feedback into ai_quality_scores and deletes the raw rows.

  • Close the loop with users. When feedback leads to an improvement, tell the users who provided that feedback. This increases future feedback rates dramatically. Even a simple "We improved article summaries based on feedback from users like you" goes a long way.

References

Powered by Contentful