Test Plans

Flaky Test Detection and Management

Detect, quarantine, and resolve flaky tests in Azure DevOps with statistical analysis, retry strategies, and automated tracking

Flaky Test Detection and Management

Flaky tests are the silent killers of developer productivity. A test that passes one run and fails the next without any code change erodes trust in your entire test suite, trains developers to ignore failures, and grinds CI pipelines to a halt. This article covers how to systematically detect, quarantine, and eliminate flaky tests using Azure DevOps, statistical analysis, and battle-tested retry strategies in Node.js.

Prerequisites

  • Node.js 18 or later installed
  • An Azure DevOps organization with Test Plans enabled
  • A personal access token (PAT) with Test Management read/write permissions
  • Familiarity with Jest or Mocha test frameworks
  • Basic understanding of CI/CD pipelines in Azure DevOps

What Makes a Test Flaky

A flaky test is any test that produces inconsistent results against the same codebase. The definition sounds simple, but the root causes are anything but. A test is flaky when its outcome depends on something other than the code under test. That dependency could be time, ordering, external state, concurrency, or environment configuration.

The critical distinction is between a genuinely flaky test and a test that is correctly detecting an intermittent bug. Before you quarantine a test, make sure you are not masking a real defect. I have seen teams quarantine tests that were actually catching race conditions in production code. The test was not flaky. The code was.

Common Causes of Flaky Tests in Node.js

Timing and Async Issues

Node.js is inherently asynchronous, and timing-dependent tests are the number one source of flakiness. Tests that rely on setTimeout, setInterval, or assume a particular execution order of promises will eventually fail.

// FLAKY: depends on timing
it("should update cache after delay", function (done) {
  cache.set("key", "value");
  setTimeout(function () {
    // This might execute before the internal cache refresh
    assert.equal(cache.get("key"), "updated-value");
    done();
  }, 100);
});

// DETERMINISTIC: wait for the actual event
it("should update cache after delay", function (done) {
  cache.set("key", "value");
  cache.on("refreshed", function () {
    assert.equal(cache.get("key"), "updated-value");
    done();
  });
});

Shared Mutable State

Tests that share state through module-level variables, databases, or file systems will interfere with each other. The classic symptom is tests that pass in isolation but fail when run together.

// FLAKY: shared state between tests
var userCount = 0;

describe("User creation", function () {
  it("should create first user", function () {
    userCount++;
    assert.equal(userCount, 1);
  });

  it("should create second user", function () {
    userCount++;
    assert.equal(userCount, 2); // Fails if test order changes
  });
});

// DETERMINISTIC: isolated state
describe("User creation", function () {
  var userCount;

  beforeEach(function () {
    userCount = 0;
  });

  it("should increment from zero", function () {
    userCount++;
    assert.equal(userCount, 1);
  });
});

Network Dependencies

Tests that call real APIs or external services are inherently unreliable. DNS resolution, rate limiting, SSL handshake timeouts, and transient network errors all introduce non-determinism.

Test Order Dependencies

Some test frameworks do not guarantee execution order. If test B depends on a side effect of test A, you have a time bomb in your suite.

Resource Contention

Tests that bind to specific ports, write to shared files, or compete for database connections will fail when run in parallel. This is especially common in Node.js integration tests that spin up Express servers.

// FLAKY: hardcoded port causes EADDRINUSE in parallel runs
var server = app.listen(3000);

// DETERMINISTIC: use port 0 for random available port
var server = app.listen(0, function () {
  var port = server.address().port;
  console.log("Test server on port " + port);
});

Azure DevOps Built-in Flaky Test Detection

Azure DevOps has first-class support for flaky test detection. The system analyzes test results across multiple pipeline runs and automatically flags tests with inconsistent outcomes.

How It Works

Azure DevOps tracks every test result published to a pipeline. When the same test produces both pass and fail results across recent runs on the same branch without code changes to the test file, the system marks it as flaky. This detection uses a sliding window analysis, typically looking at the last 10 to 20 runs.

Enabling Flaky Test Detection

Navigate to Project Settings > Test Management > Flaky Test Detection and enable the feature. You have two main options:

  1. System-detected flaky tests - Azure DevOps automatically identifies flaky tests based on historical results
  2. Custom detection - You mark tests as flaky manually or through the API

When a test is marked flaky, Azure DevOps can optionally exclude it from the pass/fail calculation of your pipeline. This means a flaky test failure will not block your build, but it will still be tracked and visible in reports.

Configuring Flaky Test Settings

In your pipeline YAML, you can control how flaky tests interact with build results:

# azure-pipelines.yml
trigger:
  - main

pool:
  vmImage: "ubuntu-latest"

steps:
  - task: NodeTool@0
    inputs:
      versionSpec: "18.x"

  - script: npm install
    displayName: "Install dependencies"

  - script: npm test -- --ci --reporters=default --reporters=jest-junit
    displayName: "Run tests"
    env:
      JEST_JUNIT_OUTPUT_DIR: "$(System.DefaultWorkingDirectory)/test-results"

  - task: PublishTestResults@2
    inputs:
      testResultsFormat: "JUnit"
      testResultsFiles: "**/test-results/*.xml"
      failTaskOnFailedTests: true
      # Flaky tests marked by Azure DevOps won't fail the build
      failTaskOnFailureToPublishResults: true
    condition: always()

Analyzing Flaky Test Trends

Raw counts are not enough. You need to understand the flakiness rate over time, which tests are getting worse, and which are improving. Azure DevOps provides analytics through the Test Results Trend widget, but for deeper analysis you should pull data through the API.

var https = require("https");

function getTestResultTrend(organization, project, pipelineId, token) {
  var url =
    "https://dev.azure.com/" +
    organization +
    "/" +
    project +
    "/_apis/test/resulttrendbybuild?buildId=" +
    pipelineId +
    "&api-version=7.1";

  var options = {
    headers: {
      Authorization:
        "Basic " + Buffer.from(":" + token).toString("base64"),
      "Content-Type": "application/json",
    },
  };

  return new Promise(function (resolve, reject) {
    https.get(url, options, function (res) {
      var data = "";
      res.on("data", function (chunk) {
        data += chunk;
      });
      res.on("end", function () {
        resolve(JSON.parse(data));
      });
      res.on("error", reject);
    });
  });
}

Quarantine Strategies

Quarantining a flaky test means isolating it so it does not block the pipeline while still keeping it visible for resolution. There are several approaches, each with tradeoffs.

Tag-Based Quarantine

The simplest approach is tagging flaky tests and conditionally skipping them in CI:

// Jest: skip quarantined tests in CI
var isCI = process.env.CI === "true";

describe("Payment processing", function () {
  var testFn = isCI ? it.skip : it;

  testFn("should process webhook within timeout", function () {
    // This test is flaky due to external payment gateway timing
  });
});

Separate Quarantine Suite

A better approach is running quarantined tests in a separate pipeline stage that does not block deployment:

stages:
  - stage: MainTests
    jobs:
      - job: RunTests
        steps:
          - script: npm test -- --testPathIgnorePatterns="quarantine"
            displayName: "Run stable tests"

  - stage: QuarantineTests
    dependsOn: []  # Run in parallel, do not block
    jobs:
      - job: RunQuarantineTests
        continueOnError: true
        steps:
          - script: npm test -- --testPathPattern="quarantine"
            displayName: "Run quarantined tests"

Time-Boxed Quarantine

Every quarantined test should have an expiration date. If a test sits in quarantine for more than two weeks without being fixed, escalate it. Either fix the test, delete it, or accept the risk and remove it permanently. Indefinite quarantine is just deletion with extra steps.

Automatic Retry Mechanisms

Retries are a pragmatic band-aid. They do not fix flaky tests, but they reduce their impact on developer experience while you work on permanent fixes.

Jest Retry Configuration

Jest has built-in retry support through jest.retryTimes:

// jest.setup.js
jest.retryTimes(2, { logErrorsBeforeRetry: true });

Or per-test:

describe("External API integration", function () {
  beforeAll(function () {
    jest.retryTimes(3);
  });

  it("should fetch user profile", function () {
    return api.getProfile("user-123").then(function (profile) {
      expect(profile.name).toBeDefined();
    });
  });
});

In your jest.config.js, you can also configure this globally:

// jest.config.js
module.exports = {
  testRetries: 2,
  reporters: [
    "default",
    [
      "jest-junit",
      {
        outputDirectory: "./test-results",
        includeConsoleOutput: true,
        reportTestSuiteErrors: true,
      },
    ],
  ],
};

Mocha Retry Patterns

Mocha supports retries natively at both the suite and test level:

describe("Database operations", function () {
  this.retries(3);

  beforeEach(function () {
    return db.connect();
  });

  afterEach(function () {
    return db.disconnect();
  });

  it("should insert and retrieve a document", function () {
    var doc = { name: "test", timestamp: Date.now() };
    return db
      .insert(doc)
      .then(function () {
        return db.findOne({ name: "test" });
      })
      .then(function (result) {
        assert.equal(result.name, "test");
      });
  });
});

You can also set retries per test:

it("should handle concurrent writes", function () {
  this.retries(5);
  // Test body
});

Important: Log every retry. A test that needs retries to pass is a test that needs fixing. Use retry data to prioritize your flaky test backlog.

Detecting Flaky Tests Programmatically

Beyond Azure DevOps built-in detection, you can build your own detection logic. The core idea is to run the same test multiple times and check for inconsistent results.

var childProcess = require("child_process");

function detectFlakyTests(testCommand, runs) {
  var results = {};

  for (var i = 0; i < runs; i++) {
    try {
      var output = childProcess.execSync(testCommand + " --json", {
        encoding: "utf-8",
        timeout: 300000,
      });
      var parsed = JSON.parse(output);
      parsed.testResults.forEach(function (suite) {
        suite.testResults.forEach(function (test) {
          var key = suite.testFilePath + "::" + test.fullName;
          if (!results[key]) {
            results[key] = { passes: 0, failures: 0, name: test.fullName };
          }
          if (test.status === "passed") {
            results[key].passes++;
          } else if (test.status === "failed") {
            results[key].failures++;
          }
        });
      });
    } catch (err) {
      // Test run failed, parse partial results if available
      console.error("Run " + (i + 1) + " failed: " + err.message);
    }
  }

  var flakyTests = [];
  Object.keys(results).forEach(function (key) {
    var r = results[key];
    if (r.passes > 0 && r.failures > 0) {
      r.flakinessRate = r.failures / (r.passes + r.failures);
      flakyTests.push(r);
    }
  });

  return flakyTests.sort(function (a, b) {
    return b.flakinessRate - a.flakinessRate;
  });
}

Flaky Test Scoring and Prioritization

Not all flaky tests are equally harmful. A flaky test that fails 50% of the time on a critical pipeline is far more damaging than one that fails 2% of the time on a nightly build. Use a scoring system to prioritize fixes.

function calculateFlakyScore(test) {
  // Flakiness rate: how often it flips between pass and fail
  var flakinessRate = test.failures / (test.passes + test.failures);

  // Impact: how many pipelines does this test block
  var pipelineImpact = test.affectedPipelines.length;

  // Recency: more recent flakiness is more urgent
  var daysSinceLastFlake = Math.floor(
    (Date.now() - new Date(test.lastFlakeDate).getTime()) / 86400000
  );
  var recencyScore = Math.max(0, 1 - daysSinceLastFlake / 30);

  // Frequency: how often does this test run
  var runsPerDay = test.totalRuns / 30;
  var frequencyScore = Math.min(runsPerDay / 10, 1);

  var score =
    flakinessRate * 0.4 +
    (pipelineImpact / 10) * 0.2 +
    recencyScore * 0.2 +
    frequencyScore * 0.2;

  return Math.round(score * 100);
}

Tests with a score above 70 should be fixed immediately. Between 40 and 70, schedule them in the current sprint. Below 40, add them to the backlog and monitor.

Preventing Flaky Tests

The best flaky test is the one you never write. Here are deterministic patterns that eliminate common sources of flakiness.

Use Fake Timers

describe("Token expiration", function () {
  beforeEach(function () {
    jest.useFakeTimers();
  });

  afterEach(function () {
    jest.useRealTimers();
  });

  it("should expire token after 1 hour", function () {
    var token = createToken();
    expect(token.isValid()).toBe(true);

    jest.advanceTimersByTime(3600001);
    expect(token.isValid()).toBe(false);
  });
});

Isolate External Dependencies

var nock = require("nock");

describe("GitHub API client", function () {
  afterEach(function () {
    nock.cleanAll();
  });

  it("should fetch repository details", function () {
    nock("https://api.github.com")
      .get("/repos/owner/repo")
      .reply(200, {
        id: 12345,
        name: "repo",
        full_name: "owner/repo",
      });

    return github.getRepo("owner", "repo").then(function (repo) {
      expect(repo.name).toBe("repo");
    });
  });
});

Deterministic Data Generation

var seedrandom = require("seedrandom");

function createTestUser(seed) {
  var rng = seedrandom(seed);
  var id = Math.floor(rng() * 1000000);
  return {
    id: id,
    email: "user-" + id + "@test.example.com",
    name: "Test User " + id,
  };
}

// Same seed always produces the same user
var user = createTestUser("test-case-42");

Wait for Conditions, Not Time

function waitFor(conditionFn, timeout, interval) {
  timeout = timeout || 5000;
  interval = interval || 100;

  return new Promise(function (resolve, reject) {
    var startTime = Date.now();

    function check() {
      var result = conditionFn();
      if (result) {
        resolve(result);
      } else if (Date.now() - startTime > timeout) {
        reject(new Error("Condition not met within " + timeout + "ms"));
      } else {
        setTimeout(check, interval);
      }
    }

    check();
  });
}

// Usage in tests
it("should process queue item", function () {
  queue.push({ id: 1, data: "test" });

  return waitFor(function () {
    return db.findOne({ id: 1 });
  }).then(function (result) {
    expect(result.processed).toBe(true);
  });
});

CI Pipeline Strategies for Flaky Tests

Selective Retry in Azure Pipelines

steps:
  - script: |
      npm test -- --ci 2>&1 | tee test-output.log
      TEST_EXIT=$?
      if [ $TEST_EXIT -ne 0 ]; then
        echo "##[warning]Tests failed. Retrying failed tests..."
        npm test -- --ci --onlyFailures 2>&1 | tee retry-output.log
        RETRY_EXIT=$?
        if [ $RETRY_EXIT -eq 0 ]; then
          echo "##[warning]Tests passed on retry - possible flaky tests detected"
          echo "##vso[task.setvariable variable=FLAKY_DETECTED]true"
        fi
        exit $RETRY_EXIT
      fi
    displayName: "Run tests with retry"

  - script: |
      node scripts/report-flaky-tests.js
    condition: eq(variables.FLAKY_DETECTED, 'true')
    displayName: "Report flaky tests"

Parallel Test Sharding

Running tests in parallel exposes flakiness caused by shared state. Azure DevOps supports parallel test execution:

strategy:
  parallel: 4

steps:
  - script: |
      npm test -- --ci --shard=$(System.JobPositionInPhase)/$(Strategy.Parallel)
    displayName: "Run test shard"

Building a Flaky Test Dashboard with Node.js

A dedicated dashboard gives your team visibility into test health. Here is a minimal Express-based dashboard that pulls data from Azure DevOps:

var express = require("express");
var https = require("https");

var app = express();
var PORT = process.env.PORT || 3200;

var ORGANIZATION = process.env.ADO_ORG;
var PROJECT = process.env.ADO_PROJECT;
var TOKEN = process.env.ADO_PAT;

function adoRequest(path) {
  var url =
    "https://dev.azure.com/" +
    ORGANIZATION +
    "/" +
    PROJECT +
    "/_apis" +
    path;

  var options = {
    headers: {
      Authorization: "Basic " + Buffer.from(":" + TOKEN).toString("base64"),
      "Content-Type": "application/json",
    },
  };

  return new Promise(function (resolve, reject) {
    https.get(url, options, function (res) {
      var data = "";
      res.on("data", function (chunk) {
        data += chunk;
      });
      res.on("end", function () {
        try {
          resolve(JSON.parse(data));
        } catch (e) {
          reject(new Error("Failed to parse response: " + data));
        }
      });
      res.on("error", reject);
    });
  });
}

app.get("/api/flaky-tests", function (req, res) {
  var days = parseInt(req.query.days) || 14;
  var minDate = new Date(Date.now() - days * 86400000).toISOString();

  adoRequest(
    "/test/runs?minLastUpdatedDate=" +
      minDate +
      "&api-version=7.1"
  )
    .then(function (runsData) {
      var runs = runsData.value || [];
      var testResultPromises = runs.map(function (run) {
        return adoRequest(
          "/test/runs/" + run.id + "/results?api-version=7.1"
        );
      });
      return Promise.all(testResultPromises);
    })
    .then(function (allResults) {
      var testMap = {};

      allResults.forEach(function (resultSet) {
        var results = resultSet.value || [];
        results.forEach(function (result) {
          var key = result.automatedTestName;
          if (!testMap[key]) {
            testMap[key] = {
              name: result.automatedTestName,
              testCaseTitle: result.testCaseTitle,
              passes: 0,
              failures: 0,
              results: [],
            };
          }
          if (result.outcome === "Passed") {
            testMap[key].passes++;
          } else if (result.outcome === "Failed") {
            testMap[key].failures++;
          }
          testMap[key].results.push({
            outcome: result.outcome,
            date: result.completedDate,
            runId: result.testRun ? result.testRun.id : null,
            duration: result.durationInMs,
            errorMessage: result.errorMessage,
          });
        });
      });

      // Filter to only tests with mixed results
      var flakyTests = [];
      Object.keys(testMap).forEach(function (key) {
        var test = testMap[key];
        if (test.passes > 0 && test.failures > 0) {
          test.flakinessRate =
            test.failures / (test.passes + test.failures);
          test.totalRuns = test.passes + test.failures;
          flakyTests.push(test);
        }
      });

      flakyTests.sort(function (a, b) {
        return b.flakinessRate - a.flakinessRate;
      });

      res.json({
        period: days + " days",
        totalFlakyTests: flakyTests.length,
        tests: flakyTests,
      });
    })
    .catch(function (err) {
      res.status(500).json({ error: err.message });
    });
});

app.listen(PORT, function () {
  console.log("Flaky test dashboard running on port " + PORT);
});

Complete Working Example

The following tool combines everything into a single Node.js script that analyzes Azure DevOps test results, identifies flaky tests using statistical analysis, generates quarantine recommendations, and tracks resolution progress.

var https = require("https");
var fs = require("fs");
var path = require("path");

// Configuration
var CONFIG = {
  organization: process.env.ADO_ORG,
  project: process.env.ADO_PROJECT,
  token: process.env.ADO_PAT,
  analysisPeriodDays: 14,
  minRunsForAnalysis: 5,
  flakinessThreshold: 0.1, // 10% failure rate triggers investigation
  quarantineThreshold: 0.3, // 30% failure rate recommends quarantine
  dataFile: path.join(__dirname, "flaky-test-data.json"),
};

// Azure DevOps API client
function adoApi(urlPath, method, body) {
  method = method || "GET";
  var base =
    "https://dev.azure.com/" +
    CONFIG.organization +
    "/" +
    CONFIG.project +
    "/_apis";

  return new Promise(function (resolve, reject) {
    var parsedUrl = new URL(base + urlPath);
    var options = {
      hostname: parsedUrl.hostname,
      path: parsedUrl.pathname + parsedUrl.search,
      method: method,
      headers: {
        Authorization:
          "Basic " + Buffer.from(":" + CONFIG.token).toString("base64"),
        "Content-Type": "application/json",
      },
    };

    var req = https.request(options, function (res) {
      var data = "";
      res.on("data", function (chunk) {
        data += chunk;
      });
      res.on("end", function () {
        try {
          resolve(JSON.parse(data));
        } catch (e) {
          reject(new Error("Parse error: " + data.substring(0, 200)));
        }
      });
    });

    req.on("error", reject);

    if (body) {
      req.write(JSON.stringify(body));
    }
    req.end();
  });
}

// Fetch test runs from the specified period
function fetchTestRuns() {
  var minDate = new Date(
    Date.now() - CONFIG.analysisPeriodDays * 86400000
  ).toISOString();

  return adoApi(
    "/test/runs?minLastUpdatedDate=" +
      encodeURIComponent(minDate) +
      "&api-version=7.1"
  ).then(function (data) {
    return data.value || [];
  });
}

// Fetch results for a specific test run
function fetchTestResults(runId) {
  return adoApi(
    "/test/runs/" + runId + "/results?$top=1000&api-version=7.1"
  ).then(function (data) {
    return data.value || [];
  });
}

// Statistical analysis of test results
function analyzeFlakiness(testHistory) {
  var total = testHistory.passes + testHistory.failures;
  if (total < CONFIG.minRunsForAnalysis) {
    return null; // Not enough data
  }

  var failureRate = testHistory.failures / total;

  // Calculate variance in recent results (sliding window)
  var recentResults = testHistory.results
    .sort(function (a, b) {
      return new Date(b.date) - new Date(a.date);
    })
    .slice(0, 20);

  var transitions = 0;
  for (var i = 1; i < recentResults.length; i++) {
    if (recentResults[i].outcome !== recentResults[i - 1].outcome) {
      transitions++;
    }
  }

  // High transition count relative to runs indicates flakiness
  var transitionRate =
    recentResults.length > 1 ? transitions / (recentResults.length - 1) : 0;

  // Calculate average duration and standard deviation
  var durations = recentResults
    .map(function (r) {
      return r.duration || 0;
    })
    .filter(function (d) {
      return d > 0;
    });

  var avgDuration = 0;
  var durationStdDev = 0;

  if (durations.length > 0) {
    avgDuration =
      durations.reduce(function (sum, d) {
        return sum + d;
      }, 0) / durations.length;

    var variance =
      durations.reduce(function (sum, d) {
        return sum + Math.pow(d - avgDuration, 2);
      }, 0) / durations.length;

    durationStdDev = Math.sqrt(variance);
  }

  return {
    name: testHistory.name,
    totalRuns: total,
    passes: testHistory.passes,
    failures: testHistory.failures,
    failureRate: Math.round(failureRate * 1000) / 1000,
    transitionRate: Math.round(transitionRate * 1000) / 1000,
    avgDurationMs: Math.round(avgDuration),
    durationStdDevMs: Math.round(durationStdDev),
    lastFailure: findLastFailure(testHistory.results),
    lastPass: findLastPass(testHistory.results),
    commonErrors: extractCommonErrors(testHistory.results),
    recommendation: generateRecommendation(failureRate, transitionRate, total),
  };
}

function findLastFailure(results) {
  for (var i = 0; i < results.length; i++) {
    if (results[i].outcome === "Failed") {
      return results[i].date;
    }
  }
  return null;
}

function findLastPass(results) {
  for (var i = 0; i < results.length; i++) {
    if (results[i].outcome === "Passed") {
      return results[i].date;
    }
  }
  return null;
}

function extractCommonErrors(results) {
  var errorCounts = {};
  results.forEach(function (r) {
    if (r.errorMessage) {
      // Normalize error messages by removing variable parts
      var normalized = r.errorMessage
        .replace(/\d+/g, "N")
        .replace(/[0-9a-f]{8,}/gi, "HASH")
        .substring(0, 200);

      if (!errorCounts[normalized]) {
        errorCounts[normalized] = { count: 0, sample: r.errorMessage };
      }
      errorCounts[normalized].count++;
    }
  });

  return Object.keys(errorCounts)
    .map(function (key) {
      return errorCounts[key];
    })
    .sort(function (a, b) {
      return b.count - a.count;
    })
    .slice(0, 3);
}

function generateRecommendation(failureRate, transitionRate, totalRuns) {
  if (failureRate >= CONFIG.quarantineThreshold && transitionRate > 0.3) {
    return {
      action: "QUARANTINE",
      priority: "HIGH",
      reason:
        "High failure rate (" +
        Math.round(failureRate * 100) +
        "%) with frequent pass/fail transitions",
    };
  }

  if (failureRate >= CONFIG.flakinessThreshold && transitionRate > 0.2) {
    return {
      action: "INVESTIGATE",
      priority: "MEDIUM",
      reason:
        "Moderate flakiness detected (" +
        Math.round(failureRate * 100) +
        "% failure rate)",
    };
  }

  if (transitionRate > 0.4) {
    return {
      action: "INVESTIGATE",
      priority: "MEDIUM",
      reason:
        "High transition rate (" +
        Math.round(transitionRate * 100) +
        "%) suggests environment sensitivity",
    };
  }

  if (failureRate >= CONFIG.flakinessThreshold) {
    return {
      action: "MONITOR",
      priority: "LOW",
      reason:
        "Elevated failure rate but low transition rate - may be a real bug",
    };
  }

  return {
    action: "OK",
    priority: "NONE",
    reason: "Test appears stable",
  };
}

// Track resolution progress
function loadTrackingData() {
  try {
    var data = fs.readFileSync(CONFIG.dataFile, "utf-8");
    return JSON.parse(data);
  } catch (e) {
    return { tests: {}, history: [] };
  }
}

function saveTrackingData(data) {
  fs.writeFileSync(CONFIG.dataFile, JSON.stringify(data, null, 2));
}

function updateTracking(analysisResults) {
  var tracking = loadTrackingData();
  var timestamp = new Date().toISOString();

  analysisResults.forEach(function (result) {
    if (!tracking.tests[result.name]) {
      tracking.tests[result.name] = {
        firstDetected: timestamp,
        status: "NEW",
        assignee: null,
        notes: [],
      };
    }

    var entry = tracking.tests[result.name];
    entry.lastAnalysis = timestamp;
    entry.currentFailureRate = result.failureRate;
    entry.recommendation = result.recommendation;

    // Auto-update status based on trends
    if (result.recommendation.action === "OK" && entry.status !== "RESOLVED") {
      entry.status = "RESOLVED";
      entry.resolvedDate = timestamp;
    } else if (
      result.recommendation.action === "QUARANTINE" &&
      entry.status === "NEW"
    ) {
      entry.status = "QUARANTINED";
    }
  });

  tracking.history.push({
    date: timestamp,
    totalFlaky: analysisResults.length,
    quarantined: analysisResults.filter(function (r) {
      return r.recommendation.action === "QUARANTINE";
    }).length,
    investigating: analysisResults.filter(function (r) {
      return r.recommendation.action === "INVESTIGATE";
    }).length,
  });

  // Keep only 90 days of history
  var cutoff = Date.now() - 90 * 86400000;
  tracking.history = tracking.history.filter(function (entry) {
    return new Date(entry.date).getTime() > cutoff;
  });

  saveTrackingData(tracking);
  return tracking;
}

// Generate report
function generateReport(analysisResults, tracking) {
  var report = [];
  report.push("=== Flaky Test Analysis Report ===");
  report.push("Date: " + new Date().toISOString());
  report.push(
    "Period: " + CONFIG.analysisPeriodDays + " days"
  );
  report.push("Total flaky tests detected: " + analysisResults.length);
  report.push("");

  var grouped = {
    QUARANTINE: [],
    INVESTIGATE: [],
    MONITOR: [],
  };

  analysisResults.forEach(function (result) {
    var action = result.recommendation.action;
    if (grouped[action]) {
      grouped[action].push(result);
    }
  });

  if (grouped.QUARANTINE.length > 0) {
    report.push("--- QUARANTINE RECOMMENDED (" + grouped.QUARANTINE.length + ") ---");
    grouped.QUARANTINE.forEach(function (t) {
      report.push("  " + t.name);
      report.push(
        "    Failure rate: " +
          Math.round(t.failureRate * 100) +
          "% | Runs: " +
          t.totalRuns +
          " | Transitions: " +
          Math.round(t.transitionRate * 100) +
          "%"
      );
      if (t.commonErrors.length > 0) {
        report.push(
          "    Common error: " + t.commonErrors[0].sample.substring(0, 120)
        );
      }
    });
    report.push("");
  }

  if (grouped.INVESTIGATE.length > 0) {
    report.push(
      "--- INVESTIGATION NEEDED (" + grouped.INVESTIGATE.length + ") ---"
    );
    grouped.INVESTIGATE.forEach(function (t) {
      report.push("  " + t.name);
      report.push(
        "    Failure rate: " +
          Math.round(t.failureRate * 100) +
          "% | Runs: " +
          t.totalRuns
      );
    });
    report.push("");
  }

  if (grouped.MONITOR.length > 0) {
    report.push("--- MONITORING (" + grouped.MONITOR.length + ") ---");
    grouped.MONITOR.forEach(function (t) {
      report.push("  " + t.name);
      report.push(
        "    Failure rate: " +
          Math.round(t.failureRate * 100) +
          "% | Runs: " +
          t.totalRuns
      );
    });
    report.push("");
  }

  // Trend summary
  if (tracking.history.length > 1) {
    var latest = tracking.history[tracking.history.length - 1];
    var previous = tracking.history[tracking.history.length - 2];
    var delta = latest.totalFlaky - previous.totalFlaky;
    report.push("--- TREND ---");
    report.push(
      "  Flaky tests: " +
        latest.totalFlaky +
        " (" +
        (delta >= 0 ? "+" : "") +
        delta +
        " since last analysis)"
    );
  }

  return report.join("\n");
}

// Main execution
function main() {
  console.log("Starting flaky test analysis...");

  fetchTestRuns()
    .then(function (runs) {
      console.log("Found " + runs.length + " test runs");

      var resultPromises = runs.map(function (run) {
        return fetchTestResults(run.id).then(function (results) {
          return { runId: run.id, results: results };
        });
      });

      return Promise.all(resultPromises);
    })
    .then(function (allRunResults) {
      // Aggregate results by test name
      var testMap = {};

      allRunResults.forEach(function (run) {
        run.results.forEach(function (result) {
          var key = result.automatedTestName;
          if (!key) return;

          if (!testMap[key]) {
            testMap[key] = {
              name: key,
              passes: 0,
              failures: 0,
              results: [],
            };
          }

          if (result.outcome === "Passed") {
            testMap[key].passes++;
          } else if (result.outcome === "Failed") {
            testMap[key].failures++;
          }

          testMap[key].results.push({
            outcome: result.outcome,
            date: result.completedDate,
            duration: result.durationInMs,
            errorMessage: result.errorMessage || null,
          });
        });
      });

      // Analyze each test with mixed results
      var analysisResults = [];
      Object.keys(testMap).forEach(function (key) {
        var test = testMap[key];
        if (test.passes > 0 && test.failures > 0) {
          var analysis = analyzeFlakiness(test);
          if (analysis && analysis.recommendation.action !== "OK") {
            analysisResults.push(analysis);
          }
        }
      });

      analysisResults.sort(function (a, b) {
        var priorityOrder = { HIGH: 0, MEDIUM: 1, LOW: 2, NONE: 3 };
        return (
          priorityOrder[a.recommendation.priority] -
          priorityOrder[b.recommendation.priority]
        );
      });

      // Update tracking and generate report
      var tracking = updateTracking(analysisResults);
      var report = generateReport(analysisResults, tracking);

      console.log("\n" + report);

      // Output JSON for pipeline consumption
      var outputPath = path.join(__dirname, "flaky-test-report.json");
      fs.writeFileSync(
        outputPath,
        JSON.stringify(
          {
            generated: new Date().toISOString(),
            period: CONFIG.analysisPeriodDays,
            results: analysisResults,
            summary: {
              total: analysisResults.length,
              quarantine: analysisResults.filter(function (r) {
                return r.recommendation.action === "QUARANTINE";
              }).length,
              investigate: analysisResults.filter(function (r) {
                return r.recommendation.action === "INVESTIGATE";
              }).length,
              monitor: analysisResults.filter(function (r) {
                return r.recommendation.action === "MONITOR";
              }).length,
            },
          },
          null,
          2
        )
      );

      console.log("\nReport saved to " + outputPath);
    })
    .catch(function (err) {
      console.error("Analysis failed: " + err.message);
      process.exit(1);
    });
}

main();

Run the tool with your Azure DevOps credentials:

ADO_ORG=myorg ADO_PROJECT=myproject ADO_PAT=your-pat-here node flaky-analyzer.js

Common Issues and Troubleshooting

Tests Pass Locally but Fail in CI

This is almost always caused by environment differences. CI agents have different CPU speeds, memory limits, and network configurations. Check for hardcoded timeouts that are too tight for slower CI machines. Increase timeouts for integration tests and use waitFor patterns instead of fixed delays. Also verify that your CI agent has the same Node.js version as your local environment.

Quarantined Tests Never Get Fixed

This happens when there is no ownership or accountability for quarantined tests. Set a maximum quarantine duration (two weeks is reasonable) and assign every quarantined test to a specific engineer. Add a CI check that fails the build if any test has been quarantined longer than the maximum period. Treat quarantine as a temporary measure, not a permanent solution.

Retry Masks Real Failures

When tests pass on retry, developers stop investigating. The danger is that some of those retried failures are real bugs that manifest intermittently. Always log retries separately from first-run results. If a test requires retries to pass more than 20% of the time, it needs to be fixed, not retried. Track the retry rate independently from the pass rate.

Flaky Test Detection Reports False Positives

Azure DevOps may flag tests as flaky when they fail due to legitimate code changes that were later reverted, or when infrastructure issues cause widespread failures. Filter your analysis to exclude runs where more than 50% of tests failed (indicating an infrastructure problem rather than individual test flakiness). Also exclude the first failure after a code change to the test file itself.

Database Tests Fail Under Parallel Execution

When multiple test suites run in parallel against the same database, they corrupt each other's test data. Use unique database names or schemas per test worker. In PostgreSQL, create a temporary schema per test run. In MongoDB, use a unique database name that includes the worker ID:

var dbName = "testdb_worker_" + (process.env.JEST_WORKER_ID || "0");

Best Practices

  • Track flakiness as a metric. Measure it weekly. If your flakiness rate is climbing, stop adding features and fix tests. A healthy test suite should have less than 1% flaky tests.

  • Never ignore a flaky test. Every flaky test you ignore teaches your team that test failures are acceptable. This cultural rot spreads quickly and is expensive to reverse.

  • Make quarantine visible. Display quarantined tests on a dashboard, in Slack notifications, and in sprint planning. Hidden quarantine lists grow without bound.

  • Fix the root cause, not the symptom. Adding a retry or increasing a timeout is a band-aid. Identify whether the flakiness is in the test, the code under test, or the test infrastructure, and fix the actual problem.

  • Run flaky test detection on every PR. Catch new flaky tests before they merge. Run the test suite multiple times against the PR branch and flag any tests with inconsistent results.

  • Use deterministic patterns by default. Fake timers, mocked HTTP, seeded random data, and isolated state should be your defaults, not your exceptions. Only use real dependencies in a small set of clearly labeled integration tests.

  • Assign ownership for every flaky test. Unowned tests never get fixed. When a test is flagged as flaky, assign it to the engineer who last modified it or the team that owns the feature.

  • Budget time for test maintenance. Allocate 10 to 15% of each sprint for test infrastructure and flaky test remediation. This is not overhead. It is an investment in development velocity.

References

Powered by Contentful