Test Plans

Flaky Test Detection and Management

A comprehensive guide to detecting and managing flaky tests in Azure DevOps, covering Azure DevOps flaky test detection, root cause analysis patterns, quarantine strategies, retry mechanisms, flaky test dashboards, and automated remediation workflows.

Flaky Test Detection and Management

Overview

A flaky test is a test that produces different outcomes -- pass or fail -- without any code changes between runs. It passes on Monday, fails on Tuesday, passes again on Wednesday, all on the same commit. Flaky tests are one of the most corrosive problems in CI/CD because they erode trust in the entire test suite. When developers see a failing test and their first thought is "probably just flaky," you have lost the signal that tests are supposed to provide. Azure DevOps has built-in flaky test detection that identifies these tests automatically based on result history, but detection is only the first step. You also need strategies for quarantine, root cause analysis, and remediation.

I have worked on codebases where 15% of the test suite was flaky. The team had learned to re-run the pipeline once or twice before investigating failures, wasting hours of compute time per week and training themselves to ignore real failures. Getting that number below 2% required a deliberate, tracked effort -- identifying root causes, fixing or quarantining the worst offenders, and preventing new flaky tests from being introduced. This article covers the complete lifecycle from detection through remediation.

Prerequisites

  • An Azure DevOps organization with Azure Pipelines running automated tests
  • Test results published via PublishTestResults@2 (at least 2 weeks of history for pattern detection)
  • Azure DevOps Analytics enabled (default for Azure DevOps Services)
  • Node.js 18+ for the management scripts
  • Pipeline admin access for configuring flaky test settings
  • Familiarity with test runner retry mechanisms (Jest, pytest, xUnit)

How Azure DevOps Detects Flaky Tests

Azure DevOps tracks test results across pipeline runs and automatically identifies tests whose outcomes flip between pass and fail. The detection algorithm considers:

  1. Same test, same branch: A test that passes on run #100 and fails on run #101 of the same branch, with the same code commit, is potentially flaky.
  2. Result flip frequency: Tests that flip multiple times within a short window (e.g., 3 flips in 10 runs) are flagged with higher confidence.
  3. Retry outcomes: If a test fails on the first attempt but passes on retry within the same pipeline run, it is flagged as flaky.

Enabling Flaky Test Detection

Navigate to Project Settings > Pipelines > Test Management:

  1. Toggle "Flaky test detection" to On

  2. Choose a detection mode:

    • System detected: Azure DevOps uses its algorithm to identify flaky tests
    • Custom: You manually flag tests as flaky via the API or UI
  3. Choose a resolution mode:

    • Report only: Flaky tests are flagged in the UI but still affect the pipeline outcome
    • Auto-resolve: Flaky test failures are resolved to "passed" with a flaky tag. The pipeline succeeds even if flaky tests fail

Important: Start with "Report only" mode. Auto-resolve masks real failures when a previously flaky test starts failing consistently due to a real bug. Use auto-resolve only after you have a reliable process for investigating flaky flags.

Viewing Flaky Tests

In the pipeline Tests tab, flaky tests appear with a special icon. Filter by "Flaky" to see all tests currently flagged. Each flaky test shows:

  • Number of times it flipped in the detection window
  • Last pass and fail dates
  • Which pipeline runs were affected
  • The test's full history of outcomes

Root Cause Categories

Flaky tests are symptoms, not causes. Understanding the root cause determines the fix.

Timing Dependencies

The most common root cause. Tests that depend on specific timing -- waiting for an animation to complete, expecting a response within a timeout, checking that a value changes within a time window -- fail when the system is slower than expected.

// FLAKY: Depends on timing
test("notification disappears after 3 seconds", function () {
  showNotification("Test message");
  return sleep(3100).then(function () {
    // Fails if system is slow and notification takes 3.5 seconds
    var el = document.querySelector(".notification");
    expect(el).toBeNull();
  });
});

// STABLE: Waits for the actual condition
test("notification disappears after timeout", function () {
  showNotification("Test message");
  return waitForCondition(function () {
    return document.querySelector(".notification") === null;
  }, 10000).then(function (disappeared) {
    expect(disappeared).toBe(true);
  });
});

Shared State

Tests that read or write shared state (databases, files, global variables, caches) can interfere with each other when run in parallel or in unpredictable order.

// FLAKY: Shared global state
var counter = 0;

test("first test increments counter", function () {
  counter++;
  expect(counter).toBe(1); // Fails if another test modified counter first
});

// STABLE: Isolated state
test("increment function works", function () {
  var localCounter = 0;
  localCounter++;
  expect(localCounter).toBe(1);
});

External Service Dependencies

Tests that call external APIs, databases, or services without mocking are subject to network latency, service outages, and rate limiting.

// FLAKY: Depends on external API availability
test("fetches weather data", function () {
  return fetch("https://api.weather.com/current").then(function (res) {
    expect(res.status).toBe(200); // Fails if API is slow or rate-limited
  });
});

// STABLE: Mocked external dependency
test("processes weather data", function () {
  var mockData = { temperature: 72, condition: "sunny" };
  var result = processWeatherData(mockData);
  expect(result.display).toBe("72°F - Sunny");
});

Test Order Dependencies

Tests that depend on running after another specific test -- because the first test creates data the second test reads -- fail when the test runner changes execution order.

// FLAKY: Depends on test order
test("create user", function () {
  return createUser({ name: "Test" }).then(function (user) {
    global.testUserId = user.id; // Shared state between tests
  });
});

test("delete user", function () {
  // Fails if "create user" hasn't run yet
  return deleteUser(global.testUserId).then(function (result) {
    expect(result.deleted).toBe(true);
  });
});

// STABLE: Self-contained
test("create and delete user", function () {
  return createUser({ name: "Test" }).then(function (user) {
    return deleteUser(user.id);
  }).then(function (result) {
    expect(result.deleted).toBe(true);
  });
});

Resource Contention

Tests that compete for limited resources (ports, file locks, database connections) fail intermittently under parallel execution.

// FLAKY: Hard-coded port
test("server starts correctly", function () {
  var server = createServer();
  return server.listen(3000).then(function () {
    // Fails if another test or process is using port 3000
    expect(server.isRunning()).toBe(true);
    return server.close();
  });
});

// STABLE: Dynamic port
test("server starts correctly", function () {
  var server = createServer();
  return server.listen(0).then(function () {
    // Port 0 = OS assigns an available port
    expect(server.isRunning()).toBe(true);
    return server.close();
  });
});

Quarantine Strategies

While you fix the root cause, quarantine prevents flaky tests from disrupting the team.

Tag-Based Quarantine

Add a tag to flaky test cases and filter them out in CI:

// jest.config.js
module.exports = {
  testPathIgnorePatterns: process.env.SKIP_FLAKY === "true"
    ? ["/quarantine/"]
    : [],
};

Move flaky test files to a quarantine/ directory. Run quarantined tests in a separate pipeline job that does not block deployment:

jobs:
  - job: MainTests
    steps:
      - script: SKIP_FLAKY=true npx jest
        displayName: Run stable tests

  - job: QuarantinedTests
    steps:
      - script: npx jest --testPathPattern=quarantine
        displayName: Run quarantined tests
        continueOnError: true  # Does not block pipeline

Retry-Based Quarantine

Configure the test runner to retry failed tests. Tests that pass on retry are flagged but do not fail the build:

Jest (using jest-retry):

// jest.config.js
module.exports = {
  retryTimes: process.env.CI ? 2 : 0,
};

pytest:

# pytest.ini
[pytest]
addopts = --count=3 -x

Playwright:

// playwright.config.js
module.exports = {
  retries: process.env.CI ? 2 : 0,
};

Time-Boxed Quarantine

Every quarantined test gets a deadline. If it is not fixed within 2 sprints, it gets deleted or rewritten. Quarantine without deadlines becomes a permanent dumping ground.

Quarantine Tracker:
| Test Name | Quarantined Date | Owner | Deadline | Root Cause |
|-----------|-----------------|-------|----------|------------|
| checkout.timeout | 2026-01-15 | Shane | 2026-02-12 | Timing dependency |
| upload.parallel | 2026-01-22 | Maria | 2026-02-19 | File lock contention |
| api.rateLimit | 2026-02-01 | Alex | 2026-02-28 | External service mock needed |

Building a Flaky Test Dashboard

Track flaky tests as a quality metric alongside pass rate and coverage.

Key Metrics

Flaky Test Health:
  Total automated tests: 450
  Currently flagged as flaky: 12 (2.7%)
  Quarantined: 5
  Under investigation: 4
  Fixed this sprint: 3

  Flaky Rate Trend:
    Sprint 45: 4.2% (19 flaky tests)
    Sprint 46: 3.6% (16 flaky tests)
    Sprint 47: 2.7% (12 flaky tests)  ← improving

  Top Flaky Tests (by flip count, last 14 days):
    1. checkout.paymentTimeout — 8 flips
    2. dashboard.chartRender — 6 flips
    3. upload.largeFile — 5 flips
    4. search.elasticsearch — 4 flips
    5. auth.sessionExpiry — 3 flips

Complete Working Example

A Node.js tool that queries Azure DevOps for flaky test data, generates reports, and manages quarantine through work item tagging.

var https = require("https");
var url = require("url");

var ORG = "my-organization";
var PROJECT = "my-project";
var PAT = process.env.AZURE_DEVOPS_PAT;
var API_VERSION = "7.1";

var BASE_URL = "https://dev.azure.com/" + ORG + "/" + PROJECT;
var ANALYTICS_URL = "https://analytics.dev.azure.com/" + ORG + "/" + PROJECT;
var AUTH = "Basic " + Buffer.from(":" + PAT).toString("base64");

function makeRequest(method, fullUrl) {
  return new Promise(function (resolve, reject) {
    var parsed = url.parse(fullUrl);
    var options = {
      hostname: parsed.hostname,
      path: parsed.path,
      method: method,
      headers: {
        Authorization: AUTH,
        Accept: "application/json",
      },
    };

    var req = https.request(options, function (res) {
      var data = "";
      res.on("data", function (chunk) { data += chunk; });
      res.on("end", function () {
        if (res.statusCode >= 200 && res.statusCode < 300) {
          resolve(data ? JSON.parse(data) : null);
        } else {
          reject(new Error(res.statusCode + " " + data.slice(0, 200)));
        }
      });
    });

    req.on("error", reject);
    req.end();
  });
}

function getRecentTestRuns(days) {
  var since = new Date();
  since.setDate(since.getDate() - days);
  var sinceStr = since.toISOString();

  var apiUrl = BASE_URL + "/_apis/test/runs?minLastUpdatedDate=" +
    encodeURIComponent(sinceStr) + "&api-version=" + API_VERSION;

  return makeRequest("GET", apiUrl).then(function (response) {
    return response.value || [];
  });
}

function getTestRunResults(runId) {
  var apiUrl = BASE_URL + "/_apis/test/runs/" + runId +
    "/results?api-version=" + API_VERSION + "&$top=1000";

  return makeRequest("GET", apiUrl).then(function (response) {
    return response.value || [];
  });
}

function analyzeFlakiness(days) {
  console.log("Analyzing test results from the last " + days + " day(s)...\n");

  return getRecentTestRuns(days).then(function (runs) {
    console.log("Found " + runs.length + " test run(s)");

    // Collect results from all runs
    var chain = Promise.resolve([]);
    var allResults = [];

    // Limit to last 50 runs for performance
    var recentRuns = runs.slice(0, 50);

    recentRuns.forEach(function (run) {
      chain = chain.then(function () {
        return getTestRunResults(run.id).then(function (results) {
          results.forEach(function (r) {
            allResults.push({
              testName: r.automatedTestName || r.testCaseTitle,
              outcome: r.outcome,
              runId: run.id,
              runDate: run.completedDate || run.startedDate,
              duration: r.durationInMs,
            });
          });
        });
      });
    });

    return chain.then(function () {
      return allResults;
    });
  }).then(function (results) {
    // Group results by test name
    var testMap = {};
    results.forEach(function (r) {
      if (!testMap[r.testName]) {
        testMap[r.testName] = {
          name: r.testName,
          outcomes: [],
          totalRuns: 0,
          passes: 0,
          failures: 0,
          flips: 0,
        };
      }
      var test = testMap[r.testName];
      test.outcomes.push(r.outcome);
      test.totalRuns++;
      if (r.outcome === "Passed") { test.passes++; }
      if (r.outcome === "Failed") { test.failures++; }
    });

    // Calculate flips (outcome changes between consecutive runs)
    Object.keys(testMap).forEach(function (name) {
      var test = testMap[name];
      for (var i = 1; i < test.outcomes.length; i++) {
        if (test.outcomes[i] !== test.outcomes[i - 1] &&
            test.outcomes[i] !== "NotExecuted" &&
            test.outcomes[i - 1] !== "NotExecuted") {
          test.flips++;
        }
      }
    });

    // Identify flaky tests (tests with at least 2 flips and mixed outcomes)
    var flakyTests = Object.keys(testMap)
      .map(function (name) { return testMap[name]; })
      .filter(function (test) {
        return test.flips >= 2 && test.passes > 0 && test.failures > 0;
      })
      .sort(function (a, b) { return b.flips - a.flips; });

    var totalTests = Object.keys(testMap).length;
    var flakyCount = flakyTests.length;
    var flakyRate = totalTests > 0 ? ((flakyCount / totalTests) * 100).toFixed(1) : "0.0";

    console.log("\n=== Flaky Test Report ===");
    console.log("Analysis period: " + days + " days");
    console.log("Test runs analyzed: " + Math.min(results.length > 0 ? 50 : 0, 50));
    console.log("Unique tests: " + totalTests);
    console.log("Flaky tests: " + flakyCount + " (" + flakyRate + "%)");
    console.log("");

    if (flakyTests.length > 0) {
      console.log("Top Flaky Tests (by flip count):");
      console.log("-".repeat(90));
      flakyTests.slice(0, 20).forEach(function (test, index) {
        var passRate = ((test.passes / test.totalRuns) * 100).toFixed(0);
        console.log(
          "  " + (index + 1) + ". " + test.name.slice(0, 60)
        );
        console.log(
          "     Flips: " + test.flips +
          " | Runs: " + test.totalRuns +
          " | Pass rate: " + passRate + "%" +
          " | Pattern: " + test.outcomes.slice(-10).map(function (o) {
            return o === "Passed" ? "P" : o === "Failed" ? "F" : "-";
          }).join("")
        );
      });
    } else {
      console.log("No flaky tests detected. Test suite is stable.");
    }

    return { totalTests: totalTests, flakyTests: flakyTests, flakyRate: flakyRate };
  });
}

function generateCsvReport(days) {
  return analyzeFlakiness(days).then(function (data) {
    if (data.flakyTests.length === 0) { return; }

    var csv = "Test Name,Flips,Total Runs,Pass Rate,Last 10 Outcomes\n";
    data.flakyTests.forEach(function (test) {
      var passRate = ((test.passes / test.totalRuns) * 100).toFixed(1);
      var pattern = test.outcomes.slice(-10).map(function (o) {
        return o === "Passed" ? "P" : o === "Failed" ? "F" : "-";
      }).join("");
      csv += '"' + test.name + '",' + test.flips + "," + test.totalRuns + "," +
        passRate + "%," + pattern + "\n";
    });

    var fs = require("fs");
    var filename = "flaky-test-report-" + new Date().toISOString().slice(0, 10) + ".csv";
    fs.writeFileSync(filename, csv);
    console.log("\nCSV report saved: " + filename);
  });
}

// Main execution
var action = process.argv[2] || "report";
var days = parseInt(process.argv[3]) || 14;

if (action === "report") {
  analyzeFlakiness(days).catch(function (err) {
    console.error("Error: " + err.message);
    process.exit(1);
  });
} else if (action === "csv") {
  generateCsvReport(days).catch(function (err) {
    console.error("Error: " + err.message);
    process.exit(1);
  });
} else {
  console.log("Usage: node flaky-test-manager.js [report|csv] [days]");
  console.log("  report  Generate flaky test report (default)");
  console.log("  csv     Generate CSV export");
  console.log("  days    Analysis period in days (default: 14)");
}

Running the report:

$ node flaky-test-manager.js report 14
Analyzing test results from the last 14 day(s)...

Found 42 test run(s)

=== Flaky Test Report ===
Analysis period: 14 days
Test runs analyzed: 42
Unique tests: 380
Flaky tests: 9 (2.4%)

Top Flaky Tests (by flip count):
------------------------------------------------------------------------------------------
  1. checkout.test.js > Payment timeout handling
     Flips: 7 | Runs: 42 | Pass rate: 81% | Pattern: PPFPPFPFPP
  2. dashboard.test.js > Chart render with large dataset
     Flips: 5 | Runs: 38 | Pass rate: 87% | Pattern: PPPFPPFPPP
  3. upload.test.js > Large file upload progress
     Flips: 4 | Runs: 40 | Pass rate: 90% | Pattern: PPPPFPPPFP
  4. search.test.js > Elasticsearch query timeout
     Flips: 4 | Runs: 35 | Pass rate: 77% | Pattern: FPFPPFPPFP
  5. auth.test.js > Session expiry redirect
     Flips: 3 | Runs: 42 | Pass rate: 93% | Pattern: PPPPPPFPPP

Common Issues and Troubleshooting

Auto-Resolve Masking Real Failures

When flaky test detection is set to "Auto-resolve," a test that was historically flaky but is now consistently failing due to a real bug gets auto-resolved as passed. The pipeline succeeds while a real regression exists. Monitor auto-resolved tests: if a test is auto-resolved more than 3 consecutive times, it should be investigated. Switch to "Report only" mode if your team does not actively monitor auto-resolved results.

Flaky Detection Flagging Legitimate Failures

A test that consistently fails on one branch and passes on another is not flaky -- it is detecting a real difference between branches. Azure DevOps may flag this as flaky if it tracks results across branches. Configure flaky detection to scope to specific branches (typically main/master) to avoid false positives from feature branch failures.

Retry Count Inflating Pass Rates

When tests retry on failure, the retry passes are counted in the total results. A test that fails once and passes on retry appears as 2 results: 1 fail + 1 pass = 50% pass rate at the run level, but the test is reported as "passed" overall. This can make pass rate dashboards misleading. Track "first-attempt pass rate" separately from "final outcome pass rate" to understand the true reliability.

Flaky Tests Consuming Excessive Pipeline Time

A test suite with 20 flaky tests and 2 retries each adds up to 40 extra test executions per run. If each test takes 30 seconds, that is 20 minutes of wasted pipeline time. Prioritize fixing the top 5 flaky tests by flip count -- they account for most of the retry waste. Calculate the time cost: flaky_tests × avg_duration × retry_count × runs_per_day.

Flaky Test Data Not Appearing in Analytics

Azure DevOps flaky test data requires the Analytics extension and needs 5+ runs with flip data before flagging a test. If you enabled flaky detection recently, wait for a week of pipeline runs before expecting results. Also verify that tests are published with consistent automatedTestName values -- if the name changes between runs (e.g., includes a timestamp), Azure DevOps treats each variation as a separate test.

Best Practices

  • Start with "Report only" detection mode. Auto-resolve is convenient but dangerous without active monitoring. Report mode gives you visibility into flaky tests without masking potential real failures. Switch to auto-resolve only when you have a reliable triage process.

  • Track flaky test rate as a team metric. Display the flaky rate on your quality dashboard alongside pass rate and coverage. A target of under 2% is achievable for most teams. Review the top flaky tests weekly and assign owners.

  • Fix the root cause, do not just add retries. Retries are a symptom treatment, not a cure. Every retried test is a test with a latent defect in its design. Add retries to keep the pipeline flowing while you investigate, but track retried tests and fix the underlying issue.

  • Quarantine with deadlines. Every quarantined test gets an owner and a 2-sprint deadline. If not fixed by the deadline, the test is either deleted (if the feature has changed) or rewritten (if the test is still needed). Permanent quarantine is test deletion with extra steps.

  • Investigate timing flakes with explicit waits. Replace all sleep() calls and implicit waits with explicit condition checks. If a test waits for an element to appear, wait for the specific element with a generous timeout rather than sleeping for a fixed duration.

  • Isolate test state completely. Each test should set up its own data and tear it down after. Use transactions that roll back, unique identifiers per test, and separate database schemas per test worker. Shared state between tests is the second most common flaky test root cause after timing.

  • Run flaky test analysis before marking a release. As part of your release checklist, generate a flaky test report for the last 14 days. A rising flaky rate signals instability that should be addressed before release, not after.

  • Prevent new flaky tests with CI checks. Add a pipeline step that compares the current flaky test count against the last known count. If the count increases, add a warning or fail the build. This creates pressure to fix existing flaky tests and prevents new ones from being normalized.

References

Powered by Contentful