Pipelines

Pipeline Debugging: Diagnosing Failed Builds

A practical guide to debugging Azure DevOps pipeline failures, covering log analysis, system.debug mode, YAML validation, agent diagnostics, template expansion, and programmatic build querying.

Pipeline Debugging: Diagnosing Failed Builds

Overview

Pipeline failures are inevitable. What separates a team that ships from a team that stalls is how fast they can diagnose and fix a broken build. Azure DevOps gives you a surprisingly deep set of debugging tools -- verbose logs, diagnostic mode, REST APIs, template expansion previews -- but most engineers only ever stare at the red X and scroll through the summary page. This article covers every debugging technique I use in production, from reading raw logs effectively to querying build details programmatically.

Prerequisites

  • An Azure DevOps organization and project with Pipelines enabled
  • Working familiarity with YAML pipeline syntax (triggers, stages, jobs, steps)
  • Access to edit pipeline variables and queue new runs
  • Basic comfort with REST APIs and command-line tools (Azure CLI or PowerShell)
  • An understanding of your agent pool configuration (Microsoft-hosted or self-hosted)

Reading Pipeline Logs Effectively

The most underrated debugging skill is reading logs properly. Most engineers glance at the error summary and start guessing. Stop guessing. The logs tell you exactly what happened.

Expanding Failed Tasks

When a pipeline fails, Azure DevOps collapses successful tasks and highlights the failed one. Click on it. But do not stop at the summary view. The summary truncates output and sometimes hides the actual root cause behind a generic error message.

Click the individual task that failed. You will see a timeline on the left and the log output on the right. The first error is usually the real one. Everything after it is cascading failures.

Downloading Raw Logs

The web UI truncates long lines and sometimes swallows output that arrived during a timeout. Always download the raw logs for complex failures.

  1. Open the failed pipeline run
  2. Click the three-dot menu (top right of the run summary)
  3. Select Download logs

You get a ZIP file with one log file per task. These are the complete, untruncated logs. I grep through these locally when the web UI is not giving me enough context.

# Unzip and search for the actual error
unzip 20260208.1-logs.zip -d build-logs
cd build-logs

# Find the real error -- ignore the noise
grep -rn "error" --include="*.txt" | grep -v "0 error(s)" | head -20

# Search for a specific failing test or module
grep -rn "FAILED" --include="*.txt"
grep -rn "exit code" --include="*.txt"

Timestamps Matter

Every log line has a timestamp. Use them. If a step took 45 minutes when it normally takes 3, that is your clue. Network timeouts, hung processes, and resource exhaustion all show up as unusually long task durations before the actual error appears.


The system.debug Variable

This is the single most useful debugging tool in Azure DevOps Pipelines, and half the engineers I work with have never used it.

Setting system.debug to true enables verbose logging for every task in the pipeline. Tasks emit detailed diagnostic information that is suppressed during normal runs: HTTP request/response details, environment variable resolution, file system operations, and internal decision points.

Enabling system.debug

Option 1: Queue with variable override

When you manually queue a run, click Variables before hitting Run. Add system.debug with value true. This enables verbose logging for that single run without modifying your YAML.

Option 2: Pipeline variable

Add it as a pipeline variable in the UI under Pipelines > Edit > Variables. Set it to true temporarily while debugging, then remove it.

Option 3: In your YAML

variables:
  system.debug: true

I do not recommend leaving this in your YAML permanently. Verbose logs are 5-10x larger and slow down the log viewer. Use Option 1 for one-off debugging.

What system.debug Reveals

With debug enabled, you will see output like this in every task:

##[debug]Evaluating condition for step: 'Build solution'
##[debug]Evaluating: succeeded()
##[debug]Evaluating succeeded:
##[debug]=> (Boolean)True
##[debug]Result: True
##[debug]Starting: Build solution
##[debug]Loading inputs
##[debug]Loading env
##[debug]Agent.BuildDirectory=D:\a\1
##[debug]Agent.HomeDirectory=D:\a
##[debug]Agent.RootDirectory=D:\a
##[debug]Agent.TempDirectory=D:\a\_temp
##[debug]Agent.ToolsDirectory=D:\a\_tool
##[debug]Agent.WorkFolder=D:\a

This is gold when you are trying to figure out why a condition evaluated the way it did, which directory a task is running in, or what environment variables were available at execution time.


The Pipeline Run UI

The pipeline run detail page has several tabs that most people ignore. Each one serves a specific debugging purpose.

Timeline Tab

Shows every stage, job, and step with duration and status. Look for:

  • Steps that took abnormally long (potential timeout or resource issues)
  • Steps that were skipped (condition evaluated to false -- why?)
  • The gap between steps (agent provisioning delays)

Tasks Tab

Flat list of all tasks with their status. Useful when you have a multi-stage pipeline and want to see all failures at a glance without expanding each stage.

Artifacts Tab

Shows published artifacts. If your build succeeded but your deployment failed because the artifact was wrong, check here. Common issues: empty artifacts, missing files, wrong artifact name.

Tests Tab

If you publish test results, they appear here with pass/fail counts, individual test details, and failure messages. This is often more useful than scrolling through raw test runner output.


Common Failure Categories

After debugging thousands of pipeline failures, I have found they fall into a handful of categories. Knowing the category immediately narrows your search.

Build Errors

The code does not compile or the bundler fails. These are usually straightforward -- the error message tells you the file and line number. The tricky ones involve version mismatches between your local environment and the build agent.

error TS2307: Cannot find module '@company/shared-lib' or its corresponding type declarations.

This almost always means your package.json references a private package and the agent does not have access to the private feed. Check your .npmrc configuration and the pipeline's feed authentication step.

Test Failures

Tests pass locally but fail in CI. The usual suspects:

  • Hardcoded paths: Tests reference /Users/shane/project/... instead of relative paths
  • Timing-dependent tests: Async operations that pass on your fast machine but timeout on a shared agent
  • Environment differences: Missing environment variables, different OS, different tool versions
  • Database state: Tests depend on data from a previous test run that does not exist on a clean agent

Infrastructure Issues

The agent itself has problems. Disk full, out of memory, network connectivity lost, Docker daemon not running. These show up as cryptic errors:

##[error]The runner has received a shutdown signal. This can happen when the machine is being shutdown or the service is being stopped.

Or:

##[error]No space left on device

For self-hosted agents, check the machine health. For Microsoft-hosted agents, retry the build -- you will get a fresh VM.

Permission Errors

Service connections, feed access, deployment targets -- anything that requires authentication can fail silently or with unhelpful error messages.

##[error]Error: unable to get local issuer certificate
##[error]VS30063: You are not authorized to access https://dev.azure.com/org/project

These require checking service connection configurations, PAT expiration dates, and managed identity assignments.


Diagnosing Agent Issues

When a pipeline is stuck in the queue or fails immediately with agent errors, the problem is in the agent pool configuration.

Agent Capabilities and Demands

Every agent advertises capabilities (installed software, environment variables, system properties). Every pipeline job can specify demands. If no agent satisfies all demands, the job sits in the queue forever.

pool:
  name: 'Self-Hosted Linux'
  demands:
    - docker
    - Agent.OS -equals Linux
    - node -gtVersion 18.0.0

To check capabilities: go to Organization Settings > Agent Pools > [Your Pool] > Agents > [Agent Name] > Capabilities.

If a job is stuck queuing, check the demands in the YAML against the capabilities of the agents in the pool. The mismatch is usually obvious once you look.

Pool Problems

Common pool issues:

  • All agents offline: Self-hosted agent service crashed or VM deallocated
  • All agents busy: Too many concurrent builds for the pool size
  • Parallelism limit: Free-tier accounts get one parallel job; everything else queues
  • Maintenance mode: An agent is online but marked for maintenance
# Check agent status via Azure CLI
az pipelines agent list --pool-id 1 --organization https://dev.azure.com/myorg --output table

Debugging YAML Syntax Errors

YAML is whitespace-sensitive and unforgiving. A single indentation error can break your pipeline in ways that produce unhelpful error messages.

Pre-Validation

Azure DevOps validates YAML when you save it in the web editor. Use this. Before committing a YAML change through a PR, paste it into the pipeline editor and click Validate. This catches:

  • Indentation errors
  • Invalid property names
  • Missing required fields
  • Invalid template references

Schema Linting Locally

Install the Azure Pipelines VS Code extension or use a YAML linter with the Azure Pipelines schema:

# Install yamllint
pip install yamllint

# Lint your pipeline file
yamllint -d "{extends: default, rules: {line-length: {max: 200}}}" azure-pipelines.yml

For deeper validation, the Azure DevOps CLI can validate the pipeline:

az pipelines run --name "My Pipeline" --branch main --detect --dry-run

Common YAML Mistakes

# WRONG: Mixing tabs and spaces
steps:
  - script: echo hello
      displayName: 'Say Hello'   # This line uses a tab -- will break

# WRONG: Incorrect indentation of multi-line script
steps:
  - script: |
    echo "line 1"
    echo "line 2"
  # The script lines need to be indented further:

# RIGHT:
steps:
  - script: |
      echo "line 1"
      echo "line 2"
    displayName: 'Multi-line script'

Variable Resolution Debugging

Variable problems are among the most common and most frustrating pipeline failures. Variables can come from the YAML file, pipeline settings, variable groups, key vault, template parameters, and runtime expressions. When the wrong value shows up (or no value at all), you need a systematic approach.

Printing Variables

Add a diagnostic step that dumps all the variables you care about:

steps:
  - script: |
      echo "Build.SourceBranch: $(Build.SourceBranch)"
      echo "Build.Reason: $(Build.Reason)"
      echo "System.PullRequest.TargetBranch: $(System.PullRequest.TargetBranch)"
      echo "MY_CUSTOM_VAR: $(MY_CUSTOM_VAR)"
      echo "Build.BuildNumber: $(Build.BuildNumber)"
    displayName: 'Debug: Print variables'

For secret variables, you cannot print them directly (Azure DevOps masks them with ***). But you can verify they are set:

steps:
  - script: |
      if [ -z "$MY_SECRET" ]; then
        echo "MY_SECRET is empty or not set!"
        exit 1
      else
        echo "MY_SECRET is set (length: ${#MY_SECRET})"
      fi
    displayName: 'Debug: Verify secret is set'
    env:
      MY_SECRET: $(MySecretVariable)

Compile-Time vs. Runtime Expressions

This trips up even experienced engineers. There are two expression syntaxes:

  • ${{ variables.myVar }} -- evaluated at compile time (template expansion)
  • $(myVar) -- evaluated at run time (macro expansion)
  • $[variables.myVar] -- evaluated at run time (runtime expression)

If you use ${{ }} for a variable that is only set at runtime (like an output variable from a previous job), it will be empty. This is not a bug. It is the evaluation order.

# WRONG: This will be empty because outputVar is set at runtime
steps:
  - script: echo ${{ variables.outputVar }}

# RIGHT: Use runtime syntax
steps:
  - script: echo $(outputVar)

Task Version Issues and Pinning

Tasks in Azure DevOps are versioned. When you write task: NodeTool@0, you are using the major version 0 of the Node.js Tool Installer task. Minor and patch versions are applied automatically by Microsoft.

This means your pipeline can break without any change to your code or YAML. Microsoft pushes a patch to a task, and suddenly your build fails.

Identifying Task Version Issues

When a pipeline breaks and nobody changed the code or YAML, check the task versions:

  1. Open the failed run
  2. Click the failed task
  3. Look for the version string in the log header:
##[section]Starting: Use Node 18.x
==============================================================================
Task         : Node.js tool installer
Description  : Finds or downloads and caches the specified version spec of Node.js
Version      : 0.238.1
Author       : Microsoft Corporation
==============================================================================

Compare this version with a previous successful run. If the version changed, that is your culprit.

Pinning Tasks

For critical pipelines, pin the task to a specific major version and file issues when minor version updates break things. You cannot pin minor versions in YAML -- Azure DevOps does not support it. But you can monitor changes through the task changelog.

# This pins to major version 2 -- minor/patch updates still apply
- task: DotNetCoreCLI@2
  inputs:
    command: 'build'

# If a task update breaks you, the workaround is to use
# a script step instead of the built-in task
- script: dotnet build --configuration Release
  displayName: 'Build (pinned behavior)'

Debugging Template Expansion

When you use templates extensively, the YAML that actually runs is different from the YAML in your repository. Template parameters get substituted, conditional blocks get resolved, loops get unrolled. Debugging a failure in the expanded YAML requires seeing what was actually generated.

Viewing the Expanded YAML

Azure DevOps shows you the expanded YAML for any run:

  1. Open the pipeline run
  2. Click the three-dot menu
  3. Select Download full YAML (or View YAML depending on your version)

This gives you the fully resolved YAML after all template expansion. It is often hundreds of lines longer than your source file. Search through it to verify that template parameters were substituted correctly and conditional blocks resolved the way you expected.

Common Template Expansion Issues

# Template parameter type mismatch
# If the template declares a boolean parameter and you pass a string,
# the condition might not evaluate correctly

# In template:
parameters:
  - name: runTests
    type: boolean
    default: true

# WRONG: Passing a string instead of boolean
jobs:
  - template: build.yml
    parameters:
      runTests: 'true'   # This is a string, not a boolean

# RIGHT:
jobs:
  - template: build.yml
    parameters:
      runTests: true

Artifact and Caching Problems

Pipeline artifacts and caching are common sources of subtle failures. The build succeeds, but the artifact is incomplete, or a stale cache causes the wrong version to deploy.

Artifact Debugging

# Add a step to inspect the artifact contents before publishing
- script: |
    echo "=== Artifact contents ==="
    find $(Build.ArtifactStagingDirectory) -type f | head -50
    echo "=== Total size ==="
    du -sh $(Build.ArtifactStagingDirectory)
  displayName: 'Debug: Inspect artifact'

- task: PublishBuildArtifacts@1
  inputs:
    pathToPublish: '$(Build.ArtifactStagingDirectory)'
    artifactName: 'drop'

Cache Key Mismatches

The Cache task uses a key to determine whether to restore a cached directory. If the key does not change when it should (or changes when it should not), you get stale or missing caches.

- task: Cache@2
  inputs:
    key: 'npm | "$(Agent.OS)" | package-lock.json'
    path: '$(Pipeline.Workspace)/.npm'
  displayName: 'Cache npm packages'

# If package-lock.json changed but the cache was not invalidated,
# check that the file path is correct relative to the repository root.
# A common mistake is specifying package-lock.json when it is
# actually in a subdirectory like src/package-lock.json.

Timeout and Resource Limit Issues

Job Timeouts

The default job timeout is 60 minutes. If your build takes longer, it gets killed with no useful error message.

jobs:
  - job: Build
    timeoutInMinutes: 120    # Increase for long builds
    cancelTimeoutInMinutes: 5 # Grace period for cleanup on cancel
    steps:
      - script: npm run build

Agent Resource Limits

Microsoft-hosted agents have fixed resources (typically 2 CPUs, 7 GB RAM, 14 GB SSD). If your build exceeds these, you get out-of-memory kills or disk space errors.

# Add diagnostic steps to monitor resource usage
df -h            # Disk space
free -m          # Memory (Linux agents)
nproc            # CPU count

If you consistently hit resource limits on Microsoft-hosted agents, switch to self-hosted agents where you control the hardware.


Service Connection and Authentication Failures

Service connections are the most common source of "it worked yesterday" failures. Tokens expire, certificates rotate, managed identity permissions change.

Diagnosing Service Connection Issues

# Test a service connection by making a simple API call
- task: AzureCLI@2
  inputs:
    azureSubscription: 'MyServiceConnection'
    scriptType: 'bash'
    scriptLocation: 'inlineScript'
    inlineScript: |
      echo "Logged in as:"
      az account show --query "{name:name, id:id, tenantId:tenantId}" -o table
      echo "Subscription access verified."
  displayName: 'Debug: Verify Azure service connection'

Common service connection failures:

  • Secret expired: Service principal client secrets have expiration dates. Check the App Registration in Azure AD.
  • Certificate rotated: If using certificate-based auth, the certificate might have been renewed without updating the service connection.
  • Permissions changed: The service principal lost access to the target resource group or subscription.
  • Federated credential misconfigured: Workload identity federation requires the correct subject claim matching the pipeline's branch and environment.

Using the REST API to Query Build Details

When the UI is not enough, use the Azure DevOps REST API to query build details programmatically. This is particularly useful for automated monitoring and for debugging issues across many builds.

# Get the details of a specific build
curl -s -u ":$(PAT)" \
  "https://dev.azure.com/{org}/{project}/_apis/build/builds/{buildId}?api-version=7.1" \
  | python -m json.tool

# Get the timeline (all tasks, durations, statuses) for a build
curl -s -u ":$(PAT)" \
  "https://dev.azure.com/{org}/{project}/_apis/build/builds/{buildId}/timeline?api-version=7.1" \
  | python -m json.tool

# List recent failed builds for a specific pipeline
curl -s -u ":$(PAT)" \
  "https://dev.azure.com/{org}/{project}/_apis/build/builds?definitions={definitionId}&statusFilter=failed&\$top=10&api-version=7.1" \
  | python -m json.tool

You can also use Node.js to build a monitoring script:

var https = require("https");

var org = "myorg";
var project = "myproject";
var pat = process.env.AZURE_DEVOPS_PAT;
var definitionId = 42;

var options = {
  hostname: "dev.azure.com",
  path: "/" + org + "/" + project + "/_apis/build/builds?definitions=" + definitionId + "&statusFilter=failed&$top=5&api-version=7.1",
  headers: {
    "Authorization": "Basic " + Buffer.from(":" + pat).toString("base64")
  }
};

https.get(options, function(res) {
  var body = "";
  res.on("data", function(chunk) {
    body += chunk;
  });
  res.on("end", function() {
    var data = JSON.parse(body);
    data.value.forEach(function(build) {
      console.log("Build " + build.buildNumber + " failed at " + build.finishTime);
      console.log("  Reason: " + build.reason);
      console.log("  Source: " + build.sourceBranch);
      console.log("  URL: " + build._links.web.href);
      console.log("");
    });
  });
});

Retry vs. Re-Run

Azure DevOps gives you two options when a build fails: Retry and Re-run. They are not the same.

Retry (Rerun Failed Jobs)

Retries only the failed jobs. Successful jobs are not re-executed. This is useful when:

  • The failure was transient (network timeout, flaky test, agent hiccup)
  • Earlier stages (build, unit tests) passed and you do not want to waste time rebuilding
  • You want to preserve the same source commit and variables

Re-Run (Run New)

Queues an entirely new pipeline run. Every stage, job, and step runs from scratch. Use this when:

  • You made a YAML change and want to pick it up
  • You changed pipeline variables
  • The failure might have been caused by a stale workspace or cached state
  • You want a clean run with fresh agent VMs

My rule of thumb: retry once for transient failures. If the retry fails, re-run from scratch. If the re-run fails, it is a real problem that requires investigation.


Pipeline Diagnostic Mode

Azure DevOps has a built-in diagnostic mode that goes beyond system.debug. You can enable it by setting multiple diagnostic variables:

variables:
  system.debug: true
  agent.diagnostic: true

Or when queuing a run manually, set these variables:

Variable Effect
system.debug Verbose task output, condition evaluation details
agent.diagnostic Agent-level diagnostics, capability resolution, job dispatch details
Agent.TempDirectory cleanup logs Shows what gets cleaned between jobs

With both enabled, you get a complete picture of what the agent did, from receiving the job to cleaning up the workspace.


Complete Working Example

Here is a debug-friendly pipeline template that includes diagnostic steps. These steps activate only when system.debug is true, so they add zero overhead to normal runs.

# azure-pipelines.yml
# Debug-friendly pipeline with conditional diagnostics

trigger:
  branches:
    include:
      - main
      - release/*

variables:
  buildConfiguration: 'Release'
  nodeVersion: '18.x'

stages:
  - stage: Build
    displayName: 'Build and Test'
    jobs:
      - job: BuildJob
        displayName: 'Build Application'
        pool:
          vmImage: 'ubuntu-latest'
        timeoutInMinutes: 30
        steps:
          # ============================================
          # DIAGNOSTIC STEPS (only when system.debug=true)
          # ============================================
          - script: |
              echo "============================================"
              echo "  DIAGNOSTIC: System Information"
              echo "============================================"
              echo ""
              echo "--- OS Info ---"
              uname -a
              cat /etc/os-release
              echo ""
              echo "--- CPU ---"
              nproc
              lscpu | head -15
              echo ""
              echo "--- Memory ---"
              free -h
              echo ""
              echo "--- Disk ---"
              df -h
              echo ""
              echo "--- Network ---"
              hostname -I
              curl -s ifconfig.me && echo ""
              echo ""
              echo "--- Docker ---"
              docker --version 2>/dev/null || echo "Docker not available"
              echo ""
              echo "--- Current User ---"
              whoami
              id
            displayName: 'Diagnostic: System info'
            condition: eq(variables['system.debug'], 'true')

          - script: |
              echo "============================================"
              echo "  DIAGNOSTIC: Tool Versions"
              echo "============================================"
              echo ""
              echo "Node.js: $(node --version 2>/dev/null || echo 'not installed')"
              echo "npm: $(npm --version 2>/dev/null || echo 'not installed')"
              echo "Python: $(python3 --version 2>/dev/null || echo 'not installed')"
              echo "Java: $(java -version 2>&1 | head -1 || echo 'not installed')"
              echo "dotnet: $(dotnet --version 2>/dev/null || echo 'not installed')"
              echo "az cli: $(az --version 2>/dev/null | head -1 || echo 'not installed')"
              echo "git: $(git --version)"
              echo "curl: $(curl --version | head -1)"
            displayName: 'Diagnostic: Tool versions'
            condition: eq(variables['system.debug'], 'true')

          - script: |
              echo "============================================"
              echo "  DIAGNOSTIC: Pipeline Variables"
              echo "============================================"
              echo ""
              echo "Build.SourceBranch: $(Build.SourceBranch)"
              echo "Build.SourceBranchName: $(Build.SourceBranchName)"
              echo "Build.SourceVersion: $(Build.SourceVersion)"
              echo "Build.Reason: $(Build.Reason)"
              echo "Build.BuildNumber: $(Build.BuildNumber)"
              echo "Build.BuildId: $(Build.BuildId)"
              echo "Build.Repository.Name: $(Build.Repository.Name)"
              echo "Build.DefinitionName: $(Build.DefinitionName)"
              echo "Agent.Name: $(Agent.Name)"
              echo "Agent.MachineName: $(Agent.MachineName)"
              echo "Agent.OS: $(Agent.OS)"
              echo "Agent.OSArchitecture: $(Agent.OSArchitecture)"
              echo "Agent.Version: $(Agent.Version)"
              echo "Agent.BuildDirectory: $(Agent.BuildDirectory)"
              echo "Agent.WorkFolder: $(Agent.WorkFolder)"
              echo "Agent.TempDirectory: $(Agent.TempDirectory)"
              echo "Agent.ToolsDirectory: $(Agent.ToolsDirectory)"
              echo "System.DefaultWorkingDirectory: $(System.DefaultWorkingDirectory)"
              echo "Pipeline.Workspace: $(Pipeline.Workspace)"
            displayName: 'Diagnostic: Pipeline variables'
            condition: eq(variables['system.debug'], 'true')

          - script: |
              echo "============================================"
              echo "  DIAGNOSTIC: Workspace Contents"
              echo "============================================"
              echo ""
              echo "--- Source directory ---"
              ls -la $(Build.SourcesDirectory) | head -30
              echo ""
              echo "--- Working directory ---"
              ls -la $(System.DefaultWorkingDirectory) | head -30
              echo ""
              echo "--- Staging directory ---"
              ls -la $(Build.ArtifactStagingDirectory) 2>/dev/null || echo "(empty or not created)"
            displayName: 'Diagnostic: Workspace contents'
            condition: eq(variables['system.debug'], 'true')

          # ============================================
          # ACTUAL BUILD STEPS
          # ============================================
          - task: NodeTool@0
            displayName: 'Install Node.js $(nodeVersion)'
            inputs:
              versionSpec: '$(nodeVersion)'

          - script: |
              echo "Installing dependencies..."
              npm ci
              echo ""
              echo "Installed $(npm ls --depth=0 2>/dev/null | wc -l) top-level packages"
            displayName: 'Install dependencies'
            workingDirectory: '$(Build.SourcesDirectory)'

          - script: |
              npm run lint 2>&1 || true
            displayName: 'Run linter'
            workingDirectory: '$(Build.SourcesDirectory)'

          - script: |
              npm test -- --reporter mocha-junit-reporter \
                --reporter-options mochaFile=$(Common.TestResultsDirectory)/test-results.xml
            displayName: 'Run tests'
            workingDirectory: '$(Build.SourcesDirectory)'
            continueOnError: false

          - task: PublishTestResults@2
            displayName: 'Publish test results'
            inputs:
              testResultsFormat: 'JUnit'
              testResultsFiles: '**/test-results.xml'
              searchFolder: '$(Common.TestResultsDirectory)'
            condition: always()

          - script: |
              npm run build -- --configuration $(buildConfiguration)
            displayName: 'Build application'
            workingDirectory: '$(Build.SourcesDirectory)'

          # ============================================
          # POST-BUILD DIAGNOSTICS (only when system.debug=true)
          # ============================================
          - script: |
              echo "============================================"
              echo "  DIAGNOSTIC: Post-Build State"
              echo "============================================"
              echo ""
              echo "--- Build output size ---"
              du -sh $(Build.SourcesDirectory)/dist 2>/dev/null || echo "No dist directory"
              du -sh $(Build.SourcesDirectory)/build 2>/dev/null || echo "No build directory"
              echo ""
              echo "--- Disk usage after build ---"
              df -h
              echo ""
              echo "--- Memory after build ---"
              free -h
            displayName: 'Diagnostic: Post-build state'
            condition: eq(variables['system.debug'], 'true')

          # ============================================
          # PUBLISH ARTIFACTS
          # ============================================
          - script: |
              cp -r dist/* $(Build.ArtifactStagingDirectory)/ 2>/dev/null || \
              cp -r build/* $(Build.ArtifactStagingDirectory)/ 2>/dev/null || \
              echo "No build output directory found"
            displayName: 'Stage artifacts'
            workingDirectory: '$(Build.SourcesDirectory)'

          - script: |
              echo "=== Artifact contents ==="
              find $(Build.ArtifactStagingDirectory) -type f | head -30
              echo ""
              echo "=== Total artifact size ==="
              du -sh $(Build.ArtifactStagingDirectory)
            displayName: 'Diagnostic: Verify artifacts'
            condition: eq(variables['system.debug'], 'true')

          - task: PublishBuildArtifacts@1
            displayName: 'Publish artifacts'
            inputs:
              pathToPublish: '$(Build.ArtifactStagingDirectory)'
              artifactName: 'drop'

This template gives you zero-cost diagnostics. During normal runs, every diagnostic step is skipped (the condition fails). When something breaks, set system.debug to true, rerun, and get a complete picture of the environment, tool versions, variables, and workspace state -- all in a single run.


Common Issues and Troubleshooting

Issue 1: "No hosted parallelism has been purchased or granted"

Error message:

##[error]No hosted parallelism has been purchased or granted. To request a free parallelism grant, please fill out the following form https://aka.ms/azpipelines-parallelism-request

Cause: New Azure DevOps organizations must request free pipeline parallelism. Microsoft disabled automatic grants to prevent crypto mining abuse.

Fix: Submit the parallelism request form at the URL in the error message. Approval typically takes 2-3 business days. In the meantime, you can set up a self-hosted agent to unblock your team.

Issue 2: "The pipeline is not valid. Job BuildJob: Step DotNetCoreCLI input command: invalid value 'run'"

Error message:

/azure-pipelines.yml (Line: 42, Col: 9): The pipeline is not valid.
Job BuildJob: Step DotNetCoreCLI input command: invalid value 'run'.
Valid values: build, push, pack, publish, restore, test, custom

Cause: You used an invalid command value for the DotNetCoreCLI task. The task only supports specific commands. For dotnet run, use a script step instead.

Fix:

# Instead of this (invalid):
- task: DotNetCoreCLI@2
  inputs:
    command: 'run'
    projects: 'MyApp.csproj'

# Use this:
- script: dotnet run --project MyApp.csproj
  displayName: 'Run application'

Issue 3: "TF401019: The Git repository with name or identifier does not exist"

Error message:

remote: TF401019: The Git repository with name or identifier MyRepo does not exist
or you do not have permissions for the operation you are attempting.
fatal: repository 'https://dev.azure.com/org/project/_git/MyRepo/' not found
##[error]Git fetch failed with exit code: 128

Cause: The pipeline is trying to check out a repository that either does not exist, has been renamed, or the build service account lacks read permissions.

Fix:

  1. Verify the repository name in your resources.repositories section
  2. Go to Project Settings > Repositories > [Repo] > Security
  3. Grant "Read" permission to [Project Name] Build Service (org name)
  4. If using a multi-repo checkout, verify the ref matches an existing branch

Issue 4: "ERROR: Failed to download task: NuGetCommand version 2.x"

Error message:

##[error]Error: Failed to download task: NuGetCommand version 2.238.1
(7e2f3c4a-2e39-4355-8675-82a527570693). Verify the task and version are correct
and the agent can reach the server.
Could not reach the server https://vstsagenttools.blob.core.windows.net/

Cause: The self-hosted agent cannot reach Azure DevOps' task download servers. This happens behind corporate firewalls or proxies that block *.blob.core.windows.net.

Fix:

  1. Whitelist *.blob.core.windows.net and *.vsassets.io in your firewall
  2. If behind a proxy, configure the agent's .env file with VSTS_HTTP_PROXY
  3. Alternatively, pre-cache tasks by running the agent once on a network with full access

Issue 5: "Exit code 137" (Out of Memory Kill)

Error message:

##[error]Bash exited with code '137'.

Cause: Exit code 137 means the process was killed by the Linux OOM (Out of Memory) killer. The build or test step consumed more memory than available on the agent.

Fix:

  1. For Node.js builds, increase the heap size: NODE_OPTIONS=--max-old-space-size=4096
  2. For Microsoft-hosted agents, consider using ubuntu-latest (7 GB RAM) or switching to self-hosted with more RAM
  3. Reduce parallelism in test runners (--maxWorkers=2 for Jest)
  4. Split large builds into multiple jobs that run on separate agents
steps:
  - script: |
      export NODE_OPTIONS="--max-old-space-size=4096"
      npm run build
    displayName: 'Build with increased memory'

Best Practices

  • Enable system.debug as your first debugging step. Before adding random echo statements to your pipeline, queue a run with system.debug: true. Nine times out of ten, the verbose output tells you exactly what went wrong without any YAML changes.

  • Download raw logs for complex failures. The web UI truncates output and sometimes loses lines during timeouts. Raw log files are the source of truth. Keep them for post-mortem analysis.

  • Add conditional diagnostic steps to every pipeline. Use the pattern from the complete example above. Wrap diagnostic steps in condition: eq(variables['system.debug'], 'true') so they cost nothing during normal runs but give you full environmental context when debugging.

  • Pin your tool versions explicitly. Do not rely on latest for Node.js, Python, Java, or .NET versions. When Microsoft updates the default toolset on hosted agents, your pipeline will break at the worst possible time. Always specify exact versions: versionSpec: '18.19.0'.

  • Monitor task version changes. Subscribe to the Azure DevOps release notes or check the task changelogs periodically. When a built-in task gets a minor version bump, test your critical pipelines immediately rather than discovering the break during a production deploy.

  • Use the REST API for fleet-wide debugging. If you manage dozens of pipelines, do not click through the UI for each one. Script the REST API to find patterns: which pipelines are failing, on which agents, with which error messages. Automate this into a monitoring dashboard.

  • Check service connection expiry dates proactively. Set calendar reminders for when service principal secrets and certificates expire. The absolute worst time to discover an expired credential is during an emergency hotfix deployment.

  • Treat "retry and it works" as a bug, not a fix. If a pipeline fails intermittently and passes on retry, there is a flaky test, a race condition, or an infrastructure instability. Track these occurrences and fix the root cause. Intermittent failures erode trust in the pipeline and train developers to ignore real failures.

  • Keep pipeline YAML in version control with PR reviews. Pipeline changes should go through the same review process as application code. A bad YAML change can break every developer's workflow. Require at least one approval for changes to pipeline templates.


References

Powered by Contentful