Pipeline Debugging: Diagnosing Failed Builds
A practical guide to debugging Azure DevOps pipeline failures, covering log analysis, system.debug mode, YAML validation, agent diagnostics, template expansion, and programmatic build querying.
Pipeline Debugging: Diagnosing Failed Builds
Overview
Pipeline failures are inevitable. What separates a team that ships from a team that stalls is how fast they can diagnose and fix a broken build. Azure DevOps gives you a surprisingly deep set of debugging tools -- verbose logs, diagnostic mode, REST APIs, template expansion previews -- but most engineers only ever stare at the red X and scroll through the summary page. This article covers every debugging technique I use in production, from reading raw logs effectively to querying build details programmatically.
Prerequisites
- An Azure DevOps organization and project with Pipelines enabled
- Working familiarity with YAML pipeline syntax (triggers, stages, jobs, steps)
- Access to edit pipeline variables and queue new runs
- Basic comfort with REST APIs and command-line tools (Azure CLI or PowerShell)
- An understanding of your agent pool configuration (Microsoft-hosted or self-hosted)
Reading Pipeline Logs Effectively
The most underrated debugging skill is reading logs properly. Most engineers glance at the error summary and start guessing. Stop guessing. The logs tell you exactly what happened.
Expanding Failed Tasks
When a pipeline fails, Azure DevOps collapses successful tasks and highlights the failed one. Click on it. But do not stop at the summary view. The summary truncates output and sometimes hides the actual root cause behind a generic error message.
Click the individual task that failed. You will see a timeline on the left and the log output on the right. The first error is usually the real one. Everything after it is cascading failures.
Downloading Raw Logs
The web UI truncates long lines and sometimes swallows output that arrived during a timeout. Always download the raw logs for complex failures.
- Open the failed pipeline run
- Click the three-dot menu (top right of the run summary)
- Select Download logs
You get a ZIP file with one log file per task. These are the complete, untruncated logs. I grep through these locally when the web UI is not giving me enough context.
# Unzip and search for the actual error
unzip 20260208.1-logs.zip -d build-logs
cd build-logs
# Find the real error -- ignore the noise
grep -rn "error" --include="*.txt" | grep -v "0 error(s)" | head -20
# Search for a specific failing test or module
grep -rn "FAILED" --include="*.txt"
grep -rn "exit code" --include="*.txt"
Timestamps Matter
Every log line has a timestamp. Use them. If a step took 45 minutes when it normally takes 3, that is your clue. Network timeouts, hung processes, and resource exhaustion all show up as unusually long task durations before the actual error appears.
The system.debug Variable
This is the single most useful debugging tool in Azure DevOps Pipelines, and half the engineers I work with have never used it.
Setting system.debug to true enables verbose logging for every task in the pipeline. Tasks emit detailed diagnostic information that is suppressed during normal runs: HTTP request/response details, environment variable resolution, file system operations, and internal decision points.
Enabling system.debug
Option 1: Queue with variable override
When you manually queue a run, click Variables before hitting Run. Add system.debug with value true. This enables verbose logging for that single run without modifying your YAML.
Option 2: Pipeline variable
Add it as a pipeline variable in the UI under Pipelines > Edit > Variables. Set it to true temporarily while debugging, then remove it.
Option 3: In your YAML
variables:
system.debug: true
I do not recommend leaving this in your YAML permanently. Verbose logs are 5-10x larger and slow down the log viewer. Use Option 1 for one-off debugging.
What system.debug Reveals
With debug enabled, you will see output like this in every task:
##[debug]Evaluating condition for step: 'Build solution'
##[debug]Evaluating: succeeded()
##[debug]Evaluating succeeded:
##[debug]=> (Boolean)True
##[debug]Result: True
##[debug]Starting: Build solution
##[debug]Loading inputs
##[debug]Loading env
##[debug]Agent.BuildDirectory=D:\a\1
##[debug]Agent.HomeDirectory=D:\a
##[debug]Agent.RootDirectory=D:\a
##[debug]Agent.TempDirectory=D:\a\_temp
##[debug]Agent.ToolsDirectory=D:\a\_tool
##[debug]Agent.WorkFolder=D:\a
This is gold when you are trying to figure out why a condition evaluated the way it did, which directory a task is running in, or what environment variables were available at execution time.
The Pipeline Run UI
The pipeline run detail page has several tabs that most people ignore. Each one serves a specific debugging purpose.
Timeline Tab
Shows every stage, job, and step with duration and status. Look for:
- Steps that took abnormally long (potential timeout or resource issues)
- Steps that were skipped (condition evaluated to false -- why?)
- The gap between steps (agent provisioning delays)
Tasks Tab
Flat list of all tasks with their status. Useful when you have a multi-stage pipeline and want to see all failures at a glance without expanding each stage.
Artifacts Tab
Shows published artifacts. If your build succeeded but your deployment failed because the artifact was wrong, check here. Common issues: empty artifacts, missing files, wrong artifact name.
Tests Tab
If you publish test results, they appear here with pass/fail counts, individual test details, and failure messages. This is often more useful than scrolling through raw test runner output.
Common Failure Categories
After debugging thousands of pipeline failures, I have found they fall into a handful of categories. Knowing the category immediately narrows your search.
Build Errors
The code does not compile or the bundler fails. These are usually straightforward -- the error message tells you the file and line number. The tricky ones involve version mismatches between your local environment and the build agent.
error TS2307: Cannot find module '@company/shared-lib' or its corresponding type declarations.
This almost always means your package.json references a private package and the agent does not have access to the private feed. Check your .npmrc configuration and the pipeline's feed authentication step.
Test Failures
Tests pass locally but fail in CI. The usual suspects:
- Hardcoded paths: Tests reference
/Users/shane/project/...instead of relative paths - Timing-dependent tests: Async operations that pass on your fast machine but timeout on a shared agent
- Environment differences: Missing environment variables, different OS, different tool versions
- Database state: Tests depend on data from a previous test run that does not exist on a clean agent
Infrastructure Issues
The agent itself has problems. Disk full, out of memory, network connectivity lost, Docker daemon not running. These show up as cryptic errors:
##[error]The runner has received a shutdown signal. This can happen when the machine is being shutdown or the service is being stopped.
Or:
##[error]No space left on device
For self-hosted agents, check the machine health. For Microsoft-hosted agents, retry the build -- you will get a fresh VM.
Permission Errors
Service connections, feed access, deployment targets -- anything that requires authentication can fail silently or with unhelpful error messages.
##[error]Error: unable to get local issuer certificate
##[error]VS30063: You are not authorized to access https://dev.azure.com/org/project
These require checking service connection configurations, PAT expiration dates, and managed identity assignments.
Diagnosing Agent Issues
When a pipeline is stuck in the queue or fails immediately with agent errors, the problem is in the agent pool configuration.
Agent Capabilities and Demands
Every agent advertises capabilities (installed software, environment variables, system properties). Every pipeline job can specify demands. If no agent satisfies all demands, the job sits in the queue forever.
pool:
name: 'Self-Hosted Linux'
demands:
- docker
- Agent.OS -equals Linux
- node -gtVersion 18.0.0
To check capabilities: go to Organization Settings > Agent Pools > [Your Pool] > Agents > [Agent Name] > Capabilities.
If a job is stuck queuing, check the demands in the YAML against the capabilities of the agents in the pool. The mismatch is usually obvious once you look.
Pool Problems
Common pool issues:
- All agents offline: Self-hosted agent service crashed or VM deallocated
- All agents busy: Too many concurrent builds for the pool size
- Parallelism limit: Free-tier accounts get one parallel job; everything else queues
- Maintenance mode: An agent is online but marked for maintenance
# Check agent status via Azure CLI
az pipelines agent list --pool-id 1 --organization https://dev.azure.com/myorg --output table
Debugging YAML Syntax Errors
YAML is whitespace-sensitive and unforgiving. A single indentation error can break your pipeline in ways that produce unhelpful error messages.
Pre-Validation
Azure DevOps validates YAML when you save it in the web editor. Use this. Before committing a YAML change through a PR, paste it into the pipeline editor and click Validate. This catches:
- Indentation errors
- Invalid property names
- Missing required fields
- Invalid template references
Schema Linting Locally
Install the Azure Pipelines VS Code extension or use a YAML linter with the Azure Pipelines schema:
# Install yamllint
pip install yamllint
# Lint your pipeline file
yamllint -d "{extends: default, rules: {line-length: {max: 200}}}" azure-pipelines.yml
For deeper validation, the Azure DevOps CLI can validate the pipeline:
az pipelines run --name "My Pipeline" --branch main --detect --dry-run
Common YAML Mistakes
# WRONG: Mixing tabs and spaces
steps:
- script: echo hello
displayName: 'Say Hello' # This line uses a tab -- will break
# WRONG: Incorrect indentation of multi-line script
steps:
- script: |
echo "line 1"
echo "line 2"
# The script lines need to be indented further:
# RIGHT:
steps:
- script: |
echo "line 1"
echo "line 2"
displayName: 'Multi-line script'
Variable Resolution Debugging
Variable problems are among the most common and most frustrating pipeline failures. Variables can come from the YAML file, pipeline settings, variable groups, key vault, template parameters, and runtime expressions. When the wrong value shows up (or no value at all), you need a systematic approach.
Printing Variables
Add a diagnostic step that dumps all the variables you care about:
steps:
- script: |
echo "Build.SourceBranch: $(Build.SourceBranch)"
echo "Build.Reason: $(Build.Reason)"
echo "System.PullRequest.TargetBranch: $(System.PullRequest.TargetBranch)"
echo "MY_CUSTOM_VAR: $(MY_CUSTOM_VAR)"
echo "Build.BuildNumber: $(Build.BuildNumber)"
displayName: 'Debug: Print variables'
For secret variables, you cannot print them directly (Azure DevOps masks them with ***). But you can verify they are set:
steps:
- script: |
if [ -z "$MY_SECRET" ]; then
echo "MY_SECRET is empty or not set!"
exit 1
else
echo "MY_SECRET is set (length: ${#MY_SECRET})"
fi
displayName: 'Debug: Verify secret is set'
env:
MY_SECRET: $(MySecretVariable)
Compile-Time vs. Runtime Expressions
This trips up even experienced engineers. There are two expression syntaxes:
${{ variables.myVar }}-- evaluated at compile time (template expansion)$(myVar)-- evaluated at run time (macro expansion)$[variables.myVar]-- evaluated at run time (runtime expression)
If you use ${{ }} for a variable that is only set at runtime (like an output variable from a previous job), it will be empty. This is not a bug. It is the evaluation order.
# WRONG: This will be empty because outputVar is set at runtime
steps:
- script: echo ${{ variables.outputVar }}
# RIGHT: Use runtime syntax
steps:
- script: echo $(outputVar)
Task Version Issues and Pinning
Tasks in Azure DevOps are versioned. When you write task: NodeTool@0, you are using the major version 0 of the Node.js Tool Installer task. Minor and patch versions are applied automatically by Microsoft.
This means your pipeline can break without any change to your code or YAML. Microsoft pushes a patch to a task, and suddenly your build fails.
Identifying Task Version Issues
When a pipeline breaks and nobody changed the code or YAML, check the task versions:
- Open the failed run
- Click the failed task
- Look for the version string in the log header:
##[section]Starting: Use Node 18.x
==============================================================================
Task : Node.js tool installer
Description : Finds or downloads and caches the specified version spec of Node.js
Version : 0.238.1
Author : Microsoft Corporation
==============================================================================
Compare this version with a previous successful run. If the version changed, that is your culprit.
Pinning Tasks
For critical pipelines, pin the task to a specific major version and file issues when minor version updates break things. You cannot pin minor versions in YAML -- Azure DevOps does not support it. But you can monitor changes through the task changelog.
# This pins to major version 2 -- minor/patch updates still apply
- task: DotNetCoreCLI@2
inputs:
command: 'build'
# If a task update breaks you, the workaround is to use
# a script step instead of the built-in task
- script: dotnet build --configuration Release
displayName: 'Build (pinned behavior)'
Debugging Template Expansion
When you use templates extensively, the YAML that actually runs is different from the YAML in your repository. Template parameters get substituted, conditional blocks get resolved, loops get unrolled. Debugging a failure in the expanded YAML requires seeing what was actually generated.
Viewing the Expanded YAML
Azure DevOps shows you the expanded YAML for any run:
- Open the pipeline run
- Click the three-dot menu
- Select Download full YAML (or View YAML depending on your version)
This gives you the fully resolved YAML after all template expansion. It is often hundreds of lines longer than your source file. Search through it to verify that template parameters were substituted correctly and conditional blocks resolved the way you expected.
Common Template Expansion Issues
# Template parameter type mismatch
# If the template declares a boolean parameter and you pass a string,
# the condition might not evaluate correctly
# In template:
parameters:
- name: runTests
type: boolean
default: true
# WRONG: Passing a string instead of boolean
jobs:
- template: build.yml
parameters:
runTests: 'true' # This is a string, not a boolean
# RIGHT:
jobs:
- template: build.yml
parameters:
runTests: true
Artifact and Caching Problems
Pipeline artifacts and caching are common sources of subtle failures. The build succeeds, but the artifact is incomplete, or a stale cache causes the wrong version to deploy.
Artifact Debugging
# Add a step to inspect the artifact contents before publishing
- script: |
echo "=== Artifact contents ==="
find $(Build.ArtifactStagingDirectory) -type f | head -50
echo "=== Total size ==="
du -sh $(Build.ArtifactStagingDirectory)
displayName: 'Debug: Inspect artifact'
- task: PublishBuildArtifacts@1
inputs:
pathToPublish: '$(Build.ArtifactStagingDirectory)'
artifactName: 'drop'
Cache Key Mismatches
The Cache task uses a key to determine whether to restore a cached directory. If the key does not change when it should (or changes when it should not), you get stale or missing caches.
- task: Cache@2
inputs:
key: 'npm | "$(Agent.OS)" | package-lock.json'
path: '$(Pipeline.Workspace)/.npm'
displayName: 'Cache npm packages'
# If package-lock.json changed but the cache was not invalidated,
# check that the file path is correct relative to the repository root.
# A common mistake is specifying package-lock.json when it is
# actually in a subdirectory like src/package-lock.json.
Timeout and Resource Limit Issues
Job Timeouts
The default job timeout is 60 minutes. If your build takes longer, it gets killed with no useful error message.
jobs:
- job: Build
timeoutInMinutes: 120 # Increase for long builds
cancelTimeoutInMinutes: 5 # Grace period for cleanup on cancel
steps:
- script: npm run build
Agent Resource Limits
Microsoft-hosted agents have fixed resources (typically 2 CPUs, 7 GB RAM, 14 GB SSD). If your build exceeds these, you get out-of-memory kills or disk space errors.
# Add diagnostic steps to monitor resource usage
df -h # Disk space
free -m # Memory (Linux agents)
nproc # CPU count
If you consistently hit resource limits on Microsoft-hosted agents, switch to self-hosted agents where you control the hardware.
Service Connection and Authentication Failures
Service connections are the most common source of "it worked yesterday" failures. Tokens expire, certificates rotate, managed identity permissions change.
Diagnosing Service Connection Issues
# Test a service connection by making a simple API call
- task: AzureCLI@2
inputs:
azureSubscription: 'MyServiceConnection'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
echo "Logged in as:"
az account show --query "{name:name, id:id, tenantId:tenantId}" -o table
echo "Subscription access verified."
displayName: 'Debug: Verify Azure service connection'
Common service connection failures:
- Secret expired: Service principal client secrets have expiration dates. Check the App Registration in Azure AD.
- Certificate rotated: If using certificate-based auth, the certificate might have been renewed without updating the service connection.
- Permissions changed: The service principal lost access to the target resource group or subscription.
- Federated credential misconfigured: Workload identity federation requires the correct subject claim matching the pipeline's branch and environment.
Using the REST API to Query Build Details
When the UI is not enough, use the Azure DevOps REST API to query build details programmatically. This is particularly useful for automated monitoring and for debugging issues across many builds.
# Get the details of a specific build
curl -s -u ":$(PAT)" \
"https://dev.azure.com/{org}/{project}/_apis/build/builds/{buildId}?api-version=7.1" \
| python -m json.tool
# Get the timeline (all tasks, durations, statuses) for a build
curl -s -u ":$(PAT)" \
"https://dev.azure.com/{org}/{project}/_apis/build/builds/{buildId}/timeline?api-version=7.1" \
| python -m json.tool
# List recent failed builds for a specific pipeline
curl -s -u ":$(PAT)" \
"https://dev.azure.com/{org}/{project}/_apis/build/builds?definitions={definitionId}&statusFilter=failed&\$top=10&api-version=7.1" \
| python -m json.tool
You can also use Node.js to build a monitoring script:
var https = require("https");
var org = "myorg";
var project = "myproject";
var pat = process.env.AZURE_DEVOPS_PAT;
var definitionId = 42;
var options = {
hostname: "dev.azure.com",
path: "/" + org + "/" + project + "/_apis/build/builds?definitions=" + definitionId + "&statusFilter=failed&$top=5&api-version=7.1",
headers: {
"Authorization": "Basic " + Buffer.from(":" + pat).toString("base64")
}
};
https.get(options, function(res) {
var body = "";
res.on("data", function(chunk) {
body += chunk;
});
res.on("end", function() {
var data = JSON.parse(body);
data.value.forEach(function(build) {
console.log("Build " + build.buildNumber + " failed at " + build.finishTime);
console.log(" Reason: " + build.reason);
console.log(" Source: " + build.sourceBranch);
console.log(" URL: " + build._links.web.href);
console.log("");
});
});
});
Retry vs. Re-Run
Azure DevOps gives you two options when a build fails: Retry and Re-run. They are not the same.
Retry (Rerun Failed Jobs)
Retries only the failed jobs. Successful jobs are not re-executed. This is useful when:
- The failure was transient (network timeout, flaky test, agent hiccup)
- Earlier stages (build, unit tests) passed and you do not want to waste time rebuilding
- You want to preserve the same source commit and variables
Re-Run (Run New)
Queues an entirely new pipeline run. Every stage, job, and step runs from scratch. Use this when:
- You made a YAML change and want to pick it up
- You changed pipeline variables
- The failure might have been caused by a stale workspace or cached state
- You want a clean run with fresh agent VMs
My rule of thumb: retry once for transient failures. If the retry fails, re-run from scratch. If the re-run fails, it is a real problem that requires investigation.
Pipeline Diagnostic Mode
Azure DevOps has a built-in diagnostic mode that goes beyond system.debug. You can enable it by setting multiple diagnostic variables:
variables:
system.debug: true
agent.diagnostic: true
Or when queuing a run manually, set these variables:
| Variable | Effect |
|---|---|
system.debug |
Verbose task output, condition evaluation details |
agent.diagnostic |
Agent-level diagnostics, capability resolution, job dispatch details |
Agent.TempDirectory cleanup logs |
Shows what gets cleaned between jobs |
With both enabled, you get a complete picture of what the agent did, from receiving the job to cleaning up the workspace.
Complete Working Example
Here is a debug-friendly pipeline template that includes diagnostic steps. These steps activate only when system.debug is true, so they add zero overhead to normal runs.
# azure-pipelines.yml
# Debug-friendly pipeline with conditional diagnostics
trigger:
branches:
include:
- main
- release/*
variables:
buildConfiguration: 'Release'
nodeVersion: '18.x'
stages:
- stage: Build
displayName: 'Build and Test'
jobs:
- job: BuildJob
displayName: 'Build Application'
pool:
vmImage: 'ubuntu-latest'
timeoutInMinutes: 30
steps:
# ============================================
# DIAGNOSTIC STEPS (only when system.debug=true)
# ============================================
- script: |
echo "============================================"
echo " DIAGNOSTIC: System Information"
echo "============================================"
echo ""
echo "--- OS Info ---"
uname -a
cat /etc/os-release
echo ""
echo "--- CPU ---"
nproc
lscpu | head -15
echo ""
echo "--- Memory ---"
free -h
echo ""
echo "--- Disk ---"
df -h
echo ""
echo "--- Network ---"
hostname -I
curl -s ifconfig.me && echo ""
echo ""
echo "--- Docker ---"
docker --version 2>/dev/null || echo "Docker not available"
echo ""
echo "--- Current User ---"
whoami
id
displayName: 'Diagnostic: System info'
condition: eq(variables['system.debug'], 'true')
- script: |
echo "============================================"
echo " DIAGNOSTIC: Tool Versions"
echo "============================================"
echo ""
echo "Node.js: $(node --version 2>/dev/null || echo 'not installed')"
echo "npm: $(npm --version 2>/dev/null || echo 'not installed')"
echo "Python: $(python3 --version 2>/dev/null || echo 'not installed')"
echo "Java: $(java -version 2>&1 | head -1 || echo 'not installed')"
echo "dotnet: $(dotnet --version 2>/dev/null || echo 'not installed')"
echo "az cli: $(az --version 2>/dev/null | head -1 || echo 'not installed')"
echo "git: $(git --version)"
echo "curl: $(curl --version | head -1)"
displayName: 'Diagnostic: Tool versions'
condition: eq(variables['system.debug'], 'true')
- script: |
echo "============================================"
echo " DIAGNOSTIC: Pipeline Variables"
echo "============================================"
echo ""
echo "Build.SourceBranch: $(Build.SourceBranch)"
echo "Build.SourceBranchName: $(Build.SourceBranchName)"
echo "Build.SourceVersion: $(Build.SourceVersion)"
echo "Build.Reason: $(Build.Reason)"
echo "Build.BuildNumber: $(Build.BuildNumber)"
echo "Build.BuildId: $(Build.BuildId)"
echo "Build.Repository.Name: $(Build.Repository.Name)"
echo "Build.DefinitionName: $(Build.DefinitionName)"
echo "Agent.Name: $(Agent.Name)"
echo "Agent.MachineName: $(Agent.MachineName)"
echo "Agent.OS: $(Agent.OS)"
echo "Agent.OSArchitecture: $(Agent.OSArchitecture)"
echo "Agent.Version: $(Agent.Version)"
echo "Agent.BuildDirectory: $(Agent.BuildDirectory)"
echo "Agent.WorkFolder: $(Agent.WorkFolder)"
echo "Agent.TempDirectory: $(Agent.TempDirectory)"
echo "Agent.ToolsDirectory: $(Agent.ToolsDirectory)"
echo "System.DefaultWorkingDirectory: $(System.DefaultWorkingDirectory)"
echo "Pipeline.Workspace: $(Pipeline.Workspace)"
displayName: 'Diagnostic: Pipeline variables'
condition: eq(variables['system.debug'], 'true')
- script: |
echo "============================================"
echo " DIAGNOSTIC: Workspace Contents"
echo "============================================"
echo ""
echo "--- Source directory ---"
ls -la $(Build.SourcesDirectory) | head -30
echo ""
echo "--- Working directory ---"
ls -la $(System.DefaultWorkingDirectory) | head -30
echo ""
echo "--- Staging directory ---"
ls -la $(Build.ArtifactStagingDirectory) 2>/dev/null || echo "(empty or not created)"
displayName: 'Diagnostic: Workspace contents'
condition: eq(variables['system.debug'], 'true')
# ============================================
# ACTUAL BUILD STEPS
# ============================================
- task: NodeTool@0
displayName: 'Install Node.js $(nodeVersion)'
inputs:
versionSpec: '$(nodeVersion)'
- script: |
echo "Installing dependencies..."
npm ci
echo ""
echo "Installed $(npm ls --depth=0 2>/dev/null | wc -l) top-level packages"
displayName: 'Install dependencies'
workingDirectory: '$(Build.SourcesDirectory)'
- script: |
npm run lint 2>&1 || true
displayName: 'Run linter'
workingDirectory: '$(Build.SourcesDirectory)'
- script: |
npm test -- --reporter mocha-junit-reporter \
--reporter-options mochaFile=$(Common.TestResultsDirectory)/test-results.xml
displayName: 'Run tests'
workingDirectory: '$(Build.SourcesDirectory)'
continueOnError: false
- task: PublishTestResults@2
displayName: 'Publish test results'
inputs:
testResultsFormat: 'JUnit'
testResultsFiles: '**/test-results.xml'
searchFolder: '$(Common.TestResultsDirectory)'
condition: always()
- script: |
npm run build -- --configuration $(buildConfiguration)
displayName: 'Build application'
workingDirectory: '$(Build.SourcesDirectory)'
# ============================================
# POST-BUILD DIAGNOSTICS (only when system.debug=true)
# ============================================
- script: |
echo "============================================"
echo " DIAGNOSTIC: Post-Build State"
echo "============================================"
echo ""
echo "--- Build output size ---"
du -sh $(Build.SourcesDirectory)/dist 2>/dev/null || echo "No dist directory"
du -sh $(Build.SourcesDirectory)/build 2>/dev/null || echo "No build directory"
echo ""
echo "--- Disk usage after build ---"
df -h
echo ""
echo "--- Memory after build ---"
free -h
displayName: 'Diagnostic: Post-build state'
condition: eq(variables['system.debug'], 'true')
# ============================================
# PUBLISH ARTIFACTS
# ============================================
- script: |
cp -r dist/* $(Build.ArtifactStagingDirectory)/ 2>/dev/null || \
cp -r build/* $(Build.ArtifactStagingDirectory)/ 2>/dev/null || \
echo "No build output directory found"
displayName: 'Stage artifacts'
workingDirectory: '$(Build.SourcesDirectory)'
- script: |
echo "=== Artifact contents ==="
find $(Build.ArtifactStagingDirectory) -type f | head -30
echo ""
echo "=== Total artifact size ==="
du -sh $(Build.ArtifactStagingDirectory)
displayName: 'Diagnostic: Verify artifacts'
condition: eq(variables['system.debug'], 'true')
- task: PublishBuildArtifacts@1
displayName: 'Publish artifacts'
inputs:
pathToPublish: '$(Build.ArtifactStagingDirectory)'
artifactName: 'drop'
This template gives you zero-cost diagnostics. During normal runs, every diagnostic step is skipped (the condition fails). When something breaks, set system.debug to true, rerun, and get a complete picture of the environment, tool versions, variables, and workspace state -- all in a single run.
Common Issues and Troubleshooting
Issue 1: "No hosted parallelism has been purchased or granted"
Error message:
##[error]No hosted parallelism has been purchased or granted. To request a free parallelism grant, please fill out the following form https://aka.ms/azpipelines-parallelism-request
Cause: New Azure DevOps organizations must request free pipeline parallelism. Microsoft disabled automatic grants to prevent crypto mining abuse.
Fix: Submit the parallelism request form at the URL in the error message. Approval typically takes 2-3 business days. In the meantime, you can set up a self-hosted agent to unblock your team.
Issue 2: "The pipeline is not valid. Job BuildJob: Step DotNetCoreCLI input command: invalid value 'run'"
Error message:
/azure-pipelines.yml (Line: 42, Col: 9): The pipeline is not valid.
Job BuildJob: Step DotNetCoreCLI input command: invalid value 'run'.
Valid values: build, push, pack, publish, restore, test, custom
Cause: You used an invalid command value for the DotNetCoreCLI task. The task only supports specific commands. For dotnet run, use a script step instead.
Fix:
# Instead of this (invalid):
- task: DotNetCoreCLI@2
inputs:
command: 'run'
projects: 'MyApp.csproj'
# Use this:
- script: dotnet run --project MyApp.csproj
displayName: 'Run application'
Issue 3: "TF401019: The Git repository with name or identifier does not exist"
Error message:
remote: TF401019: The Git repository with name or identifier MyRepo does not exist
or you do not have permissions for the operation you are attempting.
fatal: repository 'https://dev.azure.com/org/project/_git/MyRepo/' not found
##[error]Git fetch failed with exit code: 128
Cause: The pipeline is trying to check out a repository that either does not exist, has been renamed, or the build service account lacks read permissions.
Fix:
- Verify the repository name in your
resources.repositoriessection - Go to Project Settings > Repositories > [Repo] > Security
- Grant "Read" permission to
[Project Name] Build Service (org name) - If using a multi-repo checkout, verify the
refmatches an existing branch
Issue 4: "ERROR: Failed to download task: NuGetCommand version 2.x"
Error message:
##[error]Error: Failed to download task: NuGetCommand version 2.238.1
(7e2f3c4a-2e39-4355-8675-82a527570693). Verify the task and version are correct
and the agent can reach the server.
Could not reach the server https://vstsagenttools.blob.core.windows.net/
Cause: The self-hosted agent cannot reach Azure DevOps' task download servers. This happens behind corporate firewalls or proxies that block *.blob.core.windows.net.
Fix:
- Whitelist
*.blob.core.windows.netand*.vsassets.ioin your firewall - If behind a proxy, configure the agent's
.envfile withVSTS_HTTP_PROXY - Alternatively, pre-cache tasks by running the agent once on a network with full access
Issue 5: "Exit code 137" (Out of Memory Kill)
Error message:
##[error]Bash exited with code '137'.
Cause: Exit code 137 means the process was killed by the Linux OOM (Out of Memory) killer. The build or test step consumed more memory than available on the agent.
Fix:
- For Node.js builds, increase the heap size:
NODE_OPTIONS=--max-old-space-size=4096 - For Microsoft-hosted agents, consider using
ubuntu-latest(7 GB RAM) or switching to self-hosted with more RAM - Reduce parallelism in test runners (
--maxWorkers=2for Jest) - Split large builds into multiple jobs that run on separate agents
steps:
- script: |
export NODE_OPTIONS="--max-old-space-size=4096"
npm run build
displayName: 'Build with increased memory'
Best Practices
Enable
system.debugas your first debugging step. Before adding random echo statements to your pipeline, queue a run withsystem.debug: true. Nine times out of ten, the verbose output tells you exactly what went wrong without any YAML changes.Download raw logs for complex failures. The web UI truncates output and sometimes loses lines during timeouts. Raw log files are the source of truth. Keep them for post-mortem analysis.
Add conditional diagnostic steps to every pipeline. Use the pattern from the complete example above. Wrap diagnostic steps in
condition: eq(variables['system.debug'], 'true')so they cost nothing during normal runs but give you full environmental context when debugging.Pin your tool versions explicitly. Do not rely on
latestfor Node.js, Python, Java, or .NET versions. When Microsoft updates the default toolset on hosted agents, your pipeline will break at the worst possible time. Always specify exact versions:versionSpec: '18.19.0'.Monitor task version changes. Subscribe to the Azure DevOps release notes or check the task changelogs periodically. When a built-in task gets a minor version bump, test your critical pipelines immediately rather than discovering the break during a production deploy.
Use the REST API for fleet-wide debugging. If you manage dozens of pipelines, do not click through the UI for each one. Script the REST API to find patterns: which pipelines are failing, on which agents, with which error messages. Automate this into a monitoring dashboard.
Check service connection expiry dates proactively. Set calendar reminders for when service principal secrets and certificates expire. The absolute worst time to discover an expired credential is during an emergency hotfix deployment.
Treat "retry and it works" as a bug, not a fix. If a pipeline fails intermittently and passes on retry, there is a flaky test, a race condition, or an infrastructure instability. Track these occurrences and fix the root cause. Intermittent failures erode trust in the pipeline and train developers to ignore real failures.
Keep pipeline YAML in version control with PR reviews. Pipeline changes should go through the same review process as application code. A bad YAML change can break every developer's workflow. Require at least one approval for changes to pipeline templates.
