Pipeline Performance: Caching and Optimization Techniques
A practical guide to optimizing Azure DevOps pipeline performance with dependency caching, Docker layer caching, parallel execution, shallow checkout, and build analytics.
Pipeline Performance: Caching and Optimization Techniques
Overview
Slow pipelines are not just an annoyance — they are a direct tax on engineering velocity, developer focus, and infrastructure cost. Azure DevOps provides multiple mechanisms for caching dependencies, parallelizing work, reducing checkout time, and analyzing bottlenecks, but most teams only scratch the surface. This article covers every major optimization technique I have used in production to cut pipeline durations from 25+ minutes down to under 6 minutes on real Node.js and .NET projects.
Prerequisites
- An Azure DevOps organization with Pipelines enabled
- Familiarity with YAML pipeline syntax (triggers, stages, jobs, steps)
- A project with dependency management (npm, NuGet, pip, or similar)
- Basic understanding of Docker builds (for the Docker caching sections)
- Access to pipeline analytics in Azure DevOps (Project Settings > Pipelines > Analytics)
Why Pipeline Speed Matters
I have seen teams shrug off a 20-minute pipeline as "just how CI works." That attitude costs more than people realize.
Developer feedback loops. A developer pushes a commit and opens a pull request. If the pipeline takes 18 minutes, they context-switch to something else. When the results come back — maybe a test failure — they have to reload the original context. Research consistently shows that context switching adds 15-25 minutes of recovery time. A fast pipeline that returns results in 4 minutes keeps developers in flow.
Cost. Microsoft-hosted agents bill by the minute. If you are running 200 pipeline executions per day at 18 minutes each, that is 3,600 agent-minutes daily. Cut that to 5 minutes and you drop to 1,000 agent-minutes — a 72% cost reduction. On self-hosted agents, faster pipelines mean fewer agents to maintain.
Merge velocity. Slow pipelines create PR bottlenecks. When the build queue backs up, developers stack PRs and merge conflicts multiply. Fast pipelines keep the queue moving.
The goal is not to obsess over shaving seconds. The goal is to make the pipeline fast enough that developers never have a reason to skip it.
The Cache Task
Azure DevOps provides the Cache@2 task for persisting directories between pipeline runs. The task stores a compressed archive of the specified path, keyed by a hash you define. On subsequent runs, if the key matches, the cached directory is restored instead of being rebuilt from scratch.
steps:
- task: Cache@2
displayName: 'Cache npm dependencies'
inputs:
key: 'npm | "$(Agent.OS)" | package-lock.json'
restoreKeys: |
npm | "$(Agent.OS)"
path: '$(Pipeline.Workspace)/.npm'
# The CACHE_RESTORED variable is set to 'true' if cache was hit
The cache is scoped to the project and branch by default. A cache created on the main branch is available to feature branches, but a cache created on a feature branch is not available to main. This prevents feature branch experiments from polluting the main cache.
Key things to know:
- Cache size limit: 10 GB per cache entry. Most dependency caches are well under this.
- Retention: Caches expire after 7 days of not being accessed.
- Storage: Caches are stored in Azure Artifacts storage in your organization's region.
- Scope: Caches are scoped to project + pipeline + branch (with fallback to the default branch).
Cache Keys and Restore Keys
The cache key is a pipe-delimited string. Each segment is either a literal string or a file path that gets hashed. This is where most people get the design wrong.
Hash-Based Keys
# Good: precise key based on lockfile content
key: 'npm | "$(Agent.OS)" | package-lock.json'
# Also good: multiple lockfiles in a monorepo
key: 'npm | "$(Agent.OS)" | packages/**/package-lock.json'
# Bad: no file hash means the cache never invalidates
key: 'npm | "$(Agent.OS)"'
When you include a file path like package-lock.json, Azure DevOps hashes the file contents and uses that hash as part of the key. When the lockfile changes (new dependencies added), the hash changes, and the old cache is bypassed.
Fallback with Restore Keys
Restore keys provide fallback behavior when the exact key does not match. The pipeline tries the exact key first, then each restore key in order, using the most recently created cache that matches.
- task: Cache@2
inputs:
key: 'npm | "$(Agent.OS)" | package-lock.json'
restoreKeys: |
npm | "$(Agent.OS)"
npm
path: '$(Pipeline.Workspace)/.npm'
With this configuration:
- First, try to find a cache matching the exact hash of
package-lock.json - If not found, try any cache matching
npm | "$(Agent.OS)"(same OS, any lockfile version) - If not found, try any cache matching
npm(any OS, any lockfile version) - If nothing matches, proceed without cache
A partial cache hit with a restore key is still valuable. Restoring 95% of your dependencies from cache and installing the remaining 5% is far faster than installing everything from scratch.
Caching npm Dependencies
There are two strategies for caching npm, and the choice matters.
Strategy 1: Cache the npm Cache Folder
This caches npm's internal cache directory, not node_modules itself. npm still runs npm ci on every build, but it pulls packages from the local cache instead of the registry.
variables:
npm_config_cache: $(Pipeline.Workspace)/.npm
steps:
- task: Cache@2
displayName: 'Cache npm'
inputs:
key: 'npm | "$(Agent.OS)" | package-lock.json'
restoreKeys: |
npm | "$(Agent.OS)"
path: $(npm_config_cache)
- script: npm ci
displayName: 'Install dependencies'
Typical savings: npm ci drops from ~45 seconds to ~15 seconds on a medium project (150 dependencies).
Strategy 2: Cache node_modules Directly
This skips the install step entirely when the cache hits. It is faster but carries a risk — if node_modules gets into a corrupted state, the corruption persists until the cache key changes.
steps:
- task: Cache@2
displayName: 'Cache node_modules'
inputs:
key: 'node_modules | "$(Agent.OS)" | package-lock.json'
path: 'node_modules'
- script: npm ci
displayName: 'Install dependencies'
condition: ne(variables.CACHE_RESTORED, 'true')
Typical savings: The entire install step is skipped. Saves 30-60 seconds depending on project size.
I recommend Strategy 1 for most teams. Strategy 2 is faster but harder to debug when things go wrong. If you do use Strategy 2, always use npm ci (not npm install) when the cache misses, because npm ci deletes node_modules and does a clean install from the lockfile.
Caching NuGet Packages
The pattern for NuGet is similar. Cache the global packages folder and let dotnet restore pull from it.
variables:
NUGET_PACKAGES: $(Pipeline.Workspace)/.nuget/packages
steps:
- task: Cache@2
displayName: 'Cache NuGet packages'
inputs:
key: 'nuget | "$(Agent.OS)" | **/packages.lock.json'
restoreKeys: |
nuget | "$(Agent.OS)"
path: $(NUGET_PACKAGES)
- task: DotNetCoreCLI@2
displayName: 'Restore'
inputs:
command: 'restore'
projects: '**/*.csproj'
One gotcha: you need to enable NuGet lock files for this to work. Add <RestorePackagesWithLockFile>true</RestorePackagesWithLockFile> to your Directory.Build.props or individual .csproj files. Without lock files, you do not have a stable file to hash for the cache key.
Typical savings: Restore drops from ~30 seconds to ~8 seconds on a solution with 40+ packages.
Caching Docker Layers
Docker layer caching is the single biggest optimization for pipelines that build container images. A typical Node.js Docker build without caching takes 90-180 seconds. With proper layer caching, rebuilds that only changed application code take 15-25 seconds.
BuildKit Inline Cache
Docker BuildKit supports embedding cache metadata directly in the image. You push the image with cache metadata, and subsequent builds can pull from it.
steps:
- task: Docker@2
displayName: 'Build and push with cache'
inputs:
containerRegistry: 'myACR'
repository: 'myapp'
command: 'build'
Dockerfile: 'Dockerfile'
arguments: |
--build-arg BUILDKIT_INLINE_CACHE=1
--cache-from $(containerRegistry)/myapp:latest
--tag $(containerRegistry)/myapp:$(Build.BuildId)
--tag $(containerRegistry)/myapp:latest
env:
DOCKER_BUILDKIT: 1
Registry-Backed Cache
For more advanced scenarios, BuildKit can push cache layers to a separate registry reference:
steps:
- script: |
docker buildx build \
--cache-from type=registry,ref=$(containerRegistry)/myapp:cache \
--cache-to type=registry,ref=$(containerRegistry)/myapp:cache,mode=max \
--tag $(containerRegistry)/myapp:$(Build.BuildId) \
--push \
.
displayName: 'Build with registry cache'
env:
DOCKER_BUILDKIT: 1
The mode=max flag exports all layers, not just the final image layers. This maximizes cache hits on multi-stage builds.
Dockerfile Optimization for Caching
Layer ordering in your Dockerfile has a direct impact on cache hit rates. Put things that change least frequently first:
FROM node:20-alpine AS build
WORKDIR /app
# Layer 1: package files (changes occasionally)
COPY package.json package-lock.json ./
# Layer 2: install (only reruns when package files change)
RUN npm ci --production
# Layer 3: application code (changes every build)
COPY . .
RUN npm run build
If you put COPY . . before RUN npm ci, every code change invalidates the install layer and you lose the cache entirely.
Caching Build Outputs Between Stages
Multi-stage pipelines need to pass build artifacts between stages. You have two options: pipeline artifacts and the cache task.
Pipeline Artifacts
Use PublishPipelineArtifact and DownloadPipelineArtifact when the output is the final deliverable (a compiled binary, a Docker image tag, a deployment package).
stages:
- stage: Build
jobs:
- job: BuildApp
steps:
- script: npm run build
- task: PublishPipelineArtifact@1
inputs:
targetPath: '$(System.DefaultWorkingDirectory)/dist'
artifact: 'app-dist'
- stage: Deploy
dependsOn: Build
jobs:
- job: DeployApp
steps:
- task: DownloadPipelineArtifact@2
inputs:
artifact: 'app-dist'
path: '$(Pipeline.Workspace)/dist'
Cache Task for Intermediate Outputs
Use the cache task when the output is an intermediate artifact that speeds up rebuilds but is not a deliverable — compiled object files, TypeScript output, Webpack bundles.
steps:
- task: Cache@2
inputs:
key: 'build-output | "$(Agent.OS)" | src/**/*.ts'
path: 'dist'
- script: npm run build
condition: ne(variables.CACHE_RESTORED, 'true')
Pipeline artifacts are always available in downstream stages. Cached files are only available on the same agent type. Use artifacts for cross-stage data flow and caching for within-job speed improvements.
Optimizing Checkout
The default checkout: self step clones the entire repository history. For large repositories, this is a significant time sink.
Shallow Fetch
Fetch only the latest commit instead of the full history:
steps:
- checkout: self
fetchDepth: 1
displayName: 'Shallow checkout'
Typical savings on a repo with 10,000 commits: Checkout drops from ~35 seconds to ~5 seconds.
Setting fetchDepth: 1 fetches only the tip commit. If your build needs recent history (for changelog generation, for example), use fetchDepth: 50 or whatever depth covers your needs.
Sparse Checkout
For monorepos, you can check out only the directories you need:
steps:
- checkout: self
fetchDepth: 1
fetchFilter: 'blob:none'
displayName: 'Treeless checkout'
If you need only specific paths, use the checkout step with submodules and a .gitconfig sparse-checkout pattern, or use a script step:
steps:
- checkout: self
fetchDepth: 1
- script: |
git sparse-checkout init --cone
git sparse-checkout set packages/api packages/shared
displayName: 'Sparse checkout - API only'
Disable Tagging
Azure DevOps tags each successful build with a Git tag by default. On high-frequency pipelines, this adds unnecessary overhead:
steps:
- checkout: self
fetchDepth: 1
fetchTags: false
Parallel Job Strategies
Running jobs in parallel is the most effective way to reduce wall-clock time. A pipeline with 4 test suites that each take 5 minutes runs in 5 minutes when parallelized, instead of 20.
Matrix Strategy
jobs:
- job: Test
strategy:
matrix:
unit:
TEST_SUITE: 'unit'
integration:
TEST_SUITE: 'integration'
e2e:
TEST_SUITE: 'e2e'
maxParallel: 3
steps:
- checkout: self
fetchDepth: 1
- script: npm ci
- script: npm run test:$(TEST_SUITE)
displayName: 'Run $(TEST_SUITE) tests'
Fan-Out for Monorepo Packages
jobs:
- job: Test
strategy:
matrix:
api:
PACKAGE: 'packages/api'
web:
PACKAGE: 'packages/web'
shared:
PACKAGE: 'packages/shared'
steps:
- checkout: self
fetchDepth: 1
- script: |
cd $(PACKAGE)
npm ci
npm test
displayName: 'Test $(PACKAGE)'
Dynamic Parallelism with Each
For large test suites, split tests across N agents dynamically:
parameters:
- name: testShards
type: object
default:
- shard: 1
total: 4
- shard: 2
total: 4
- shard: 3
total: 4
- shard: 4
total: 4
jobs:
- ${{ each testShard in parameters.testShards }}:
- job: Test_Shard_${{ testShard.shard }}
steps:
- script: |
npx jest --shard=${{ testShard.shard }}/${{ testShard.total }}
displayName: 'Run test shard ${{ testShard.shard }}/${{ testShard.total }}'
Incremental Builds
Running every build step on every commit is wasteful when only a few files changed. Incremental builds detect what changed and skip unnecessary work.
Path-Based Triggers
Limit which paths trigger the pipeline at all:
trigger:
branches:
include:
- main
paths:
include:
- packages/api/**
exclude:
- '**/*.md'
- docs/**
Detecting Changed Files in Scripts
For monorepos where you want one pipeline but conditional execution:
#!/bin/bash
CHANGED_FILES=$(git diff --name-only HEAD~1 HEAD)
if echo "$CHANGED_FILES" | grep -q "^packages/api/"; then
echo "##vso[task.setvariable variable=BUILD_API]true"
fi
if echo "$CHANGED_FILES" | grep -q "^packages/web/"; then
echo "##vso[task.setvariable variable=BUILD_WEB]true"
fi
steps:
- script: ./scripts/detect-changes.sh
displayName: 'Detect changes'
- script: |
cd packages/api
npm ci && npm run build && npm test
displayName: 'Build and test API'
condition: eq(variables.BUILD_API, 'true')
- script: |
cd packages/web
npm ci && npm run build && npm test
displayName: 'Build and test Web'
condition: eq(variables.BUILD_WEB, 'true')
Optimizing Test Execution
Tests typically consume 40-60% of total pipeline time. Optimizing them has outsized impact.
Test Splitting
Jest supports sharding natively:
steps:
- script: npx jest --shard=1/4
displayName: 'Test shard 1'
For other frameworks, split by file count:
// scripts/get-test-files.js
var fs = require('fs');
var path = require('path');
var glob = require('glob');
var shardIndex = parseInt(process.env.SHARD_INDEX, 10);
var shardTotal = parseInt(process.env.SHARD_TOTAL, 10);
var testFiles = glob.sync('**/*.test.js', { ignore: 'node_modules/**' });
testFiles.sort();
var shardFiles = testFiles.filter(function(file, index) {
return index % shardTotal === shardIndex;
});
console.log(shardFiles.join(' '));
steps:
- script: |
TEST_FILES=$(node scripts/get-test-files.js)
npx jest $TEST_FILES
env:
SHARD_INDEX: $(System.JobPositionInPhase)
SHARD_TOTAL: $(System.TotalJobsInPhase)
displayName: 'Run test shard'
Running Only Affected Tests
If you use Jest, the --changedSince flag runs only tests related to changed files:
steps:
- checkout: self
fetchDepth: 0 # Need full history for diff
- script: npx jest --changedSince=origin/main
displayName: 'Run affected tests'
This can reduce test execution from 8 minutes to 30 seconds on a large codebase when only a few files changed. It is ideal for PR validation pipelines where you know the base branch.
Agent Selection and Sizing
The agent you run on has a direct impact on pipeline performance.
Microsoft-Hosted vs Self-Hosted
| Factor | Microsoft-Hosted | Self-Hosted |
|---|---|---|
| Startup time | 15-45 seconds (cold start) | Near-instant |
| Disk I/O | Moderate (standard SSD) | Configurable (NVMe) |
| CPU | 2 vCPU standard | Configurable |
| Cache persistence | None (clean VM every run) | Persistent across runs |
| Docker layer cache | None (rebuilt every time) | Persists between builds |
| Maintenance | Zero | You manage updates |
For build-heavy pipelines, self-hosted agents with NVMe storage and persistent Docker caches can cut build times by 50-70%. The tradeoff is operational overhead.
Right-Sizing Self-Hosted Agents
Profile your builds to match agent specs to workload:
# For compile-heavy workloads (C++, Rust, large TypeScript projects)
# Use: 8 vCPU, 16 GB RAM, NVMe SSD
# For test-heavy workloads (Node.js, Python)
# Use: 4 vCPU, 8 GB RAM, standard SSD
# For Docker builds
# Use: 4 vCPU, 8 GB RAM, 100 GB+ SSD for layer cache
A self-hosted agent running on a Standard_D8s_v5 Azure VM (8 vCPU, 32 GB) compiled a TypeScript monorepo with 200,000+ lines in 45 seconds. The same build on a Microsoft-hosted agent took 3 minutes 20 seconds. The VM costs $280/month. The time savings across 50 daily builds paid for it within the first week.
Reducing Task Overhead
Every task in a pipeline has startup overhead — typically 2-5 seconds for downloading the task, initializing the runtime, and reporting telemetry. On a pipeline with 25 tasks, that is 50-125 seconds of pure overhead.
Combine Script Steps
# Bad: 4 separate steps = ~15 seconds overhead
steps:
- script: echo "Setting up environment"
- script: export NODE_ENV=production
- script: npm run lint
- script: npm run build
# Good: 1 step = ~4 seconds overhead
steps:
- script: |
export NODE_ENV=production
npm run lint
npm run build
displayName: 'Lint and build'
Skip Unnecessary Tasks
Disable features you do not use:
steps:
- checkout: self
fetchDepth: 1
fetchTags: false
lfs: false
submodules: false
clean: false
Setting clean: false reuses the working directory from the previous run on self-hosted agents. Combined with caching, this means the agent already has node_modules populated. The pipeline skips both the cache restore and the install step.
Avoid Post-Job Cleanup on Self-Hosted
pool:
name: 'MySelfHostedPool'
workspace:
clean: outputs # Only clean build outputs, not sources or dependencies
Pipeline Analytics for Identifying Bottlenecks
Azure DevOps provides built-in analytics that most teams never look at. Navigate to Pipelines > Analytics to access them.
Key Metrics to Track
- P50 and P95 duration: The median tells you typical performance. The 95th percentile tells you about flaky slow runs.
- Task duration breakdown: Identifies which tasks consume the most time.
- Queue time: How long builds wait for an available agent. High queue times mean you need more agents or faster builds.
- Pass rate: Low pass rates mean wasted compute on failing builds. Fix flaky tests before optimizing speed.
Querying Analytics via API
// scripts/pipeline-metrics.js
var https = require('https');
var org = process.env.ADO_ORG;
var project = process.env.ADO_PROJECT;
var pat = process.env.ADO_PAT;
var url = 'https://analytics.dev.azure.com/' + org + '/' + project +
'/_odata/v3.0-preview/PipelineRuns?' +
'$filter=PipelineName eq \'MyPipeline\' and CompletedDate ge 2026-01-01Z' +
'&$select=PipelineRunId,TotalDurationSeconds,QueueDurationSeconds,RunResult' +
'&$orderby=CompletedDate desc' +
'&$top=100';
var options = {
headers: {
'Authorization': 'Basic ' + Buffer.from(':' + pat).toString('base64')
}
};
https.get(url, options, function(res) {
var data = '';
res.on('data', function(chunk) { data += chunk; });
res.on('end', function() {
var runs = JSON.parse(data).value;
var durations = runs.map(function(r) { return r.TotalDurationSeconds; });
durations.sort(function(a, b) { return a - b; });
var p50 = durations[Math.floor(durations.length * 0.5)];
var p95 = durations[Math.floor(durations.length * 0.95)];
console.log('P50 duration: ' + Math.round(p50 / 60) + ' minutes');
console.log('P95 duration: ' + Math.round(p95 / 60) + ' minutes');
console.log('Total runs: ' + runs.length);
});
});
Build Output Artifact Optimization
Large artifacts slow down uploads and downloads between stages. Compress and filter them.
Exclude Unnecessary Files
steps:
- task: PublishPipelineArtifact@1
inputs:
targetPath: '$(Build.ArtifactStagingDirectory)'
artifact: 'drop'
# Before publishing, copy only what you need
- script: |
mkdir -p $(Build.ArtifactStagingDirectory)/app
cp -r dist/ $(Build.ArtifactStagingDirectory)/app/
cp package.json $(Build.ArtifactStagingDirectory)/app/
cp package-lock.json $(Build.ArtifactStagingDirectory)/app/
displayName: 'Stage artifacts'
Compress Before Upload
steps:
- script: |
tar -czf $(Build.ArtifactStagingDirectory)/app.tar.gz \
-C dist .
displayName: 'Compress build output'
- task: PublishPipelineArtifact@1
inputs:
targetPath: '$(Build.ArtifactStagingDirectory)/app.tar.gz'
artifact: 'app-compressed'
A 450 MB uncompressed dist folder containing JavaScript bundles and source maps compresses to ~85 MB. Upload time drops from 40 seconds to 8 seconds on a typical agent connection.
Complete Working Example
Here is a production pipeline for a Node.js monorepo with three packages. It incorporates npm caching, Docker layer caching, parallel test execution, shallow checkout, and timing instrumentation.
# azure-pipelines.yml
trigger:
branches:
include:
- main
paths:
exclude:
- '**/*.md'
- docs/**
pr:
branches:
include:
- main
variables:
npm_config_cache: $(Pipeline.Workspace)/.npm
DOCKER_BUILDKIT: 1
containerRegistry: 'myacr.azurecr.io'
stages:
- stage: Build
displayName: 'Build & Test'
jobs:
# ============================================
# Shared: Install + Lint + Build
# ============================================
- job: BuildAll
displayName: 'Install, Lint, Build'
pool:
vmImage: 'ubuntu-latest'
steps:
- checkout: self
fetchDepth: 1
fetchTags: false
displayName: 'Shallow checkout'
- task: Cache@2
displayName: 'Restore npm cache'
inputs:
key: 'npm | "$(Agent.OS)" | package-lock.json | packages/**/package-lock.json'
restoreKeys: |
npm | "$(Agent.OS)"
path: $(npm_config_cache)
- script: |
echo "##vso[task.setvariable variable=INSTALL_START]$(date +%s)"
npm ci
echo "##vso[task.setvariable variable=INSTALL_END]$(date +%s)"
displayName: 'Install dependencies'
- script: |
npm run lint --workspaces
displayName: 'Lint all packages'
- script: |
npm run build --workspaces
displayName: 'Build all packages'
- task: PublishPipelineArtifact@1
inputs:
targetPath: 'packages/api/dist'
artifact: 'api-dist'
- task: PublishPipelineArtifact@1
inputs:
targetPath: 'packages/web/dist'
artifact: 'web-dist'
# ============================================
# Parallel Test Execution (3 shards)
# ============================================
- job: Test
displayName: 'Test'
dependsOn: [] # Run in parallel with BuildAll
pool:
vmImage: 'ubuntu-latest'
strategy:
matrix:
shard_1:
SHARD: '1/3'
shard_2:
SHARD: '2/3'
shard_3:
SHARD: '3/3'
maxParallel: 3
steps:
- checkout: self
fetchDepth: 1
fetchTags: false
- task: Cache@2
displayName: 'Restore npm cache'
inputs:
key: 'npm | "$(Agent.OS)" | package-lock.json | packages/**/package-lock.json'
restoreKeys: |
npm | "$(Agent.OS)"
path: $(npm_config_cache)
- script: npm ci
displayName: 'Install dependencies'
- script: |
echo "Test shard $(SHARD) starting at $(date)"
npx jest --shard=$(SHARD) \
--ci \
--coverage \
--reporters=default \
--reporters=jest-junit
displayName: 'Run test shard $(SHARD)'
env:
JEST_JUNIT_OUTPUT_DIR: $(System.DefaultWorkingDirectory)/test-results
- task: PublishTestResults@2
inputs:
testResultsFormat: 'JUnit'
testResultsFiles: 'test-results/*.xml'
mergeTestResults: true
condition: always()
# ============================================
# Docker Build with Layer Caching
# ============================================
- stage: Docker
displayName: 'Build Docker Image'
dependsOn: Build
jobs:
- job: DockerBuild
pool:
vmImage: 'ubuntu-latest'
steps:
- checkout: self
fetchDepth: 1
fetchTags: false
- task: DownloadPipelineArtifact@2
inputs:
artifact: 'api-dist'
path: '$(Pipeline.Workspace)/api-dist'
- task: Docker@2
displayName: 'Login to ACR'
inputs:
containerRegistry: 'myACR'
command: 'login'
- script: |
echo "Docker build starting at $(date)"
START=$(date +%s)
docker buildx build \
--cache-from type=registry,ref=$(containerRegistry)/api:cache \
--cache-to type=registry,ref=$(containerRegistry)/api:cache,mode=max \
--tag $(containerRegistry)/api:$(Build.BuildId) \
--tag $(containerRegistry)/api:latest \
--push \
--file packages/api/Dockerfile \
.
END=$(date +%s)
echo "Docker build completed in $((END - START)) seconds"
displayName: 'Build and push with layer cache'
# ============================================
# Deploy (gated)
# ============================================
- stage: Deploy
displayName: 'Deploy to Production'
dependsOn: Docker
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
jobs:
- deployment: DeployProd
environment: 'production'
pool:
vmImage: 'ubuntu-latest'
strategy:
runOnce:
deploy:
steps:
- script: |
echo "Deploying $(containerRegistry)/api:$(Build.BuildId)"
# Your deployment commands here
displayName: 'Deploy to production'
Before/After Timing Comparison
| Step | Before Optimization | After Optimization | Savings |
|---|---|---|---|
| Checkout | 35s | 5s | 86% |
| npm install | 48s | 12s (cached) | 75% |
| Lint | 25s | 25s | 0% |
| Build | 40s | 40s | 0% |
| Tests (total) | 12m 30s | 4m 10s (3 shards) | 67% |
| Docker build | 2m 45s | 28s (layer cache) | 83% |
| Artifact upload | 38s | 9s (compressed) | 76% |
| Total wall-clock | 18m 41s | 5m 29s | 71% |
The total savings come from three techniques stacked together: caching eliminates redundant downloads, parallelism divides test time by the shard count, and shallow checkout eliminates unnecessary Git history.
Common Issues & Troubleshooting
Issue 1: Cache Not Being Restored
Error: No matching cache found. The task proceeds without a cache hit, and installs take full time every run.
##[debug]Cache miss - no cache found matching key: npm | "Linux" | a4c8f32b...
##[debug]Attempted restore keys:
##[debug] npm | "Linux"
##[debug]No matching restore key found.
Cause: The cache key includes a hash of a file that does not exist or whose path is wrong. Common when using glob patterns like **/package-lock.json in a repo where some packages do not have lockfiles.
Fix: Verify the file paths in your cache key. Run find . -name "package-lock.json" locally to confirm the patterns match. Also check that the cache was populated by a previous successful run on the same branch or the default branch.
Issue 2: Docker Cache Pull Fails
Error:
ERROR: failed to solve: failed to fetch cache from myacr.azurecr.io/api:cache:
unexpected status: 401 Unauthorized
Cause: The Docker login step succeeded for image push, but the --cache-from pull happens before the build context is fully authenticated, or the cache image reference does not exist yet.
Fix: Ensure the Docker login happens before the docker buildx build command. For the first run, the cache image does not exist yet — the build will proceed without cache on the initial run and populate the cache for subsequent runs. Add || true to ignore cache pull failures:
- script: |
docker pull $(containerRegistry)/api:cache || true
docker buildx build \
--cache-from type=registry,ref=$(containerRegistry)/api:cache \
...
Issue 3: Test Sharding Produces Uneven Distribution
Symptom: Shard 1 takes 6 minutes while shard 3 takes 45 seconds. Total time is still 6 minutes because the slowest shard is the bottleneck.
Cause: Jest's default sharding splits by file count, not by execution time. A shard that gets the integration test files takes much longer.
Fix: Use --shard with --testPathPattern to manually distribute slow test files, or use a test timing file:
- script: |
npx jest --shard=$(SHARD) \
--json \
--outputFile=test-timing.json
displayName: 'Run tests with timing'
# Upload timing data as artifact for future rebalancing
- task: PublishPipelineArtifact@1
inputs:
targetPath: 'test-timing.json'
artifact: 'test-timing-$(System.JobPositionInPhase)'
Then write a script that reads historical timing data and produces balanced shard assignments.
Issue 4: Cache Keeps Growing and Hits Size Limit
Error:
##[error]Cache upload failed: The cache entry size (11534336000 bytes) exceeds
the maximum allowed size (10737418240 bytes).
Cause: Caching node_modules directly on a project with native modules (sharp, bcrypt, canvas) produces a cache that exceeds 10 GB.
Fix: Switch from caching node_modules to caching the npm cache folder (~/.npm or the npm_config_cache path). The npm cache stores compressed tarballs, which are significantly smaller than the extracted node_modules. For a project where node_modules is 4.2 GB, the npm cache is typically 800 MB.
Issue 5: Self-Hosted Agent Disk Full After Cache Accumulation
Error:
##[error]No space left on device
Cause: The Cache@2 task stores cached content in $(Pipeline.Workspace)/.cachesalt and the agent work directory accumulates across runs. Combined with Docker layer caches and build outputs, the disk fills up.
Fix: Schedule a maintenance job that cleans old caches and Docker images:
# maintenance-pipeline.yml
schedules:
- cron: '0 2 * * 0' # Weekly at 2 AM Sunday
displayName: 'Weekly cleanup'
branches:
include:
- main
steps:
- script: |
docker system prune -af --volumes
rm -rf $(Agent.WorkFolder)/_work/_temp/*
rm -rf $(Agent.WorkFolder)/_work/*/s/node_modules
displayName: 'Clean agent disk'
Best Practices
Always use
fetchDepth: 1unless you need Git history. This is the single easiest optimization and it applies to every pipeline. The only exception is when you rungit logorgit diffagainst historical commits during the build.Cache the package manager cache, not
node_modules. Cachingnode_modulesdirectly is faster on hit but fragile. Corrupted caches cause mysterious build failures that are hard to diagnose. The npm/NuGet cache folder approach is slower by a few seconds but far more reliable.Parallelize tests before optimizing individual test speed. Splitting a 12-minute test suite across 4 shards gives you a 3-minute suite. No amount of individual test optimization will match that improvement for the same effort.
Profile before optimizing. Use Pipeline Analytics to identify the actual bottleneck. I have seen teams spend a week optimizing Docker builds when 60% of their pipeline time was in test execution. Check the data first.
Use
condition: ne(variables.CACHE_RESTORED, 'true')to skip install steps on cache hit. This turns a 45-second install into a 0-second skip. It works with the Cache@2 task, which automatically sets theCACHE_RESTOREDvariable.Set
clean: falseon self-hosted agents for incremental builds. Combined with caching, the working directory persists between runs. Source files are already checked out, dependencies are already installed, and only changed files need processing.Compress artifacts before publishing. A
tar.gzof your build output is typically 5-8x smaller than the raw directory. This directly reduces upload and download time between stages.Monitor cache hit rates over time. A cache that hits 95% of the time is working well. A cache that hits 40% of the time probably has a key that is too specific or a branch-scoping issue. Track this metric monthly.
Do not cache everything. Caching has overhead — the restore step itself takes 5-15 seconds depending on cache size. If your install step only takes 10 seconds, caching it adds complexity for zero gain. Only cache operations that take more than 20-30 seconds.
