Self-Hosted Agents: Setup, Scaling, and Management
Complete guide to setting up, scaling, and managing self-hosted Azure DevOps agents, covering Docker-based agents, auto-scaling with KEDA, security hardening, and cost optimization.
Self-Hosted Agents: Setup, Scaling, and Management
Overview
Self-hosted agents give you full control over the build environment your Azure DevOps pipelines run on. Instead of relying on Microsoft-managed VMs that spin up fresh for every job, you run the agent software on your own hardware — a physical server, a VM, a container, whatever you want. This matters when you need specific toolchains pre-installed, access to private network resources, or when you are burning through enough pipeline minutes that the Microsoft-hosted agent bill starts to sting. If you have ever waited eight minutes for a Microsoft-hosted agent to install dependencies that never change between builds, self-hosted agents are the fix.
Prerequisites
- An Azure DevOps organization with at least one project
- A Personal Access Token (PAT) with Agent Pools (Read & Manage) scope
- A machine running Linux (Ubuntu 20.04+), Windows Server 2019+, or macOS 12+
- Node.js 18+ installed (for the orchestrator example)
- Docker installed (for containerized agent examples)
- Basic familiarity with Azure Pipelines YAML syntax
Microsoft-Hosted vs Self-Hosted Agents
Before you invest time in self-hosted infrastructure, understand the tradeoffs.
| Factor | Microsoft-Hosted | Self-Hosted |
|---|---|---|
| Setup time | Zero | 30 min to several hours |
| Maintenance | None | You own it |
| Cold start | 30-90 seconds | Near-instant (agent already running) |
| Pre-installed tools | Standard image | Whatever you install |
| Network access | Public internet only | Private VNets, on-prem resources |
| Cost (1 parallel job) | ~$40/month | Your infrastructure cost |
| Cost (10 parallel jobs) | ~$400/month | Often much less |
| Build caching | Lost every job | Persists between jobs |
| Max job duration | 60 minutes (free) / 360 min | Unlimited |
The breakeven point is usually around 3-4 parallel jobs. Below that, Microsoft-hosted agents are less hassle. Above that, self-hosted agents start saving real money — especially when you factor in build speed improvements from warm caches and pre-installed dependencies.
Installing the Agent
The Azure Pipelines agent is a cross-platform .NET application. Installation is straightforward on every OS.
Linux (Ubuntu/Debian)
# Create a directory for the agent
mkdir ~/azagent && cd ~/azagent
# Download the latest agent package
curl -LO https://vstsagentpackage.azureedge.net/agent/3.248.0/vsts-agent-linux-x64-3.248.0.tar.gz
# Extract
tar xzf vsts-agent-linux-x64-3.248.0.tar.gz
# Configure the agent
./config.sh \
--url https://dev.azure.com/your-organization \
--auth pat \
--token YOUR_PAT_TOKEN \
--pool "Linux-Pool" \
--agent "build-agent-01" \
--acceptTeeEula \
--unattended
# Install and start as a systemd service
sudo ./svc.sh install
sudo ./svc.sh start
Output from a successful configuration:
>> Connect:
Connecting to server ...
>> Register Agent:
Scanning for tool capabilities.
Connecting to the server.
Successfully added the agent
Testing agent connection.
2026-02-08 14:23:01Z: Settings Saved.
Windows
# Create directory
mkdir C:\azagent ; cd C:\azagent
# Download agent
Invoke-WebRequest -Uri "https://vstsagentpackage.azureedge.net/agent/3.248.0/vsts-agent-win-x64-3.248.0.zip" -OutFile agent.zip
# Extract
Expand-Archive -Path agent.zip -DestinationPath .
# Configure
.\config.cmd --url https://dev.azure.com/your-organization `
--auth pat `
--token YOUR_PAT_TOKEN `
--pool "Windows-Pool" `
--agent "win-build-01" `
--runAsService `
--windowsLogonAccount "NT AUTHORITY\NETWORK SERVICE" `
--unattended
macOS
mkdir ~/azagent && cd ~/azagent
curl -LO https://vstsagentpackage.azureedge.net/agent/3.248.0/vsts-agent-osx-arm64-3.248.0.tar.gz
tar xzf vsts-agent-osx-arm64-3.248.0.tar.gz
./config.sh \
--url https://dev.azure.com/your-organization \
--auth pat \
--token YOUR_PAT_TOKEN \
--pool "Mac-Pool" \
--agent "mac-build-01" \
--unattended
# Install as a LaunchAgent (runs on user login)
./svc.sh install
./svc.sh start
Agent Pools and Capabilities
Agent pools are logical groupings. Every self-hosted agent belongs to exactly one pool. Pipelines target pools, not individual agents.
The default pool is called Default, but do not dump everything into it. Create pools that reflect your workload topology:
├── Linux-Build # General Linux CI
├── Linux-GPU # ML model training jobs
├── Windows-Build # .NET Framework builds
├── Docker-Ephemeral # Auto-scaling container agents
└── Mac-Signing # iOS/macOS code signing
System Capabilities vs User Capabilities
Every agent automatically reports system capabilities — environment variables, installed tool versions, OS info. You can also add user capabilities as key-value pairs in the Azure DevOps portal or during configuration.
Pipelines use demands to match agents:
pool:
name: 'Linux-Build'
demands:
- Agent.OS -equals Linux
- docker
- node.js
- gpu # user capability you added manually
This is how you route GPU-heavy training jobs to beefy machines while keeping regular builds on smaller instances.
Running as a Service vs Interactive
Service mode is what you want for production. The agent starts on boot, runs in the background, restarts automatically on crash. On Linux this is a systemd unit; on Windows it is a Windows Service; on macOS it is a launchd plist.
Interactive mode (./run.sh or .\run.cmd) is useful for debugging. You see real-time console output. The agent stops when you close the terminal.
One scenario where interactive mode matters: builds that require a desktop session. If your pipeline runs UI tests with a browser, the agent needs access to a display. On Linux, you would need Xvfb or a similar virtual framebuffer. On Windows, the service account must have "Log on as a service" rights and you may need to configure the service to interact with the desktop.
# Check agent service status on Linux
sudo systemctl status vsts.agent.your-org.Linux-Build.build-agent-01
# Expected output:
# ● vsts.agent.your-org.Linux-Build.build-agent-01.service
# Loaded: loaded (/etc/systemd/system/vsts.agent...service; enabled)
# Active: active (running) since Thu 2026-02-08 10:15:32 UTC; 3h ago
# Main PID: 4821 (runsvc.sh)
Docker-Based Agents
Containerized agents are the sweet spot for most teams. You get reproducible environments, fast scaling, and trivial cleanup. Here is a production-grade Dockerfile:
FROM ubuntu:22.04
# Prevent interactive prompts during package installation
ENV DEBIAN_FRONTEND=noninteractive
# Install base dependencies
RUN apt-get update && apt-get install -y \
curl \
git \
jq \
unzip \
zip \
wget \
apt-transport-https \
ca-certificates \
gnupg \
lsb-release \
software-properties-common \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Install Docker CLI (for Docker-in-Docker or Docker-outside-of-Docker)
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg \
&& echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" > /etc/apt/sources.list.d/docker.list \
&& apt-get update && apt-get install -y docker-ce-cli && rm -rf /var/lib/apt/lists/*
# Install Node.js 20
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
&& apt-get install -y nodejs \
&& rm -rf /var/lib/apt/lists/*
# Install .NET SDK 8.0
RUN wget https://dot.net/v1/dotnet-install.sh -O dotnet-install.sh \
&& chmod +x dotnet-install.sh \
&& ./dotnet-install.sh --channel 8.0 --install-dir /usr/share/dotnet \
&& ln -s /usr/share/dotnet/dotnet /usr/bin/dotnet \
&& rm dotnet-install.sh
# Install Azure DevOps agent
ENV AGENT_VERSION=3.248.0
WORKDIR /azp
RUN curl -LO "https://vstsagentpackage.azureedge.net/agent/${AGENT_VERSION}/vsts-agent-linux-x64-${AGENT_VERSION}.tar.gz" \
&& tar xzf "vsts-agent-linux-x64-${AGENT_VERSION}.tar.gz" \
&& rm "vsts-agent-linux-x64-${AGENT_VERSION}.tar.gz" \
&& ./bin/installdependencies.sh
# Create a non-root user for the agent
RUN useradd -m -d /home/agentuser agentuser \
&& chown -R agentuser:agentuser /azp
# Startup script
COPY start.sh /azp/start.sh
RUN chmod +x /azp/start.sh
USER agentuser
ENTRYPOINT ["/azp/start.sh"]
The start.sh script handles configuration and graceful shutdown:
#!/bin/bash
set -euo pipefail
# Validate required environment variables
if [ -z "${AZP_URL:-}" ]; then
echo "error: AZP_URL is not set"
exit 1
fi
if [ -z "${AZP_TOKEN:-}" ]; then
echo "error: AZP_TOKEN is not set"
exit 1
fi
AZP_POOL="${AZP_POOL:-Default}"
AZP_AGENT_NAME="${AZP_AGENT_NAME:-$(hostname)}"
echo "Configuring agent ${AZP_AGENT_NAME} in pool ${AZP_POOL}..."
cd /azp
# Configure the agent
./config.sh \
--url "${AZP_URL}" \
--auth pat \
--token "${AZP_TOKEN}" \
--pool "${AZP_POOL}" \
--agent "${AZP_AGENT_NAME}" \
--replace \
--acceptTeeEula \
--unattended
# Trap SIGTERM for graceful shutdown
cleanup() {
echo "Received shutdown signal. Draining agent..."
./config.sh remove --auth pat --token "${AZP_TOKEN}"
}
trap cleanup SIGTERM SIGINT
# Run the agent
./run.sh &
wait $!
Build and run:
docker build -t azdo-agent:latest .
docker run -d \
--name build-agent-01 \
-e AZP_URL="https://dev.azure.com/your-organization" \
-e AZP_TOKEN="your-pat-token" \
-e AZP_POOL="Docker-Ephemeral" \
-e AZP_AGENT_NAME="docker-agent-01" \
-v /var/run/docker.sock:/var/run/docker.sock \
azdo-agent:latest
The -v /var/run/docker.sock mount gives the agent access to the host Docker daemon. This is the Docker-outside-of-Docker (DooD) pattern. It is simpler and faster than Docker-in-Docker (DinD) and sufficient for most build scenarios.
Auto-Scaling Agent Pools
Static agent pools waste money. When no builds are queued, agents sit idle. When the queue spikes, developers wait. You need auto-scaling.
Option 1: Azure VMSS (Virtual Machine Scale Sets)
Azure DevOps has native VMSS integration. You create a scale set, point an agent pool at it, and Azure DevOps handles provisioning and deprovisioning.
# Create a VMSS with the agent pre-baked into a custom image
az vmss create \
--resource-group rg-build-agents \
--name vmss-build-agents \
--image "your-custom-agent-image-id" \
--vm-sku Standard_D4s_v5 \
--instance-count 0 \
--upgrade-policy-mode manual \
--admin-username azureuser \
--authentication-type ssh \
--ssh-key-values ~/.ssh/id_rsa.pub
Then in Azure DevOps, go to Organization Settings > Agent Pools > Add Pool and select Azure virtual machine scale set. Set your min/max instance counts and idle timeout.
Option 2: Kubernetes with KEDA
KEDA (Kubernetes Event-Driven Autoscaling) can scale agent pods based on Azure DevOps queue depth. This is the most elegant solution for teams already running Kubernetes.
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: azdo-agent-scaler
namespace: build-agents
spec:
scaleTargetRef:
name: azdo-agent-deployment
pollingInterval: 15
cooldownPeriod: 300
minReplicaCount: 0
maxReplicaCount: 10
triggers:
- type: azure-pipelines
metadata:
organizationURLFromEnv: "AZP_URL"
personalAccessTokenFromEnv: "AZP_TOKEN"
poolName: "Kubernetes-Pool"
# Scale one agent per pending job
targetPipelinesQueueLength: "1"
The corresponding Kubernetes deployment:
# agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: azdo-agent-deployment
namespace: build-agents
spec:
replicas: 0 # KEDA controls this
selector:
matchLabels:
app: azdo-agent
template:
metadata:
labels:
app: azdo-agent
spec:
terminationGracePeriodSeconds: 600 # Allow running jobs to finish
containers:
- name: agent
image: your-registry.azurecr.io/azdo-agent:latest
env:
- name: AZP_URL
valueFrom:
secretKeyRef:
name: azdo-agent-secret
key: url
- name: AZP_TOKEN
valueFrom:
secretKeyRef:
name: azdo-agent-secret
key: token
- name: AZP_POOL
value: "Kubernetes-Pool"
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
volumeMounts:
- name: docker-sock
mountPath: /var/run/docker.sock
volumes:
- name: docker-sock
hostPath:
path: /var/run/docker.sock
Apply with:
kubectl create namespace build-agents
kubectl create secret generic azdo-agent-secret \
--namespace build-agents \
--from-literal=url="https://dev.azure.com/your-organization" \
--from-literal=token="YOUR_PAT_TOKEN"
kubectl apply -f agent-deployment.yaml
kubectl apply -f keda-scaledobject.yaml
KEDA polls the Azure DevOps API every 15 seconds. When jobs queue up, pods scale out. After 5 minutes of idle (the cooldownPeriod), pods scale back to zero. You pay nothing when nothing is building.
Agent Maintenance and Updates
Self-hosted agents do not auto-update by default. When Microsoft releases a new agent version, your pipelines will warn you with:
##[warning] The agent version 3.220.0 is out of date. Please update to the latest version 3.248.0.
Automate updates. Here is a script you can run on a cron schedule:
#!/bin/bash
# update-agent.sh — checks for new agent version and updates if needed
CURRENT_VERSION=$(cat /azp/.agent | jq -r '.agentVersion')
LATEST_VERSION=$(curl -s https://api.github.com/repos/microsoft/azure-pipelines-agent/releases/latest | jq -r '.tag_name' | sed 's/v//')
echo "Current: ${CURRENT_VERSION}, Latest: ${LATEST_VERSION}"
if [ "${CURRENT_VERSION}" != "${LATEST_VERSION}" ]; then
echo "Updating agent from ${CURRENT_VERSION} to ${LATEST_VERSION}..."
cd /azp
sudo ./svc.sh stop
curl -LO "https://vstsagentpackage.azureedge.net/agent/${LATEST_VERSION}/vsts-agent-linux-x64-${LATEST_VERSION}.tar.gz"
tar xzf "vsts-agent-linux-x64-${LATEST_VERSION}.tar.gz"
rm "vsts-agent-linux-x64-${LATEST_VERSION}.tar.gz"
sudo ./svc.sh start
echo "Agent updated to ${LATEST_VERSION}"
else
echo "Agent is up to date."
fi
For Docker-based agents, just rebuild the image with the new version and redeploy. With Kubernetes, a rolling update handles this cleanly.
Security Considerations
Self-hosted agents run your code. Treat them like production servers.
Service accounts: Never run agents under your personal account or as root. Create a dedicated service account with the minimum permissions required.
# Create a locked-down service account on Linux
sudo useradd -r -m -s /usr/sbin/nologin azagent
sudo chown -R azagent:azagent /azp
PAT token rotation: PATs expire. Use short-lived tokens (90 days max) and automate rotation. Better yet, use a Service Principal with OIDC for Kubernetes-based agents.
Network isolation: Place agents in a dedicated subnet or VLAN. Use network security groups to restrict outbound access to only what is needed:
dev.azure.com(HTTPS/443)vstsagentpackage.azureedge.net(agent downloads)- Your package registries (npm, NuGet, PyPI)
- Your artifact storage
Block everything else.
Pipeline permissions: In Azure DevOps, restrict which pipelines can use your self-hosted pools. Go to Agent Pool > Security and grant access only to specific projects or pipelines.
Ephemeral agents: The gold standard. Every job gets a fresh container. After the job completes, the container is destroyed. No persistent state, no credential leakage between builds, no supply chain poisoning from a compromised workspace.
Managing Agent Capabilities and Demands
When you have a diverse fleet — some agents have GPU, some have specific SDKs, some are high-memory — capabilities and demands keep everything sane.
Add custom capabilities to agents:
# During configuration
./config.sh \
--url https://dev.azure.com/your-org \
--auth pat \
--token YOUR_PAT \
--pool "Linux-Build" \
--agent "gpu-agent-01" \
--unattended
# Then in Azure DevOps portal, add user capabilities:
# gpu = true
# cuda_version = 12.2
# memory_gb = 64
Use demands in your pipeline YAML:
pool:
name: 'Linux-Build'
demands:
- gpu -equals true
- cuda_version -equals 12.2
steps:
- script: nvidia-smi
displayName: 'Verify GPU access'
- script: python train_model.py
displayName: 'Train ML model'
If no agent matches all demands, the job sits in the queue indefinitely. Always check that at least one agent satisfies your demands before pushing a pipeline change.
Monitoring Agent Health and Utilization
You need visibility into your agent fleet. Azure DevOps provides some data through the API, but you should supplement it with your own monitoring.
Query the agent pool API to get agent status:
var http = require("https");
var orgUrl = "dev.azure.com";
var organization = "your-organization";
var poolId = 12;
var pat = process.env.AZP_TOKEN;
var auth = Buffer.from(":" + pat).toString("base64");
var options = {
hostname: orgUrl,
path: "/" + organization + "/_apis/distributedtask/pools/" + poolId + "/agents?includeAssignedRequest=true&api-version=7.1",
method: "GET",
headers: {
"Authorization": "Basic " + auth,
"Content-Type": "application/json"
}
};
var req = http.request(options, function(res) {
var data = "";
res.on("data", function(chunk) {
data += chunk;
});
res.on("end", function() {
var agents = JSON.parse(data).value;
agents.forEach(function(agent) {
var status = agent.status === 1 ? "Online" : "Offline";
var busy = agent.assignedRequest ? "BUSY" : "Idle";
var lastJob = agent.assignedRequest
? agent.assignedRequest.definition.name
: "none";
console.log(
"[" + status + "] " + agent.name +
" | " + busy +
" | Current job: " + lastJob +
" | OS: " + agent.osDescription
);
});
});
});
req.on("error", function(err) {
console.error("Error fetching agent status:", err.message);
});
req.end();
Example output:
[Online] docker-agent-01 | BUSY | Current job: api-service-ci | OS: Linux 5.15.0-92-generic
[Online] docker-agent-02 | Idle | Current job: none | OS: Linux 5.15.0-92-generic
[Online] docker-agent-03 | BUSY | Current job: web-app-ci | OS: Linux 5.15.0-92-generic
[Offline] docker-agent-04 | Idle | Current job: none | OS: Linux 5.15.0-92-generic
For deeper monitoring, export metrics to Prometheus or Datadog. Track:
- Queue wait time — how long jobs wait for an available agent
- Agent utilization — percentage of time agents are running jobs vs idle
- Job duration trends — detect performance regressions in your build pipeline
- Agent failure rate — how often agents go offline or jobs fail due to agent issues
Cost Analysis
Let me break down a real scenario. Say you have a team of 20 developers pushing code throughout the day, generating roughly 80 pipeline runs per day, each averaging 12 minutes.
Microsoft-hosted agents:
Parallel jobs needed: 4 (to keep queue wait under 5 min)
Cost: 4 × $40/month = $160/month
Total build minutes: 80 × 12 = 960 min/day = ~20,000 min/month
Free tier: 1,800 min/month (1 free parallel job)
Overage: 18,200 min × $0.008/min = $145.60/month
Total: $160 + $145.60 = $305.60/month
Self-hosted agents (Docker on a single VM):
VM: Standard_D4s_v5 (4 vCPU, 16 GB RAM)
Cost: ~$140/month (Linux, pay-as-you-go)
Runs 4 containerized agents simultaneously
Build speed improvement: ~30% (warm caches, pre-installed deps)
Effective build time: 80 × 8.4 = 672 min/day
Total: $140/month
Self-hosted agents (Kubernetes with KEDA, scale-to-zero):
AKS cluster: Standard_D4s_v5 node pool, spot instances
Average cost: ~$65/month (agents scale to zero during nights/weekends)
Total: $65/month + $30/month AKS management fee = $95/month
The Kubernetes approach saves over 68% compared to Microsoft-hosted agents, and builds finish 30% faster because dependencies are baked into the image.
Complete Working Example: Node.js Agent Orchestrator
Here is a Node.js application that monitors your Azure DevOps queue depth and scales Docker-based agents up and down. This is a simpler alternative to KEDA for teams not running Kubernetes.
// orchestrator.js
// Monitors Azure DevOps pipeline queue and scales Docker agents accordingly
var https = require("https");
var childProcess = require("child_process");
var os = require("os");
var CONFIG = {
organization: process.env.AZP_ORG || "your-organization",
pat: process.env.AZP_TOKEN,
poolName: process.env.AZP_POOL || "Docker-Ephemeral",
poolId: parseInt(process.env.AZP_POOL_ID || "12", 10),
agentImage: process.env.AGENT_IMAGE || "azdo-agent:latest",
minAgents: parseInt(process.env.MIN_AGENTS || "0", 10),
maxAgents: parseInt(process.env.MAX_AGENTS || "8", 10),
pollIntervalMs: parseInt(process.env.POLL_INTERVAL || "15000", 10),
cooldownMs: parseInt(process.env.COOLDOWN || "300000", 10), // 5 minutes
agentPrefix: "ephemeral-agent"
};
var lastScaleDownTime = 0;
var managedAgents = {}; // agentName -> { containerId, createdAt, status }
function log(message) {
var timestamp = new Date().toISOString();
console.log("[" + timestamp + "] " + message);
}
function azureDevOpsRequest(path, callback) {
var auth = Buffer.from(":" + CONFIG.pat).toString("base64");
var options = {
hostname: "dev.azure.com",
path: "/" + CONFIG.organization + path,
method: "GET",
headers: {
"Authorization": "Basic " + auth,
"Content-Type": "application/json"
}
};
var req = https.request(options, function(res) {
var data = "";
res.on("data", function(chunk) { data += chunk; });
res.on("end", function() {
try {
callback(null, JSON.parse(data));
} catch (err) {
callback(new Error("Failed to parse response: " + data));
}
});
});
req.on("error", callback);
req.end();
}
function getQueuedJobs(callback) {
var path = "/_apis/distributedtask/pools/" + CONFIG.poolId +
"/jobrequests?api-version=7.1";
azureDevOpsRequest(path, function(err, data) {
if (err) return callback(err);
var queued = (data.value || []).filter(function(job) {
return !job.assignTime; // Not yet assigned to an agent
});
var running = (data.value || []).filter(function(job) {
return job.assignTime && !job.finishTime;
});
callback(null, {
queuedCount: queued.length,
runningCount: running.length,
totalPending: queued.length
});
});
}
function getOnlineAgents(callback) {
var path = "/_apis/distributedtask/pools/" + CONFIG.poolId +
"/agents?includeAssignedRequest=true&api-version=7.1";
azureDevOpsRequest(path, function(err, data) {
if (err) return callback(err);
var agents = (data.value || []).filter(function(agent) {
return agent.status === 1; // Online
});
var idle = agents.filter(function(agent) {
return !agent.assignedRequest;
});
var busy = agents.filter(function(agent) {
return !!agent.assignedRequest;
});
callback(null, {
total: agents.length,
idle: idle.length,
busy: busy.length,
agents: agents
});
});
}
function startAgent(agentName, callback) {
log("Starting agent: " + agentName);
var cmd = [
"docker", "run", "-d",
"--name", agentName,
"-e", "AZP_URL=https://dev.azure.com/" + CONFIG.organization,
"-e", "AZP_TOKEN=" + CONFIG.pat,
"-e", "AZP_POOL=" + CONFIG.poolName,
"-e", "AZP_AGENT_NAME=" + agentName,
"-v", "/var/run/docker.sock:/var/run/docker.sock",
CONFIG.agentImage
].join(" ");
childProcess.exec(cmd, function(err, stdout, stderr) {
if (err) {
log("ERROR starting agent " + agentName + ": " + stderr);
return callback(err);
}
var containerId = stdout.trim().substring(0, 12);
managedAgents[agentName] = {
containerId: containerId,
createdAt: Date.now(),
status: "running"
};
log("Started agent " + agentName + " (container: " + containerId + ")");
callback(null, containerId);
});
}
function stopAgent(agentName, callback) {
log("Stopping agent: " + agentName);
var cmd = "docker stop " + agentName + " && docker rm " + agentName;
childProcess.exec(cmd, function(err, stdout, stderr) {
if (err) {
log("WARN: Error stopping " + agentName + ": " + stderr);
}
delete managedAgents[agentName];
log("Stopped and removed agent: " + agentName);
callback(null);
});
}
function generateAgentName() {
var suffix = Date.now().toString(36) + Math.random().toString(36).substring(2, 6);
return CONFIG.agentPrefix + "-" + suffix;
}
function scaleUp(count, callback) {
var started = 0;
var errors = [];
if (count === 0) return callback(null, 0);
log("Scaling UP by " + count + " agent(s)");
for (var i = 0; i < count; i++) {
(function() {
var name = generateAgentName();
startAgent(name, function(err) {
if (err) errors.push(err);
started++;
if (started === count) {
callback(errors.length > 0 ? errors[0] : null, count - errors.length);
}
});
})();
}
}
function scaleDown(count, callback) {
var now = Date.now();
if (now - lastScaleDownTime < CONFIG.cooldownMs) {
log("Cooldown active, skipping scale down");
return callback(null, 0);
}
var agentNames = Object.keys(managedAgents);
var removed = 0;
var toRemove = Math.min(count, agentNames.length);
if (toRemove === 0) return callback(null, 0);
log("Scaling DOWN by " + toRemove + " agent(s)");
lastScaleDownTime = now;
var completed = 0;
for (var i = 0; i < toRemove; i++) {
stopAgent(agentNames[i], function() {
completed++;
removed++;
if (completed === toRemove) {
callback(null, removed);
}
});
}
}
function evaluateScaling() {
getQueuedJobs(function(err, queue) {
if (err) {
log("ERROR fetching queue: " + err.message);
return;
}
getOnlineAgents(function(err, agents) {
if (err) {
log("ERROR fetching agents: " + err.message);
return;
}
var managedCount = Object.keys(managedAgents).length;
log(
"Queue: " + queue.queuedCount + " pending | " +
"Agents: " + agents.total + " online (" +
agents.busy + " busy, " + agents.idle + " idle) | " +
"Managed: " + managedCount
);
// Scale up: one new agent per queued job, up to max
if (queue.queuedCount > 0) {
var needed = Math.min(
queue.queuedCount,
CONFIG.maxAgents - managedCount
);
if (needed > 0) {
scaleUp(needed, function(err, count) {
if (err) log("Scale up error: " + err.message);
else log("Successfully started " + count + " new agent(s)");
});
} else {
log("At max capacity (" + CONFIG.maxAgents + "), cannot scale up");
}
}
// Scale down: remove idle agents beyond minimum
if (queue.queuedCount === 0 && agents.idle > CONFIG.minAgents) {
var excess = managedCount - CONFIG.minAgents;
if (excess > 0) {
scaleDown(excess, function(err, count) {
if (err) log("Scale down error: " + err.message);
else log("Removed " + count + " idle agent(s)");
});
}
}
});
});
}
// --- Main ---
if (!CONFIG.pat) {
console.error("ERROR: AZP_TOKEN environment variable is required");
process.exit(1);
}
log("Agent Orchestrator starting...");
log("Organization: " + CONFIG.organization);
log("Pool: " + CONFIG.poolName + " (ID: " + CONFIG.poolId + ")");
log("Min agents: " + CONFIG.minAgents + ", Max agents: " + CONFIG.maxAgents);
log("Poll interval: " + (CONFIG.pollIntervalMs / 1000) + "s");
log("Cooldown: " + (CONFIG.cooldownMs / 1000) + "s");
log("---");
// Run immediately, then on interval
evaluateScaling();
setInterval(evaluateScaling, CONFIG.pollIntervalMs);
// Graceful shutdown
process.on("SIGTERM", function() {
log("Received SIGTERM. Shutting down orchestrator...");
var agentNames = Object.keys(managedAgents);
var remaining = agentNames.length;
if (remaining === 0) {
log("No managed agents to clean up. Exiting.");
process.exit(0);
}
log("Cleaning up " + remaining + " managed agent(s)...");
agentNames.forEach(function(name) {
stopAgent(name, function() {
remaining--;
if (remaining === 0) {
log("All agents cleaned up. Exiting.");
process.exit(0);
}
});
});
});
Run the orchestrator:
export AZP_TOKEN="your-pat-token"
export AZP_ORG="your-organization"
export AZP_POOL="Docker-Ephemeral"
export AZP_POOL_ID="12"
export MIN_AGENTS="1"
export MAX_AGENTS="6"
export POLL_INTERVAL="15000"
node orchestrator.js
Expected output during a busy period:
[2026-02-08T14:30:00.123Z] Agent Orchestrator starting...
[2026-02-08T14:30:00.124Z] Organization: your-organization
[2026-02-08T14:30:00.124Z] Pool: Docker-Ephemeral (ID: 12)
[2026-02-08T14:30:00.124Z] Min agents: 1, Max agents: 6
[2026-02-08T14:30:00.124Z] Poll interval: 15s
[2026-02-08T14:30:00.125Z] Cooldown: 300s
[2026-02-08T14:30:00.125Z] ---
[2026-02-08T14:30:01.456Z] Queue: 3 pending | Agents: 1 online (1 busy, 0 idle) | Managed: 1
[2026-02-08T14:30:01.457Z] Scaling UP by 3 agent(s)
[2026-02-08T14:30:03.891Z] Started agent ephemeral-agent-m4x7k2ab (container: a3f2b1c9d4e5)
[2026-02-08T14:30:04.102Z] Started agent ephemeral-agent-m4x7k8cd (container: b7e3c2d1f6a8)
[2026-02-08T14:30:04.334Z] Started agent ephemeral-agent-m4x7kef9 (container: c1d4e5f2a3b7)
[2026-02-08T14:30:04.335Z] Successfully started 3 new agent(s)
[2026-02-08T14:30:15.789Z] Queue: 0 pending | Agents: 4 online (4 busy, 0 idle) | Managed: 4
[2026-02-08T14:35:30.123Z] Queue: 0 pending | Agents: 4 online (0 busy, 4 idle) | Managed: 4
[2026-02-08T14:35:30.124Z] Scaling DOWN by 3 agent(s)
[2026-02-08T14:35:32.456Z] Stopped and removed agent: ephemeral-agent-m4x7k2ab
[2026-02-08T14:35:33.789Z] Stopped and removed agent: ephemeral-agent-m4x7k8cd
[2026-02-08T14:35:34.012Z] Stopped and removed agent: ephemeral-agent-m4x7kef9
[2026-02-08T14:35:34.013Z] Removed 3 idle agent(s)
Common Issues and Troubleshooting
1. Agent Goes Offline Immediately After Starting
##[error] The agent did not connect within the allotted time of 60 seconds.
Agent listener started, but is unable to connect to the server.
Cause: Network connectivity issue. The agent cannot reach dev.azure.com on port 443. This happens frequently in corporate environments with proxy servers or restrictive firewalls.
Fix: Check DNS resolution and outbound HTTPS connectivity:
# Test connectivity
curl -v https://dev.azure.com/your-organization/_apis/connectionData
# If behind a proxy, configure the agent
export http_proxy=http://proxy.corp.com:8080
export https_proxy=http://proxy.corp.com:8080
export no_proxy=localhost,127.0.0.1
# Then reconfigure the agent
./config.sh --url ... --proxyurl http://proxy.corp.com:8080
2. Permission Denied Accessing Docker Socket
Got permission denied while trying to connect to the Docker daemon socket at
unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/json":
dial unix /var/run/docker.sock: connect: permission denied
Cause: The agent user is not in the docker group, or the Docker socket permissions are too restrictive.
Fix:
# Add the agent user to the docker group
sudo usermod -aG docker azagent
# Restart the agent service
sudo ./svc.sh stop
sudo ./svc.sh start
# For containerized agents, ensure the GID matches the host docker group
docker run -d \
--group-add $(getent group docker | cut -d: -f3) \
-v /var/run/docker.sock:/var/run/docker.sock \
azdo-agent:latest
3. PAT Token Expired — Agent Cannot Authenticate
VS30063: You are not authorized to access https://dev.azure.com/your-organization.
Failed to get an agent based on the provided agent id. Agent id: 847.
Cause: The Personal Access Token used during agent configuration has expired.
Fix: Generate a new PAT and reconfigure the agent:
cd /azp
sudo ./svc.sh stop
# Remove old configuration
./config.sh remove --auth pat --token OLD_EXPIRED_TOKEN
# If the old token is already expired and removal fails:
./config.sh remove --auth pat --token NEW_PAT_TOKEN
# Reconfigure with new token
./config.sh \
--url https://dev.azure.com/your-organization \
--auth pat \
--token NEW_PAT_TOKEN \
--pool "Linux-Build" \
--agent "build-agent-01" \
--replace \
--unattended
sudo ./svc.sh start
Set a calendar reminder 2 weeks before your PAT expires. Or better, automate rotation with a script that generates a new PAT via the Azure DevOps REST API and reconfigures agents.
4. Disk Space Exhausted on Long-Running Agents
##[error] No space left on device
ENOSPC: no space left on device, write '/azp/_work/42/s/node_modules/.package-lock.json'
Cause: Build artifacts, Docker images, npm caches, and workspace directories accumulate over time. A busy agent can fill a 50 GB disk in under a week.
Fix: Implement automated cleanup. Add a cron job (or a pipeline maintenance task):
# cleanup-agent.sh — run nightly via cron
#!/bin/bash
echo "=== Agent Disk Cleanup ==="
echo "Before: $(df -h / | tail -1 | awk '{print $4}') free"
# Clean old workspaces (keep last 5)
cd /azp/_work
ls -dt */ | tail -n +6 | xargs rm -rf 2>/dev/null
# Prune Docker (if applicable)
docker system prune -af --filter "until=48h" 2>/dev/null
# Clear npm cache
npm cache clean --force 2>/dev/null
# Clear NuGet cache
dotnet nuget locals all --clear 2>/dev/null
echo "After: $(df -h / | tail -1 | awk '{print $4}') free"
Add to crontab:
# Run cleanup at 2 AM daily
0 2 * * * /azp/cleanup-agent.sh >> /var/log/agent-cleanup.log 2>&1
5. Jobs Stuck in "Waiting for Agent" State
Pool: Linux-Build
Demands:
Agent.OS -equals Linux
npm
dotnet
No agent found in pool which satisfies the specified demands.
Cause: No online agent in the pool has all the capabilities (demands) your pipeline requires. Either the agents are offline, or they lack a specific capability.
Fix: Check which agents are in the pool and what capabilities they report:
# List agents and their capabilities via the API
curl -s -u ":${AZP_TOKEN}" \
"https://dev.azure.com/${ORG}/_apis/distributedtask/pools/${POOL_ID}/agents?includeCapabilities=true&api-version=7.1" \
| jq '.value[] | {name: .name, status: .status, capabilities: .systemCapabilities | keys}'
Then either install the missing tool on the agent or remove the demand from your pipeline.
Best Practices
Use ephemeral (disposable) agents for all CI builds. Every job gets a fresh environment. This eliminates "works on my agent" problems and prevents credential leakage between builds. Docker containers or KEDA-scaled Kubernetes pods are ideal for this.
Bake dependencies into the agent image, not into your pipeline. If every build starts with
apt-get installornpm install -g, you are wasting minutes per build. Put those installations in your Dockerfile and rebuild the image weekly.Set both minimum AND maximum agent counts. A minimum of 1 keeps one warm agent ready for fast feedback on the first build of the day. A maximum prevents runaway scaling from eating your infrastructure budget if someone accidentally triggers 200 pipeline runs.
Separate pools by workload type. Do not mix GPU training jobs with web app CI in the same pool. A 10-minute training job blocking a 2-minute build creates unnecessary wait times. Create dedicated pools with appropriate hardware.
Monitor queue wait time as a team health metric. If developers regularly wait more than 2 minutes for an agent, you need more capacity. Queue wait time directly correlates with developer frustration and context switching.
Rotate PAT tokens on a schedule and store them in a vault. Do not hardcode tokens in scripts or Dockerfiles. Use Azure Key Vault, HashiCorp Vault, or Kubernetes secrets. Set up automated rotation every 60-90 days.
Run agent updates on a schedule, not ad-hoc. Pin your agent version in your Dockerfile or update script. Test new agent versions in a staging pool before rolling out to production pools. Microsoft occasionally ships breaking changes.
Tag agents with metadata using user capabilities. Include the agent image version, the date it was built, the hardware specs, and any special software installed. This makes debugging "why did this job fail on agent X but not agent Y" much easier.
Implement disk cleanup automation from day one. Do not wait until agents start failing with ENOSPC errors. Proactive cleanup is trivial to set up and saves hours of debugging.
Use the
--replaceflag when configuring ephemeral agents. This ensures a new agent can register with the same name as a previously terminated one without manual cleanup in the Azure DevOps portal.
