AI

Building Resilient Agents That Don't Break When APIs Change Overnight

It was a Tuesday morning. I was drinking coffee and watching moose tracks fill with fresh snow outside my cabin window when my phone started buzzing. The...

It was a Tuesday morning. I was drinking coffee and watching moose tracks fill with fresh snow outside my cabin window when my phone started buzzing. The job fetcher on Grizzly Peak Software had failed overnight. Not a timeout, not a rate limit — a 200 OK response with a completely different JSON schema than what it returned the day before.

The upstream API had changed its response format without warning, without a version bump, without a deprecation notice. My agent code parsed the new response, found none of the fields it expected, and silently produced zero results. The scheduler marked the run as successful because technically nothing threw an error. It just did nothing.

The 90-Day AI Transformation: A Week-by-Week Playbook for CIOs Who Need Results Now

The 90-Day AI Transformation: A Week-by-Week Playbook for CIOs Who Need Results Now

Stop planning your 18-month AI roadmap. This week-by-week playbook takes tech leaders from zero AI deployments to measurable production results in 90 days.

Learn More

This is the reality of building AI agents that depend on external APIs. They break. Not when you expect them to. Not gracefully. And usually at the worst possible time.

I've been building agent systems for AutoDetective.ai and the job board on Grizzly Peak Software for the better part of a year now. Every single one of them has broken due to an upstream API change at some point. What I've learned is that resilience isn't something you add later — it's an architectural decision you make from day one.


Why Agents Are Uniquely Fragile

Traditional web applications are fragile too, but agents have a special kind of fragility that makes them harder to manage.

A traditional application calls an API, gets a response, renders a page. If the API breaks, the user sees an error. It's obvious, it's immediate, and someone files a bug report.

An agent calls an API, processes the response through an LLM, makes decisions based on the output, and then takes actions — possibly calling more APIs. If any link in that chain breaks, the failure can cascade in unpredictable ways. Worse, the LLM in the middle might paper over the problem. Feed a language model malformed data and it won't crash — it'll hallucinate something plausible and keep going. Your agent continues executing, making decisions based on garbage data, and you don't find out until the damage is done.

I learned this the hard way when one of my agent pipelines started classifying every job posting as "DevOps" for an entire day. The upstream API had changed a field name from job_title to title, and my extraction code was passing undefined to the classifier. The LLM didn't complain about getting undefined — it just defaulted to the most common category in its training data. Nobody noticed until a user emailed asking why every job on the site was DevOps.

That's the fundamental challenge: agents fail silently in ways that look like they're working.


Pattern 1: Schema Validation at Every Boundary

The first and most important pattern is validating the shape of data at every boundary in your agent system. Not just the external API boundary — every boundary. Between the API response and your processing logic. Between your processing logic and the LLM. Between the LLM output and your action layer.

Here's what this looks like in practice:

var Ajv = require("ajv");
var ajv = new Ajv({ allErrors: true });

var jobListingSchema = {
    type: "object",
    required: ["id", "title", "company", "url"],
    properties: {
        id: { type: ["string", "number"] },
        title: { type: "string", minLength: 1 },
        company: { type: "string", minLength: 1 },
        url: { type: "string", format: "uri" },
        salary_min: { type: ["number", "null"] },
        salary_max: { type: ["number", "null"] },
        location: { type: "string" },
        description: { type: "string" }
    },
    additionalProperties: true
};

var validateJobListing = ajv.compile(jobListingSchema);

function parseJobListings(apiResponse) {
    var listings = [];
    var errors = [];

    if (!Array.isArray(apiResponse)) {
        return {
            listings: [],
            errors: [{ message: "API response is not an array", raw: typeof apiResponse }]
        };
    }

    apiResponse.forEach(function(item, index) {
        var valid = validateJobListing(item);
        if (valid) {
            listings.push(item);
        } else {
            errors.push({
                index: index,
                errors: validateJobListing.errors,
                raw: JSON.stringify(item).substring(0, 200)
            });
        }
    });

    return { listings: listings, errors: errors };
}

The key detail here is that I'm not just validating and moving on. I'm collecting the validation errors and the raw data that failed validation. When an API changes its schema, I want to know exactly what changed. Did they rename a field? Add a new required field? Change a type from string to number? The error collection tells me that, and it gives me the raw data I need to update my schema mapping.

I also set additionalProperties: true deliberately. I don't want to reject a response just because the API added a new field I don't use yet. I only care that the fields I need are present and correctly typed.


Pattern 2: Fallback Chains

When you depend on multiple data sources, build a fallback chain. If source A fails, try source B. If source B fails, try source C. This sounds obvious, but the implementation details matter.

var sources = [
    { name: "remoteok", fetcher: require("./fetchers/remoteok") },
    { name: "remotive", fetcher: require("./fetchers/remotive") },
    { name: "arbeitnow", fetcher: require("./fetchers/arbeitnow") }
];

function fetchJobsWithFallback(category, options) {
    var results = {
        jobs: [],
        sourcesAttempted: [],
        sourcesSucceeded: [],
        sourcesFailed: []
    };

    return sources.reduce(function(chain, source) {
        return chain.then(function() {
            if (results.jobs.length >= (options.minResults || 50)) {
                return results;
            }

            results.sourcesAttempted.push(source.name);

            return fetchWithTimeout(source.fetcher, category, options.timeout || 10000)
                .then(function(jobs) {
                    var validated = parseJobListings(jobs);

                    if (validated.errors.length > 0) {
                        console.warn(
                            source.name + ": " + validated.errors.length +
                            " items failed validation out of " +
                            (validated.listings.length + validated.errors.length)
                        );
                    }

                    results.jobs = results.jobs.concat(validated.listings);
                    results.sourcesSucceeded.push(source.name);
                })
                .catch(function(err) {
                    console.error(source.name + " failed: " + err.message);
                    results.sourcesFailed.push({
                        name: source.name,
                        error: err.message
                    });
                });
        });
    }, Promise.resolve()).then(function() {
        return results;
    });
}

A few things to notice here. First, the fallback chain doesn't stop at the first successful source — it keeps going until it has enough results. This is important because a source might succeed but return fewer results than expected. Maybe the API is partially degraded, or maybe a schema change broke 80% of the listings but 20% still validate. You want to supplement from other sources.

Second, every source that fails gets logged with the error details. I pipe these logs into a monitoring system that alerts me when a source fails for more than two consecutive runs. One failure might be a network blip. Two failures means something changed and I need to investigate.

Third, the fetchWithTimeout wrapper is critical. Without it, a hung API connection will block your entire chain. I set aggressive timeouts — 10 seconds for job APIs that should respond in under 2 seconds. If it takes longer than that, something's wrong and I'd rather move on to the next source.


Pattern 3: Retry with Exponential Backoff and Jitter

Retries are table stakes, but most people implement them wrong. Here's a pattern that actually works in production:

function retryWithBackoff(fn, options) {
    var maxRetries = options.maxRetries || 3;
    var baseDelay = options.baseDelay || 1000;
    var maxDelay = options.maxDelay || 30000;
    var retryOn = options.retryOn || function(err) {
        return err.statusCode >= 500 || err.code === "ECONNRESET" || err.code === "ETIMEDOUT";
    };

    function attempt(retryCount) {
        return fn().catch(function(err) {
            if (retryCount >= maxRetries || !retryOn(err)) {
                throw err;
            }

            var delay = Math.min(
                baseDelay * Math.pow(2, retryCount),
                maxDelay
            );

            // Add jitter: +/- 25% of the delay
            var jitter = delay * 0.25 * (Math.random() * 2 - 1);
            delay = Math.floor(delay + jitter);

            console.log(
                "Retry " + (retryCount + 1) + "/" + maxRetries +
                " after " + delay + "ms: " + err.message
            );

            return new Promise(function(resolve) {
                setTimeout(resolve, delay);
            }).then(function() {
                return attempt(retryCount + 1);
            });
        });
    }

    return attempt(0);
}

The jitter is the important part that most implementations skip. Without jitter, if your agent crashes and restarts, all the retries happen at exactly the same intervals, which means they all hit the API at the same time. With 10 agent instances retrying simultaneously, you create a thundering herd that makes the problem worse. Random jitter spreads the retries out.

The retryOn function is equally important. You should only retry on transient errors — 500s, connection resets, timeouts. If you get a 400 Bad Request or a 401 Unauthorized, retrying won't help. You'll just burn through your retry budget and delay the real error reporting.

I also want to call out what this function does NOT do: it doesn't retry on schema validation failures. If the API returns a 200 OK with a response that doesn't match your schema, that's not a transient error. That's a contract change. Retrying will give you the same broken response every time. You need a different response to that situation — an alert, a fallback to cached data, or a graceful degradation.


Pattern 4: Circuit Breakers

When an API is down, you don't want your agent hammering it with requests. Circuit breakers solve this:

function createCircuitBreaker(options) {
    var failureThreshold = options.failureThreshold || 5;
    var resetTimeout = options.resetTimeout || 60000;
    var state = "CLOSED";
    var failureCount = 0;
    var lastFailureTime = null;

    return {
        execute: function(fn) {
            if (state === "OPEN") {
                var timeSinceLastFailure = Date.now() - lastFailureTime;
                if (timeSinceLastFailure > resetTimeout) {
                    state = "HALF_OPEN";
                } else {
                    return Promise.reject(new Error(
                        "Circuit breaker OPEN. Resets in " +
                        Math.ceil((resetTimeout - timeSinceLastFailure) / 1000) + "s"
                    ));
                }
            }

            return fn().then(function(result) {
                if (state === "HALF_OPEN") {
                    state = "CLOSED";
                    failureCount = 0;
                }
                return result;
            }).catch(function(err) {
                failureCount++;
                lastFailureTime = Date.now();

                if (failureCount >= failureThreshold) {
                    state = "OPEN";
                    console.error(
                        "Circuit breaker OPENED after " + failureCount +
                        " failures. Will retry in " + (resetTimeout / 1000) + "s"
                    );
                }

                throw err;
            });
        },

        getState: function() {
            return {
                state: state,
                failureCount: failureCount,
                lastFailureTime: lastFailureTime
            };
        }
    };
}

// Usage
var remoteOkBreaker = createCircuitBreaker({
    failureThreshold: 3,
    resetTimeout: 300000 // 5 minutes
});

function fetchFromRemoteOk(category) {
    return remoteOkBreaker.execute(function() {
        return retryWithBackoff(function() {
            return makeApiCall("https://remoteok.com/api", { tag: category });
        }, { maxRetries: 2 });
    });
}

Notice how the circuit breaker wraps the retry logic, not the other way around. If the circuit is open, we don't even attempt the retries. This prevents your agent from wasting time and API quota on a service that's known to be down.

The HALF_OPEN state is crucial. After the reset timeout, the breaker allows one request through. If it succeeds, the circuit closes and normal operation resumes. If it fails, the circuit opens again for another timeout period. This gives the API time to recover without your agent flooding it the moment it comes back up.


Pattern 5: Cached Fallbacks and Graceful Degradation

Sometimes every source fails. Your fallback chain is exhausted, your circuit breakers are all open, and your agent has no fresh data. What do you do?

You serve stale data. It's not ideal, but it's better than serving nothing.

var fs = require("fs");
var path = require("path");

var CACHE_DIR = path.join(__dirname, ".cache");

function cacheResults(sourceId, data) {
    var cachePath = path.join(CACHE_DIR, sourceId + ".json");
    var cacheEntry = {
        timestamp: Date.now(),
        data: data
    };

    try {
        if (!fs.existsSync(CACHE_DIR)) {
            fs.mkdirSync(CACHE_DIR, { recursive: true });
        }
        fs.writeFileSync(cachePath, JSON.stringify(cacheEntry));
    } catch (err) {
        console.error("Failed to write cache for " + sourceId + ": " + err.message);
    }
}

function getCachedResults(sourceId, maxAge) {
    var cachePath = path.join(CACHE_DIR, sourceId + ".json");
    maxAge = maxAge || 86400000; // Default: 24 hours

    try {
        if (!fs.existsSync(cachePath)) return null;

        var cacheEntry = JSON.parse(fs.readFileSync(cachePath, "utf8"));
        var age = Date.now() - cacheEntry.timestamp;

        if (age > maxAge) {
            console.warn(
                "Cache for " + sourceId + " is " +
                Math.round(age / 3600000) + " hours old (max: " +
                Math.round(maxAge / 3600000) + "h)"
            );
            return null;
        }

        console.log(
            "Using cached data for " + sourceId +
            " (age: " + Math.round(age / 60000) + " minutes)"
        );
        return cacheEntry.data;
    } catch (err) {
        console.error("Failed to read cache for " + sourceId + ": " + err.message);
        return null;
    }
}

The important decision here is how old is too old. For job listings, 24-hour-old data is fine — jobs don't change that frequently. For stock prices, 24-second-old data might be too stale. You have to make that call based on your domain.

I also use the cache proactively, not just as a last resort. Every successful fetch writes to the cache. This means my cache is always warm, always recent, and always available. If the APIs all go down at 2 AM, my scheduler still has yesterday's data to work with.


Pattern 6: Health Monitoring and Alerting

All of these patterns are useless if you don't know when they're firing. Build observability into your agent from the start:

function createAgentMonitor(agentName) {
    var stats = {
        totalRuns: 0,
        successfulRuns: 0,
        failedRuns: 0,
        retriesUsed: 0,
        circuitBreakerTrips: 0,
        cacheFallbacks: 0,
        lastRunTime: null,
        lastError: null,
        sourceStats: {}
    };

    return {
        recordRun: function(result) {
            stats.totalRuns++;
            stats.lastRunTime = new Date().toISOString();

            if (result.sourcesFailed.length === result.sourcesAttempted.length) {
                stats.failedRuns++;
                stats.lastError = result.sourcesFailed[0].error;
            } else {
                stats.successfulRuns++;
            }

            result.sourcesSucceeded.forEach(function(name) {
                if (!stats.sourceStats[name]) {
                    stats.sourceStats[name] = { success: 0, failure: 0 };
                }
                stats.sourceStats[name].success++;
            });

            result.sourcesFailed.forEach(function(source) {
                if (!stats.sourceStats[source.name]) {
                    stats.sourceStats[source.name] = { success: 0, failure: 0 };
                }
                stats.sourceStats[source.name].failure++;
            });
        },

        getReport: function() {
            var successRate = stats.totalRuns > 0
                ? (stats.successfulRuns / stats.totalRuns * 100).toFixed(1)
                : "N/A";

            return {
                agent: agentName,
                successRate: successRate + "%",
                stats: stats
            };
        },

        shouldAlert: function() {
            if (stats.failedRuns >= 3 && stats.successfulRuns === 0) return true;
            var recentFailureRate = stats.totalRuns > 10
                ? stats.failedRuns / stats.totalRuns
                : 0;
            return recentFailureRate > 0.5;
        }
    };
}

I run a health check endpoint on my agent systems that exposes these stats. A simple cron job pings the endpoint every hour and sends me a notification if the failure rate exceeds 50%. It's not sophisticated monitoring — no Grafana dashboards, no PagerDuty integration — but it's caught every significant outage before users noticed.


Putting It All Together

The resilient agent architecture looks like this:

  1. Request layer: Circuit breaker wraps retry logic wraps the actual API call
  2. Validation layer: Schema validation on every response, collecting errors for debugging
  3. Fallback layer: Multiple sources tried in sequence, supplemented with cached data when needed
  4. Processing layer: LLM receives only validated, well-structured data
  5. Monitoring layer: Every decision point logged, health stats exposed, alerts on degradation

Each layer is independent. You can swap out the retry logic without touching the circuit breaker. You can add a new data source without changing the validation layer. You can replace the LLM without modifying the fallback chain.

This is not over-engineering. I know it looks like a lot of code for what should be a simple "call an API and process the result" operation. But I've been doing this long enough to know that the API will change. The service will go down. The LLM will hallucinate if you give it bad data. These aren't edge cases — they're the normal operating conditions of any system that depends on external services.


Lessons from a Year of Broken Agents

Let me leave you with the things I wish someone had told me a year ago:

1. APIs will change without notice. Even the ones with version numbers. Even the ones with SLAs. Even the ones run by companies you trust. Build for it.

2. LLMs make silent failures worse. A traditional system crashes when it gets bad data. An LLM improvises. That improvisation looks like correct output until you check the results. Always validate LLM output against known constraints.

3. Retries without circuit breakers are dangerous. You'll DDoS a struggling service and get your IP banned. Ask me how I know.

4. Cache everything, even when you don't think you need to. The cache isn't just for performance — it's your insurance policy against total source failure.

5. Monitor the health of your agent, not just the health of the APIs. An API returning 200 OK with a changed schema is a healthy API and a broken agent. Your monitoring needs to detect both.

6. Test your failure paths. Once a month, I intentionally break one of my data sources to verify that the fallback chain works correctly. The first time I did this, I found that my fallback logic had a bug that caused it to return an empty array instead of falling through to the cache. Testing failure paths is just as important as testing happy paths.

Building resilient agents is more work upfront. Significantly more work. But the alternative is getting woken up at 3 AM by angry users because an API you don't control decided to rename a field from job_title to title and your entire pipeline silently broke.

I'd rather write the circuit breaker.


Shane Larson is a software engineer and founder of Grizzly Peak Software. He writes about AI, APIs, and building real things from his cabin in Caswell Lakes, Alaska. His book on training and fine-tuning LLMs is available on Amazon.

Powered by Contentful