Software Engineering

The Grizzly Mindset for Resilient Software Architecture

I lost power for three days last January.

I lost power for three days last January.

Not a rolling brownout, not a brief interruption — three full days of nothing. The temperature outside was negative twenty-two. The wood stove was running full-time. My generator kicked on for the chest freezer and the well pump, but the internet was gone, the development machine was dark, and there was nothing to do about it except wait.

The AI Augmented Engineer: Software Development 2026-2030: A Practical Guide to Thriving in the Age of AI-Native Development

The AI Augmented Engineer: Software Development 2026-2030: A Practical Guide to Thriving in the Age of AI-Native Development

By 2030, 0% of IT work will be done without AI. Data-backed career roadmap for software engineers. No hype, no doom. Practical strategies that work.

Learn More

And the funny thing is, everything kept working. Not me — the software. Every system I've built over the last decade that's running in production kept serving requests, handling users, processing data. Because a long time ago I started designing systems the way Alaska teaches you to think about survival: assume the worst will happen, prepare for it, and build so that when things go sideways, the damage stays contained.

I call it the grizzly mindset. And it's the most important thing I've ever learned about software architecture.


What a Grizzly Actually Does

People romanticize bears. They're majestic, they're powerful, they're symbols of the wild. Fine. But if you actually watch a grizzly operate — and I watch them every year from my cabin near Caswell Lakes — what strikes you isn't the power. It's the pragmatism.

A grizzly doesn't waste energy. It doesn't fight battles it can avoid. It doesn't chase prey that's faster than it unless the calorie math works out. It stores fat when food is available, hibernates when it isn't, and it builds redundancy into every aspect of its survival strategy. Multiple food sources. Multiple denning options. The ability to swim, climb, dig, and run.

That's not raw power. That's architecture.

Resilient software works the same way. It's not about building the most powerful system — it's about building the one that survives. The one that keeps working when a dependency goes down, when traffic spikes ten-fold, when someone deploys a bad config at 2 AM on a Friday.


Principle 1: Never Depend on a Single Food Source

A grizzly eats berries, salmon, roots, insects, carrion, and whatever else it can find. If the salmon run is bad, it doesn't starve — it eats more berries. If the berry crop fails, there's always something else.

Most systems I've seen fail catastrophically do it because they have a single critical dependency with no fallback. One database. One API provider. One authentication service. One deployment path.

// The fragile way — single dependency, no fallback
async function getUserProfile(userId) {
  var response = await fetch('https://api.primary-provider.com/users/' + userId);
  if (!response.ok) {
    throw new Error('Failed to fetch user profile');
  }
  return response.json();
}

// The grizzly way — multiple food sources
function createProfileFetcher(options) {
  var providers = options.providers || [];
  var cache = options.cache || null;
  var cacheTTL = options.cacheTTL || 300000; // 5 minutes

  return async function getUserProfile(userId) {
    // Try cache first — the stored fat
    if (cache) {
      var cached = await cache.get('profile:' + userId);
      if (cached) {
        return JSON.parse(cached);
      }
    }

    // Try each provider in order — multiple food sources
    var lastError = null;
    for (var i = 0; i < providers.length; i++) {
      try {
        var profile = await providers[i].fetch(userId);

        // Cache the result for next time
        if (cache && profile) {
          await cache.set('profile:' + userId, JSON.stringify(profile), cacheTTL);
        }

        return profile;
      } catch (err) {
        lastError = err;
        console.warn('Provider ' + providers[i].name + ' failed: ' + err.message);
      }
    }

    throw new Error('All providers failed. Last error: ' + lastError.message);
  };
}

var getProfile = createProfileFetcher({
  providers: [
    { name: 'primary-api', fetch: function(id) { return fetchFromPrimary(id); } },
    { name: 'replica-db', fetch: function(id) { return fetchFromReplica(id); } },
    { name: 'local-cache', fetch: function(id) { return fetchFromLocalStore(id); } }
  ],
  cache: redisClient
});

The key insight isn't just "have backups." It's that your system should treat switching between data sources as a normal operation, not an emergency procedure. The grizzly doesn't panic when the salmon aren't running. It just eats something else.


Principle 2: Hibernate, Don't Crash

I've written about circuit breakers before, but the grizzly mindset goes deeper than that. Hibernation isn't failure — it's a survival strategy. The bear reduces its metabolic rate by seventy-five percent, drops its heart rate to eight beats per minute, and enters a state where it can survive for months on stored resources.

Your systems need hibernation modes too. Not just circuit breakers that cut off failing dependencies, but genuine reduced-operation states where the system continues providing value at reduced capacity.

// Service with graduated degradation levels
function createResilientService(config) {
  var degradationLevel = 0; // 0 = full, 1 = reduced, 2 = minimal, 3 = hibernating
  var healthCheckInterval = config.healthCheckInterval || 30000;
  var metrics = { errors: 0, requests: 0, lastCheck: Date.now() };

  function assessHealth() {
    var errorRate = metrics.requests > 0
      ? metrics.errors / metrics.requests
      : 0;

    if (errorRate > 0.5) {
      degradationLevel = 3; // Hibernate
      console.log('Service entering hibernation — error rate: ' + (errorRate * 100).toFixed(1) + '%');
    } else if (errorRate > 0.25) {
      degradationLevel = 2; // Minimal operations
    } else if (errorRate > 0.1) {
      degradationLevel = 1; // Reduced operations
    } else {
      degradationLevel = 0; // Full operations
    }

    // Reset counters for next window
    metrics.errors = 0;
    metrics.requests = 0;
    metrics.lastCheck = Date.now();
  }

  setInterval(assessHealth, healthCheckInterval);

  return {
    handleRequest: function(req) {
      metrics.requests++;

      switch (degradationLevel) {
        case 3: // Hibernating — serve only cached/static responses
          return serveCachedResponse(req);

        case 2: // Minimal — skip non-essential features
          return serveMinimal(req);

        case 1: // Reduced — disable background jobs, reduce freshness
          return serveReduced(req);

        default: // Full operations
          return serveFull(req);
      }
    },

    recordError: function() {
      metrics.errors++;
    },

    getStatus: function() {
      var levels = ['full', 'reduced', 'minimal', 'hibernating'];
      return levels[degradationLevel];
    }
  };
}

I ran a SaaS product a few years back — AutoDetective.ai for automotive diagnostics — and the most valuable architecture decision I made was building three distinct operation modes. Full mode hit the OpenAI API, pulled real-time data, the whole thing. Reduced mode used cached results and simpler models. Minimal mode served pre-computed answers for common queries. When OpenAI had outages (and they did, multiple times), my users still got value. Degraded value, but value.

The alternative — a loading spinner and an error message — isn't architecture. It's hoping.


Principle 3: Territory Awareness

A grizzly knows its territory intimately. Every stream, every berry patch, every game trail. It knows where other bears are, where the boundaries are, and what resources are available at any given time.

In software, this translates to observability — but not the dashboard-full-of-metrics kind that most teams implement. I mean genuine awareness of what your system is doing right now, what resources are available, and where the boundaries are.

// Territory-aware service registry
function createTerritoryMap() {
  var services = {};
  var boundaries = {};

  return {
    register: function(name, config) {
      services[name] = {
        endpoint: config.endpoint,
        healthUrl: config.healthUrl,
        lastKnownStatus: 'unknown',
        lastChecked: null,
        responseTimeMs: [],
        capacityPercent: 100
      };
    },

    setBoundary: function(name, limits) {
      boundaries[name] = {
        maxRPS: limits.maxRPS || 100,
        maxLatencyMs: limits.maxLatencyMs || 5000,
        maxErrorRate: limits.maxErrorRate || 0.05,
        currentRPS: 0
      };
    },

    canAccess: function(name) {
      var service = services[name];
      var boundary = boundaries[name];

      if (!service || service.lastKnownStatus === 'down') {
        return { allowed: false, reason: 'service unavailable' };
      }

      if (boundary && boundary.currentRPS >= boundary.maxRPS) {
        return { allowed: false, reason: 'rate limit reached' };
      }

      var avgLatency = 0;
      if (service.responseTimeMs.length > 0) {
        var sum = service.responseTimeMs.reduce(function(a, b) { return a + b; }, 0);
        avgLatency = sum / service.responseTimeMs.length;
      }

      if (boundary && avgLatency > boundary.maxLatencyMs) {
        return { allowed: false, reason: 'latency too high: ' + avgLatency + 'ms' };
      }

      return { allowed: true };
    },

    recordAccess: function(name, latencyMs, success) {
      var service = services[name];
      if (!service) return;

      service.responseTimeMs.push(latencyMs);
      if (service.responseTimeMs.length > 100) {
        service.responseTimeMs.shift(); // Keep rolling window
      }

      if (boundaries[name]) {
        boundaries[name].currentRPS++;
        setTimeout(function() {
          boundaries[name].currentRPS--;
        }, 1000);
      }
    }
  };
}

I've watched teams spend months building elaborate monitoring dashboards that nobody looks at. The grizzly doesn't consult a dashboard — it knows its territory because it lives there. Your system needs to know its own territory the same way: automatically, continuously, and with built-in responses to what it finds.


Principle 4: Fat Reserves — Build Slack Into Everything

Before hibernation, a grizzly will gain up to three pounds per day. It's not being greedy — it's building reserves that will sustain it through months of zero intake.

Every resilient system I've ever built has had intentional slack. Not waste — slack. The difference matters. Waste is unused capacity that serves no purpose. Slack is unused capacity that serves the purpose of handling the unexpected.

This shows up in three places:

Compute slack. Don't run your servers at 80% CPU and call it efficient. Run them at 50% and call it prepared. When that traffic spike hits — and it will — you'll have headroom to absorb it while your auto-scaling catches up.

Time slack. If your batch job takes four hours and your window is six hours, that's not a two-hour waste. That's a two-hour buffer for the day it takes five hours because the database is slow, or the network is congested, or someone's running a migration in parallel.

Data slack. Keep more state than you think you need. Cache aggressively. Store intermediate results. When a downstream service goes down, that cached data is the difference between degraded service and no service.

// Request queue with built-in slack
function createSlackQueue(options) {
  var maxConcurrent = options.maxConcurrent || 10;
  var slackFactor = options.slackFactor || 0.3; // 30% slack
  var effectiveMax = Math.floor(maxConcurrent * (1 - slackFactor));
  var active = 0;
  var queue = [];
  var burstActive = false;

  function processNext() {
    var currentLimit = burstActive ? maxConcurrent : effectiveMax;

    while (queue.length > 0 && active < currentLimit) {
      var item = queue.shift();
      active++;

      item.fn()
        .then(item.resolve)
        .catch(item.reject)
        .finally(function() {
          active--;
          processNext();
        });
    }
  }

  return {
    enqueue: function(fn, priority) {
      return new Promise(function(resolve, reject) {
        var item = { fn: fn, resolve: resolve, reject: reject, priority: priority || 0 };

        if (priority > 0) {
          // High priority items go to front
          queue.unshift(item);
        } else {
          queue.push(item);
        }

        processNext();
      });
    },

    enableBurst: function() {
      burstActive = true;
      console.log('Burst mode enabled — using slack capacity');
      processNext();
    },

    disableBurst: function() {
      burstActive = false;
    },

    getStats: function() {
      return {
        active: active,
        queued: queue.length,
        effectiveLimit: burstActive ? maxConcurrent : effectiveMax,
        slackAvailable: !burstActive ? (maxConcurrent - effectiveMax) : 0
      };
    }
  };
}

The thirty percent slack factor might look wasteful on a capacity planning spreadsheet. It's not. It's the reason your system survives Black Friday, or a viral Reddit post, or whatever unexpected load event hits you. I've been on both sides of this. Systems with slack survive. Systems without it make the incident channel very exciting.


Principle 5: Scars Are Data

Grizzly bears accumulate scars. Fights with other bears, encounters with porcupines, failed fishing attempts on rocky streambeds. Those scars aren't damage — they're data. Every scar represents a lesson the bear survived and integrated.

Your production systems should work the same way. Every incident, every outage, every near-miss should leave a permanent mark in your architecture — not just in a post-mortem document that nobody reads six months later, but in the code itself.

// Incident-driven configuration
var LEARNED_LIMITS = {
  // 2024-03-15: Database connection pool exhaustion under load
  // We ran at 50 connections and hit the wall at 200 concurrent users
  maxDbConnections: 100, // Doubled from original 50

  // 2024-07-22: Memory leak from unclosed streams on file upload
  // Streams weren't being destroyed on client disconnect
  uploadStreamTimeout: 30000, // Kill uploads after 30s

  // 2024-11-03: Third-party geocoding API went down for 6 hours
  // No fallback, entire location feature was dead
  geocodingProviders: ['mapbox', 'google', 'openstreetmap'], // Was just ['mapbox']

  // 2025-02-18: Log aggregator couldn't handle our volume during sale event
  // Lost 4 hours of logs, couldn't debug a payment issue
  localLogRetentionHours: 72, // Keep local logs even when shipping to aggregator

  // 2025-08-30: DNS resolution failure cascaded through all services
  // Every service was resolving DNS on every request
  dnsCacheTTL: 300, // Cache DNS lookups for 5 minutes
};

// Every number in that object has a story. Every story has a scar.

I keep comments like these in my production configs. New engineers on a project can read the history of everything that went wrong and why the system is configured the way it is. It's more valuable than any architecture document I've ever written because it's grounded in reality, not theory.


Principle 6: Avoid Fights You Don't Need

A grizzly will bluff-charge before it commits to a real fight. It'll make noise, stand tall, put on a show. Actual combat is a last resort because every fight carries risk — even ones you win.

In architecture, this means avoiding unnecessary complexity, unnecessary dependencies, and unnecessary technology choices. Every external service you integrate is a potential fight. Every microservice boundary is a potential fight. Every clever abstraction is a potential fight.

I build with Node.js and Express. PostgreSQL and MongoDB. Pug templates and vanilla JavaScript on the front end. It's not exciting. It doesn't look impressive on a conference slide. But I've been running production systems with this stack for years, and the failure modes are well-understood, the debugging is straightforward, and the operational burden is minimal.

The teams I see getting into trouble are the ones adding Kubernetes when Docker Compose would do, or implementing event sourcing when a PostgreSQL table would work, or building a microservices architecture when they have three developers and one product.

Choose boring technology. Avoid fights you don't need. Save your energy for the fights that actually matter.


The Mindset in Practice

I rebuilt the backend for grizzlypeaksoftware.com last year with all of these principles in mind. It's a content site — articles, a job board, an ad system. Not exactly mission-critical infrastructure. But I applied the grizzly mindset anyway:

  • Multiple data sources (Contentful for articles, PostgreSQL for jobs and ads, MongoDB for contacts)
  • Graceful degradation (the site works even if PostgreSQL is down — you just don't see jobs)
  • Built-in slack (rate limiting with generous headroom, connection pooling with reserves)
  • Scar-driven config (every timeout value, every retry count has a reason behind it)
  • Minimal complexity (Express, Pug templates, server-rendered HTML, progressive enhancement)

It's not going to win any architecture awards. But it runs. It keeps running. And when something goes wrong — because something always goes wrong — the blast radius is contained and the recovery is automatic.

That's the grizzly mindset. Not the most powerful system. Not the most sophisticated. The one that survives.


The Bear Doesn't Read Hacker News

One last thought. The grizzly doesn't optimize for what other bears think about its survival strategy. It doesn't adopt new foraging techniques because they're trending. It doesn't abandon a working den because someone published a paper about a theoretically better one.

Build for survival. Build for the conditions you actually face, not the ones you imagine. Test your systems under failure, not just under ideal conditions. Keep slack in your resources, scars in your configs, and multiple food sources in your architecture.

And when the power goes out for three days in January, your systems will keep running without you.

That's resilience. That's the grizzly mindset.


Shane Larson is a software engineer and the founder of Grizzly Peak Software. He writes code, builds products, and watches grizzly bears from a cabin in Caswell Lakes, Alaska. His book on training large language models is available on Amazon.

Powered by Contentful