Mongodb

MongoDB Backup Strategies and Disaster Recovery

A comprehensive guide to MongoDB backup strategies covering mongodump, point-in-time recovery, automated S3 backups, retention policies, and disaster recovery planning.

MongoDB Backup Strategies and Disaster Recovery

I have seen production databases disappear. Not in some dramatic, Hollywood-style explosion — just a single mistyped command, an unfiltered deleteMany({}), or a failed disk that nobody noticed until it was too late. The conversation that follows is always the same: "Do we have backups?" If the answer is no, or if the backups have never been tested, you are in for the worst week of your career.

Backups are not optional infrastructure. They are the last line of defense against data corruption, human error, ransomware, hardware failure, and bad deployments. This guide covers everything you need to build a real, tested, automated MongoDB backup strategy — from basic mongodump commands through to a fully automated Node.js backup system with S3 storage, retention policies, and verified restore procedures.

Why Backups Matter More Than You Think

Most teams understand backups in theory but underestimate the threat surface in practice. Here are the real scenarios that will eventually hit your database:

  • Human error — A developer runs a migration script against production instead of staging. An admin drops a collection. Someone pushes a code change that silently corrupts documents over hours before anyone notices.
  • Data corruption — Unclean shutdowns, storage subsystem bugs, or replication lag can leave your data in an inconsistent state.
  • Ransomware and security breaches — Publicly exposed MongoDB instances have been mass-targeted. Attackers wipe your data and leave a ransom note in a new collection.
  • Hardware failure — Disks fail. Entire availability zones go down. Even with replica sets, a bug that corrupts data will replicate that corruption to every secondary.

Replica sets are not backups. They protect against hardware failure, but they replicate every destructive operation instantly. If a bad write hits the primary, it hits every secondary within milliseconds.

RPO and RTO: Define Your Requirements First

Before choosing a backup strategy, you need to define two numbers:

Recovery Point Objective (RPO) — How much data can you afford to lose? If your RPO is one hour, you need backups at least every hour. If it is zero, you need continuous replication to a separate system.

Recovery Time Objective (RTO) — How long can your application be down during a restore? A 500 GB database restored from mongodump takes significantly longer than restoring from a filesystem snapshot.

These numbers drive every decision that follows. A startup with a 10 GB database and a 24-hour RPO has very different requirements than a fintech company with a 2 TB database and a 5-minute RPO.

mongodump and mongorestore: The Foundation

The mongodump and mongorestore utilities are the most straightforward backup tools in the MongoDB ecosystem. They produce a binary export (BSON format) of your data that can be restored on any compatible MongoDB instance.

Basic Backup

# Full database backup
mongodump --uri="mongodb://user:password@localhost:27017/myapp" --out=/backups/2026-02-13

# Single collection backup
mongodump --uri="mongodb://user:password@localhost:27017/myapp" --collection=users --out=/backups/users

# Backup with gzip compression (reduces size 60-80%)
mongodump --uri="mongodb://user:password@localhost:27017/myapp" --gzip --archive=/backups/myapp-2026-02-13.archive

# Backup from a replica set, reading from a secondary
mongodump --uri="mongodb://user:[email protected]:27017,rs2.example.com:27017/myapp?replicaSet=rs0&readPreference=secondary" --gzip --archive=/backups/myapp-full.archive

Basic Restore

# Restore from directory
mongorestore --uri="mongodb://user:password@localhost:27017" --drop /backups/2026-02-13

# Restore from compressed archive
mongorestore --uri="mongodb://user:password@localhost:27017" --gzip --archive=/backups/myapp-2026-02-13.archive --drop

# Restore a single collection
mongorestore --uri="mongodb://user:password@localhost:27017/myapp" --collection=users --drop /backups/2026-02-13/myapp/users.bson

The --drop flag drops each collection before restoring it. Without it, mongorestore will attempt to insert documents and fail on duplicate _id values.

Limitations of mongodump

mongodump is not a point-in-time snapshot unless you use the --oplog flag with a replica set. Without it, the backup may contain documents written at different points during the dump process. For large databases, this inconsistency window can be significant.

Additionally, mongodump locks collections for reads and can put meaningful pressure on your server. Always run it against a secondary in a replica set, never against the primary during peak traffic.

Point-in-Time Recovery with the Oplog

The oplog (operations log) is a capped collection that records every write operation on a replica set. By combining a full mongodump with oplog entries, you can restore your database to any specific second in time.

# Backup with oplog (replica set only)
mongodump --uri="mongodb://user:password@localhost:27017/myapp?replicaSet=rs0" --oplog --gzip --archive=/backups/myapp-with-oplog.archive

# Restore to a specific point in time
mongorestore --gzip --archive=/backups/myapp-with-oplog.archive --oplogReplay --oplogLimit="1707800400:1"

The --oplogLimit parameter is a BSON timestamp. The first number is Unix epoch seconds, the second is the ordinal within that second. This lets you replay operations up to the exact moment before a destructive event occurred.

Point-in-time recovery is what separates a basic backup strategy from a real disaster recovery plan. It is the difference between losing 24 hours of data and losing 30 seconds.

Filesystem Snapshots

For large databases where mongodump takes too long, filesystem snapshots offer near-instantaneous backups. This works with LVM snapshots on Linux or EBS snapshots on AWS.

# Lock writes, take snapshot, unlock
mongo --eval "db.fsyncLock()"
# Take LVM or EBS snapshot here
lvcreate --size 10G --snapshot --name mongo-snap /dev/vg0/mongo-data
mongo --eval "db.fsyncUnlock()"

The write lock window is typically under a second. On AWS, EBS snapshots are incremental at the block level, making them fast and storage-efficient regardless of database size.

Critical requirement: Your MongoDB data directory and journal must be on the same filesystem/volume. If they are split across volumes, you need to snapshot both simultaneously, which most cloud providers do not support atomically.

MongoDB Atlas Backup

If you are running MongoDB Atlas, backups are built in:

  • Continuous backup with point-in-time restore to any second within your retention window
  • Cloud provider snapshots taken every 6, 8, 12, or 24 hours depending on your tier
  • Queryable snapshots that let you query backup data without a full restore
  • Cross-region snapshot distribution for disaster recovery

Atlas handles all the complexity of oplog-based recovery, snapshot scheduling, and retention. For most teams, this is the right answer unless cost or compliance requirements push you to self-managed infrastructure.

Backup Strategies: Full, Incremental, Continuous

Full backups capture everything every time. Simple but expensive for large databases. Suitable for databases under 50 GB where the backup completes in minutes.

Incremental backups capture only changes since the last backup. MongoDB's oplog enables this — take a full backup weekly and capture oplog entries continuously. Restoring requires replaying the full backup plus all incremental segments.

Continuous replication mirrors writes to a separate system in real time. This gives you an RPO near zero but requires a dedicated secondary or a tool like MongoDB's change streams.

Grandfather-Father-Son Rotation

The GFS rotation scheme balances storage cost against recovery flexibility:

  • Daily (Son): Keep the last 7 daily backups
  • Weekly (Father): Keep the last 4 weekly backups (every Sunday)
  • Monthly (Grandfather): Keep the last 12 monthly backups (first of each month)

This gives you fine-grained recovery for recent events and progressively coarser recovery for older events, without storing 365 daily backups.

Complete Working Example: Automated Backup System

Here is a production-grade backup system built with Node.js. It runs mongodump on schedule, compresses the output, uploads to S3, verifies backup integrity, applies a retention policy, and sends email notifications on failure.

// backup-system.js
var childProcess = require("child_process");
var fs = require("fs");
var path = require("path");
var AWS = require("aws-sdk");
var cron = require("node-cron");
var nodemailer = require("nodemailer");

var config = {
  mongoUri: process.env.MONGO_URI || "mongodb://localhost:27017/myapp",
  s3Bucket: process.env.BACKUP_S3_BUCKET || "myapp-backups",
  s3Region: process.env.AWS_REGION || "us-east-1",
  s3Prefix: process.env.BACKUP_S3_PREFIX || "mongodb/",
  localBackupDir: process.env.BACKUP_LOCAL_DIR || "/tmp/mongo-backups",
  retentionDays: parseInt(process.env.BACKUP_RETENTION_DAYS, 10) || 30,
  schedule: process.env.BACKUP_SCHEDULE || "0 2 * * *",  // 2 AM daily
  notifyEmail: process.env.BACKUP_NOTIFY_EMAIL || "",
  smtpHost: process.env.SMTP_HOST || "smtp.gmail.com",
  smtpPort: parseInt(process.env.SMTP_PORT, 10) || 587,
  smtpUser: process.env.SMTP_USER || "",
  smtpPass: process.env.SMTP_PASS || ""
};

var s3 = new AWS.S3({ region: config.s3Region });

function getTimestamp() {
  var now = new Date();
  var year = now.getFullYear();
  var month = String(now.getMonth() + 1).padStart(2, "0");
  var day = String(now.getDate()).padStart(2, "0");
  var hour = String(now.getHours()).padStart(2, "0");
  var minute = String(now.getMinutes()).padStart(2, "0");
  return year + "-" + month + "-" + day + "_" + hour + "-" + minute;
}

function ensureDir(dir) {
  if (!fs.existsSync(dir)) {
    fs.mkdirSync(dir, { recursive: true });
  }
}

function runMongodump(outputPath, callback) {
  var args = [
    "--uri=" + config.mongoUri,
    "--gzip",
    "--archive=" + outputPath,
    "--oplog"
  ];

  console.log("[backup] Starting mongodump...");
  var startTime = Date.now();

  var proc = childProcess.spawn("mongodump", args, { stdio: "pipe" });
  var stderr = "";

  proc.stderr.on("data", function (data) {
    stderr += data.toString();
  });

  proc.on("close", function (code) {
    var elapsed = ((Date.now() - startTime) / 1000).toFixed(1);

    if (code !== 0) {
      return callback(new Error("mongodump exited with code " + code + ": " + stderr));
    }

    var stats = fs.statSync(outputPath);
    var sizeMB = (stats.size / (1024 * 1024)).toFixed(2);
    console.log("[backup] mongodump completed in " + elapsed + "s, size: " + sizeMB + " MB");

    callback(null, { path: outputPath, size: stats.size, elapsed: elapsed });
  });
}

function uploadToS3(filePath, callback) {
  var fileName = path.basename(filePath);
  var s3Key = config.s3Prefix + fileName;
  var fileStream = fs.createReadStream(filePath);

  console.log("[backup] Uploading to s3://" + config.s3Bucket + "/" + s3Key);

  var params = {
    Bucket: config.s3Bucket,
    Key: s3Key,
    Body: fileStream,
    ServerSideEncryption: "AES256",
    StorageClass: "STANDARD_IA"
  };

  s3.upload(params, function (err, data) {
    if (err) {
      return callback(new Error("S3 upload failed: " + err.message));
    }

    console.log("[backup] Upload complete: " + data.Location);
    callback(null, { key: s3Key, location: data.Location });
  });
}

function verifyBackup(filePath, callback) {
  // Verify the archive is a valid gzip file by reading the header
  console.log("[backup] Verifying backup integrity...");

  var fd = fs.openSync(filePath, "r");
  var header = Buffer.alloc(2);
  fs.readSync(fd, header, 0, 2, 0);
  fs.closeSync(fd);

  // Gzip magic number: 0x1f 0x8b
  if (header[0] !== 0x1f || header[1] !== 0x8b) {
    return callback(new Error("Backup file is not a valid gzip archive"));
  }

  // Run a dry-run restore to verify BSON integrity
  var proc = childProcess.spawn("mongorestore", [
    "--gzip",
    "--archive=" + filePath,
    "--dryRun",
    "--oplogReplay"
  ], { stdio: "pipe" });

  var stderr = "";
  proc.stderr.on("data", function (data) {
    stderr += data.toString();
  });

  proc.on("close", function (code) {
    if (code !== 0) {
      return callback(new Error("Backup verification failed: " + stderr));
    }
    console.log("[backup] Backup integrity verified");
    callback(null);
  });
}

function applyRetentionPolicy(callback) {
  console.log("[backup] Applying retention policy (" + config.retentionDays + " days)...");

  var cutoffDate = new Date();
  cutoffDate.setDate(cutoffDate.getDate() - config.retentionDays);

  var params = {
    Bucket: config.s3Bucket,
    Prefix: config.s3Prefix
  };

  s3.listObjectsV2(params, function (err, data) {
    if (err) {
      return callback(new Error("Failed to list S3 objects: " + err.message));
    }

    var toDelete = [];
    data.Contents.forEach(function (obj) {
      if (obj.LastModified < cutoffDate) {
        toDelete.push({ Key: obj.Key });
      }
    });

    if (toDelete.length === 0) {
      console.log("[backup] No expired backups to remove");
      return callback(null, 0);
    }

    var deleteParams = {
      Bucket: config.s3Bucket,
      Delete: { Objects: toDelete }
    };

    s3.deleteObjects(deleteParams, function (err) {
      if (err) {
        return callback(new Error("Failed to delete expired backups: " + err.message));
      }
      console.log("[backup] Removed " + toDelete.length + " expired backup(s)");
      callback(null, toDelete.length);
    });
  });
}

function sendNotification(subject, body, callback) {
  if (!config.notifyEmail || !config.smtpUser) {
    console.log("[backup] No notification email configured, skipping");
    return callback(null);
  }

  var transporter = nodemailer.createTransport({
    host: config.smtpHost,
    port: config.smtpPort,
    secure: config.smtpPort === 465,
    auth: {
      user: config.smtpUser,
      pass: config.smtpPass
    }
  });

  var mailOptions = {
    from: config.smtpUser,
    to: config.notifyEmail,
    subject: subject,
    text: body
  };

  transporter.sendMail(mailOptions, function (err) {
    if (err) {
      console.error("[backup] Failed to send notification: " + err.message);
    }
    callback(null);
  });
}

function cleanupLocal(filePath) {
  try {
    if (fs.existsSync(filePath)) {
      fs.unlinkSync(filePath);
      console.log("[backup] Cleaned up local file: " + filePath);
    }
  } catch (e) {
    console.error("[backup] Failed to clean up " + filePath + ": " + e.message);
  }
}

function runBackup() {
  var timestamp = getTimestamp();
  var fileName = "backup-" + timestamp + ".archive.gz";
  var filePath = path.join(config.localBackupDir, fileName);

  ensureDir(config.localBackupDir);

  console.log("[backup] === Starting backup: " + timestamp + " ===");
  var startTime = Date.now();

  runMongodump(filePath, function (err, dumpResult) {
    if (err) {
      console.error("[backup] FAILED: " + err.message);
      sendNotification(
        "[ALERT] MongoDB Backup Failed",
        "Backup failed at " + new Date().toISOString() + "\n\nError: " + err.message,
        function () {}
      );
      return;
    }

    verifyBackup(filePath, function (err) {
      if (err) {
        console.error("[backup] VERIFICATION FAILED: " + err.message);
        cleanupLocal(filePath);
        sendNotification(
          "[ALERT] MongoDB Backup Verification Failed",
          "Backup verification failed at " + new Date().toISOString() + "\n\nError: " + err.message,
          function () {}
        );
        return;
      }

      uploadToS3(filePath, function (err, uploadResult) {
        cleanupLocal(filePath);

        if (err) {
          console.error("[backup] UPLOAD FAILED: " + err.message);
          sendNotification(
            "[ALERT] MongoDB Backup Upload Failed",
            "S3 upload failed at " + new Date().toISOString() + "\n\nError: " + err.message,
            function () {}
          );
          return;
        }

        applyRetentionPolicy(function (err, removedCount) {
          var totalElapsed = ((Date.now() - startTime) / 1000).toFixed(1);
          if (err) {
            console.error("[backup] Retention policy error: " + err.message);
          }

          console.log("[backup] === Backup complete in " + totalElapsed + "s ===");

          sendNotification(
            "[OK] MongoDB Backup Successful",
            "Backup completed successfully at " + new Date().toISOString() +
            "\n\nFile: " + uploadResult.key +
            "\nSize: " + (dumpResult.size / (1024 * 1024)).toFixed(2) + " MB" +
            "\nDuration: " + totalElapsed + "s" +
            "\nExpired backups removed: " + (removedCount || 0),
            function () {}
          );
        });
      });
    });
  });
}

// Schedule automated backups
console.log("[backup] Scheduling backups with cron: " + config.schedule);
cron.schedule(config.schedule, function () {
  runBackup();
});

// Allow manual trigger
module.exports = { runBackup: runBackup };

// Run immediately if called directly
if (require.main === module) {
  runBackup();
}

Restore Scripts

Full restore from the latest S3 backup:

// restore.js
var childProcess = require("child_process");
var fs = require("fs");
var path = require("path");
var AWS = require("aws-sdk");

var config = {
  mongoUri: process.env.MONGO_URI || "mongodb://localhost:27017/myapp",
  s3Bucket: process.env.BACKUP_S3_BUCKET || "myapp-backups",
  s3Region: process.env.AWS_REGION || "us-east-1",
  s3Prefix: process.env.BACKUP_S3_PREFIX || "mongodb/",
  localRestoreDir: "/tmp/mongo-restore"
};

var s3 = new AWS.S3({ region: config.s3Region });

function getLatestBackup(callback) {
  var params = {
    Bucket: config.s3Bucket,
    Prefix: config.s3Prefix
  };

  s3.listObjectsV2(params, function (err, data) {
    if (err) return callback(err);

    if (!data.Contents || data.Contents.length === 0) {
      return callback(new Error("No backups found in bucket"));
    }

    // Sort by last modified descending
    data.Contents.sort(function (a, b) {
      return b.LastModified - a.LastModified;
    });

    callback(null, data.Contents[0]);
  });
}

function downloadBackup(s3Key, localPath, callback) {
  console.log("[restore] Downloading " + s3Key + "...");

  var params = {
    Bucket: config.s3Bucket,
    Key: s3Key
  };

  var file = fs.createWriteStream(localPath);
  var stream = s3.getObject(params).createReadStream();

  stream.pipe(file);

  file.on("finish", function () {
    file.close();
    var stats = fs.statSync(localPath);
    console.log("[restore] Downloaded " + (stats.size / (1024 * 1024)).toFixed(2) + " MB");
    callback(null);
  });

  stream.on("error", function (err) {
    callback(err);
  });
}

function runRestore(archivePath, oplogLimit, callback) {
  var args = [
    "--uri=" + config.mongoUri,
    "--gzip",
    "--archive=" + archivePath,
    "--drop",
    "--oplogReplay"
  ];

  if (oplogLimit) {
    args.push("--oplogLimit=" + oplogLimit);
    console.log("[restore] Point-in-time restore to: " + oplogLimit);
  }

  console.log("[restore] Running mongorestore...");
  var startTime = Date.now();

  var proc = childProcess.spawn("mongorestore", args, { stdio: "inherit" });

  proc.on("close", function (code) {
    var elapsed = ((Date.now() - startTime) / 1000).toFixed(1);

    if (code !== 0) {
      return callback(new Error("mongorestore exited with code " + code));
    }

    console.log("[restore] Restore completed in " + elapsed + "s");
    callback(null);
  });
}

// Parse CLI arguments
var args = process.argv.slice(2);
var oplogLimit = null;
var specificBackup = null;

args.forEach(function (arg) {
  if (arg.indexOf("--oplog-limit=") === 0) {
    oplogLimit = arg.split("=")[1];
  }
  if (arg.indexOf("--backup=") === 0) {
    specificBackup = arg.split("=")[1];
  }
});

// Confirmation prompt
var readline = require("readline");
var rl = readline.createInterface({ input: process.stdin, output: process.stdout });

rl.question("WARNING: This will DROP existing data. Type 'RESTORE' to confirm: ", function (answer) {
  rl.close();

  if (answer !== "RESTORE") {
    console.log("[restore] Aborted.");
    process.exit(1);
  }

  if (!fs.existsSync(config.localRestoreDir)) {
    fs.mkdirSync(config.localRestoreDir, { recursive: true });
  }

  function doRestore(s3Key) {
    var localPath = path.join(config.localRestoreDir, path.basename(s3Key));

    downloadBackup(s3Key, localPath, function (err) {
      if (err) {
        console.error("[restore] Download failed: " + err.message);
        process.exit(1);
      }

      runRestore(localPath, oplogLimit, function (err) {
        // Cleanup
        try { fs.unlinkSync(localPath); } catch (e) { /* ignore */ }

        if (err) {
          console.error("[restore] FAILED: " + err.message);
          process.exit(1);
        }

        console.log("[restore] === Restore successful ===");
        process.exit(0);
      });
    });
  }

  if (specificBackup) {
    doRestore(config.s3Prefix + specificBackup);
  } else {
    getLatestBackup(function (err, latest) {
      if (err) {
        console.error("[restore] " + err.message);
        process.exit(1);
      }
      console.log("[restore] Using latest backup: " + latest.Key);
      doRestore(latest.Key);
    });
  }
});

Usage:

# Restore latest backup
node restore.js

# Restore specific backup
node restore.js --backup=backup-2026-02-13_02-00.archive.gz

# Point-in-time restore (replay oplog up to specific timestamp)
node restore.js --oplog-limit="1739404800:1"

Backing Up Replica Sets vs. Standalone

For replica sets, always run backups against a secondary member. This avoids any performance impact on the primary and does not interfere with client operations.

# Connect to secondary explicitly
mongodump --host rs0/secondary1.example.com:27017 --readPreference=secondary --oplog --gzip --archive=/backups/rs-backup.archive.gz

The --oplog flag is only available with replica sets. It captures a consistent snapshot by recording operations that occur during the dump and replaying them on restore.

For standalone instances, you do not have access to the oplog. Your only options for consistency are:

  1. Lock the database with db.fsyncLock() during the dump
  2. Accept the possibility of minor inconsistencies during the backup window
  3. Convert your standalone to a single-node replica set (recommended — this is free and enables oplog-based backups)

Sharded Cluster Backup Considerations

Backing up sharded clusters is significantly more complex. You need a consistent snapshot across all shards and the config servers simultaneously.

Do not simply run mongodump against each shard independently. The data will be inconsistent across shards because each dump starts and finishes at different times.

Instead:

  1. Stop the balancer before starting the backup: sh.stopBalancer()
  2. Back up the config server replica set first — this contains shard metadata and chunk distribution
  3. Back up each shard replica set using --oplog on their secondaries
  4. Restart the balancer after all backups complete: sh.startBalancer()
  5. Restore in order: config servers first, then each shard

For Atlas-managed sharded clusters, this is all handled automatically. For self-managed clusters, seriously consider whether the operational complexity is worth it versus using Atlas or a third-party backup tool.

Backup Encryption

Backups contain your data in plaintext (albeit binary BSON format). Encrypt them before storing:

# Encrypt with OpenSSL after mongodump
mongodump --uri="$MONGO_URI" --gzip --archive | openssl enc -aes-256-cbc -salt -pbkdf2 -pass file:/etc/backup-key -out /backups/encrypted-backup.archive.gz.enc

# Decrypt before restore
openssl enc -aes-256-cbc -d -pbkdf2 -pass file:/etc/backup-key -in /backups/encrypted-backup.archive.gz.enc | mongorestore --gzip --archive --drop

When using S3, enable server-side encryption with ServerSideEncryption: "AES256" (as shown in the backup system above) or use KMS-managed keys for envelope encryption.

Monitoring Backup Health

A backup you do not monitor is a backup you cannot trust. Implement these checks:

  • Backup recency — Alert if the most recent backup in S3 is older than your backup interval plus a margin (e.g., 26 hours for daily backups)
  • Backup size — Track sizes over time. A sudden drop may indicate an incomplete backup. A sudden spike may indicate a data explosion worth investigating.
  • Verification results — Log dry-run restore outcomes and alert on failures.
  • Backup duration — If backups take progressively longer, your data growth may eventually push backups past your maintenance window.
// health-check.js — Run via cron or monitoring system
var AWS = require("aws-sdk");
var s3 = new AWS.S3({ region: "us-east-1" });

function checkBackupHealth(callback) {
  s3.listObjectsV2({
    Bucket: "myapp-backups",
    Prefix: "mongodb/"
  }, function (err, data) {
    if (err) return callback(err);

    if (!data.Contents || data.Contents.length === 0) {
      return callback(new Error("CRITICAL: No backups found"));
    }

    data.Contents.sort(function (a, b) {
      return b.LastModified - a.LastModified;
    });

    var latest = data.Contents[0];
    var ageHours = (Date.now() - latest.LastModified.getTime()) / (1000 * 60 * 60);
    var sizeMB = (latest.Size / (1024 * 1024)).toFixed(2);

    var status = {
      latestBackup: latest.Key,
      ageHours: ageHours.toFixed(1),
      sizeMB: sizeMB,
      totalBackups: data.Contents.length,
      healthy: ageHours < 26 && latest.Size > 1024
    };

    if (!status.healthy) {
      return callback(new Error("WARN: Latest backup is " + status.ageHours + " hours old, " + sizeMB + " MB"));
    }

    callback(null, status);
  });
}

checkBackupHealth(function (err, status) {
  if (err) {
    console.error(err.message);
    process.exit(1);
  }
  console.log("Backup health OK:", JSON.stringify(status, null, 2));
});

Testing Restore Procedures

The single most important thing about your backup strategy is whether you have actually tested a restore. I cannot overstate this. An untested backup is not a backup — it is a hope.

Schedule regular restore tests. At minimum, monthly:

  1. Spin up a temporary MongoDB instance (Docker makes this trivial)
  2. Download the latest backup from S3
  3. Run a full restore
  4. Run validation queries to confirm document counts and data integrity
  5. Tear down the temporary instance
# Monthly restore test using Docker
docker run -d --name mongo-restore-test -p 27018:27017 mongo:7
mongorestore --uri="mongodb://localhost:27018" --gzip --archive=latest-backup.archive.gz --drop
# Run validation queries
mongo --port 27018 --eval "db.users.countDocuments({})"
docker stop mongo-restore-test && docker rm mongo-restore-test

Automate this. If restoring from your backup fails silently for six months because of a configuration change, you will only find out when you desperately need that restore to work.

Common Issues and Troubleshooting

1. mongodump fails with "not authorized"

Your backup user needs the backup role or, at minimum, find on all databases plus find on local.oplog.rs for oplog access. Create a dedicated backup user:

db.createUser({
  user: "backup_agent",
  pwd: "secure-password-here",
  roles: [{ role: "backup", db: "admin" }]
});

2. Backup archive is 0 bytes or truncated

This usually means mongodump ran out of disk space on the local filesystem. Check /tmp or wherever your backup directory is located. The backup system above cleans up local files after upload, but if uploads fail repeatedly, stale files can accumulate.

3. mongorestore fails with "cannot restore to a non-empty database"

Use the --drop flag to drop collections before restoring. Without it, mongorestore tries to insert documents that may already exist, causing duplicate key errors.

4. Restore takes hours for a moderately sized database

mongorestore rebuilds indexes after restoring data. For databases with many indexes, this can take longer than the data restore itself. Use --noIndexRestore to skip index rebuilding and recreate indexes manually afterward if you need to minimize downtime:

mongorestore --gzip --archive=backup.archive.gz --drop --noIndexRestore
# Then recreate indexes
mongo myapp --eval "db.users.createIndex({ email: 1 }, { unique: true })"

5. Point-in-time restore replays too many or too few operations

The --oplogLimit timestamp must be a BSON timestamp, not a JavaScript date. Use db.oplog.rs.find().sort({ ts: -1 }).limit(1) on your replica set to find recent timestamps and understand the format.

Best Practices

  1. Run backups against secondaries. Never run mongodump against your primary during production hours. Direct read traffic to a secondary member dedicated to backup operations.

  2. Test restores regularly. Schedule monthly automated restore tests. Document the procedure so any team member can execute it under pressure at 3 AM.

  3. Encrypt backups at rest and in transit. Use S3 server-side encryption and TLS for MongoDB connections. Treat backup files with the same security posture as your production database.

  4. Monitor backup freshness and size. Set up alerts for missing backups, unexpected size changes, and failed verification. The moment you stop watching is the moment something breaks.

  5. Keep at least one backup off-site. Cross-region S3 replication, a different cloud provider, or even a local copy. If your primary region goes down and your backups are in the same region, you have two problems.

  6. Version your backup and restore scripts. Check them into source control. Include the exact commands needed to restore in your runbook. When disaster strikes, nobody wants to reverse-engineer the restore process from memory.

  7. Convert standalone instances to single-node replica sets. This is free, takes five minutes, and enables oplog-based point-in-time recovery. There is no reason not to do this.

  8. Document your RPO and RTO. Make sure the entire team knows the targets and the backup strategy that supports them. Review these numbers quarterly as your data grows.

Disaster Recovery Planning

A backup strategy is one component of a disaster recovery plan. The full plan should cover:

  • Communication: Who gets notified when a disaster is detected? What is the escalation path?
  • Decision authority: Who authorizes a production restore? This should be decided in advance, not during the incident.
  • Runbook: Step-by-step instructions for every recovery scenario, tested and up to date.
  • Data validation: After restore, how do you verify data integrity? Document the queries and expected results.
  • Post-mortem: After every restore event, conduct a post-mortem. Update the runbook with lessons learned.

The best disaster recovery plan is one that has been practiced. Run a full DR drill at least once a year — simulate a database loss, execute the recovery plan, and measure your actual RTO against the target.

References

Powered by Contentful