Mongodb

MongoDB Aggregation Pipeline Performance Optimization

A practical guide to optimizing MongoDB aggregation pipelines covering stage ordering, index usage, memory management, and building performant analytics queries with Node.js.

MongoDB Aggregation Pipeline Performance Optimization

The MongoDB aggregation framework is one of the most powerful features in the database, but it is also one of the easiest to misuse. I have seen pipelines that should run in milliseconds take minutes because of poor stage ordering, missing indexes, or bloated intermediate documents. After years of building analytics backends on MongoDB, I have a clear picture of what separates a fast pipeline from a slow one. This guide covers the fundamentals, the optimization strategies, and a complete working example you can drop into a Node.js project today.

Aggregation Pipeline Fundamentals

An aggregation pipeline is a sequence of stages. Documents flow through each stage and are transformed along the way. The key stages you will use constantly are:

  • $match — Filters documents. Think of it as a WHERE clause.
  • $group — Groups documents by a key and applies accumulators like $sum, $avg, $push.
  • $sort — Orders documents.
  • $project / $addFields — Reshapes documents by including, excluding, or computing fields.
  • $lookup — Performs a left outer join to another collection.
  • $unwind — Deconstructs an array field into individual documents.
  • $facet — Runs multiple sub-pipelines in parallel on the same input.
  • $bucket / $bucketAuto — Groups documents into histogram-style ranges.
  • $merge / $out — Writes results to a collection for materialized views.

The order you place these stages in matters enormously. MongoDB's query planner does some automatic optimization, but it cannot fix a fundamentally bad pipeline design.

Pipeline Stage Ordering for Performance

The single most important optimization rule is: filter early, reshape early, join late. Every document that passes through a stage costs CPU and memory. Reducing the document count and document size as early as possible has a compounding effect on every downstream stage.

Here is the ideal ordering pattern:

  1. $match — Filter out as many documents as possible first.
  2. $sort — Sort while the result set is smallest and indexes can be used.
  3. $project / $addFields — Strip fields you do not need before expensive operations.
  4. $group — Aggregate the reduced dataset.
  5. $lookup — Join only after grouping has reduced the document count.
  6. $unwind — Deconstruct arrays only when necessary and as late as possible.

A common mistake is putting $lookup before $match. If you join 10,000 orders with a products collection and then filter to orders from last week, you just joined 9,500 documents for nothing.

// BAD: lookup before match
var badPipeline = [
  { $lookup: { from: "products", localField: "productId", foreignField: "_id", as: "product" } },
  { $unwind: "$product" },
  { $match: { orderDate: { $gte: lastWeek } } }
];

// GOOD: match before lookup
var goodPipeline = [
  { $match: { orderDate: { $gte: lastWeek } } },
  { $lookup: { from: "products", localField: "productId", foreignField: "_id", as: "product" } },
  { $unwind: "$product" }
];

Index Usage in Aggregation Pipelines

MongoDB can use indexes in aggregation pipelines, but only for specific stages and only when those stages appear at the beginning of the pipeline.

$match uses indexes when it is the first stage (or follows only other $match stages). Compound indexes work here just like they do in find() queries. If your pipeline starts with { $match: { status: "completed", orderDate: { $gte: lastMonth } } }, you want a compound index on { status: 1, orderDate: -1 }.

$sort uses indexes when it immediately follows a $match that used the same index, or when it is the first stage. A $sort buried after a $group cannot use any index because the documents have been transformed.

$geoNear must always be the first stage and always uses a geospatial index.

The critical insight is that once a stage transforms documents, all subsequent stages lose index access. This is why $match and $sort must come first.

// Create compound index for this pipeline
// db.orders.createIndex({ status: 1, orderDate: -1 })

var pipeline = [
  { $match: { status: "completed", orderDate: { $gte: new Date("2026-01-01") } } },
  { $sort: { orderDate: -1 } },
  { $group: {
    _id: "$category",
    totalRevenue: { $sum: "$amount" },
    orderCount: { $sum: 1 }
  }}
];

Use explain() to verify index usage:

var explanation = db.orders.aggregate(pipeline, { explain: true });

Look for IXSCAN in the first stage. If you see COLLSCAN, your index is not being used.

$lookup Optimization

$lookup is the most expensive common stage. There are two forms: the simple equality form and the expressive pipeline form.

// Simple form — faster, uses index on foreign field
{ $lookup: {
  from: "products",
  localField: "productId",
  foreignField: "_id",
  as: "product"
}}

// Pipeline form — more flexible but slower
{ $lookup: {
  from: "products",
  let: { pid: "$productId" },
  pipeline: [
    { $match: { $expr: { $eq: ["$_id", "$$pid"] } } },
    { $project: { name: 1, price: 1 } }
  ],
  as: "product"
}}

Always prefer the simple form unless you need to filter or reshape the joined documents. The simple form can use an index on the foreign field directly. The pipeline form runs a sub-pipeline for every input document, which is dramatically slower on large datasets.

When you must use the pipeline form, ensure the foreign collection has an index on the fields referenced in your $match. And always include a $project in the sub-pipeline to limit the data pulled in.

Reduce the number of documents entering a $lookup by placing $match and $group stages before it. If you need product details for a sales report grouped by category, group first and then look up category names — not the other way around.

Memory Limits and allowDiskUse

By default, each aggregation pipeline stage has a 100MB memory limit. When a stage exceeds this, the pipeline fails with an error. You have two options:

  1. Optimize the pipeline so stages process less data (preferred).
  2. Use allowDiskUse to let MongoDB spill to disk.
db.orders.aggregate(pipeline, { allowDiskUse: true });

In Node.js with the native driver:

var cursor = collection.aggregate(pipeline, { allowDiskUse: true });

allowDiskUse is a safety net, not a strategy. If you need it regularly, your pipeline design needs work. Disk I/O is orders of magnitude slower than memory. The goal is to reduce intermediate document sizes with early $project stages and early $match filtering so you never hit the limit.

Tip: $sort is the most common stage to exceed the memory limit. Sort on indexed fields at the start of the pipeline, or reduce the dataset with $match before sorting.

$facet for Parallel Pipelines

$facet lets you run multiple aggregation sub-pipelines on the same set of input documents in a single pass. This is ideal for dashboard queries where you need several different aggregations from the same base data.

var dashboardPipeline = [
  { $match: { orderDate: { $gte: startOfMonth } } },
  { $facet: {
    revenueByCategory: [
      { $group: { _id: "$category", total: { $sum: "$amount" } } },
      { $sort: { total: -1 } }
    ],
    dailyTrends: [
      { $group: {
        _id: { $dateToString: { format: "%Y-%m-%d", date: "$orderDate" } },
        dailyTotal: { $sum: "$amount" },
        orderCount: { $sum: 1 }
      }},
      { $sort: { _id: 1 } }
    ],
    topCustomers: [
      { $group: { _id: "$customerId", spent: { $sum: "$amount" } } },
      { $sort: { spent: -1 } },
      { $limit: 10 }
    ]
  }}
];

Be aware that $facet output is a single document. If the combined output of all sub-pipelines exceeds 16MB (the BSON document size limit), the query fails. Keep sub-pipeline results bounded with $limit.

$bucket and $bucketAuto for Histograms

When you need to group numeric data into ranges — price tiers, age brackets, response time buckets — use $bucket or $bucketAuto.

// Fixed boundaries
{ $bucket: {
  groupBy: "$amount",
  boundaries: [0, 50, 100, 250, 500, 1000, Infinity],
  default: "other",
  output: {
    count: { $sum: 1 },
    avgAmount: { $avg: "$amount" }
  }
}}

// Automatic boundaries
{ $bucketAuto: {
  groupBy: "$amount",
  buckets: 10,
  output: {
    count: { $sum: 1 },
    avgAmount: { $avg: "$amount" }
  }
}}

$bucketAuto is useful for exploratory analysis when you do not know the data distribution. $bucket is better for production dashboards where consistent boundaries matter.

$merge and $out for Materialized Views

For expensive pipelines that do not need real-time results, write the output to a collection and query that collection instead. This is a materialized view pattern.

// $out replaces the target collection entirely
var rollupPipeline = [
  { $match: { orderDate: { $gte: startOfMonth } } },
  { $group: {
    _id: { category: "$category", day: { $dateToString: { format: "%Y-%m-%d", date: "$orderDate" } } },
    revenue: { $sum: "$amount" },
    orders: { $sum: 1 }
  }},
  { $out: "monthly_sales_rollup" }
];

// $merge is more flexible — can upsert, merge, or replace
var incrementalRollup = [
  { $match: { orderDate: { $gte: yesterday } } },
  { $group: {
    _id: { category: "$category", day: { $dateToString: { format: "%Y-%m-%d", date: "$orderDate" } } },
    revenue: { $sum: "$amount" },
    orders: { $sum: 1 }
  }},
  { $merge: {
    into: "daily_sales_rollup",
    on: "_id",
    whenMatched: "merge",
    whenNotMatched: "insert"
  }}
];

$merge is the better choice for incremental updates because it does not wipe the target collection. Run these pipelines on a schedule (via cron or a background job) and serve the pre-computed results to your API.

Accumulator Operators

Understanding the accumulator operators available in $group is essential for writing efficient pipelines.

Operator Description
$sum Sum numeric values or count documents ($sum: 1)
$avg Calculate the mean
$min / $max Find extreme values
$push Collect values into an array
$addToSet Collect unique values into an array
$first / $last First or last value in group (order-dependent)
$stdDevPop / $stdDevSamp Standard deviation

Watch out for $push and $addToSet — these build arrays in memory and can cause OOM issues if the group has many members. If you need the top N items per group, use $sort before $group and then $first, or use $topN (MongoDB 5.2+).

// Instead of $push all items then slice
// Use $topN to limit during accumulation
{ $group: {
  _id: "$category",
  topProducts: {
    $topN: {
      n: 5,
      sortBy: { revenue: -1 },
      output: { name: "$productName", revenue: "$amount" }
    }
  }
}}

$expr and $redact

$expr allows you to use aggregation expressions inside $match, enabling comparisons between fields in the same document.

// Find orders where discount exceeds 50% of the amount
{ $match: {
  $expr: { $gt: ["$discount", { $multiply: ["$amount", 0.5] }] }
}}

$redact controls document access at a field level. It walks through the document structure and applies $$DESCEND, $$PRUNE, or $$KEEP at each level. It is useful for row-level security but is niche. In most cases, $match combined with $project achieves the same result more readably.

Optimizing $group Stages

$group is often the bottleneck. Here are specific techniques to make it faster:

  1. Reduce input size first. A $project that strips 20 unused fields before $group means less data shuffled in memory.
  2. Use simple _id expressions. Grouping by { $dateToString: { format: "%Y-%m-%d", date: "$orderDate" } } is slower than grouping by a pre-computed date field.
  3. Avoid grouping by high-cardinality fields unless you need to. Grouping by userId when you have 10 million users creates 10 million groups.
  4. Pre-aggregate where possible. If you are rolling up hourly data into daily data, store hourly rollups and aggregate those instead of raw events.

Pipeline explain() and Profiling

Always verify your pipeline performance with explain().

var MongoClient = require("mongodb").MongoClient;

MongoClient.connect(process.env.DB_MONGO, function(err, client) {
  var db = client.db("analytics");
  var orders = db.collection("orders");

  orders.aggregate(pipeline, { explain: true }).toArray(function(err, result) {
    console.log(JSON.stringify(result, null, 2));
    client.close();
  });
});

In the explain output, look for:

  • queryPlanner.winningPlan.stage: Should show IXSCAN for indexed queries, not COLLSCAN.
  • executionStats.totalDocsExamined: Compare to nReturned. A large ratio means inefficient filtering.
  • executionStats.executionTimeMillis: Your primary performance metric.

Also enable the MongoDB profiler for slow queries:

db.setProfilingLevel(1, { slowms: 100 });

Then inspect db.system.profile for pipelines exceeding your threshold.

Complete Working Example: Sales Analytics API

Here is a full Express.js API that demonstrates optimized aggregation pipelines for a sales analytics dashboard.

var express = require("express");
var MongoClient = require("mongodb").MongoClient;
var app = express();

var dbUrl = process.env.DB_MONGO || "mongodb://localhost:27017";
var dbName = process.env.DATABASE || "salesdb";
var db;

MongoClient.connect(dbUrl, function(err, client) {
  if (err) {
    console.error("Failed to connect to MongoDB:", err.message);
    process.exit(1);
  }
  db = client.db(dbName);
  console.log("Connected to MongoDB");

  // Create indexes for aggregation performance
  db.collection("orders").createIndex({ status: 1, orderDate: -1 }, function(err) {
    if (err) console.error("Index creation failed:", err.message);
  });
  db.collection("orders").createIndex({ category: 1, orderDate: -1 }, function(err) {
    if (err) console.error("Index creation failed:", err.message);
  });
  db.collection("orders").createIndex({ customerId: 1 }, function(err) {
    if (err) console.error("Index creation failed:", err.message);
  });
});

// Revenue by category
app.get("/api/analytics/revenue-by-category", function(req, res) {
  var startDate = req.query.start ? new Date(req.query.start) : new Date("2026-01-01");
  var endDate = req.query.end ? new Date(req.query.end) : new Date();

  var pipeline = [
    { $match: {
      status: "completed",
      orderDate: { $gte: startDate, $lte: endDate }
    }},
    { $project: {
      category: 1,
      amount: 1,
      orderDate: 1
    }},
    { $group: {
      _id: "$category",
      totalRevenue: { $sum: "$amount" },
      averageOrder: { $avg: "$amount" },
      orderCount: { $sum: 1 },
      minOrder: { $min: "$amount" },
      maxOrder: { $max: "$amount" }
    }},
    { $sort: { totalRevenue: -1 } },
    { $project: {
      _id: 0,
      category: "$_id",
      totalRevenue: { $round: ["$totalRevenue", 2] },
      averageOrder: { $round: ["$averageOrder", 2] },
      orderCount: 1,
      minOrder: 1,
      maxOrder: 1
    }}
  ];

  db.collection("orders").aggregate(pipeline).toArray(function(err, results) {
    if (err) return res.status(500).json({ error: err.message });
    res.json({ data: results, period: { start: startDate, end: endDate } });
  });
});

// Daily sales trends
app.get("/api/analytics/daily-trends", function(req, res) {
  var days = parseInt(req.query.days) || 30;
  var startDate = new Date();
  startDate.setDate(startDate.getDate() - days);

  var pipeline = [
    { $match: {
      status: "completed",
      orderDate: { $gte: startDate }
    }},
    { $project: {
      amount: 1,
      orderDate: 1
    }},
    { $group: {
      _id: { $dateToString: { format: "%Y-%m-%d", date: "$orderDate" } },
      revenue: { $sum: "$amount" },
      orders: { $sum: 1 },
      avgOrderValue: { $avg: "$amount" }
    }},
    { $sort: { _id: 1 } },
    { $project: {
      _id: 0,
      date: "$_id",
      revenue: { $round: ["$revenue", 2] },
      orders: 1,
      avgOrderValue: { $round: ["$avgOrderValue", 2] }
    }}
  ];

  db.collection("orders").aggregate(pipeline).toArray(function(err, results) {
    if (err) return res.status(500).json({ error: err.message });
    res.json({ data: results, days: days });
  });
});

// Top customers
app.get("/api/analytics/top-customers", function(req, res) {
  var limit = parseInt(req.query.limit) || 10;
  var startDate = req.query.start ? new Date(req.query.start) : new Date("2026-01-01");

  var pipeline = [
    { $match: {
      status: "completed",
      orderDate: { $gte: startDate }
    }},
    { $project: {
      customerId: 1,
      amount: 1
    }},
    { $group: {
      _id: "$customerId",
      totalSpent: { $sum: "$amount" },
      orderCount: { $sum: 1 },
      avgOrder: { $avg: "$amount" },
      firstOrder: { $min: "$orderDate" },
      lastOrder: { $max: "$orderDate" }
    }},
    { $sort: { totalSpent: -1 } },
    { $limit: limit },
    { $lookup: {
      from: "customers",
      localField: "_id",
      foreignField: "_id",
      as: "customer"
    }},
    { $unwind: { path: "$customer", preserveNullAndEmptyArrays: true } },
    { $project: {
      _id: 0,
      customerId: "$_id",
      name: { $ifNull: ["$customer.name", "Unknown"] },
      email: { $ifNull: ["$customer.email", ""] },
      totalSpent: { $round: ["$totalSpent", 2] },
      orderCount: 1,
      avgOrder: { $round: ["$avgOrder", 2] }
    }}
  ];

  db.collection("orders").aggregate(pipeline).toArray(function(err, results) {
    if (err) return res.status(500).json({ error: err.message });
    res.json({ data: results });
  });
});

// Product performance with revenue buckets
app.get("/api/analytics/product-performance", function(req, res) {
  var startDate = req.query.start ? new Date(req.query.start) : new Date("2026-01-01");

  var pipeline = [
    { $match: {
      status: "completed",
      orderDate: { $gte: startDate }
    }},
    { $project: {
      productId: 1,
      productName: 1,
      amount: 1,
      quantity: 1
    }},
    { $group: {
      _id: "$productId",
      productName: { $first: "$productName" },
      totalRevenue: { $sum: "$amount" },
      totalUnits: { $sum: "$quantity" },
      orderCount: { $sum: 1 },
      avgPrice: { $avg: "$amount" }
    }},
    { $sort: { totalRevenue: -1 } },
    { $project: {
      _id: 0,
      productId: "$_id",
      productName: 1,
      totalRevenue: { $round: ["$totalRevenue", 2] },
      totalUnits: 1,
      orderCount: 1,
      avgPrice: { $round: ["$avgPrice", 2] },
      revenuePerUnit: {
        $round: [{ $cond: [{ $eq: ["$totalUnits", 0] }, 0, { $divide: ["$totalRevenue", "$totalUnits"] }] }, 2]
      }
    }}
  ];

  db.collection("orders").aggregate(pipeline).toArray(function(err, results) {
    if (err) return res.status(500).json({ error: err.message });
    res.json({ data: results });
  });
});

// Full dashboard using $facet
app.get("/api/analytics/dashboard", function(req, res) {
  var startDate = req.query.start ? new Date(req.query.start) : new Date("2026-01-01");
  var endDate = req.query.end ? new Date(req.query.end) : new Date();

  var pipeline = [
    { $match: {
      status: "completed",
      orderDate: { $gte: startDate, $lte: endDate }
    }},
    { $project: {
      category: 1,
      customerId: 1,
      amount: 1,
      orderDate: 1
    }},
    { $facet: {
      summary: [
        { $group: {
          _id: null,
          totalRevenue: { $sum: "$amount" },
          totalOrders: { $sum: 1 },
          avgOrderValue: { $avg: "$amount" }
        }},
        { $project: { _id: 0 } }
      ],
      byCategory: [
        { $group: {
          _id: "$category",
          revenue: { $sum: "$amount" },
          orders: { $sum: 1 }
        }},
        { $sort: { revenue: -1 } },
        { $limit: 10 }
      ],
      dailyTrend: [
        { $group: {
          _id: { $dateToString: { format: "%Y-%m-%d", date: "$orderDate" } },
          revenue: { $sum: "$amount" },
          orders: { $sum: 1 }
        }},
        { $sort: { _id: 1 } }
      ],
      topCustomers: [
        { $group: {
          _id: "$customerId",
          spent: { $sum: "$amount" }
        }},
        { $sort: { spent: -1 } },
        { $limit: 5 }
      ]
    }}
  ];

  db.collection("orders").aggregate(pipeline, { allowDiskUse: true }).toArray(function(err, results) {
    if (err) return res.status(500).json({ error: err.message });
    var data = results[0];
    data.summary = data.summary[0] || { totalRevenue: 0, totalOrders: 0, avgOrderValue: 0 };
    res.json({ data: data, period: { start: startDate, end: endDate } });
  });
});

// Explain endpoint for debugging pipeline performance
app.get("/api/analytics/explain", function(req, res) {
  var pipeline = [
    { $match: {
      status: "completed",
      orderDate: { $gte: new Date("2026-01-01") }
    }},
    { $sort: { orderDate: -1 } },
    { $group: {
      _id: "$category",
      totalRevenue: { $sum: "$amount" },
      orderCount: { $sum: 1 }
    }}
  ];

  db.collection("orders").aggregate(pipeline, { explain: true }).toArray(function(err, result) {
    if (err) return res.status(500).json({ error: err.message });
    res.json(result);
  });
});

var port = process.env.PORT || 3000;
app.listen(port, function() {
  console.log("Sales analytics API running on port " + port);
});

When you hit the /api/analytics/explain endpoint, you will get output showing the query plan. The key things to verify:

  • The $match stage shows IXSCAN on the { status: 1, orderDate: -1 } index.
  • totalDocsExamined is close to nReturned, indicating efficient filtering.
  • No stages report exceeding the memory limit.

Common Issues and Troubleshooting

1. COLLSCAN on $match Stage

Your $match is not using an index. This happens when the $match is not the first stage, when you are matching on fields not covered by an index, or when you are using $expr in $match (which has limited index support). Fix it by moving $match to the front and creating a covering index.

2. Exceeding 100MB Memory Limit

The $sort or $group stage is processing too much data. Solutions: add $match before the expensive stage to reduce input, add $project to strip unnecessary fields, or use allowDiskUse: true as a last resort. If you are consistently hitting this on $group, consider pre-aggregating data or using $merge to create materialized views.

3. Slow $lookup Operations

The foreign collection lacks an index on the join field. Always ensure the foreignField in a simple $lookup has an index. For pipeline-form $lookup, make sure the first $match inside the sub-pipeline can use an index. Also, reduce the number of documents entering the $lookup by placing $match and $group before it.

4. $facet Exceeding 16MB BSON Limit

The combined output of all facet sub-pipelines exceeds the BSON document size limit. Add $limit to each sub-pipeline. If you genuinely need more data, split the $facet into separate aggregation calls or use $merge to write intermediate results to a collection.

5. $unwind Causing Document Explosion

Unwinding an array with 1,000 elements turns one document into 1,000. If this feeds into a $group or $lookup, performance collapses. Only $unwind when necessary. Consider using array operators like $filter, $map, or $reduce inside $project to work with arrays without unwinding them.

Best Practices

  1. Always start with $match. The earlier you filter, the less work every subsequent stage does. This is the single highest-impact optimization in any pipeline.

  2. Create compound indexes that match your pipeline's $match and $sort. A pipeline that matches on status and sorts by orderDate needs { status: 1, orderDate: -1 }, not separate indexes on each field.

  3. Use $project early to reduce document size. If your documents have 50 fields but you only need 3 for aggregation, strip the other 47 before $group and $lookup. This reduces memory usage and speeds up every stage.

  4. Prefer the simple $lookup form over the pipeline form. The simple equality join is significantly faster because MongoDB can batch the lookups and use indexes directly. Only use the pipeline form when you need to filter or project the foreign documents.

  5. Use $merge for expensive analytics queries. Run the pipeline on a schedule, write results to a rollup collection, and serve the pre-computed data from your API. Your users get sub-millisecond response times and your database is not hammered by repeated heavy aggregations.

  6. Profile your pipelines in production. Use explain() during development and the MongoDB profiler (db.setProfilingLevel(1, { slowms: 100 })) in production. Log slow aggregations and review them regularly. Pipeline performance degrades as data grows — what worked with 100K documents may fail at 10M.

  7. Avoid $group on high-cardinality fields without $limit. Grouping by a field with millions of unique values creates millions of groups in memory. If you need the top N results, sort and limit before or use $topN accumulators.

  8. Use allowDiskUse as a safety net, not a design choice. If your pipeline regularly needs disk, redesign the pipeline or pre-aggregate the data. Disk-spilling stages are 10-100x slower than in-memory stages.

References

Powered by Contentful