Aws

AWS Step Functions for Workflow Orchestration

Build resilient workflow orchestration with AWS Step Functions using state machines, error handling, and parallel execution patterns

AWS Step Functions for Workflow Orchestration

AWS Step Functions lets you coordinate distributed components into resilient, observable workflows using state machines defined in JSON. If you have ever wired together Lambda functions with SQS queues, DynamoDB streams, and a prayer, Step Functions replaces that fragile glue with a first-class orchestration service. It is the right tool whenever your workflow has branching logic, retries, parallel fan-out, human approval steps, or any sequence that spans more than two services.

Prerequisites

  • An AWS account with permissions for Step Functions, Lambda, IAM, and CloudWatch
  • Node.js 18+ installed locally
  • AWS CLI v2 configured with credentials
  • The AWS SDK for JavaScript v3 (@aws-sdk/client-sfn)
  • Basic familiarity with AWS Lambda and IAM roles

State Machine Concepts

A Step Functions workflow is a state machine — a directed graph of states connected by transitions. Every execution starts at a designated start state, moves through transitions based on the output of each state, and terminates at a terminal state (Succeed or Fail) or when a state has no Next field and End is true.

States, Transitions, and Data Flow

Each state receives a JSON input, performs work, and produces a JSON output that becomes the input for the next state. This input/output chain is the backbone of data flow through the entire workflow. You control what data enters and leaves each state using filters — more on that later.

A minimal state machine has one state:

{
  "Comment": "Minimal state machine",
  "StartAt": "DoWork",
  "States": {
    "DoWork": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:doWork",
      "End": true
    }
  }
}

The StartAt field points to the first state. Each state declares its Type and either a Next field pointing to the subsequent state or "End": true to terminate the execution.

State Types

Step Functions defines eight state types. Understanding each one is essential.

Task

The workhorse. A Task state invokes a resource — typically a Lambda function, but also ECS tasks, SQS queues, DynamoDB operations, SNS topics, or any of the 200+ AWS SDK integrations.

{
  "ProcessPayment": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke",
    "Parameters": {
      "FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:processPayment",
      "Payload.$": "$"
    },
    "ResultSelector": {
      "transactionId.$": "$.Payload.transactionId",
      "status.$": "$.Payload.status"
    },
    "Next": "CheckResult"
  }
}

There are two integration patterns for Task states:

  • Request-Response (default): Step Functions calls the service and moves to the next state immediately after getting the API response.
  • Run a Job (.sync): Step Functions waits for the job to complete. Append .sync to the resource ARN.
  • Wait for Callback (.waitForTaskToken): Step Functions pauses until your application calls SendTaskSuccess or SendTaskFailure with a task token.

Choice

Branching logic. A Choice state evaluates conditions against the input and transitions to different states based on the result. There is no Next or End on a Choice state itself — transitions come from the choice rules.

{
  "RouteByStatus": {
    "Type": "Choice",
    "Choices": [
      {
        "Variable": "$.paymentStatus",
        "StringEquals": "approved",
        "Next": "FulfillOrder"
      },
      {
        "Variable": "$.paymentStatus",
        "StringEquals": "declined",
        "Next": "NotifyDecline"
      },
      {
        "Variable": "$.amount",
        "NumericGreaterThan": 10000,
        "Next": "ManualReview"
      }
    ],
    "Default": "HandleUnknownStatus"
  }
}

Choice supports string, numeric, boolean, and timestamp comparisons, as well as And, Or, and Not compound conditions.

Parallel

Executes multiple branches concurrently. Each branch is a complete sub-state-machine. The Parallel state waits for all branches to complete and collects their outputs into an array.

{
  "SendNotifications": {
    "Type": "Parallel",
    "Branches": [
      {
        "StartAt": "SendEmail",
        "States": {
          "SendEmail": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:123456789012:function:sendEmail",
            "End": true
          }
        }
      },
      {
        "StartAt": "SendSMS",
        "States": {
          "SendSMS": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:123456789012:function:sendSMS",
            "End": true
          }
        }
      }
    ],
    "Next": "Complete"
  }
}

The output of this Parallel state would be an array: [emailResult, smsResult].

Map

Iterates over an array in the input, running a sub-state-machine for each element. Map comes in two modes:

  • Inline mode: For small arrays (up to 40 concurrent iterations).
  • Distributed mode: For large-scale processing (up to 10,000 concurrent iterations) using S3 as the data source.
{
  "ProcessLineItems": {
    "Type": "Map",
    "ItemsPath": "$.orderItems",
    "ItemProcessor": {
      "ProcessorConfig": {
        "Mode": "INLINE"
      },
      "StartAt": "ReserveItem",
      "States": {
        "ReserveItem": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:us-east-1:123456789012:function:reserveItem",
          "End": true
        }
      }
    },
    "MaxConcurrency": 10,
    "Next": "AllItemsReserved"
  }
}

Wait

Pauses execution for a fixed duration or until a specific timestamp.

{
  "WaitForShipping": {
    "Type": "Wait",
    "Seconds": 3600,
    "Next": "CheckShippingStatus"
  }
}

You can also use "Timestamp", "SecondsPath", or "TimestampPath" for dynamic waits.

Pass

A no-op state useful for injecting fixed data or transforming input without invoking an external resource.

{
  "SetDefaults": {
    "Type": "Pass",
    "Result": {
      "retryCount": 0,
      "region": "us-east-1"
    },
    "ResultPath": "$.defaults",
    "Next": "ProcessOrder"
  }
}

Succeed and Fail

Terminal states. Succeed marks the execution as successful. Fail marks it as failed and accepts an error name and cause.

{
  "OrderFailed": {
    "Type": "Fail",
    "Error": "OrderProcessingError",
    "Cause": "Payment was declined and compensation failed"
  }
}

Amazon States Language (ASL)

All state machine definitions are written in Amazon States Language, a JSON-based specification. ASL is declarative — you describe what the workflow does, not how to execute it. The key top-level fields are:

  • Comment — documentation string
  • StartAt — the name of the first state
  • States — an object where each key is a state name and each value is the state definition
  • TimeoutSeconds — maximum execution duration (optional)
  • Version — ASL version, currently "1.0" (optional)

ASL supports intrinsic functions for data manipulation within the definition:

{
  "Parameters": {
    "orderId.$": "States.Format('ORD-{}-{}', $.region, $.sequenceNumber)",
    "timestamp.$": "$$.State.EnteredTime",
    "items.$": "States.ArrayPartition($.allItems, 10)"
  }
}

The $$ prefix accesses the context object, which contains metadata about the execution — execution ARN, state name, entry time, task token, and map iteration index.

Lambda Task Integration

Lambda is the most common integration target. Here is a Lambda handler designed for Step Functions:

// handler.js
var AWS = require("aws-sdk");
var dynamodb = new AWS.DynamoDB.DocumentClient();

exports.processPayment = function(event, context, callback) {
  var orderId = event.orderId;
  var amount = event.amount;
  var paymentMethod = event.paymentMethod;

  console.log("Processing payment for order:", orderId, "amount:", amount);

  // Simulate payment gateway call
  var params = {
    TableName: "Payments",
    Item: {
      orderId: orderId,
      amount: amount,
      paymentMethod: paymentMethod,
      status: "approved",
      transactionId: "txn-" + Date.now(),
      processedAt: new Date().toISOString()
    }
  };

  dynamodb.put(params, function(err, data) {
    if (err) {
      console.error("Payment failed:", err);
      callback(new Error("PaymentProcessingError"));
      return;
    }

    callback(null, {
      orderId: orderId,
      transactionId: params.Item.transactionId,
      paymentStatus: "approved",
      amount: amount
    });
  });
};

When Step Functions invokes this Lambda, the state's input JSON becomes the event parameter. The return value (or callback result) becomes the state's output.

Error Handling and Retry Policies

Step Functions has built-in error handling that eliminates the need for custom retry logic in your Lambda code.

Retry

The Retry field accepts an array of retry policies matched by error name:

{
  "ProcessPayment": {
    "Type": "Task",
    "Resource": "arn:aws:lambda:us-east-1:123456789012:function:processPayment",
    "Retry": [
      {
        "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
        "IntervalSeconds": 2,
        "MaxAttempts": 6,
        "BackoffRate": 2.0,
        "MaxDelaySeconds": 60
      },
      {
        "ErrorEquals": ["PaymentGatewayTimeout"],
        "IntervalSeconds": 5,
        "MaxAttempts": 3,
        "BackoffRate": 1.5
      },
      {
        "ErrorEquals": ["States.ALL"],
        "IntervalSeconds": 1,
        "MaxAttempts": 2,
        "BackoffRate": 2.0
      }
    ],
    "Catch": [
      {
        "ErrorEquals": ["States.ALL"],
        "Next": "HandlePaymentFailure",
        "ResultPath": "$.error"
      }
    ],
    "Next": "ConfirmPayment"
  }
}

Retry policies are evaluated in order. The first matching policy handles the error. BackoffRate multiplies the interval after each retry. MaxDelaySeconds caps the exponential growth.

Catch

If all retries are exhausted (or there is no matching retry policy), Catch kicks in. It routes the execution to a fallback state. The error information is placed at the path specified by ResultPath.

Reserved error names:

  • States.ALL — matches any error
  • States.Timeout — the state exceeded its TimeoutSeconds
  • States.TaskFailed — the task returned a failure
  • States.Permissions — insufficient IAM permissions
  • States.ResultPathMatchFailure — the output path could not be applied to the input
  • States.HeartbeatTimeout — the task did not send a heartbeat in time

Input/Output Processing

This is the most confusing part of Step Functions, and also the most powerful. Every state has four filters applied in order:

  1. InputPath — Selects a subset of the raw input to pass to the state's work.
  2. Parameters — Constructs a new JSON object as input, using static values and JSONPath references.
  3. ResultSelector — After the work completes, selects a subset of the result.
  4. ResultPath — Determines where in the original input to place the result.
  5. OutputPath — Selects a subset of the combined output to pass to the next state.
{
  "LookupCustomer": {
    "Type": "Task",
    "Resource": "arn:aws:lambda:us-east-1:123456789012:function:lookupCustomer",
    "InputPath": "$.customerInfo",
    "Parameters": {
      "customerId.$": "$.id",
      "lookupType": "full"
    },
    "ResultSelector": {
      "name.$": "$.customerName",
      "email.$": "$.contactEmail",
      "tier.$": "$.loyaltyTier"
    },
    "ResultPath": "$.customer",
    "OutputPath": "$",
    "Next": "ProcessOrder"
  }
}

In this example:

  • InputPath narrows the full execution input to just the customerInfo object.
  • Parameters builds a new payload with the customer ID and a static lookup type.
  • ResultSelector cherry-picks three fields from the Lambda response.
  • ResultPath nests the selected result under $.customer in the original input (not the filtered input).
  • OutputPath passes everything to the next state.

If you set ResultPath to null, the state's output is discarded and the original input passes through unchanged. This is useful for side-effect-only tasks like sending notifications.

Express vs Standard Workflows

Step Functions offers two workflow types:

Feature Standard Express
Max duration 1 year 5 minutes
Execution model Exactly-once At-least-once (async) or at-most-once (sync)
Pricing Per state transition Per execution + duration
Execution history 90 days in console CloudWatch Logs only
Max start rate 2,000/sec 100,000/sec

Use Standard for long-running workflows, human approvals, and processes that require exactly-once semantics. Use Express for high-volume, short-duration event processing — API Gateway backends, IoT data processing, streaming transformations.

Express workflows can be invoked synchronously (you get the result in the HTTP response) or asynchronously (fire-and-forget). Standard workflows are always asynchronous.

SDK Integration

Use the AWS SDK to start executions, poll for results, and manage state machines programmatically.

Starting and Monitoring Executions

// stepfunctions-client.js
var SFNClient = require("@aws-sdk/client-sfn").SFNClient;
var StartExecutionCommand = require("@aws-sdk/client-sfn").StartExecutionCommand;
var DescribeExecutionCommand = require("@aws-sdk/client-sfn").DescribeExecutionCommand;
var GetExecutionHistoryCommand = require("@aws-sdk/client-sfn").GetExecutionHistoryCommand;

var client = new SFNClient({ region: "us-east-1" });

var STATE_MACHINE_ARN = "arn:aws:states:us-east-1:123456789012:stateMachine:OrderProcessing";

function startOrderExecution(orderData) {
  var params = {
    stateMachineArn: STATE_MACHINE_ARN,
    name: "order-" + orderData.orderId + "-" + Date.now(),
    input: JSON.stringify(orderData)
  };

  var command = new StartExecutionCommand(params);
  return client.send(command).then(function(result) {
    console.log("Execution started:", result.executionArn);
    return result.executionArn;
  });
}

function waitForCompletion(executionArn, intervalMs) {
  intervalMs = intervalMs || 2000;

  return new Promise(function(resolve, reject) {
    var poll = setInterval(function() {
      var command = new DescribeExecutionCommand({ executionArn: executionArn });

      client.send(command).then(function(result) {
        console.log("Status:", result.status);

        if (result.status === "SUCCEEDED") {
          clearInterval(poll);
          resolve(JSON.parse(result.output));
        } else if (result.status === "FAILED" || result.status === "TIMED_OUT" || result.status === "ABORTED") {
          clearInterval(poll);
          reject(new Error("Execution " + result.status + ": " + (result.cause || "Unknown")));
        }
      }).catch(function(err) {
        clearInterval(poll);
        reject(err);
      });
    }, intervalMs);
  });
}

function getExecutionHistory(executionArn) {
  var command = new GetExecutionHistoryCommand({
    executionArn: executionArn,
    maxResults: 100,
    reverseOrder: true
  });

  return client.send(command).then(function(result) {
    return result.events.map(function(event) {
      return {
        id: event.id,
        type: event.type,
        timestamp: event.timestamp,
        details: event.stateEnteredEventDetails || event.stateExitedEventDetails || null
      };
    });
  });
}

// Usage
var order = {
  orderId: "ORD-2026-001",
  customerId: "CUST-500",
  items: [
    { sku: "WIDGET-A", quantity: 2, price: 29.99 },
    { sku: "GADGET-B", quantity: 1, price: 49.99 }
  ],
  paymentMethod: "credit_card",
  shippingAddress: {
    street: "123 Main St",
    city: "Denver",
    state: "CO",
    zip: "80202"
  }
};

startOrderExecution(order)
  .then(function(arn) {
    return waitForCompletion(arn);
  })
  .then(function(result) {
    console.log("Order completed:", result);
  })
  .catch(function(err) {
    console.error("Order failed:", err.message);
  });

Callback Patterns with Task Tokens

For workflows that need to wait for external events — human approval, third-party API callbacks, manual QA — use the .waitForTaskToken integration pattern. Step Functions pauses the execution and generates a unique task token. Your code sends this token to an external system, which eventually calls back to resume the workflow.

{
  "WaitForApproval": {
    "Type": "Task",
    "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
    "Parameters": {
      "QueueUrl": "https://sqs.us-east-1.amazonaws.com/123456789012/approval-queue",
      "MessageBody": {
        "taskToken.$": "$$.Task.Token",
        "orderId.$": "$.orderId",
        "amount.$": "$.amount",
        "message": "Order requires manual approval"
      }
    },
    "HeartbeatSeconds": 3600,
    "TimeoutSeconds": 86400,
    "Next": "FulfillOrder"
  }
}

The approval service processes the SQS message and calls back:

// approval-handler.js
var SFNClient = require("@aws-sdk/client-sfn").SFNClient;
var SendTaskSuccessCommand = require("@aws-sdk/client-sfn").SendTaskSuccessCommand;
var SendTaskFailureCommand = require("@aws-sdk/client-sfn").SendTaskFailureCommand;

var client = new SFNClient({ region: "us-east-1" });

function approveOrder(taskToken, approverName) {
  var command = new SendTaskSuccessCommand({
    taskToken: taskToken,
    output: JSON.stringify({
      approved: true,
      approvedBy: approverName,
      approvedAt: new Date().toISOString()
    })
  });

  return client.send(command).then(function() {
    console.log("Task resumed with approval");
  });
}

function rejectOrder(taskToken, reason) {
  var command = new SendTaskFailureCommand({
    taskToken: taskToken,
    error: "OrderRejected",
    cause: reason
  });

  return client.send(command).then(function() {
    console.log("Task resumed with rejection");
  });
}

Nested Workflows

Complex workflows benefit from decomposition. You can invoke a child state machine from a parent using the arn:aws:states:::states:startExecution.sync:2 resource. The .sync:2 suffix tells Step Functions to wait for the child to finish and return its output directly (instead of the execution metadata).

{
  "RunPaymentSubflow": {
    "Type": "Task",
    "Resource": "arn:aws:states:::states:startExecution.sync:2",
    "Parameters": {
      "StateMachineArn": "arn:aws:states:us-east-1:123456789012:stateMachine:PaymentProcessing",
      "Input": {
        "orderId.$": "$.orderId",
        "amount.$": "$.totalAmount",
        "paymentMethod.$": "$.paymentMethod",
        "AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
      }
    },
    "Next": "HandlePaymentResult"
  }
}

The AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID field is a convention that allows you to trace the lineage of nested executions. Always include it.

Testing with Step Functions Local

AWS provides a Docker image for running Step Functions locally. This is invaluable for testing state machine logic without deploying to AWS.

# Pull and run Step Functions Local
docker pull amazon/aws-stepfunctions-local
docker run -p 8083:8083 amazon/aws-stepfunctions-local

# Create a state machine against the local endpoint
aws stepfunctions create-state-machine \
  --endpoint-url http://localhost:8083 \
  --definition file://state-machine.json \
  --name "OrderProcessingLocal" \
  --role-arn "arn:aws:iam::123456789012:role/dummy"

# Start an execution
aws stepfunctions start-execution \
  --endpoint-url http://localhost:8083 \
  --state-machine-arn "arn:aws:states:us-east-1:123456789012:stateMachine:OrderProcessingLocal" \
  --input '{"orderId":"TEST-001","amount":99.99}'

For testing with mock Lambda responses, create a MockConfigFile.json:

{
  "StateMachines": {
    "OrderProcessingLocal": {
      "TestCases": {
        "HappyPath": {
          "ProcessPayment": "MockPaymentSuccess",
          "ReserveInventory": "MockInventorySuccess",
          "SendNotifications": "MockNotificationSuccess"
        },
        "PaymentDeclined": {
          "ProcessPayment": "MockPaymentDeclined"
        }
      }
    }
  },
  "MockedResponses": {
    "MockPaymentSuccess": {
      "Return": {
        "transactionId": "mock-txn-001",
        "paymentStatus": "approved",
        "amount": 99.99
      }
    },
    "MockPaymentDeclined": {
      "Return": {
        "transactionId": "mock-txn-002",
        "paymentStatus": "declined",
        "reason": "Insufficient funds"
      }
    },
    "MockInventorySuccess": {
      "Return": {
        "reservationId": "mock-res-001",
        "status": "reserved"
      }
    },
    "MockNotificationSuccess": {
      "Return": {
        "emailSent": true,
        "smsSent": true
      }
    }
  }
}

Launch with the mock config:

docker run -p 8083:8083 \
  -v $(pwd)/MockConfigFile.json:/home/StepFunctionsLocal/MockConfigFile.json \
  -e SFN_MOCK_CONFIG="/home/StepFunctionsLocal/MockConfigFile.json" \
  amazon/aws-stepfunctions-local

Then run test cases by referencing them in the execution name:

aws stepfunctions start-execution \
  --endpoint-url http://localhost:8083 \
  --state-machine-arn "arn:aws:states:us-east-1:123456789012:stateMachine:OrderProcessingLocal#HappyPath" \
  --name "test-happy-path" \
  --input '{}'

Complete Working Example: Order Processing Workflow

This brings everything together. The workflow validates payment, reserves inventory, sends notifications in parallel, and handles failures with compensation logic (saga pattern).

State Machine Definition

{
  "Comment": "Order processing workflow with compensation",
  "StartAt": "ValidateOrder",
  "TimeoutSeconds": 300,
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:validateOrder",
      "ResultPath": "$.validation",
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["ValidationError"],
          "Next": "OrderValidationFailed",
          "ResultPath": "$.error"
        }
      ],
      "Next": "ProcessPayment"
    },

    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:processPayment",
      "InputPath": "$",
      "Parameters": {
        "orderId.$": "$.orderId",
        "amount.$": "$.totalAmount",
        "paymentMethod.$": "$.paymentMethod",
        "customerId.$": "$.customerId"
      },
      "ResultSelector": {
        "transactionId.$": "$.transactionId",
        "paymentStatus.$": "$.paymentStatus",
        "chargedAmount.$": "$.amount"
      },
      "ResultPath": "$.payment",
      "Retry": [
        {
          "ErrorEquals": ["PaymentGatewayTimeout"],
          "IntervalSeconds": 5,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["PaymentDeclined", "PaymentError"],
          "Next": "OrderPaymentFailed",
          "ResultPath": "$.error"
        }
      ],
      "Next": "CheckPaymentStatus"
    },

    "CheckPaymentStatus": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.payment.paymentStatus",
          "StringEquals": "approved",
          "Next": "ReserveInventory"
        },
        {
          "Variable": "$.payment.paymentStatus",
          "StringEquals": "pending_review",
          "Next": "WaitForPaymentReview"
        }
      ],
      "Default": "OrderPaymentFailed"
    },

    "WaitForPaymentReview": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "ReserveInventory"
    },

    "ReserveInventory": {
      "Type": "Map",
      "ItemsPath": "$.items",
      "ItemProcessor": {
        "ProcessorConfig": {
          "Mode": "INLINE"
        },
        "StartAt": "ReserveItem",
        "States": {
          "ReserveItem": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:123456789012:function:reserveInventory",
            "Retry": [
              {
                "ErrorEquals": ["InventoryLockTimeout"],
                "IntervalSeconds": 1,
                "MaxAttempts": 5,
                "BackoffRate": 1.5
              }
            ],
            "End": true
          }
        }
      },
      "MaxConcurrency": 5,
      "ResultPath": "$.reservations",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "CompensatePayment",
          "ResultPath": "$.error"
        }
      ],
      "Next": "SendNotifications"
    },

    "SendNotifications": {
      "Type": "Parallel",
      "ResultPath": "$.notifications",
      "Branches": [
        {
          "StartAt": "SendConfirmationEmail",
          "States": {
            "SendConfirmationEmail": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-east-1:123456789012:function:sendEmail",
              "Parameters": {
                "to.$": "$.customerEmail",
                "template": "order_confirmation",
                "data": {
                  "orderId.$": "$.orderId",
                  "amount.$": "$.payment.chargedAmount"
                }
              },
              "Retry": [
                {
                  "ErrorEquals": ["States.ALL"],
                  "IntervalSeconds": 2,
                  "MaxAttempts": 3,
                  "BackoffRate": 2.0
                }
              ],
              "End": true
            }
          }
        },
        {
          "StartAt": "SendSMSNotification",
          "States": {
            "SendSMSNotification": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-east-1:123456789012:function:sendSMS",
              "Parameters": {
                "phone.$": "$.customerPhone",
                "message.$": "States.Format('Order {} confirmed. Total: ${}', $.orderId, $.payment.chargedAmount)"
              },
              "Retry": [
                {
                  "ErrorEquals": ["States.ALL"],
                  "IntervalSeconds": 2,
                  "MaxAttempts": 2,
                  "BackoffRate": 2.0
                }
              ],
              "End": true
            }
          }
        },
        {
          "StartAt": "UpdateOrderDashboard",
          "States": {
            "UpdateOrderDashboard": {
              "Type": "Task",
              "Resource": "arn:aws:states:::dynamodb:putItem",
              "Parameters": {
                "TableName": "OrderDashboard",
                "Item": {
                  "orderId": { "S.$": "$.orderId" },
                  "status": { "S": "confirmed" },
                  "transactionId": { "S.$": "$.payment.transactionId" },
                  "processedAt": { "S.$": "$$.State.EnteredTime" }
                }
              },
              "End": true
            }
          }
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "NotificationWarning",
          "ResultPath": "$.notificationError"
        }
      ],
      "Next": "OrderSucceeded"
    },

    "NotificationWarning": {
      "Type": "Pass",
      "Comment": "Notifications failed but order is still valid",
      "Result": "Notifications failed - manual follow-up required",
      "ResultPath": "$.notificationWarning",
      "Next": "OrderSucceeded"
    },

    "OrderSucceeded": {
      "Type": "Succeed"
    },

    "CompensatePayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:refundPayment",
      "Parameters": {
        "transactionId.$": "$.payment.transactionId",
        "amount.$": "$.payment.chargedAmount",
        "reason": "Inventory reservation failed"
      },
      "ResultPath": "$.refund",
      "Retry": [
        {
          "ErrorEquals": ["States.ALL"],
          "IntervalSeconds": 5,
          "MaxAttempts": 5,
          "BackoffRate": 2.0
        }
      ],
      "Next": "OrderInventoryFailed"
    },

    "OrderValidationFailed": {
      "Type": "Fail",
      "Error": "OrderValidationError",
      "Cause": "Order failed validation checks"
    },

    "OrderPaymentFailed": {
      "Type": "Fail",
      "Error": "OrderPaymentError",
      "Cause": "Payment processing failed"
    },

    "OrderInventoryFailed": {
      "Type": "Fail",
      "Error": "OrderInventoryError",
      "Cause": "Inventory reservation failed, payment refunded"
    }
  }
}

Lambda Handlers

// lambdas/validateOrder.js
exports.handler = function(event, context, callback) {
  var errors = [];

  if (!event.orderId) {
    errors.push("Missing orderId");
  }
  if (!event.items || event.items.length === 0) {
    errors.push("Order must contain at least one item");
  }
  if (!event.paymentMethod) {
    errors.push("Missing payment method");
  }

  var totalAmount = 0;
  if (event.items) {
    event.items.forEach(function(item) {
      if (!item.sku || !item.quantity || !item.price) {
        errors.push("Invalid item: " + JSON.stringify(item));
      }
      totalAmount += (item.price * item.quantity);
    });
  }

  if (errors.length > 0) {
    var error = new Error("Validation failed: " + errors.join(", "));
    error.name = "ValidationError";
    callback(error);
    return;
  }

  callback(null, {
    valid: true,
    totalAmount: Math.round(totalAmount * 100) / 100,
    itemCount: event.items.length
  });
};

// lambdas/processPayment.js
var crypto = require("crypto");

exports.handler = function(event, context, callback) {
  console.log("Processing payment:", event.orderId, "amount:", event.amount);

  // Simulate payment processing
  var transactionId = "txn-" + crypto.randomBytes(8).toString("hex");

  // In production, call Stripe/PayPal/etc.
  var result = {
    transactionId: transactionId,
    paymentStatus: "approved",
    amount: event.amount,
    processedAt: new Date().toISOString()
  };

  callback(null, result);
};

// lambdas/reserveInventory.js
exports.handler = function(event, context, callback) {
  var sku = event.sku;
  var quantity = event.quantity;

  console.log("Reserving inventory:", sku, "quantity:", quantity);

  // In production, check stock and create a hold in your inventory system
  var reservationId = "res-" + sku + "-" + Date.now();

  callback(null, {
    reservationId: reservationId,
    sku: sku,
    quantity: quantity,
    status: "reserved",
    expiresAt: new Date(Date.now() + 30 * 60 * 1000).toISOString()
  });
};

// lambdas/refundPayment.js
exports.handler = function(event, context, callback) {
  console.log("Refunding payment:", event.transactionId, "amount:", event.amount);

  callback(null, {
    refundId: "ref-" + Date.now(),
    originalTransactionId: event.transactionId,
    amount: event.amount,
    status: "refunded",
    reason: event.reason
  });
};

Deploying with AWS CLI

# Create the IAM role for Step Functions
aws iam create-role \
  --role-name OrderProcessingStepFunctionRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"Service": "states.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }]
  }'

# Attach policies for Lambda invocation and DynamoDB access
aws iam put-role-policy \
  --role-name OrderProcessingStepFunctionRole \
  --policy-name InvokeLambdas \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": "lambda:InvokeFunction",
        "Resource": "arn:aws:lambda:us-east-1:123456789012:function:*"
      },
      {
        "Effect": "Allow",
        "Action": "dynamodb:PutItem",
        "Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/OrderDashboard"
      }
    ]
  }'

# Create the state machine
aws stepfunctions create-state-machine \
  --name OrderProcessing \
  --definition file://order-processing.asl.json \
  --role-arn arn:aws:iam::123456789012:role/OrderProcessingStepFunctionRole \
  --type STANDARD

Common Issues and Troubleshooting

1. ResultPath Overwrites Entire Input

Error: Your downstream states are missing fields that were present earlier in the workflow.

Cause: Omitting ResultPath means the task result replaces the entire state input. If your Lambda returns { "status": "ok" }, the order ID, customer data, and everything else vanishes.

Fix: Always set ResultPath explicitly to nest the result:

"ResultPath": "$.paymentResult"

2. States.Runtime Error from Malformed JSONPath

Error message:

An error occurred while executing the state 'ProcessPayment'.
The JSONPath '$.items[0].sku' specified for the field 'sku.$'
could not be found in the input.

Cause: The referenced path does not exist in the state's input. This often happens when a previous state's OutputPath filtered out the field, or when ResultPath moved data to a different location than expected.

Fix: Trace the data flow state by state. Use the Step Functions console's visual inspector to examine the input and output of each state during a failed execution. Add a Pass state before the failing state to log the input shape.

3. Lambda Payload Size Limit

Error message:

States.DataLimitExceeded - The state/task returned a result with a size
exceeding the maximum number of bytes service limit.

Cause: The payload between states exceeds 256 KB for Standard workflows (or 64 KB for Express). This commonly happens when Lambda functions return large database query results or file contents.

Fix: Store large payloads in S3 or DynamoDB and pass only the reference (key or ID) through the state machine. For Map states processing large datasets, use Distributed mode with an S3 data source.

4. Execution Stuck in Running State

Symptom: The execution shows status RUNNING indefinitely but no state transitions are occurring.

Cause: A task with .waitForTaskToken is waiting for a callback that never arrives. The external system may have lost the task token or crashed before calling SendTaskSuccess.

Fix: Always set TimeoutSeconds and HeartbeatSeconds on callback tasks:

{
  "WaitForApproval": {
    "Type": "Task",
    "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
    "TimeoutSeconds": 86400,
    "HeartbeatSeconds": 3600,
    "Parameters": { ... }
  }
}

Store task tokens in a durable store (DynamoDB) so they survive process restarts. Implement a dead-letter queue on the SQS queue to catch unprocessable messages.

5. IAM Permission Errors on SDK Integrations

Error message:

States.TaskFailed - User: arn:aws:sts::123456789012:assumed-role/StepFunctionRole/...
is not authorized to perform: dynamodb:PutItem on resource:
arn:aws:dynamodb:us-east-1:123456789012:table/OrderDashboard

Cause: The Step Functions execution role lacks permissions for the AWS service you are calling directly (not via Lambda). When you use arn:aws:states:::dynamodb:putItem, Step Functions itself makes the DynamoDB call and needs the IAM permission.

Fix: Add the specific AWS service permissions to the Step Functions execution role. Lambda invocations only need lambda:InvokeFunction, but direct SDK integrations need the target service's permissions.

Best Practices

  • Always set ResultPath explicitly. Never let a task result silently overwrite the entire state input. Nest results under a descriptive key like $.payment or $.validation. This preserves the accumulated context that downstream states need.

  • Use TimeoutSeconds on every Task state. Lambda has its own timeout, but Step Functions does not know about it. Set a timeout on the state that is slightly longer than the Lambda timeout to ensure the state machine does not hang if Lambda fails to respond cleanly.

  • Prefer direct SDK integrations over Lambda wrappers. If all your Lambda does is call dynamodb.putItem() or sns.publish(), eliminate the Lambda and use Step Functions' built-in service integrations. Fewer moving parts, lower latency, lower cost.

  • Design for idempotency. State machine executions can be retried, and Express workflows use at-least-once semantics. Every task should produce the same result if called twice with the same input. Use idempotency keys in your Lambda handlers and conditional writes in DynamoDB.

  • Implement the Saga pattern for distributed transactions. When a step fails after previous steps have committed changes, use Catch blocks to route to compensation states that undo the prior work. The order processing example above demonstrates this with the CompensatePayment state.

  • Keep state machine definitions in version control. Treat your ASL JSON files like code. Use CloudFormation, CDK, or SAM to deploy them. Never edit state machines in the console for production workloads.

  • Use execution name conventions for traceability. Include the business entity ID and a timestamp in the execution name (e.g., order-ORD123-1706900000). This makes it trivial to find executions in the console and correlate them with business events.

  • Decompose large workflows into nested state machines. If your state machine exceeds 20-25 states, break it into child workflows invoked via .sync:2. Each workflow should have a single responsibility — payment processing, inventory management, notification dispatch.

  • Monitor with CloudWatch Alarms. Set alarms on ExecutionsFailed, ExecutionsTimedOut, and ExecutionThrottled metrics. Step Functions emits these automatically — you just need to set thresholds and notification targets.

  • Test locally before deploying. Use Step Functions Local with mock configurations to validate your state machine logic, branching paths, and error handling without incurring AWS costs or requiring network access.

References

Powered by Contentful