Step Functions | AWS | Stack | Fluid Attacks Help

Step Functions

Rationale

We use Step Functions for orchestrating workflow execution in the cloud. The main reasons why we chose it over other alternatives are the following:
  1. It is SaaS (software as a service), so we do not need to manage any infrastructure directly.
  2. It is inexpensive, so we only have to pay for the state transitions and the underlying compute resources (like Lambda functions, EC2 instances, or Batch tasks) we use to execute workflow steps.
  3. It complies with several certifications from ISO and CSA. Many of these certifications are focused on granting that the entity follows best practices regarding secure cloud-based environments and information security.
  4. We can monitor workflow execution and state transitions using CloudWatch.
  5. The workflows are highly resilient, with built-in error handling and retry mechanisms. This feature is very important when workflows orchestrate long-running processes that take several hours or days to finish.
  6. It supports visual workflow design and provides a visual representation of workflow execution, making it easier to understand and debug complex workflows.
  7. All its settings can be written as code using Terraform.
  8. It supports parallel execution of workflow branches, allowing us to run multiple tasks concurrently and improve overall workflow performance.
  9. It supports conditional branching and dynamic routing, enabling complex workflow logic based on data conditions.
  10. It supports automatic retries with configurable retry policies for handling transient failures.
  11. It integrates seamlessly with over 200 AWS services, including Lambda, ECS, Batch, DynamoDB, SNS, SQS, and many others.
  12. It integrates with Identity and Access Management (IAM), allowing us to keep a least privilege approach regarding authentication and authorization.
  13. It maintains execution history for up to 90 days, providing visibility into workflow execution and making it easier to troubleshoot issues.

Alternatives

AWS Batch

We use Batch for running batch processing jobs. While Batch is excellent for executing individual jobs, Step Functions provides superior orchestration capabilities when we need to coordinate multiple steps, handle complex branching logic, or integrate with various AWS services in a workflow.

When we use Step Functions over Batch:
  1. Orchestrating multi-step workflows with dependencies
  2. Needing conditional logic and dynamic routing
  3. Integrating multiple AWS services in a single workflow
  4. Requiring visual workflow representation and debugging
When we use Batch over Step Functions:
  1. Executing single, long-running batch processing jobs
  2. Compute resource management as primary need rather than workflow orchestration

Apache Airflow

Apache Airflow is a popular open-source workflow orchestration platform.

Pros:
  1. Highly flexible and extensible
  2. Rich ecosystem of operators and integrations
  3. Strong community support
Cons:
  1. Requires infrastructure management and maintenance
  2. More complex setup and configuration
  3. Higher operational overhead
  4. Less native integration with AWS services compared to Step Functions

Temporal

Temporal is a modern workflow orchestration platform.

Pros:
  1. Excellent developer experience with SDKs in multiple languages
  2. Strong support for long-running workflows
  3. Good observability and debugging tools
  4. Can handle complex workflow patterns
Cons:
  1. Requires infrastructure management (unless using Temporal Cloud, which is paid)
  2. Less native integration with AWS services
  3. Higher learning curve for teams unfamiliar with Temporal concepts

Custom orchestration logic

Before implementing Step Functions, we used a custom orchestration approach built with Python code, DynamoDB, and Batch. This approach involved writing imperative Python code to determine workflow steps, manually managing dependencies, and using DynamoDB as a job queue.

Pros:
  1. Can implement complex business logic directly in code
  2. No additional service costs for orchestration (only DynamoDB and Batch costs)
  3. Works well with existing Python codebase and patterns
  4. Can easily integrate with existing data loaders and business logic
Cons:
  1. Workflow logic is imperative and scattered across code, making it harder to understand end-to-end workflows
  2. No visual representation of workflows, making debugging and onboarding more difficult
  3. Manual dependency management requires careful code maintenance
  4. Harder to reason about complex multi-step workflows
  5. No built-in retry policies or error handling at the orchestration level
  6. Workflow changes require code changes and deployments
  7. Limited observability into workflow execution state and history
  8. Workflow logic difficult to test in isolation

Usage

We use Step Functions for orchestrating:
  1. Repository synchronization for our clients' repositories, including cloning, post-processing and vulnerability reporting.
  2. Production background schedules for some of our components.