Compute | Out-of-Band Batch Processing Infrastructure | Fluid Attacks Help

Compute

Compute is the component of Runs in charge of providing out-of-band processing. It can run jobs both on-demand and on-schedule.

Public Oath

Fluid Attacks will constantly look for out-of-band computing solutions that balance:
  1. Cost
  2. Security
  3. Scalability
  4. Speed
  5. Traceability
Such solutions must also be:
  1. Cloud based
  2. Integrable with the rest of our stack

Architecture

compute arch

  1. The module is managed as code using Terraform.
  2. Batch jobs use AWS EC2 Spot machines.
  3. Spot machines have Internet access.
  4. Spot machines are of aarch64-linux architecture.
  5. Batch jobs are able to run jobs, but for as long as an EC2 SPOT instance last (so design with idempotency, and retrial mechanisms in mind).
  6. Jobs can be sent to batch in two ways:
    1. Using curl, boto3, or any other tool that allows interacting with AWS API.
    2. Defining a schedule, which periodically submits a job to a queue.
  7. AWS EventBridge is used to trigger scheduled jobs.
  8. On failure, an email is sent to development@fluidattacks.com.
  9. Batch machines come in two sizes:
    1. small with 1 vcpu and 8 GiB memory.
    2. large with 2 vcpu and 16 GiB memory.
  10. All runners have internal solid-state drives for maximum performance.
  11. A special compute environment called warp meant for cloning repositories via Cloudflare WARP uses 2 vcpu and 4 GiB memory machines on a x86_64-linux architecture.
  12. Compute environments use subnets on all availability zones within us-east-1 for maximum spot availability.

Contributing

Please read the contributing page first.

General

  1. You can access the Batch console after authenticating to AWS via Okta.
  2. If a scheduled job takes longer than six hours, it should generally run in Batch; otherwise, you can use the CI.

Schedules

Schedules are a powerful way to run tasks periodically.

You can find all schedules in the GitLab project.

Creating a new schedule

We highly advise you to take a look at the currently existing schedules to get an idea of what is required.

Some special considerations are:
  1. The scheduleExpression option follows the AWS schedule expression syntax.

Testing the schedules

Schedules are tested by two Makes jobs:
  1. runs-compute-core schedule-test Grants that
    1. all schedules comply with a given schema;
    2. all schedules have at least one maintainer with access to the universe repository;
    3. every schedule is reviewed by a maintainer on a monthly basis.
  2. runs-compute-infra apply Tests infrastructure that will be deployed when new schedules are created.

Deploying schedules to production

Once a schedule reaches production, required infrastructure for running it is created.

Technical details can be found in our GitLab project.

Local reproducibility in schedules

Once a new schedule is declared, A Makes job is created with the format computeOnAwsBatch/schedule_<name> for local reproducibility.

Generally, to run any schedule, all that is necessary is to export the UNIVERSE_API_TOKEN variable. Bear in mind that data.nix becomes the single source of truth regarding schedules. Everything is defined there, albeit with a few exceptions.

Testing compute environments

Testing compute environments is hard for multiple reasons:
  1. Environments use init data that is critical for properly provisioning machines.
  2. Environments require AWS AMIs that are especially optimized for ECS.
  3. When upgrading an AMI, many things within the machines change, including the cloud-init (the software that initializes the machine using the init data provided) version, GLIBC version, among many others.
  4. There is not a comfortable way to test this locally or in CI, which forces us to rely on productive test environments.
test is the compute environment for testing. It uses the runs/compute/infra/init/test init data.

Below is a step-by-step guide to testing environments.
Warning
Caution
We do this to make sure that production environments will work properly with a given change, which means that parity between test and production environments should be as high as possible.
  1. Change the test environment with whatever changes you want to test.
  2. direnv to the AWS prod_common role.
  3. Export CACHIX_AUTH_TOKEN on your environment. You can find this variable in GitLab’s CI/CD variables. If you do not have access to this, ask a maintainer.
  4. Deploy changes made to the environment you want to test with runs-compute-infra apply.

  5. Warning
    Caution
    Keep in mind that other deployments to production can overwrite your local deployment. You can avoid this by re-running the command and never saying yes to the prompt.

  6. Queue compute test jobs with:

  7. runs-compute-core schedule-job runs_compute_test_environment_default

    or

    runs-compute-core schedule-job runs_compute_test_environment_warp
    Idea
    Tip
    The tests executed by these compute jobs are located in runs/compute/test/environment, feel free to modify them as you see fit. 

  8. Review that jobs are running properly on the test environment.
  9. Extend your changes to the production environments.
Idea
Tip
Have an idea to simplify our architecture or noticed docs that could use some love? Don't hesitate to open an issue or submit improvements.