Compute is the component of Runs in charge of providing out-of-band processing. It can run jobs both on-demand and on-schedule.
Public Oath
Fluid Attacks will constantly look for out-of-band computing solutions that balance:
- Cost
- Security
- Scalability
- Speed
- Traceability
Such solutions must also be:
- Cloud based
- Integrable with the rest of our stack
- The module is managed as code using Terraform.
- Batch jobs use AWS EC2 Spot machines.
- Spot machines have Internet access.
- Spot machines are of
aarch64-linux
architecture. - Batch jobs are able to run jobs, but for as long as an EC2 SPOT instance last (so design with idempotency, and retrial mechanisms in mind).
- Jobs can be sent to batch in two ways:
- Using curl, boto3, or any other tool that allows interacting with AWS API.
- Defining a schedule, which periodically submits a job to a queue.
- AWS EventBridge is used to trigger scheduled jobs.
- On failure, an email is sent to development@fluidattacks.com.
- Batch machines come in two sizes:
small
with 1 vcpu and 8 GiB memory.large
with 2 vcpu and 16 GiB memory.
- All runners have internal solid-state drives for maximum performance.
- A special compute environment called
warp
meant for cloning repositories via Cloudflare WARP uses 2 vcpu and 4 GiB memory machines on a x86_64-linux
architecture. - Compute environments use subnets on all availability zones within
us-east-1
for maximum spot availability.
Contributing
General
- You can access the Batch console after authenticating to AWS via Okta.
- If a scheduled job takes longer than six hours, it should generally run in Batch; otherwise, you can use the CI.
Schedules
Schedules are a powerful way to run tasks periodically.
Creating a new schedule
We highly advise you to take a look at the currently existing schedules to get an idea of what is required.
Some special considerations are:
- The
scheduleExpression
option follows the AWS schedule expression syntax.
Testing the schedules
Schedules are tested by two Makes jobs:
runs-compute-core schedule-test
Grants that- all schedules comply with a given schema;
- all schedules have at least one maintainer with access to the universe repository;
- every schedule is reviewed by a maintainer on a monthly basis.
runs-compute-infra apply
Tests infrastructure that will be deployed when new schedules are created.
Deploying schedules to production
Once a schedule reaches production, required infrastructure for running it is created.
Local reproducibility in schedules
Once a new schedule is declared, A Makes job is created with the format computeOnAwsBatch/schedule_<name>
for local reproducibility.
Generally, to run any schedule, all that is necessary is to export the UNIVERSE_API_TOKEN
variable. Bear in mind that data.nix
becomes the single source of truth regarding schedules. Everything is defined there, albeit with a few exceptions.
Testing compute environments
Testing compute environments is hard for multiple reasons:
- Environments use init data that is critical for properly provisioning machines.
- Environments require AWS AMIs that are especially optimized for ECS.
- When upgrading an AMI, many things within the machines change, including the
cloud-init
(the software that initializes the machine using the init data provided) version, GLIBC version, among many others. - There is not a comfortable way to test this locally or in CI, which forces us to rely on productive test environments.
Below is a step-by-step guide to testing environments.
Caution
We do this to make sure that production environments will work properly with a given change, which means that parity between test and production environments should be as high as possible.
- Change the
test
environment with whatever changes you want to test. direnv
to the AWS prod_common
role.- Export
CACHIX_AUTH_TOKEN
on your environment. You can find this variable in GitLab’s CI/CD variables. If you do not have access to this, ask a maintainer. - Deploy changes made to the environment you want to test with
runs-compute-infra apply
.
Caution
Keep in mind that other deployments to production can overwrite your local deployment. You can avoid this by re-running the command and never saying yes
to the prompt.
- Queue compute test jobs with:
runs-compute-core schedule-job runs_compute_test_environment_default
or
runs-compute-core schedule-job runs_compute_test_environment_warp
- Review that jobs are running properly on the
test
environment.
- Extend your changes to the production environments.