CI is in charge of providing the infrastructure of a Continuous Integration and Continuous Delivery system (CI/CD).
As the CI/CD is the backbone of our technological operation, we will constantly look for new technology that improves our development, testing and release processes.
Contributing
General
- Any changes to the CI pipelines must be done via Merge Requests.
- Any changes to the AWS autoscaler infrastructure must be done via Merge Requests by modifying its Terraform module.
- If a scheduled job takes longer than six hours, it generally should run in Compute, otherwise it can use the GitLab CI.
Components
We use:
- terraform-aws-gitlab-module for defining our CI as code.
- GitLab Bot for listening to GitLab webhooks and trigger actions like canceling unnecessary pipelines, rebasing MRs and merging MRs.
- Terraform state locking to avoid race conditions.
Make the lambda review my merge request
The lambda reviews all currently opened merge requests when:
- A new merge request is created.
- An existing merge request is updated, approved (by all required approvers), unapproved, merged, or closed.
- An individual user adds or removes their approval to an existing merge request.
- All threads are resolved on the merge request.
If you want the lambda to rebase or merge your merge request, you can perform one of the previously mentioned actions on any of the currently opened merge requests.
Tuning the CI
One of the most important values is the idle-count
, as it:
- Specifies how many idle machines should be waiting for new jobs. the more jobs a product pipeline has, the more idle machines it should have. you can take the integrates runner as a reference.
- It also dictates the rate at which the CI turns on new machines, that is, if a pipeline with 100 jobs is triggered for a CI with
idle-count = 8
, it will turn on new machines in batches of 8
until it stabilizes. - More information about how the autoscaling algorithm works can be found here.
Sometimes, Terraform CI jobs get stuck in a failed state due to a locked state file.
The Terraform state file
stores local information regarding our infrastructure configuration, which is used to determine the necessary changes required to be made in the real world (terraform apply). This state file is shared amongst team members to ensure consistency; however, if it is not properly locked, it can lead to data loss, conflicts, and state file corruption.
In case of conflicts with the state file, please follow the steps below:
- Obtain the state lock id from the failed job.
- Access the
terraform_state_lock
table in DynamoDB by going to AWS - production in Okta (requires prod_integrates role). - Search for the ID in the Info attribute and delete the
.tfstate
item. - Attempt to rerun the job that failed.
Debugging
Review GitLab CI/CD Settings
If you’re an admin in GitLab, you can visit the CI/CD Settings to validate if bastions are properly communicating.
Inspect infrastructure
Connect to bastions or workers
Just go to the AWS EC2 console, select the instance you want to connect to, click on Connect
, and start a Session Manager
session.
Debugging the bastion
Typical things you want to look at when debugging a bastion are:
-
docker-machine
commands. This will allow you to inspect and access workers with commands like docker-machine ls
, docker-machine inspect <worker>
, and docker-machine ssh <worker>
. /var/log/messages
for relevant logs from the r
gitlab-runner service./etc/gitlab-runner/config.toml
for bastion configurations.
Debugging a specific CI job
You can know which machine ran a job by looking at its logs.
Consider an example job where the line Running on runner-cabqrx3c-project-20741933-concurrent-0 via runner-cabqrx3c-ci-worker-skims-small-0-1677537770-87c5ed70...
is displayed. This tells us that the worker with the name runner-cabqrx3c-ci-worker-skims-small-0-1677537770-87c5ed70
was the one that ran it.
From there you can access the bastion and run memory or disk debugging.
We’ve developed tooling specifically for monitoring job performance in CI. These tools interact directly with Gitlab’s
GraphQL API, extracting and analyzing data to generate both plots and tabulated data in CSV format on your local machine.
Tip
All tools are accessible via the command line interface (CLI). Use the —help flag for detailed information on available arguments.
General analytics
Retrieve and visualize general job, pipeline or merge requests analytics using the CLI tool with its default workflow.
For fetching, parsing and plotting data, run:
runs-ci-analytics run default --resource=[RESOURCE]
Optional arguments:
options:
-h, --help show this help message and exit
--pages PAGES Number of pages to process (default: 100)
--time_scale {hour,day,week,month}
Time scale for grouping (default: day)
--cache Used cached data for producing figures instead of fetching new data
--resource {jobs,pipelines,pipeline_insights,merge_requests}
Target resource to fetch (default: jobs)
--target_job TARGET_JOB
Target job to fetch
--product PRODUCT Target product for pipeline analysis
--match_type {contains,starts_with,ends_with,exact}
Match type for target resource filtering
--ref REF Commit reference to fetch