Fluid Attacks' platform ETL
This
ETL process (a.k.a. dynamo_etl) extracts data from the Fluid Attacks’ platform database (dynamo) and sends it to the warehouse (
Snowflake).
Note
Local execution and sending of jobs will require to setup your environment and have access to the prod_observes
role and the CACHIX_AUTH_TOKEN
environment variable.
Architecture
The
ETL has two core procedures:
- Data-schema determination:
Where the schema of the data is inferred.
- Data refresh (the ETL):
Where all data is updated from dynamodb to the warehouse.
The ETL has four phases:
- Segment ETL (dynamo -> s3):
Where the ETL is executed over a segment of the dynamo data and saves it on s3.
- Preparation:
Where a pristine staging (a.k.a. loading) warehouse-schema is created for temporal store of the new data.
- Upload ETL (s3 -> warehouse):
Where the codified s3 data is uploaded to the corresponding tables on the staging warehouse-schema.
- Replacement:
Where the staging schema becomes the new source of truth.
Caution
Do not confuse: warehouse-schema is an entity that groups a collection of tables (like a folder), instead, data-schema is the metadata of some data (e.g. their column names and types).
ETL phases details
- Segment ETL:
- Segment extraction: The data is extracted using a parallel scan over one specific segment.
- Data transform: By using the auto generated data-schemas, the data is adjusted.
- S3 upload: Data is transformed into a csv file (one for each data-schema) and uploaded into the observes.etl-data bucket.
Data is uploaded first to s3 and then to the warehouse due to performance issues. Loading from s3 is more efficient than direct upload queries.
- Upload ETL:
Data-schema determination.
This process infer data-schema from raw data and stores the determined data-schemas into observes.cache s3 bucket for serving as a cache.
This process is triggered by an schedule. It has a frequency of execution of one week.
Remote and local execution
The main interface for execution is through m . /observes/etl/dynamo/bin
for now on we call it dynamo-etl
for simplification. With this you can trigger the execution of a phase locally (on the current machine) or remotely (on aws batch).
- Local with
dynamo-etl local <phase-X>
. - Remotely with
dynamo-etl remote <phase-X>
.
Where <phase-X>
is a placeholder. For more information use the --help
flag.
The pipeline
The ETL pipeline consist of a chain of aws-batch jobs that are coupled i.e. when some phase ends the other starts.
The pipeline is triggered by schedule.
Tip
If you want to start the pipeline from a phase other than phase-1, you have to modify the pipeline definition observesDynamoPipeline
at /observes/batch/dynamo.nix
.
Other ETL
- Google sheets ETL
- Zoho CRM ETL
Tip
Have an idea to simplify our architecture or noticed docs that could use some love? Don't hesitate to
open an issue or submit improvements.