Fluid Attacks' platform ETL

This ETL process (a.k.a. dynamo_etl) extracts data from the Fluid Attacks’ platform database (dynamo) and sends it to the warehouse (Snowflake).

Note

Local execution and sending of jobs will require setting up your environment and having access to the prod_observes role and the CACHIX_AUTH_TOKEN environment variable.

Architecture

The ETL has two core procedures:

Data-schema determination:

Where the schema of the data is inferred.

Data refresh (the ETL):

Where all data is updated from dynamodb to the warehouse.

The ETL has four phases:

Segment ETL (dynamo -> s3):

Where the ETL is executed over a segment of the dynamo data and saves it on s3.

Preparation:

Where a pristine staging (a.k.a. loading) warehouse-schema is created for temporal store of the new data.

Upload ETL (s3 -> warehouse):

Where the codified s3 data is uploaded to the corresponding tables on the staging warehouse-schema.

Replacement:

Where the staging schema becomes the new source of truth.

Caution

Do not confuse: warehouse-schema is an entity that groups a collection of tables (like a folder), instead, data-schema is the metadata of some data (e.g. their column names and types).

ETL phases details

Segment ETL:

Segment extraction: The data is extracted using a parallel scan over one specific segment.
Data transform: By using the auto generated data-schemas, the data is adjusted.
S3 upload: Data is transformed into a csv file (one for each data-schema) and uploaded into the observes.etl-data bucket.

Data is uploaded first to s3 and then to the warehouse due to performance issues. Loading from s3 is more efficient than direct upload queries.

Upload ETL:

Data-schema determination.

This process infer data-schema from raw data and stores the determined data-schemas into observes.cache s3 bucket for serving as a cache.

This process is triggered by an schedule. It has a frequency of execution of one week.

Remote and local execution

The main interface for execution is through m . /observes/etl/dynamo/bin for now on we call it dynamo-etl for simplification. With this you can trigger the execution of a phase locally (on the current machine) or remotely (on aws batch).

Local with dynamo-etl local <phase-X>.
Remotely with dynamo-etl remote <phase-X>.

Where <phase-X> is a placeholder. For more information use the --help flag.

The pipeline

The ETL pipeline consist of a chain of aws-batch jobs that are coupled i.e. when some phase ends the other starts.

The pipeline is triggered by schedule.

Tip

If you want to start the pipeline from a phase other than phase-1, you have to modify the pipeline definition observesDynamoPipeline at /observes/batch/dynamo.nix.

Other ETL