Set up minio for integration tests against python_etl

RESOLVED WONTFIX

Status

Data Platform and Tools
Datasets: General
P2
normal
RESOLVED WONTFIX
a year ago
10 months ago

People

(Reporter: amiyaguchi, Unassigned)

Tracking

Details

(Reporter)

Description

a year ago
Minio is an open-source, s3 compatible object store that can be run locally. Much of our data-pipeline is reliant on s3 for data storage in production. Minio has the potential to isolate and simplify testing of various components in the data-pipeline.

python_etl makes a good candidate for this kind of infrastructure because s3 is the location of both the input and output for many ETL jobs that are scheduled in airflow, e.g. churn to churn_to_csv. This repository also has tests that depend on a local installation of Spark, and has it implemented in continuous integration.

There are a few problems with using minio, such as proprietary hadoop binaries on EMR that have an incompatible notion of the s3 URI prefix. This would require a consistent usage of s3a:// across tests, and switching over to s3:// during production. There is also the problem of requiring a hadoop binary of 2.8 or above, which is not currently distributed as a prepackaged bundle. [1]

There should be at least one integration test against churn and churn_to_csv that can demonstrate a broader use of minio in validating our infrastructure.


[1] https://github.com/minio/minio/issues/2965
(Reporter)

Updated

a year ago
Points: --- → 3
Priority: -- → P2
Minio might be overkill for this. I would suggest to use moto's stand-alone server mode [1] which is what we use to test our telemetry APIs.  

[1] https://github.com/spulec/moto#stand-alone-server-mode

Updated

a year ago
Component: Metrics: Pipeline → Datasets: General
Product: Cloud Services → Data Platform and Tools
(Reporter)

Updated

10 months ago
Status: NEW → RESOLVED
Last Resolved: 10 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.