Closed Bug 1356699 Opened 7 years ago Closed 7 years ago

Set up minio for integration tests against python_etl

Categories

(Data Platform and Tools :: General, enhancement, P2)

enhancement
Points:
3

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: amiyaguchi, Unassigned)

Details

Minio is an open-source, s3 compatible object store that can be run locally. Much of our data-pipeline is reliant on s3 for data storage in production. Minio has the potential to isolate and simplify testing of various components in the data-pipeline.

python_etl makes a good candidate for this kind of infrastructure because s3 is the location of both the input and output for many ETL jobs that are scheduled in airflow, e.g. churn to churn_to_csv. This repository also has tests that depend on a local installation of Spark, and has it implemented in continuous integration.

There are a few problems with using minio, such as proprietary hadoop binaries on EMR that have an incompatible notion of the s3 URI prefix. This would require a consistent usage of s3a:// across tests, and switching over to s3:// during production. There is also the problem of requiring a hadoop binary of 2.8 or above, which is not currently distributed as a prepackaged bundle. [1]

There should be at least one integration test against churn and churn_to_csv that can demonstrate a broader use of minio in validating our infrastructure.


[1] https://github.com/minio/minio/issues/2965
Points: --- → 3
Priority: -- → P2
Minio might be overkill for this. I would suggest to use moto's stand-alone server mode [1] which is what we use to test our telemetry APIs.  

[1] https://github.com/spulec/moto#stand-alone-server-mode
Component: Metrics: Pipeline → Datasets: General
Product: Cloud Services → Data Platform and Tools
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.