Closed Bug 1340595 Opened 8 years ago Closed 8 years ago

Create repository for Python ETL jobs

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: harter)

References

Details

User Story

1. set up a repo (or use telemetry-batch-view?);
2. write an example ETL job with tests;
3. ensure sure ATMO & Airflow can schedule Python jobs defined within that repo.
No description provided.
See Also: → 1336617
This sounds like a great idea! I did the initial test setup for python_moztelemetry, I could do the same for the new etl library. I don't see any advantage in using the same repository as telemetry-batch-view and we probably want to test and release them independently.
Assignee: nobody → rvitillo
Priority: -- → P2
I set up an example ipynb [0] to load and run an ETL job from a git repo. I have it working on ATMO and it shouldn't be difficult to get it running on Airflow as well. That doesn't address testing at all, but I plan on using the betl repo [1] to host useful code snippets for future ETL work. Maybe it makes sense to host useful testing utilities there as well. [0] https://github.com/harterrt/betl/blob/master/notebooks/load_and_execute.ipynb [1] https://github.com/harterrt/betl/
Here's an example library for testing a simple ETL job. Comments and questions are appreciated. https://github.com/harterrt/cookiecutter-python-etl I'm going to work on making this even easier by getting this to work with cookiecutter https://github.com/audreyr/cookiecutter
Assignee: rvitillo → rharter
This is coming along nicely. At Roberto's suggestion, I started a new python_etl repository here [0]. ETL checked into this repository should come with tests and should be reviewed by a peer. ETL in this repository will be considered to have "graduated" into production. The ETL in this repo will be scheduled to run on Airflow. However, I expect we'll still want to schedule some short-lived or incubating jobs on Airflow. To make it easier to test and deploy these jobs, I've refactored the example etl job [1] to use the python utility `cookiecutter`. You can now start a new ETL repository by calling `cookiecutter gh:harterrt/cookiecutter-python-etl`. This will generate all of boilerplate including example tests, deploy scripts, licences, etc. Taking a step back, here's how I hope these tools are used: * An analyst begins exploring and prototyping an ETL job in a Jupyter notebook. * When the analyst have a loose plan of what the data look like and how it will fit together, they should start writing some example tests to speed up their development. This is a good time to generate a new cookiecutter repo. The analyst has the option of deploying this script as is to ATMO. * If the script will be running for a while, needs additional review, or is important enough to share, the analyst should start a pull request against python_etl. This will get the code into a centralized repository and give the analyst a second set of eyes on their code. Moving forward, I need to: * schedule the python_etl library on Airflow * add any existing ETL job (probaby bug 1346480) to python_etl to knock off any rough edges [0] https://github.com/mozilla/python_etl [1] https://github.com/harterrt/cookiecutter-python-etl
I'll file new bugs for next actions. This bug is completed with: [0] https://github.com/mozilla/python_etl [1] https://github.com/harterrt/cookiecutter-python-etl
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.