Closed
Bug 1340595
Opened 8 years ago
Closed 8 years ago
Create repository for Python ETL jobs
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rvitillo, Assigned: harter)
References
Details
User Story
1. set up a repo (or use telemetry-batch-view?); 2. write an example ETL job with tests; 3. ensure sure ATMO & Airflow can schedule Python jobs defined within that repo.
No description provided.
Comment 2•8 years ago
|
||
This sounds like a great idea! I did the initial test setup for python_moztelemetry, I could do the same for the new etl library. I don't see any advantage in using the same repository as telemetry-batch-view and we probably want to test and release them independently.
Reporter | ||
Updated•8 years ago
|
Assignee: nobody → rvitillo
Priority: -- → P2
Assignee | ||
Comment 3•8 years ago
|
||
I set up an example ipynb [0] to load and run an ETL job from a git repo. I have it working on ATMO and it shouldn't be difficult to get it running on Airflow as well.
That doesn't address testing at all, but I plan on using the betl repo [1] to host useful code snippets for future ETL work. Maybe it makes sense to host useful testing utilities there as well.
[0] https://github.com/harterrt/betl/blob/master/notebooks/load_and_execute.ipynb
[1] https://github.com/harterrt/betl/
Assignee | ||
Comment 4•8 years ago
|
||
Here's an example library for testing a simple ETL job. Comments and questions are appreciated.
https://github.com/harterrt/cookiecutter-python-etl
I'm going to work on making this even easier by getting this to work with cookiecutter
https://github.com/audreyr/cookiecutter
Reporter | ||
Updated•8 years ago
|
Assignee: rvitillo → rharter
Assignee | ||
Comment 5•8 years ago
|
||
This is coming along nicely.
At Roberto's suggestion, I started a new python_etl repository here [0]. ETL checked into this repository should come with tests and should be reviewed by a peer. ETL in this repository will be considered to have "graduated" into production. The ETL in this repo will be scheduled to run on Airflow.
However, I expect we'll still want to schedule some short-lived or incubating jobs on Airflow. To make it easier to test and deploy these jobs, I've refactored the example etl job [1] to use the python utility `cookiecutter`. You can now start a new ETL repository by calling `cookiecutter gh:harterrt/cookiecutter-python-etl`. This will generate all of boilerplate including example tests, deploy scripts, licences, etc.
Taking a step back, here's how I hope these tools are used:
* An analyst begins exploring and prototyping an ETL job in a Jupyter notebook.
* When the analyst have a loose plan of what the data look like and how it will fit together, they should start writing some example tests to speed up their development. This is a good time to generate a new cookiecutter repo. The analyst has the option of deploying this script as is to ATMO.
* If the script will be running for a while, needs additional review, or is important enough to share, the analyst should start a pull request against python_etl. This will get the code into a centralized repository and give the analyst a second set of eyes on their code.
Moving forward, I need to:
* schedule the python_etl library on Airflow
* add any existing ETL job (probaby bug 1346480) to python_etl to knock off any rough edges
[0] https://github.com/mozilla/python_etl
[1] https://github.com/harterrt/cookiecutter-python-etl
Assignee | ||
Comment 6•8 years ago
|
||
I'll file new bugs for next actions. This bug is completed with:
[0] https://github.com/mozilla/python_etl
[1] https://github.com/harterrt/cookiecutter-python-etl
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•