Closed Bug 1340595 Opened 8 years ago Closed 8 years ago

Create repository for Python ETL jobs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: harter)

References

Details

User Story

1. set up a repo (or use telemetry-batch-view?);
2. write an example ETL job with tests;
3. ensure sure ATMO & Airflow can schedule Python jobs defined within that repo.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

8 years ago

No description provided.

Ryan Harter [:harter]

Assignee

Updated

•

8 years ago

Comment 2

•

8 years ago

This sounds like a great idea! I did the initial test setup for python_moztelemetry, I could do the same for the new etl library. I don't see any advantage in using the same repository as telemetry-batch-view and we probably want to test and release them independently.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Assignee: nobody → rvitillo

Priority: -- → P2

Ryan Harter [:harter]

Assignee

Comment 3

•

8 years ago

I set up an example ipynb [0] to load and run an ETL job from a git repo. I have it working on ATMO and it shouldn't be difficult to get it running on Airflow as well. That doesn't address testing at all, but I plan on using the betl repo [1] to host useful code snippets for future ETL work. Maybe it makes sense to host useful testing utilities there as well. [0] https://github.com/harterrt/betl/blob/master/notebooks/load_and_execute.ipynb [1] https://github.com/harterrt/betl/

Ryan Harter [:harter]

Assignee

Comment 4

•

8 years ago

Here's an example library for testing a simple ETL job. Comments and questions are appreciated. https://github.com/harterrt/cookiecutter-python-etl I'm going to work on making this even easier by getting this to work with cookiecutter https://github.com/audreyr/cookiecutter

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Assignee: rvitillo → rharter

Ryan Harter [:harter]

Assignee

Comment 5

•

8 years ago

This is coming along nicely. At Roberto's suggestion, I started a new python_etl repository here [0]. ETL checked into this repository should come with tests and should be reviewed by a peer. ETL in this repository will be considered to have "graduated" into production. The ETL in this repo will be scheduled to run on Airflow. However, I expect we'll still want to schedule some short-lived or incubating jobs on Airflow. To make it easier to test and deploy these jobs, I've refactored the example etl job [1] to use the python utility `cookiecutter`. You can now start a new ETL repository by calling `cookiecutter gh:harterrt/cookiecutter-python-etl`. This will generate all of boilerplate including example tests, deploy scripts, licences, etc. Taking a step back, here's how I hope these tools are used: * An analyst begins exploring and prototyping an ETL job in a Jupyter notebook. * When the analyst have a loose plan of what the data look like and how it will fit together, they should start writing some example tests to speed up their development. This is a good time to generate a new cookiecutter repo. The analyst has the option of deploying this script as is to ATMO. * If the script will be running for a while, needs additional review, or is important enough to share, the analyst should start a pull request against python_etl. This will get the code into a centralized repository and give the analyst a second set of eyes on their code. Moving forward, I need to: * schedule the python_etl library on Airflow * add any existing ETL job (probaby bug 1346480) to python_etl to knock off any rough edges [0] https://github.com/mozilla/python_etl [1] https://github.com/harterrt/cookiecutter-python-etl

Ryan Harter [:harter]

Assignee

Comment 6

•

8 years ago

I'll file new bugs for next actions. This bug is completed with: [0] https://github.com/mozilla/python_etl [1] https://github.com/harterrt/cookiecutter-python-etl

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Create repository for Python ETL jobs

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

Tracking

(Not tracked)

People

(Reporter: rvitillo, Assigned: harter)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 2

Updated

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Updated