Determine WTMO deployment procedure

RESOLVED FIXED

Status

Data Platform and Tools
Scheduling
P2
normal
RESOLVED FIXED
a year ago
5 months ago

People

(Reporter: whd, Assigned: Harold Woo)

Tracking

(Blocks: 1 bug)

Details

Attachments

(1 attachment)

(Reporter)

Description

a year ago
Now that WTMO has been migrated to dockerflow we need to determine the deployment cadence for it. There's been an email thread with some competing notions on how to proceed, so I'm filing this bug to determine a resolution.

The models being considered are essentially fully automated deployment or two-step production deployment. I started to enumerate some technical details around them, but decided I wanted to spend a tractable amount of time writing this bug so I'm just going to point at https://github.com/mozilla-services/cloudops-deployment/pull/664, https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/data/puppet/yaml/app/data.prod.wtmo.yaml#L10-L12, and https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/data/puppet/yaml/app/data.stage.wtmo.yaml#L10-L12, which contain, in my opinion, the operative parameters for discussion.

My original understanding of deployment requirements was that dag modifications (the majority of changes to our airflow container) should be automatically deployed to both staging and production. There is notably no or very little testing in this case, but as I understand it the majority of prior issues related to airflow were not around dag changes, but rather operational issues with the service that should now be resolved. In this model issues related to a dag deploy can be quickly addressed by merging the fix to master, which is then auto-deployed. There's a technical wrinkle around worker/scheduler replacement that could be resolved in at least four ways: social convention (don't merge to master while jobs are running or are about to run), using the new EMR operator / sensor mechanisms that are the subject of bug #1325393, some kind of operational instrumentation (e.g. of the worker queue) that exposes to the deployment pipeline whether and when it is safe to redeploy, or some mechanism like mounting an external volume to the docker container that facilitates dag "deploys" without requiring a rebuild of the container.

Two-step deployment can be accomplished in myriad ways using the aforementioned parameters, depending on where we want to put the verification procedures. In the current configuration we have a full staging environment at https://data-wtmo.stage.mozaws.net/admin/ that has the same permissions as the production environment, but has all dags paused and is configured to dump analysis data to telemetry-test-bucket instead of our production data buckets.

I have no particular preference on how we proceed, as I believe whatever method we decide can be implemented in such a way that there is no operator involvement.

Updated

11 months ago
Assignee: nobody → whd
Points: --- → 2
Priority: -- → P2
(Reporter)

Comment 1

11 months ago
From email discussion, we're going to move forward with the following:

1) Split Airflow between the web service and the DAGs.
2) Use two-step-deployment for new Airflow versions and operator changes.
3) Continuously deployed individual DAGs and their tasks.

I'll work out the implementation of this next sprint.
(Assignee)

Comment 2

5 months ago
For my own reference:

- Cfn/Ansible template for creating EFS in staging/prod
- Modify app.yml (https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/data/ansible/templates/wtmo/app.yml) userdata to mount EFS volume to the webapp/worker instances
- Mount EFS folder to containers
- modify telemetry-airflow/airflow.cfg to point to EFS mounted folder?  This may break current deployment?
- Add cronjob on webapps/worker to keep EFS and github in sync (git pull) in userdata as well? 
- add circleci tests for DAG syntax on telemetry-airflow repo
- modify circleci build so that changes to dag folder do not create new containers and deploy
- airflow scheduler on worker instance needs logrotation(https://bugzilla.mozilla.org/show_bug.cgi?id=1392310)
(Assignee)

Updated

5 months ago
Assignee: whd → hwoo
(Assignee)

Updated

5 months ago
Status: NEW → RESOLVED
Last Resolved: 5 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.