ATMO v2: Ensure that a deploy does not impact running clusters or scheduled jobs

RESOLVED FIXED

Status

Cloud Services
Metrics: Pipeline
P2
normal
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: mreid, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

a year ago
A problem with ATMO v1 was that a deploy of the service during a job run could cause the job to be interrupted before it completed.

Since users can specify that jobs run at any time, we should ensure that deploying new code does not impact running clusters or jobs.
Blocks: 1248688
Mark, can you elaborate how the jobs was interrupted when ATMOv1 was deployed? Did it somehow reset the jobs or something during deploy?
Flags: needinfo?(mreid)
(Reporter)

Comment 2

a year ago
Since the job was actually launched from the webserver node (via cron), a shutdown would stop monitoring any running jobs, so any detection of job success / failure wouldn't work. I believe it would also force-stop any old-style non-spark jobs, but that shouldn't be a concern anymore.

Also, it was possible for the scheduler to "miss" jobs if their execution time happened after the previous instance was torn down, but before the new instance was fully spun up. That meant whoever was doing the deploy had to take care not to do it right around the time when jobs were scheduled to launch.
Flags: needinfo?(mreid)

Updated

a year ago
Points: --- → 2
Priority: -- → P2
As long as the processes receive a SIGTERM for termination everything should be fine:
gunicorn: http://docs.gunicorn.org/en/stable/signals.html#master-process
rq worker: http://python-rq.org/docs/workers/
rq scheduler: https://github.com/ui/rq-scheduler/blob/396efadda8610548b474e680507b278676fc2262/rq_scheduler/scheduler.py#L52-L67

:robotblake do you know if that's the case in the dockerflow environment?
Flags: needinfo?(bimsland)
Created attachment 8803843 [details] [review]
[telemetry-analysis-service] mozilla:bug1309688 > mozilla:master
I'll do some testing but I believe that this is doable (and may work already?).
Flags: needinfo?(bimsland)
It appears that currently the process will receive a SIGTERM followed approximately 30 seconds later (assuming it's still alive) by a SIGKILL.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.