61 bytes, text/x-github-pull-request
|Details | Review | Splinter Review|
A problem with ATMO v1 was that a deploy of the service during a job run could cause the job to be interrupted before it completed. Since users can specify that jobs run at any time, we should ensure that deploying new code does not impact running clusters or jobs.
Mark, can you elaborate how the jobs was interrupted when ATMOv1 was deployed? Did it somehow reset the jobs or something during deploy?
Since the job was actually launched from the webserver node (via cron), a shutdown would stop monitoring any running jobs, so any detection of job success / failure wouldn't work. I believe it would also force-stop any old-style non-spark jobs, but that shouldn't be a concern anymore. Also, it was possible for the scheduler to "miss" jobs if their execution time happened after the previous instance was torn down, but before the new instance was fully spun up. That meant whoever was doing the deploy had to take care not to do it right around the time when jobs were scheduled to launch.
As long as the processes receive a SIGTERM for termination everything should be fine: gunicorn: http://docs.gunicorn.org/en/stable/signals.html#master-process rq worker: http://python-rq.org/docs/workers/ rq scheduler: https://github.com/ui/rq-scheduler/blob/396efadda8610548b474e680507b278676fc2262/rq_scheduler/scheduler.py#L52-L67 :robotblake do you know if that's the case in the dockerflow environment?
Created attachment 8803843 [details] [review] [telemetry-analysis-service] mozilla:bug1309688 > mozilla:master
I'll do some testing but I believe that this is doable (and may work already?).
It appears that currently the process will receive a SIGTERM followed approximately 30 seconds later (assuming it's still alive) by a SIGKILL.