release runner doesn't die properly when supervisord kills it

RESOLVED FIXED

Status

Release Engineering
Release Automation: Other
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: bhearsum, Assigned: rail)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

5 years ago
It seems to get hung, at least when using "supervisorctl stop". Not sure if this happens when using "restart". We end up with process state like this:
root     13268  0.0  0.0 150328  5204 ?        Ss   Jan30   2:44 /usr/bin/python26 /usr/bin/supervisord
cltbld    6902  0.0  0.0  63864  1128 ?        S    06:30   0:00  \_ /bin/bash /home/cltbld/release-runner/tools/buildfarm/release/release-runner.sh
cltbld    6911  0.5  0.2 165284 13356 ?        S    06:30   0:00      \_ python release-runner.py -c /home/cltbld/.release-runner.ini
cltbld    6801  0.1  0.2 169132 16988 ?        S    06:28   0:00 python release-runner.py -c /home/cltbld/.release-runner.ini

...which is bad because we could have two release runners racing on a release.

One idea for fixing this:
09:54 < jhopkins> bhearsum: try using an exec call to launch python from the shell script. this will result in only one pid beneath supervisord
Assignee: nobody → rail
Unfortunately we can't easily replace the current invocation method with "exec" because we check the exit status of the child script and send an email if it exits no zero...
(Reporter)

Comment 2

5 years ago
A more invasive approach would be to get rid of the shell wrapper. The only thing it does is send failure e-mail AFAIK. We could probably get better failure mail with an extra LogHandler in Python, anyways...
We need "stopasgroup" which was introduced in 3.0b1 (we use 3.0a9).
Depends on: 883693
(Reporter)

Comment 4

5 years ago
(In reply to Rail Aliiev [:rail] from comment #3)
> We need "stopasgroup" which was introduced in 3.0b1 (we use 3.0a9).

Sounds like a good opportunity to move release runner to a more modern machine, and puppetize it...
(Reporter)

Updated

5 years ago
Duplicate of this bug: 885422
I just verified that 3.0b2 fixes the problem. Since bm36 is not managed by puppet and will die soon I left this version installed.
This should be resolved now. I manually upgraded the supervisor package on bm36 (which is not managed by puppet), added stopasgroup/killasgroup to its config and verified that supervisor kills the subprocess properly.

stopasgroup/killasgroup will be set by default in bug 836289 for the future deployments.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.