Fix the deploy script, so the celery workers do not become idle/disconnected

RESOLVED FIXED

Status

Tree Management
Treeherder: Infrastructure
P1
critical
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: emorley, Assigned: mdoglio)

Tracking

Details

Attachments

(2 attachments)

(Reporter)

Description

3 years ago
Filing this as a separate bug, since this will be just about the tree closure, and so might be resolved fixed by just waiting for parsing to catch up.

Bug 1074927 tracks the various longer term solutions (incl more workers, which are on the way).

All trunk integration trees closed.
(Reporter)

Comment 1

3 years ago
Though TBPL is having issues too, so this might be more infra related.

TBPL-dev (on which we have access to import logs, unlike prod) hasn't imported successfully since 06-Oct-2014 19:02 (UTC+0) - see zero bytes logs here:
https://tbpl-dev.allizom.org/cache/

And TBPL prod is missing completed jobs (only grey symbols for recently pushes, none in green/other colours):
https://tbpl.mozilla.org/
(Reporter)

Comment 2

3 years ago
Filed bug 1079279 for TBPL.
(Reporter)

Comment 3

3 years ago
15:31 <•mdoglio> fubar: can you plese check the queues size on the admin node?
15:32 <•mdoglio> I'm particularly interested in the 2 log_parser queues
15:37 <fubar> Listing queues ...
15:37 <fubar> ...done.
15:37 <fubar> >.<
15:39 <•mdoglio> fubar: I would try restarting the celery services
15:40 <•mdoglio> fubar: I can see a few celery processes on the admin node consuming the default queue
15:44 <fubar> aha
15:45 <fubar> log_parser 20446
15:45 <fubar> log_parser_fail 634
15:45 <fubar> really? tabs? tsk.
15:45 <•mdoglio> fubar: I was expecting something like that
15:45 <•mdoglio> fubar: can you please kill the celery processes on the processor machines?
15:45 <•mdoglio> they will restart automatically
15:48 <fubar> mdoglio: done
15:48 <•mdoglio> thanks fubar
(Reporter)

Comment 4

3 years ago
I'm temporarily unable to connect to irccloud (), but here was the latest update from #treeherder


15:26	mdoglio	fubar: sorry to bug you again, can you please re-check the queues size?
15:26	fubar	log_parser 19593
15:26	fubar	log_parser_fail 20
15:27	fubar	default's at 20477, fwiw
15:27	mdoglio	fubar: interesting, default shouldn't be that big
15:28	mdoglio	but it's not a big deal
15:40	mdoglio	camd: https://treeherder.mozilla.org/admin/djcelery/periodictask/
15:47	mdoglio	fubar: I'm gonna clear the default queue on the admin node. there are only periodic tasks in it, there's no value in its content
15:47	fubar	cool
15:52	mdoglio	fubar: done, can you please tell me the size of the queues for (hopefully) the last time?
15:53	fubar	default30
15:53	fubar	log_parser 17995
15:53	fubar	log_parser_fail 2
15:58	mdoglio	python manage.py celery worker -Q default --purge
16:00	camd	RyanVM: edmorley|sheriffduty : when the log parsing backlog catches up and the trees can reopen, we'll push new treeherder code with fixes for several things (like toggling unclassified)
16:00	camd	just an fyi
16:02	edmorley|sheriffduty	camd: sounds good, thank you :-)

Note these timestamps need an hour adding to them, to make them comparable to those in comment 0, since this lot came from http://logs.glob.uno/?c=mozilla%23treeherder
(Reporter)

Comment 5

3 years ago
Ryan, could you take over? (The IRCCloud impromptu server restart is not quite complete: https://twitter.com/IRCCloud/status/519527547294720001). Coordindate with Cameron in #treeherder re queue lengths :-)
Flags: needinfo?(ryanvm)
(Reporter)

Comment 6

3 years ago
(TBPL is still not working either and no one from IT has yet replied in bug 1079279 :-()
Sure. Remind me again why people like irccloud so much?
Flags: needinfo?(ryanvm)
Comment hidden (typo)
Comment hidden (typo)
(Reporter)

Comment 10

3 years ago
Since the tree closure has been and gone, morphing this bug to be about investigating the root cause for the backlog & hopefully avoiding the cause/improving alerting.

19:17 <•edmorley> mdoglio: did we work out the reason for the queues getting backed up last night?
19:17 <•edmorley> (prior to the prod push)
19:25 <•mdoglio> edmorley: there was only one log parser running, the other one got disconnected from the queue
19:26 <•mdoglio> we need to find a way to detect that situation and take (automatic) action when it happens

It seems like we need to:
1) If possible, figure out why the worker became disconnected from the queue
2) Make not only the queue count be reported to new relic, but the number of active workers count (if possible) be reported to new relic (where we can then set up email alerts) - this comes under bug 1059325.
Severity: blocker → critical
Summary: Log parsing getting behind on production - trunk trees closed → Investigate why a log parser worker became disconnected from the queue
(Reporter)

Comment 11

3 years ago
3) If a worker becomes disconnected, make it automatically reconnect.
(Assignee)

Comment 12

3 years ago
The last comment here makes me think it's a bug in the celery gevent-based worker.
There are a couple of options if I can't find a quick fix for this:
- use a multiprocessing-based worker pool (almost no code changes, slightly worse performances)
- use the eventlet-based worker pool (a few code changes, same performances as now)
Also, the versions of both celery and kombu are outdated (see bug 1016117) upgrading them could possibly solve this.
(Reporter)

Updated

3 years ago
Blocks: 1080757
No longer blocks: 1059400
(Reporter)

Updated

3 years ago
Blocks: 1074927
No longer depends on: 1074927
(Assignee)

Comment 13

3 years ago
I can't reproduce the bug on my local environment and I it's intermittent on production. I think it's better to debug this on stage once we have a production mirror there.
(Assignee)

Updated

3 years ago
Blocks: 1072437
(Reporter)

Updated

3 years ago
No longer blocks: 1080757
Component: Treeherder → Treeherder: Data Ingestion
(Reporter)

Updated

3 years ago
Component: Treeherder: Data Ingestion → Treeherder: Infrastructure
QA Contact: laura
Summary: Investigate why a log parser worker became disconnected from the queue → Fix the deploy script, so workers do not become idle/disconnected
(Reporter)

Updated

3 years ago
Summary: Fix the deploy script, so workers do not become idle/disconnected → Fix the deploy script, so the celery workers do not become idle/disconnected
(Reporter)

Updated

3 years ago
Depends on: 1121494
Assignee: nobody → mdoglio
(Assignee)

Comment 14

3 years ago
Created attachment 8555155 [details] [review]
Github PR #355 on treeherder-service
Attachment #8555155 - Flags: review?(emorley)
(Assignee)

Updated

3 years ago
Status: NEW → ASSIGNED
(Reporter)

Comment 15

3 years ago
Comment on attachment 8555155 [details] [review]
Github PR #355 on treeherder-service

Looks good to me!

We'll likely never do this (given possible move of everything to AWS and binning of Chief as well), but were we to set up dev on the same infra and use Chief, we'll need to adjust the restart-jobs line appropriately.
Attachment #8555155 - Flags: review?(emorley) → review+
(Assignee)

Comment 16

3 years ago
yeah, I considered that as a pretty remote possibility

Comment 17

3 years ago
Commit pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/d88791068f2b44955473efa831a3f3e92252af51
Bug 1079270 - Use new restart command on deploy
(Assignee)

Updated

3 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(Reporter)

Comment 18

3 years ago
When I deployed to stage, it hung on the restart-jobs step for 15 mins, and then I got a "502 Proxy Error" from chief:

"""
Proxy Error

The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request POST /chief/treeherder.stage.

Reason: Error reading from remote server
"""

The log:
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.stage/logs/master.1422386387

[2015-01-27 19:19:57] Running deploy_log
[2015-01-27 19:19:57] [treeherder-processor1.stage.private.scl3.mozilla.com] running: /data/bin/update-www.sh
[2015-01-27 19:19:57] [treeherder-processor2.stage.private.scl3.mozilla.com] running: /data/bin/update-www.sh
[2015-01-27 19:19:57] [treeherder-processor3.stage.private.scl3.mozilla.com] running: /data/bin/update-www.sh
[2015-01-27 19:19:57] [treeherder-processor3.stage.private.scl3.mozilla.com] finished: /data/bin/update-www.sh (0.387s)
[2015-01-27 19:19:57] [treeherder-processor2.stage.private.scl3.mozilla.com] finished: /data/bin/update-www.sh (0.427s)
[2015-01-27 19:19:57] [treeherder-processor1.stage.private.scl3.mozilla.com] finished: /data/bin/update-www.sh (0.427s)
[2015-01-27 19:19:57] Finished deploy_log (0.429s)
[2015-01-27 19:19:57] [localhost] running: /root/bin/restart-jobs -s all

Should we split up the restart-jobs calls, and do one for each type, or is there something else happening?
(Reporter)

Comment 19

3 years ago
Running it manually only took 2 mins:

[root@treeherderadm.private.scl3 ~]# /root/bin/restart-jobs -s all
[2015-01-27 19:39:44] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celerybeat
[2015-01-27 19:39:56] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celerybeat (11.804s)
[treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celerybeat: stopped
[treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celerybeat: started
[2015-01-27 19:39:56] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker
[2015-01-27 19:40:10] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker (13.893s)
[treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celery_worker: stopped
[treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celery_worker: started
[2015-01-27 19:40:10] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_hp
[2015-01-27 19:40:24] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_hp (13.948s)
[treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celery_worker_hp: stopped
[treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celery_worker_hp: started
[2015-01-27 19:40:24] [treeherder-etl1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_buildapi
[2015-01-27 19:40:24] [treeherder-etl2.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_buildapi
[2015-01-27 19:40:38] [treeherder-etl2.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_buildapi (13.454s)
[treeherder-etl2.stage.private.scl3.mozilla.com] out: run_celery_worker_buildapi: stopped
[treeherder-etl2.stage.private.scl3.mozilla.com] out: run_celery_worker_buildapi: started
[2015-01-27 19:40:39] [treeherder-etl1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_buildapi (14.844s)
[treeherder-etl1.stage.private.scl3.mozilla.com] out: run_celery_worker_buildapi: stopped
[treeherder-etl1.stage.private.scl3.mozilla.com] out: run_celery_worker_buildapi: started
[2015-01-27 19:40:39] [treeherder-etl1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_pushlog
[2015-01-27 19:40:39] [treeherder-etl2.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_pushlog
[2015-01-27 19:40:53] [treeherder-etl1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_pushlog (13.676s)
[treeherder-etl1.stage.private.scl3.mozilla.com] out: run_celery_worker_pushlog: stopped
[treeherder-etl1.stage.private.scl3.mozilla.com] out: run_celery_worker_pushlog: started
[2015-01-27 19:40:53] [treeherder-etl2.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_pushlog (13.840s)
[treeherder-etl2.stage.private.scl3.mozilla.com] out: run_celery_worker_pushlog: stopped
[treeherder-etl2.stage.private.scl3.mozilla.com] out: run_celery_worker_pushlog: started
[2015-01-27 19:40:53] [treeherder-processor1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_gevent
[2015-01-27 19:40:53] [treeherder-processor2.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_gevent
[2015-01-27 19:40:53] [treeherder-processor3.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_gevent
[2015-01-27 19:41:12] [treeherder-processor3.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_gevent (19.534s)
[treeherder-processor3.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: stopped
[treeherder-processor3.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: started
[2015-01-27 19:41:15] [treeherder-processor2.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_gevent (21.797s)
[treeherder-processor2.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: stopped
[treeherder-processor2.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: started
[2015-01-27 19:41:16] [treeherder-processor1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_gevent (22.785s)
[treeherder-processor1.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: stopped
[treeherder-processor1.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: started
[2015-01-27 19:41:16] [treeherder1.stage.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[2015-01-27 19:41:16] [treeherder2.stage.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[2015-01-27 19:41:16] [treeherder3.stage.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[2015-01-27 19:41:26] [treeherder3.stage.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (10.674s)

[treeherder3.stage.webapp.scl3.mozilla.com] out: run_gunicorn: stopped
[treeherder3.stage.webapp.scl3.mozilla.com] out: run_gunicorn: started
[2015-01-27 19:41:26] [treeherder2.stage.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (10.732s)

[treeherder2.stage.webapp.scl3.mozilla.com] out: run_gunicorn: stopped
[treeherder2.stage.webapp.scl3.mozilla.com] out: run_gunicorn: started
[2015-01-27 19:41:27] [treeherder1.stage.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (10.801s)

[treeherder1.stage.webapp.scl3.mozilla.com] out: run_gunicorn: stopped
[treeherder1.stage.webapp.scl3.mozilla.com] out: run_gunicorn: started
As mentioned in IRC today, supervisorctl is waiting on the celery job to stop and restart which can be highly variable. 

Also, eek: the restart-jobs script is a hack job to make it easier to manually restart things, and wasn't intended to be used by the update script. :-/ 

puppet configs have been updated for staging and prod to have supervisord use SIGKILL to stop celery jobs.
(Reporter)

Comment 21

3 years ago
<•edmorley> mdoglio: fubar it sounds like we shouldn't be using KILL, or at least not without trying TERM first and also catching child processes http://celery.readthedocs.org/en/latest/userguide/workers.html#stopping-the-worker
<fubar> my immediate kneejerk reaction is "yes", but the better answer might actually be keep it.
<fubar> because then I can roll in bits to drain/undrain the web nodes from zlb to prevent outages during restarts
<•edmorley> fubar: we currently don't have proper handling for interrupted workers for at least objectstore processes (bug 1125476), and possibly more
<•edmorley> discovered as a side note during friday's db issue debugging
<fubar> hm
<fubar> then I think I need to change the stopwait to something > 60, since supervisor will send a KILL anyways if a stop takes longer than that
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Comment 22

3 years ago
That said, we need to assume workers can get killed at any time, and have the appropriate handling built-in, since deploy script aside, they can get killed for other reasons. Once we have bug 1125476 (and any others) then KILL seems fine - so long as we also take note of (from the docs link in comment 21):

"Also as processes can’t override the KILL signal, the worker will not be able to reap its children, so make sure to do so manually. This command usually does the trick: ..."
Updated supervisor configs to use TERM instead of KILL.
(Assignee)

Comment 24

3 years ago
Created attachment 8556410 [details] [review]
Github PR #360 on treeherder-service
Attachment #8556410 - Flags: review?(emorley)
(Reporter)

Comment 25

3 years ago
Comment on attachment 8556410 [details] [review]
Github PR #360 on treeherder-service

This will run the command multiple times, once for each host, since the hostgroups decorator does some magic.

Also, the order of restarts is different from when the "all" option is used. restart-jobs does:

# put rabbit first for all, for celerybeat
if [ "$_RAB" == "1" ]; then
    /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}rabbit supervisorctl restart run_celerybeat
    /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}rabbit supervisorctl restart run_celery_worker
    /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}rabbit supervisorctl restart run_celery_worker_hp
fi
if [ "$_ETL" == "1" ]; then
    /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}etl supervisorctl restart run_celery_worker_buildapi
    /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}etl supervisorctl restart run_celery_worker_pushlog
fi
if [ "$_LOG" == "1" ]; then
    /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}processors supervisorctl restart run_celery_worker_gevent fi
if [ "$_WEB" == "1" ]; then
    /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}web supervisorctl restart run_gunicorn
fi


...I don't know which order is "right", or if it doesn't matter, but just wanted to point it out :-)
Attachment #8556410 - Flags: review?(emorley) → review-
(Assignee)

Comment 26

3 years ago
I can't think of a reason to restart one worker before the other. I kept the same ordering that was used before. Restarting celerybeat before or after the workers doesn't sound like a problem to me.
(Assignee)

Updated

3 years ago
Attachment #8556410 - Flags: review- → review?(klibby)
Comment on attachment 8556410 [details] [review]
Github PR #360 on treeherder-service

lgtm!

I had put the celerybeat restart first as there were a couple of probably conincidental cases where workers appeared to lose their minds if restarted first.
Attachment #8556410 - Flags: review?(klibby) → review+

Comment 28

3 years ago
Commit pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/864148d099d93b6d10d0a9a7c5a36b892f416732
Bug 1079270 - Restart workers individually
ffs, I missed an important problem:

issue-multi-command behaves differently depending on whether or not there is a controlling terminal. If there IS NOT, it will not read command line arguments: they must be piped in. restart-jobs was written assuming that it was run manually, so update.py calling it means all of the restart jobs hang waiting on input.

Updated

3 years ago
Depends on: 1128963
restart-jobs script updated to check for terminal and change how issue-multi gets run. also added framework for un/draining web nodes from zlb, but that needs the flow in bug 1128963 before it can be enabled.
(Reporter)

Comment 31

3 years ago
(In reply to Kendall Libby [:fubar] from comment #30)
> also added framework for un/draining web nodes from zlb, but that
> needs the flow in bug 1128963 before it can be enabled.

That's awesome - once enabled should take care of bug 1088226 :-)
(Though I guess we may also need to handle the case of the UI knowing it needs to refresh when assets updated, but that's probably not the kind of notification bug 1088226 had in mind)
(Reporter)

Comment 32

3 years ago
(In reply to Ed Morley [:edmorley] from comment #25)
> Comment on attachment 8556410 [details] [review]
> Github PR #360 on treeherder-service
> 
> This will run the command multiple times, once for each host, since the
> hostgroups decorator does some magic.

I think this is happening after all:

http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/master.1423226819

[2015-02-06 12:47:42] [localhost] running: /root/bin/restart-jobs -p web
[2015-02-06 12:47:42] [localhost] running: /root/bin/restart-jobs -p web
[2015-02-06 12:47:42] [localhost] running: /root/bin/restart-jobs -p web
[2015-02-06 12:47:54] [localhost] finished: /root/bin/restart-jobs -p web (11.524s)
[localhost] err: [2015-02-06 12:47:42] [treeherder1.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[localhost] err: [2015-02-06 12:47:42] [treeherder2.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[localhost] err: [2015-02-06 12:47:42] [treeherder3.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[localhost] err: [2015-02-06 12:47:43] [treeherder2.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (0.499s)
[localhost] err: [treeherder2.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (not running)
[localhost] err: [treeherder2.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (abnormal termination)
[localhost] err: [2015-02-06 12:47:53] [treeherder3.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (10.516s)
[localhost] err: [treeherder3.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (not running)
[localhost] err: [treeherder3.webapp.scl3.mozilla.com] out: run_gunicorn: started
[localhost] err: [2015-02-06 12:47:54] [treeherder1.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (11.489s)
[localhost] err: [treeherder1.webapp.scl3.mozilla.com] out: run_gunicorn: stopped
[localhost] err: [treeherder1.webapp.scl3.mozilla.com] out: run_gunicorn: started
[2015-02-06 12:47:54] [localhost] finished: /root/bin/restart-jobs -p web (11.903s)
[localhost] err: [2015-02-06 12:47:42] [treeherder1.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[localhost] err: [2015-02-06 12:47:42] [treeherder2.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[localhost] err: [2015-02-06 12:47:42] [treeherder3.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[localhost] err: [2015-02-06 12:47:42] [treeherder1.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (0.255s)
[localhost] err: [treeherder1.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (not running)
[localhost] err: [treeherder1.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (abnormal termination)
[localhost] err: [2015-02-06 12:47:42] [treeherder3.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (0.390s)
[localhost] err: [treeherder3.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (not running)
[localhost] err: [treeherder3.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (abnormal termination)
[localhost] err: [2015-02-06 12:47:54] [treeherder2.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (11.865s)
[localhost] err: [treeherder2.webapp.scl3.mozilla.com] out: run_gunicorn: stopped
[localhost] err: [treeherder2.webapp.scl3.mozilla.com] out: run_gunicorn: started
if it's running restart-jobs once per host in the host group, then it's going to step all over itself - doubly so for the web group, since it's draining nodes from the zlb. restart-jobs is designed to be run once per group, not per host.
(Reporter)

Updated

3 years ago
Depends on: 1130408
(Reporter)

Comment 34

3 years ago
I've broken this out to bug 1130408 :-)
(Reporter)

Updated

3 years ago
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.