Closed
Bug 1079270
Opened 10 years ago
Closed 10 years ago
Fix the deploy script, so the celery workers do not become idle/disconnected
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: mdoglio)
References
Details
Attachments
(2 files)
Filing this as a separate bug, since this will be just about the tree closure, and so might be resolved fixed by just waiting for parsing to catch up.
Bug 1074927 tracks the various longer term solutions (incl more workers, which are on the way).
All trunk integration trees closed.
Reporter | ||
Comment 1•10 years ago
|
||
Though TBPL is having issues too, so this might be more infra related.
TBPL-dev (on which we have access to import logs, unlike prod) hasn't imported successfully since 06-Oct-2014 19:02 (UTC+0) - see zero bytes logs here:
https://tbpl-dev.allizom.org/cache/
And TBPL prod is missing completed jobs (only grey symbols for recently pushes, none in green/other colours):
https://tbpl.mozilla.org/
Reporter | ||
Comment 2•10 years ago
|
||
Filed bug 1079279 for TBPL.
Reporter | ||
Comment 3•10 years ago
|
||
15:31 <•mdoglio> fubar: can you plese check the queues size on the admin node?
15:32 <•mdoglio> I'm particularly interested in the 2 log_parser queues
15:37 <fubar> Listing queues ...
15:37 <fubar> ...done.
15:37 <fubar> >.<
15:39 <•mdoglio> fubar: I would try restarting the celery services
15:40 <•mdoglio> fubar: I can see a few celery processes on the admin node consuming the default queue
15:44 <fubar> aha
15:45 <fubar> log_parser 20446
15:45 <fubar> log_parser_fail 634
15:45 <fubar> really? tabs? tsk.
15:45 <•mdoglio> fubar: I was expecting something like that
15:45 <•mdoglio> fubar: can you please kill the celery processes on the processor machines?
15:45 <•mdoglio> they will restart automatically
15:48 <fubar> mdoglio: done
15:48 <•mdoglio> thanks fubar
Reporter | ||
Comment 4•10 years ago
|
||
I'm temporarily unable to connect to irccloud (), but here was the latest update from #treeherder
15:26 mdoglio fubar: sorry to bug you again, can you please re-check the queues size?
15:26 fubar log_parser 19593
15:26 fubar log_parser_fail 20
15:27 fubar default's at 20477, fwiw
15:27 mdoglio fubar: interesting, default shouldn't be that big
15:28 mdoglio but it's not a big deal
15:40 mdoglio camd: https://treeherder.mozilla.org/admin/djcelery/periodictask/
15:47 mdoglio fubar: I'm gonna clear the default queue on the admin node. there are only periodic tasks in it, there's no value in its content
15:47 fubar cool
15:52 mdoglio fubar: done, can you please tell me the size of the queues for (hopefully) the last time?
15:53 fubar default30
15:53 fubar log_parser 17995
15:53 fubar log_parser_fail 2
15:58 mdoglio python manage.py celery worker -Q default --purge
16:00 camd RyanVM: edmorley|sheriffduty : when the log parsing backlog catches up and the trees can reopen, we'll push new treeherder code with fixes for several things (like toggling unclassified)
16:00 camd just an fyi
16:02 edmorley|sheriffduty camd: sounds good, thank you :-)
Note these timestamps need an hour adding to them, to make them comparable to those in comment 0, since this lot came from http://logs.glob.uno/?c=mozilla%23treeherder
Reporter | ||
Comment 5•10 years ago
|
||
Ryan, could you take over? (The IRCCloud impromptu server restart is not quite complete: https://twitter.com/IRCCloud/status/519527547294720001). Coordindate with Cameron in #treeherder re queue lengths :-)
Flags: needinfo?(ryanvm)
Reporter | ||
Comment 6•10 years ago
|
||
(TBPL is still not working either and no one from IT has yet replied in bug 1079279 :-()
Comment 7•10 years ago
|
||
Sure. Remind me again why people like irccloud so much?
Flags: needinfo?(ryanvm)
Comment hidden (typo) |
Comment hidden (typo) |
Reporter | ||
Comment 10•10 years ago
|
||
Since the tree closure has been and gone, morphing this bug to be about investigating the root cause for the backlog & hopefully avoiding the cause/improving alerting.
19:17 <•edmorley> mdoglio: did we work out the reason for the queues getting backed up last night?
19:17 <•edmorley> (prior to the prod push)
19:25 <•mdoglio> edmorley: there was only one log parser running, the other one got disconnected from the queue
19:26 <•mdoglio> we need to find a way to detect that situation and take (automatic) action when it happens
It seems like we need to:
1) If possible, figure out why the worker became disconnected from the queue
2) Make not only the queue count be reported to new relic, but the number of active workers count (if possible) be reported to new relic (where we can then set up email alerts) - this comes under bug 1059325.
Severity: blocker → critical
Summary: Log parsing getting behind on production - trunk trees closed → Investigate why a log parser worker became disconnected from the queue
Reporter | ||
Comment 11•10 years ago
|
||
3) If a worker becomes disconnected, make it automatically reconnect.
Assignee | ||
Comment 12•10 years ago
|
||
The last comment here makes me think it's a bug in the celery gevent-based worker.
There are a couple of options if I can't find a quick fix for this:
- use a multiprocessing-based worker pool (almost no code changes, slightly worse performances)
- use the eventlet-based worker pool (a few code changes, same performances as now)
Also, the versions of both celery and kombu are outdated (see bug 1016117) upgrading them could possibly solve this.
Reporter | ||
Updated•10 years ago
|
Reporter | ||
Updated•10 years ago
|
Assignee | ||
Comment 13•10 years ago
|
||
I can't reproduce the bug on my local environment and I it's intermittent on production. I think it's better to debug this on stage once we have a production mirror there.
Reporter | ||
Updated•10 years ago
|
No longer blocks: 1080757
Component: Treeherder → Treeherder: Data Ingestion
Reporter | ||
Updated•10 years ago
|
Component: Treeherder: Data Ingestion → Treeherder: Infrastructure
QA Contact: laura
Summary: Investigate why a log parser worker became disconnected from the queue → Fix the deploy script, so workers do not become idle/disconnected
Reporter | ||
Updated•10 years ago
|
Summary: Fix the deploy script, so workers do not become idle/disconnected → Fix the deploy script, so the celery workers do not become idle/disconnected
Updated•10 years ago
|
Assignee: nobody → mdoglio
Assignee | ||
Comment 14•10 years ago
|
||
Attachment #8555155 -
Flags: review?(emorley)
Assignee | ||
Updated•10 years ago
|
Status: NEW → ASSIGNED
Reporter | ||
Comment 15•10 years ago
|
||
Comment on attachment 8555155 [details] [review]
Github PR #355 on treeherder-service
Looks good to me!
We'll likely never do this (given possible move of everything to AWS and binning of Chief as well), but were we to set up dev on the same infra and use Chief, we'll need to adjust the restart-jobs line appropriately.
Attachment #8555155 -
Flags: review?(emorley) → review+
Assignee | ||
Comment 16•10 years ago
|
||
yeah, I considered that as a pretty remote possibility
Comment 17•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder-service
https://github.com/mozilla/treeherder-service/commit/d88791068f2b44955473efa831a3f3e92252af51
Bug 1079270 - Use new restart command on deploy
Assignee | ||
Updated•10 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 18•10 years ago
|
||
When I deployed to stage, it hung on the restart-jobs step for 15 mins, and then I got a "502 Proxy Error" from chief:
"""
Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request POST /chief/treeherder.stage.
Reason: Error reading from remote server
"""
The log:
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.stage/logs/master.1422386387
[2015-01-27 19:19:57] Running deploy_log
[2015-01-27 19:19:57] [treeherder-processor1.stage.private.scl3.mozilla.com] running: /data/bin/update-www.sh
[2015-01-27 19:19:57] [treeherder-processor2.stage.private.scl3.mozilla.com] running: /data/bin/update-www.sh
[2015-01-27 19:19:57] [treeherder-processor3.stage.private.scl3.mozilla.com] running: /data/bin/update-www.sh
[2015-01-27 19:19:57] [treeherder-processor3.stage.private.scl3.mozilla.com] finished: /data/bin/update-www.sh (0.387s)
[2015-01-27 19:19:57] [treeherder-processor2.stage.private.scl3.mozilla.com] finished: /data/bin/update-www.sh (0.427s)
[2015-01-27 19:19:57] [treeherder-processor1.stage.private.scl3.mozilla.com] finished: /data/bin/update-www.sh (0.427s)
[2015-01-27 19:19:57] Finished deploy_log (0.429s)
[2015-01-27 19:19:57] [localhost] running: /root/bin/restart-jobs -s all
Should we split up the restart-jobs calls, and do one for each type, or is there something else happening?
Reporter | ||
Comment 19•10 years ago
|
||
Running it manually only took 2 mins:
[root@treeherderadm.private.scl3 ~]# /root/bin/restart-jobs -s all
[2015-01-27 19:39:44] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celerybeat
[2015-01-27 19:39:56] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celerybeat (11.804s)
[treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celerybeat: stopped
[treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celerybeat: started
[2015-01-27 19:39:56] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker
[2015-01-27 19:40:10] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker (13.893s)
[treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celery_worker: stopped
[treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celery_worker: started
[2015-01-27 19:40:10] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_hp
[2015-01-27 19:40:24] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_hp (13.948s)
[treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celery_worker_hp: stopped
[treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celery_worker_hp: started
[2015-01-27 19:40:24] [treeherder-etl1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_buildapi
[2015-01-27 19:40:24] [treeherder-etl2.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_buildapi
[2015-01-27 19:40:38] [treeherder-etl2.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_buildapi (13.454s)
[treeherder-etl2.stage.private.scl3.mozilla.com] out: run_celery_worker_buildapi: stopped
[treeherder-etl2.stage.private.scl3.mozilla.com] out: run_celery_worker_buildapi: started
[2015-01-27 19:40:39] [treeherder-etl1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_buildapi (14.844s)
[treeherder-etl1.stage.private.scl3.mozilla.com] out: run_celery_worker_buildapi: stopped
[treeherder-etl1.stage.private.scl3.mozilla.com] out: run_celery_worker_buildapi: started
[2015-01-27 19:40:39] [treeherder-etl1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_pushlog
[2015-01-27 19:40:39] [treeherder-etl2.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_pushlog
[2015-01-27 19:40:53] [treeherder-etl1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_pushlog (13.676s)
[treeherder-etl1.stage.private.scl3.mozilla.com] out: run_celery_worker_pushlog: stopped
[treeherder-etl1.stage.private.scl3.mozilla.com] out: run_celery_worker_pushlog: started
[2015-01-27 19:40:53] [treeherder-etl2.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_pushlog (13.840s)
[treeherder-etl2.stage.private.scl3.mozilla.com] out: run_celery_worker_pushlog: stopped
[treeherder-etl2.stage.private.scl3.mozilla.com] out: run_celery_worker_pushlog: started
[2015-01-27 19:40:53] [treeherder-processor1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_gevent
[2015-01-27 19:40:53] [treeherder-processor2.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_gevent
[2015-01-27 19:40:53] [treeherder-processor3.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_gevent
[2015-01-27 19:41:12] [treeherder-processor3.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_gevent (19.534s)
[treeherder-processor3.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: stopped
[treeherder-processor3.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: started
[2015-01-27 19:41:15] [treeherder-processor2.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_gevent (21.797s)
[treeherder-processor2.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: stopped
[treeherder-processor2.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: started
[2015-01-27 19:41:16] [treeherder-processor1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_gevent (22.785s)
[treeherder-processor1.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: stopped
[treeherder-processor1.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: started
[2015-01-27 19:41:16] [treeherder1.stage.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[2015-01-27 19:41:16] [treeherder2.stage.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[2015-01-27 19:41:16] [treeherder3.stage.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[2015-01-27 19:41:26] [treeherder3.stage.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (10.674s)
[treeherder3.stage.webapp.scl3.mozilla.com] out: run_gunicorn: stopped
[treeherder3.stage.webapp.scl3.mozilla.com] out: run_gunicorn: started
[2015-01-27 19:41:26] [treeherder2.stage.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (10.732s)
[treeherder2.stage.webapp.scl3.mozilla.com] out: run_gunicorn: stopped
[treeherder2.stage.webapp.scl3.mozilla.com] out: run_gunicorn: started
[2015-01-27 19:41:27] [treeherder1.stage.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (10.801s)
[treeherder1.stage.webapp.scl3.mozilla.com] out: run_gunicorn: stopped
[treeherder1.stage.webapp.scl3.mozilla.com] out: run_gunicorn: started
Comment 20•10 years ago
|
||
As mentioned in IRC today, supervisorctl is waiting on the celery job to stop and restart which can be highly variable.
Also, eek: the restart-jobs script is a hack job to make it easier to manually restart things, and wasn't intended to be used by the update script. :-/
puppet configs have been updated for staging and prod to have supervisord use SIGKILL to stop celery jobs.
Reporter | ||
Comment 21•10 years ago
|
||
<•edmorley> mdoglio: fubar it sounds like we shouldn't be using KILL, or at least not without trying TERM first and also catching child processes http://celery.readthedocs.org/en/latest/userguide/workers.html#stopping-the-worker
<fubar> my immediate kneejerk reaction is "yes", but the better answer might actually be keep it.
<fubar> because then I can roll in bits to drain/undrain the web nodes from zlb to prevent outages during restarts
<•edmorley> fubar: we currently don't have proper handling for interrupted workers for at least objectstore processes (bug 1125476), and possibly more
<•edmorley> discovered as a side note during friday's db issue debugging
<fubar> hm
<fubar> then I think I need to change the stopwait to something > 60, since supervisor will send a KILL anyways if a stop takes longer than that
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 22•10 years ago
|
||
That said, we need to assume workers can get killed at any time, and have the appropriate handling built-in, since deploy script aside, they can get killed for other reasons. Once we have bug 1125476 (and any others) then KILL seems fine - so long as we also take note of (from the docs link in comment 21):
"Also as processes can’t override the KILL signal, the worker will not be able to reap its children, so make sure to do so manually. This command usually does the trick: ..."
Comment 23•10 years ago
|
||
Updated supervisor configs to use TERM instead of KILL.
Assignee | ||
Comment 24•10 years ago
|
||
Attachment #8556410 -
Flags: review?(emorley)
Reporter | ||
Comment 25•10 years ago
|
||
Comment on attachment 8556410 [details] [review]
Github PR #360 on treeherder-service
This will run the command multiple times, once for each host, since the hostgroups decorator does some magic.
Also, the order of restarts is different from when the "all" option is used. restart-jobs does:
# put rabbit first for all, for celerybeat
if [ "$_RAB" == "1" ]; then
/usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}rabbit supervisorctl restart run_celerybeat
/usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}rabbit supervisorctl restart run_celery_worker
/usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}rabbit supervisorctl restart run_celery_worker_hp
fi
if [ "$_ETL" == "1" ]; then
/usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}etl supervisorctl restart run_celery_worker_buildapi
/usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}etl supervisorctl restart run_celery_worker_pushlog
fi
if [ "$_LOG" == "1" ]; then
/usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}processors supervisorctl restart run_celery_worker_gevent fi
if [ "$_WEB" == "1" ]; then
/usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}web supervisorctl restart run_gunicorn
fi
...I don't know which order is "right", or if it doesn't matter, but just wanted to point it out :-)
Attachment #8556410 -
Flags: review?(emorley) → review-
Assignee | ||
Comment 26•10 years ago
|
||
I can't think of a reason to restart one worker before the other. I kept the same ordering that was used before. Restarting celerybeat before or after the workers doesn't sound like a problem to me.
Assignee | ||
Updated•10 years ago
|
Attachment #8556410 -
Flags: review- → review?(klibby)
Comment 27•10 years ago
|
||
Comment on attachment 8556410 [details] [review]
Github PR #360 on treeherder-service
lgtm!
I had put the celerybeat restart first as there were a couple of probably conincidental cases where workers appeared to lose their minds if restarted first.
Attachment #8556410 -
Flags: review?(klibby) → review+
Comment 28•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder-service
https://github.com/mozilla/treeherder-service/commit/864148d099d93b6d10d0a9a7c5a36b892f416732
Bug 1079270 - Restart workers individually
Comment 29•10 years ago
|
||
ffs, I missed an important problem:
issue-multi-command behaves differently depending on whether or not there is a controlling terminal. If there IS NOT, it will not read command line arguments: they must be piped in. restart-jobs was written assuming that it was run manually, so update.py calling it means all of the restart jobs hang waiting on input.
Comment 30•10 years ago
|
||
restart-jobs script updated to check for terminal and change how issue-multi gets run. also added framework for un/draining web nodes from zlb, but that needs the flow in bug 1128963 before it can be enabled.
Reporter | ||
Comment 31•10 years ago
|
||
(In reply to Kendall Libby [:fubar] from comment #30)
> also added framework for un/draining web nodes from zlb, but that
> needs the flow in bug 1128963 before it can be enabled.
That's awesome - once enabled should take care of bug 1088226 :-)
(Though I guess we may also need to handle the case of the UI knowing it needs to refresh when assets updated, but that's probably not the kind of notification bug 1088226 had in mind)
Reporter | ||
Comment 32•10 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #25)
> Comment on attachment 8556410 [details] [review]
> Github PR #360 on treeherder-service
>
> This will run the command multiple times, once for each host, since the
> hostgroups decorator does some magic.
I think this is happening after all:
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/master.1423226819
[2015-02-06 12:47:42] [localhost] running: /root/bin/restart-jobs -p web
[2015-02-06 12:47:42] [localhost] running: /root/bin/restart-jobs -p web
[2015-02-06 12:47:42] [localhost] running: /root/bin/restart-jobs -p web
[2015-02-06 12:47:54] [localhost] finished: /root/bin/restart-jobs -p web (11.524s)
[localhost] err: [2015-02-06 12:47:42] [treeherder1.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[localhost] err: [2015-02-06 12:47:42] [treeherder2.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[localhost] err: [2015-02-06 12:47:42] [treeherder3.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[localhost] err: [2015-02-06 12:47:43] [treeherder2.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (0.499s)
[localhost] err: [treeherder2.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (not running)
[localhost] err: [treeherder2.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (abnormal termination)
[localhost] err: [2015-02-06 12:47:53] [treeherder3.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (10.516s)
[localhost] err: [treeherder3.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (not running)
[localhost] err: [treeherder3.webapp.scl3.mozilla.com] out: run_gunicorn: started
[localhost] err: [2015-02-06 12:47:54] [treeherder1.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (11.489s)
[localhost] err: [treeherder1.webapp.scl3.mozilla.com] out: run_gunicorn: stopped
[localhost] err: [treeherder1.webapp.scl3.mozilla.com] out: run_gunicorn: started
[2015-02-06 12:47:54] [localhost] finished: /root/bin/restart-jobs -p web (11.903s)
[localhost] err: [2015-02-06 12:47:42] [treeherder1.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[localhost] err: [2015-02-06 12:47:42] [treeherder2.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[localhost] err: [2015-02-06 12:47:42] [treeherder3.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn
[localhost] err: [2015-02-06 12:47:42] [treeherder1.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (0.255s)
[localhost] err: [treeherder1.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (not running)
[localhost] err: [treeherder1.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (abnormal termination)
[localhost] err: [2015-02-06 12:47:42] [treeherder3.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (0.390s)
[localhost] err: [treeherder3.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (not running)
[localhost] err: [treeherder3.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (abnormal termination)
[localhost] err: [2015-02-06 12:47:54] [treeherder2.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (11.865s)
[localhost] err: [treeherder2.webapp.scl3.mozilla.com] out: run_gunicorn: stopped
[localhost] err: [treeherder2.webapp.scl3.mozilla.com] out: run_gunicorn: started
Comment 33•10 years ago
|
||
if it's running restart-jobs once per host in the host group, then it's going to step all over itself - doubly so for the web group, since it's draining nodes from the zlb. restart-jobs is designed to be run once per group, not per host.
Reporter | ||
Comment 34•10 years ago
|
||
I've broken this out to bug 1130408 :-)
Reporter | ||
Updated•10 years ago
|
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•