Closed Bug 1079270 Opened 10 years ago Closed 10 years ago

Fix the deploy script, so the celery workers do not become idle/disconnected

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: mdoglio)

References

Details

Attachments

(2 files)

Filing this as a separate bug, since this will be just about the tree closure, and so might be resolved fixed by just waiting for parsing to catch up. Bug 1074927 tracks the various longer term solutions (incl more workers, which are on the way). All trunk integration trees closed.
Though TBPL is having issues too, so this might be more infra related. TBPL-dev (on which we have access to import logs, unlike prod) hasn't imported successfully since 06-Oct-2014 19:02 (UTC+0) - see zero bytes logs here: https://tbpl-dev.allizom.org/cache/ And TBPL prod is missing completed jobs (only grey symbols for recently pushes, none in green/other colours): https://tbpl.mozilla.org/
Filed bug 1079279 for TBPL.
15:31 <•mdoglio> fubar: can you plese check the queues size on the admin node? 15:32 <•mdoglio> I'm particularly interested in the 2 log_parser queues 15:37 <fubar> Listing queues ... 15:37 <fubar> ...done. 15:37 <fubar> >.< 15:39 <•mdoglio> fubar: I would try restarting the celery services 15:40 <•mdoglio> fubar: I can see a few celery processes on the admin node consuming the default queue 15:44 <fubar> aha 15:45 <fubar> log_parser 20446 15:45 <fubar> log_parser_fail 634 15:45 <fubar> really? tabs? tsk. 15:45 <•mdoglio> fubar: I was expecting something like that 15:45 <•mdoglio> fubar: can you please kill the celery processes on the processor machines? 15:45 <•mdoglio> they will restart automatically 15:48 <fubar> mdoglio: done 15:48 <•mdoglio> thanks fubar
I'm temporarily unable to connect to irccloud (), but here was the latest update from #treeherder 15:26 mdoglio fubar: sorry to bug you again, can you please re-check the queues size? 15:26 fubar log_parser 19593 15:26 fubar log_parser_fail 20 15:27 fubar default's at 20477, fwiw 15:27 mdoglio fubar: interesting, default shouldn't be that big 15:28 mdoglio but it's not a big deal 15:40 mdoglio camd: https://treeherder.mozilla.org/admin/djcelery/periodictask/ 15:47 mdoglio fubar: I'm gonna clear the default queue on the admin node. there are only periodic tasks in it, there's no value in its content 15:47 fubar cool 15:52 mdoglio fubar: done, can you please tell me the size of the queues for (hopefully) the last time? 15:53 fubar default30 15:53 fubar log_parser 17995 15:53 fubar log_parser_fail 2 15:58 mdoglio python manage.py celery worker -Q default --purge 16:00 camd RyanVM: edmorley|sheriffduty : when the log parsing backlog catches up and the trees can reopen, we'll push new treeherder code with fixes for several things (like toggling unclassified) 16:00 camd just an fyi 16:02 edmorley|sheriffduty camd: sounds good, thank you :-) Note these timestamps need an hour adding to them, to make them comparable to those in comment 0, since this lot came from http://logs.glob.uno/?c=mozilla%23treeherder
Ryan, could you take over? (The IRCCloud impromptu server restart is not quite complete: https://twitter.com/IRCCloud/status/519527547294720001). Coordindate with Cameron in #treeherder re queue lengths :-)
Flags: needinfo?(ryanvm)
(TBPL is still not working either and no one from IT has yet replied in bug 1079279 :-()
Sure. Remind me again why people like irccloud so much?
Flags: needinfo?(ryanvm)
Since the tree closure has been and gone, morphing this bug to be about investigating the root cause for the backlog & hopefully avoiding the cause/improving alerting. 19:17 <•edmorley> mdoglio: did we work out the reason for the queues getting backed up last night? 19:17 <•edmorley> (prior to the prod push) 19:25 <•mdoglio> edmorley: there was only one log parser running, the other one got disconnected from the queue 19:26 <•mdoglio> we need to find a way to detect that situation and take (automatic) action when it happens It seems like we need to: 1) If possible, figure out why the worker became disconnected from the queue 2) Make not only the queue count be reported to new relic, but the number of active workers count (if possible) be reported to new relic (where we can then set up email alerts) - this comes under bug 1059325.
Severity: blocker → critical
Summary: Log parsing getting behind on production - trunk trees closed → Investigate why a log parser worker became disconnected from the queue
3) If a worker becomes disconnected, make it automatically reconnect.
The last comment here makes me think it's a bug in the celery gevent-based worker. There are a couple of options if I can't find a quick fix for this: - use a multiprocessing-based worker pool (almost no code changes, slightly worse performances) - use the eventlet-based worker pool (a few code changes, same performances as now) Also, the versions of both celery and kombu are outdated (see bug 1016117) upgrading them could possibly solve this.
Blocks: 1080757
No longer blocks: treeherder-dev-transition
Blocks: 1074927
No longer depends on: 1074927
I can't reproduce the bug on my local environment and I it's intermittent on production. I think it's better to debug this on stage once we have a production mirror there.
Blocks: 1072437
No longer blocks: 1080757
Component: Treeherder → Treeherder: Data Ingestion
Component: Treeherder: Data Ingestion → Treeherder: Infrastructure
QA Contact: laura
Summary: Investigate why a log parser worker became disconnected from the queue → Fix the deploy script, so workers do not become idle/disconnected
Summary: Fix the deploy script, so workers do not become idle/disconnected → Fix the deploy script, so the celery workers do not become idle/disconnected
Depends on: 1121494
Assignee: nobody → mdoglio
Attachment #8555155 - Flags: review?(emorley)
Status: NEW → ASSIGNED
Comment on attachment 8555155 [details] [review] Github PR #355 on treeherder-service Looks good to me! We'll likely never do this (given possible move of everything to AWS and binning of Chief as well), but were we to set up dev on the same infra and use Chief, we'll need to adjust the restart-jobs line appropriately.
Attachment #8555155 - Flags: review?(emorley) → review+
yeah, I considered that as a pretty remote possibility
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
When I deployed to stage, it hung on the restart-jobs step for 15 mins, and then I got a "502 Proxy Error" from chief: """ Proxy Error The proxy server received an invalid response from an upstream server. The proxy server could not handle the request POST /chief/treeherder.stage. Reason: Error reading from remote server """ The log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.stage/logs/master.1422386387 [2015-01-27 19:19:57] Running deploy_log [2015-01-27 19:19:57] [treeherder-processor1.stage.private.scl3.mozilla.com] running: /data/bin/update-www.sh [2015-01-27 19:19:57] [treeherder-processor2.stage.private.scl3.mozilla.com] running: /data/bin/update-www.sh [2015-01-27 19:19:57] [treeherder-processor3.stage.private.scl3.mozilla.com] running: /data/bin/update-www.sh [2015-01-27 19:19:57] [treeherder-processor3.stage.private.scl3.mozilla.com] finished: /data/bin/update-www.sh (0.387s) [2015-01-27 19:19:57] [treeherder-processor2.stage.private.scl3.mozilla.com] finished: /data/bin/update-www.sh (0.427s) [2015-01-27 19:19:57] [treeherder-processor1.stage.private.scl3.mozilla.com] finished: /data/bin/update-www.sh (0.427s) [2015-01-27 19:19:57] Finished deploy_log (0.429s) [2015-01-27 19:19:57] [localhost] running: /root/bin/restart-jobs -s all Should we split up the restart-jobs calls, and do one for each type, or is there something else happening?
Running it manually only took 2 mins: [root@treeherderadm.private.scl3 ~]# /root/bin/restart-jobs -s all [2015-01-27 19:39:44] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celerybeat [2015-01-27 19:39:56] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celerybeat (11.804s) [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celerybeat: stopped [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celerybeat: started [2015-01-27 19:39:56] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker [2015-01-27 19:40:10] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker (13.893s) [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celery_worker: stopped [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celery_worker: started [2015-01-27 19:40:10] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_hp [2015-01-27 19:40:24] [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_hp (13.948s) [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celery_worker_hp: stopped [treeherder-rabbitmq1.stage.private.scl3.mozilla.com] out: run_celery_worker_hp: started [2015-01-27 19:40:24] [treeherder-etl1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_buildapi [2015-01-27 19:40:24] [treeherder-etl2.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_buildapi [2015-01-27 19:40:38] [treeherder-etl2.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_buildapi (13.454s) [treeherder-etl2.stage.private.scl3.mozilla.com] out: run_celery_worker_buildapi: stopped [treeherder-etl2.stage.private.scl3.mozilla.com] out: run_celery_worker_buildapi: started [2015-01-27 19:40:39] [treeherder-etl1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_buildapi (14.844s) [treeherder-etl1.stage.private.scl3.mozilla.com] out: run_celery_worker_buildapi: stopped [treeherder-etl1.stage.private.scl3.mozilla.com] out: run_celery_worker_buildapi: started [2015-01-27 19:40:39] [treeherder-etl1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_pushlog [2015-01-27 19:40:39] [treeherder-etl2.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_pushlog [2015-01-27 19:40:53] [treeherder-etl1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_pushlog (13.676s) [treeherder-etl1.stage.private.scl3.mozilla.com] out: run_celery_worker_pushlog: stopped [treeherder-etl1.stage.private.scl3.mozilla.com] out: run_celery_worker_pushlog: started [2015-01-27 19:40:53] [treeherder-etl2.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_pushlog (13.840s) [treeherder-etl2.stage.private.scl3.mozilla.com] out: run_celery_worker_pushlog: stopped [treeherder-etl2.stage.private.scl3.mozilla.com] out: run_celery_worker_pushlog: started [2015-01-27 19:40:53] [treeherder-processor1.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_gevent [2015-01-27 19:40:53] [treeherder-processor2.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_gevent [2015-01-27 19:40:53] [treeherder-processor3.stage.private.scl3.mozilla.com] running: supervisorctl restart run_celery_worker_gevent [2015-01-27 19:41:12] [treeherder-processor3.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_gevent (19.534s) [treeherder-processor3.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: stopped [treeherder-processor3.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: started [2015-01-27 19:41:15] [treeherder-processor2.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_gevent (21.797s) [treeherder-processor2.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: stopped [treeherder-processor2.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: started [2015-01-27 19:41:16] [treeherder-processor1.stage.private.scl3.mozilla.com] finished: supervisorctl restart run_celery_worker_gevent (22.785s) [treeherder-processor1.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: stopped [treeherder-processor1.stage.private.scl3.mozilla.com] out: run_celery_worker_gevent: started [2015-01-27 19:41:16] [treeherder1.stage.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn [2015-01-27 19:41:16] [treeherder2.stage.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn [2015-01-27 19:41:16] [treeherder3.stage.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn [2015-01-27 19:41:26] [treeherder3.stage.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (10.674s) [treeherder3.stage.webapp.scl3.mozilla.com] out: run_gunicorn: stopped [treeherder3.stage.webapp.scl3.mozilla.com] out: run_gunicorn: started [2015-01-27 19:41:26] [treeherder2.stage.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (10.732s) [treeherder2.stage.webapp.scl3.mozilla.com] out: run_gunicorn: stopped [treeherder2.stage.webapp.scl3.mozilla.com] out: run_gunicorn: started [2015-01-27 19:41:27] [treeherder1.stage.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (10.801s) [treeherder1.stage.webapp.scl3.mozilla.com] out: run_gunicorn: stopped [treeherder1.stage.webapp.scl3.mozilla.com] out: run_gunicorn: started
As mentioned in IRC today, supervisorctl is waiting on the celery job to stop and restart which can be highly variable. Also, eek: the restart-jobs script is a hack job to make it easier to manually restart things, and wasn't intended to be used by the update script. :-/ puppet configs have been updated for staging and prod to have supervisord use SIGKILL to stop celery jobs.
<•edmorley> mdoglio: fubar it sounds like we shouldn't be using KILL, or at least not without trying TERM first and also catching child processes http://celery.readthedocs.org/en/latest/userguide/workers.html#stopping-the-worker <fubar> my immediate kneejerk reaction is "yes", but the better answer might actually be keep it. <fubar> because then I can roll in bits to drain/undrain the web nodes from zlb to prevent outages during restarts <•edmorley> fubar: we currently don't have proper handling for interrupted workers for at least objectstore processes (bug 1125476), and possibly more <•edmorley> discovered as a side note during friday's db issue debugging <fubar> hm <fubar> then I think I need to change the stopwait to something > 60, since supervisor will send a KILL anyways if a stop takes longer than that
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
That said, we need to assume workers can get killed at any time, and have the appropriate handling built-in, since deploy script aside, they can get killed for other reasons. Once we have bug 1125476 (and any others) then KILL seems fine - so long as we also take note of (from the docs link in comment 21): "Also as processes can’t override the KILL signal, the worker will not be able to reap its children, so make sure to do so manually. This command usually does the trick: ..."
Updated supervisor configs to use TERM instead of KILL.
Attachment #8556410 - Flags: review?(emorley)
Comment on attachment 8556410 [details] [review] Github PR #360 on treeherder-service This will run the command multiple times, once for each host, since the hostgroups decorator does some magic. Also, the order of restarts is different from when the "all" option is used. restart-jobs does: # put rabbit first for all, for celerybeat if [ "$_RAB" == "1" ]; then /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}rabbit supervisorctl restart run_celerybeat /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}rabbit supervisorctl restart run_celery_worker /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}rabbit supervisorctl restart run_celery_worker_hp fi if [ "$_ETL" == "1" ]; then /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}etl supervisorctl restart run_celery_worker_buildapi /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}etl supervisorctl restart run_celery_worker_pushlog fi if [ "$_LOG" == "1" ]; then /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}processors supervisorctl restart run_celery_worker_gevent fi if [ "$_WEB" == "1" ]; then /usr/bin/issue-multi-command -i $SSHKEY treeherder-${_ENV}web supervisorctl restart run_gunicorn fi ...I don't know which order is "right", or if it doesn't matter, but just wanted to point it out :-)
Attachment #8556410 - Flags: review?(emorley) → review-
I can't think of a reason to restart one worker before the other. I kept the same ordering that was used before. Restarting celerybeat before or after the workers doesn't sound like a problem to me.
Attachment #8556410 - Flags: review- → review?(klibby)
Comment on attachment 8556410 [details] [review] Github PR #360 on treeherder-service lgtm! I had put the celerybeat restart first as there were a couple of probably conincidental cases where workers appeared to lose their minds if restarted first.
Attachment #8556410 - Flags: review?(klibby) → review+
ffs, I missed an important problem: issue-multi-command behaves differently depending on whether or not there is a controlling terminal. If there IS NOT, it will not read command line arguments: they must be piped in. restart-jobs was written assuming that it was run manually, so update.py calling it means all of the restart jobs hang waiting on input.
restart-jobs script updated to check for terminal and change how issue-multi gets run. also added framework for un/draining web nodes from zlb, but that needs the flow in bug 1128963 before it can be enabled.
(In reply to Kendall Libby [:fubar] from comment #30) > also added framework for un/draining web nodes from zlb, but that > needs the flow in bug 1128963 before it can be enabled. That's awesome - once enabled should take care of bug 1088226 :-) (Though I guess we may also need to handle the case of the UI knowing it needs to refresh when assets updated, but that's probably not the kind of notification bug 1088226 had in mind)
(In reply to Ed Morley [:edmorley] from comment #25) > Comment on attachment 8556410 [details] [review] > Github PR #360 on treeherder-service > > This will run the command multiple times, once for each host, since the > hostgroups decorator does some magic. I think this is happening after all: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/master.1423226819 [2015-02-06 12:47:42] [localhost] running: /root/bin/restart-jobs -p web [2015-02-06 12:47:42] [localhost] running: /root/bin/restart-jobs -p web [2015-02-06 12:47:42] [localhost] running: /root/bin/restart-jobs -p web [2015-02-06 12:47:54] [localhost] finished: /root/bin/restart-jobs -p web (11.524s) [localhost] err: [2015-02-06 12:47:42] [treeherder1.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn [localhost] err: [2015-02-06 12:47:42] [treeherder2.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn [localhost] err: [2015-02-06 12:47:42] [treeherder3.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn [localhost] err: [2015-02-06 12:47:43] [treeherder2.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (0.499s) [localhost] err: [treeherder2.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (not running) [localhost] err: [treeherder2.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (abnormal termination) [localhost] err: [2015-02-06 12:47:53] [treeherder3.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (10.516s) [localhost] err: [treeherder3.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (not running) [localhost] err: [treeherder3.webapp.scl3.mozilla.com] out: run_gunicorn: started [localhost] err: [2015-02-06 12:47:54] [treeherder1.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (11.489s) [localhost] err: [treeherder1.webapp.scl3.mozilla.com] out: run_gunicorn: stopped [localhost] err: [treeherder1.webapp.scl3.mozilla.com] out: run_gunicorn: started [2015-02-06 12:47:54] [localhost] finished: /root/bin/restart-jobs -p web (11.903s) [localhost] err: [2015-02-06 12:47:42] [treeherder1.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn [localhost] err: [2015-02-06 12:47:42] [treeherder2.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn [localhost] err: [2015-02-06 12:47:42] [treeherder3.webapp.scl3.mozilla.com] running: supervisorctl restart run_gunicorn [localhost] err: [2015-02-06 12:47:42] [treeherder1.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (0.255s) [localhost] err: [treeherder1.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (not running) [localhost] err: [treeherder1.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (abnormal termination) [localhost] err: [2015-02-06 12:47:42] [treeherder3.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (0.390s) [localhost] err: [treeherder3.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (not running) [localhost] err: [treeherder3.webapp.scl3.mozilla.com] out: run_gunicorn: ERROR (abnormal termination) [localhost] err: [2015-02-06 12:47:54] [treeherder2.webapp.scl3.mozilla.com] finished: supervisorctl restart run_gunicorn (11.865s) [localhost] err: [treeherder2.webapp.scl3.mozilla.com] out: run_gunicorn: stopped [localhost] err: [treeherder2.webapp.scl3.mozilla.com] out: run_gunicorn: started
if it's running restart-jobs once per host in the host group, then it's going to step all over itself - doubly so for the web group, since it's draining nodes from the zlb. restart-jobs is designed to be run once per group, not per host.
Depends on: 1130408
I've broken this out to bug 1130408 :-)
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: