Closed Bug 1125124 Opened 9 years ago Closed 9 years ago

Figure out why the log processor nodes regularly stop handling tasks

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1123479

People

(Reporter: emorley, Unassigned)

Details

This morning the sheriffs reported many jobs with unparsed logs.

It turns out two of the prod log processors (processor2 and processor3) had stopped handling tasks - you can see the drop here, until I restarted them:

https://rpm.newrelic.com/accounts/677903/servers/4303155?tw[end]=1422021783&tw[start]=1421978583
https://rpm.newrelic.com/accounts/677903/servers/5313432?tw[end]=1422021846&tw[start]=1421935446

One hadn't been taking jobs for ~7 hours, the other ~10 hours.

This resulted in a 40,000 job backlog for log parsing.

We have other bugs on file for: (a) getting queue counts in new relic, and thus alerts, (b) improving log parser performance; but it would be great to figure out why they stop taking tasks like this.
I should add - there hadn't been a deployment for 4.5 days, so it wasn't due to bug 1079270.
Three hours after restarting the log processors, the queues are going down, but still:

default 1852
high_priority   0
log_parser      24042
log_parser_fail 0
log_parser_hp   14109

I think we could get away with clearing the log_parser_hp queue, since aiui it's just on-demand log parse requests caused by people clicking around in the UI - all of which should duplicate the main queue - if my hunch about bug 1125088 is correct.

However I can't clear the log_parser_hp queue, since rabbitmqctl doesn't support clearing queues, I don't have the password for the web admin page, and the only other cli option (rabbitmqadmin) isn't installed as far as I can see.
treeherder-processor3 just became idle again, so I had to restart them all (I couldn't |sudo supervisorctl restart run_celery_worker_gevent| just on that node, since it prompted for a password when sudoing).
Slowly getting there:

default 2168
high_priority   0
log_parser      21430
log_parser_fail 11
log_parser_hp   9148
default 2490
log_parser      21219
log_parser_fail 17
log_parser_hp   5850
One step forwards, two back:

default 3015
log_parser      31179
log_parser_fail 1
log_parser_hp   2092
We appear to have lost proc3 again:
https://rpm.newrelic.com/accounts/677903/servers/5313432?tw[end]=1422053068&tw[start]=1422049468

default 3569
log_parser      33399
log_parser_fail 53
log_parser_hp   516

Restarting.
Well on the plus side we're catching up:

default 4955
log_parser      14587
log_parser_fail 4
log_parser_hp   0

The only problem is that's just because we're not ingesting any completed jobs (bug 1125410), so once that's resolved, we'll have another backlog, yey \o/
default 834
log_parser      87
log_parser_fail 3
log_parser_hp   0
this happened again, so bumping priority since this makes sheriffing more complicated with having no insight into the results
Severity: normal → critical
And again:

buildapi        0
celery@buildapi.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox  0
celery@buildapi.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox  0
celery@default.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox      0
celery@hp.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox   0
celery@log_parser.treeherder-processor1.private.scl3.mozilla.com.celery.pidbox  0
celery@log_parser.treeherder-processor2.private.scl3.mozilla.com.celery.pidbox  0
celery@log_parser.treeherder-processor3.private.scl3.mozilla.com.celery.pidbox  0
celery@pushlog.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox   0
celery@pushlog.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox   0
celeryev.1bda9ad7-fcff-429b-b520-5dcf58333982   0
celeryev.24f6138b-eafa-49d2-a6d5-b9c0388f47cc   0
celeryev.609cc011-de8a-4f17-9ae9-70d944116c1d   17110
celeryev.86309a15-671d-4a24-bc6d-242d4afced72   3606
celeryev.b5409c1a-2dd5-412d-871e-c091bac8c3c2   0
celeryev.c224d58c-744b-4f6a-9752-7382ee1cd299   0
celeryev.e2da783f-e4e6-46bd-b64b-dae217d28ffe   0
celeryev.e84ddfc4-8131-4ffd-9216-9cf81980c6fc   0
celeryev.ed36265a-b998-43b7-a0ae-80eed999710a   0
default 1220
high_priority   0
log_parser      4525
log_parser_fail 478
log_parser_hp   464
pushlog 0

treeherder-processor[23] have been idle for the last 60-90 mins.

I've restarted them.

(In reply to Carsten Book [:Tomcat] from comment #10)
> this happened again, so bumping priority since this makes sheriffing more
> complicated with having no insight into the results

Agree this is very high priority - we use the priority field rather than the severity field - and the priority field is at P1 already.
processor3 disconnected again, restarted. worker log:
http://pastebin.mozilla.org/8378673 (the restart started at 13:08:59)

Queues are coming down now:

[emorley@treeherder-rabbitmq1.private.scl3 ~]$ sudo rabbitmqctl list_queues -p treeherder
Listing queues ...
buildapi        0
celery@buildapi.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox  0
celery@buildapi.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox  0
celery@default.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox      0
celery@hp.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox   0
celery@log_parser.treeherder-processor1.private.scl3.mozilla.com.celery.pidbox  0
celery@log_parser.treeherder-processor2.private.scl3.mozilla.com.celery.pidbox  0
celery@log_parser.treeherder-processor3.private.scl3.mozilla.com.celery.pidbox  0
celery@pushlog.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox   0
celery@pushlog.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox   0
celeryev.24f6138b-eafa-49d2-a6d5-b9c0388f47cc   0
celeryev.6104655e-5688-4e29-a364-d9f59831dc80   0
celeryev.b5409c1a-2dd5-412d-871e-c091bac8c3c2   0
celeryev.b6c24cfb-fe30-4067-b962-9fa9eee76633   0
celeryev.bf41b634-d32f-46da-ad0c-1c60cf48310e   0
celeryev.c224d58c-744b-4f6a-9752-7382ee1cd299   0
celeryev.e2da783f-e4e6-46bd-b64b-dae217d28ffe   0
celeryev.e84ddfc4-8131-4ffd-9216-9cf81980c6fc   0
celeryev.ed36265a-b998-43b7-a0ae-80eed999710a   0
default 1486
high_priority   0
log_parser      1443
log_parser_fail 6
log_parser_hp   1
pushlog 0
...done.
Summary: Figure out why two log processor nodes stopped handling tasks → Figure out why the log processor nodes regularly stop handling tasks
Going to call this fixed by bug 1123479 until proven otherwise :-)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.