Backlog of detect_intermittents and generate_perf_alerts tasks on SCL3 prod

RESOLVED FIXED

Status

Tree Management
Treeherder: Infrastructure
P1
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

Details

(Assignee)

Description

2 years ago
There is currently an increasing backlog of detect_intermittents and generate_perf_alerts tasks on SCl3 prod. (5000+ of the former, and 2000+ of the latter).

These tasks are run from the `bin/run_celery_worker` script:
https://github.com/mozilla/treeherder/blob/b040bb40455e2baac403813c882d143711262b49/bin/run_celery_worker#L18

Which is run on the treeherder-rabbitmq2.private.scl3.mozilla.com node.

However this node also runs:
bin/run_celerybeat
bin/run_celery_worker_hp

So between the above, the following queues are served by this single node:
default
cycle_data
calculate_durations
fetch_bugs
fetch_allthethings
generate_perf_alerts
detect_intermittents
classification_mirroring
publish_to_pulse
(Assignee)

Comment 1

2 years ago
So for whatever reason, the run_celery_worker Python process stopped taking jobs:

[2016-07-06 00:04:20,483: INFO/MainProcess] Task fetch-bugs[21bb106d-f2f2-4eb9-bc66-906a74d97039] succeeded in 260.43417345s: None
[2016-07-06 00:04:20,486: INFO/MainProcess] Received task: detect-intermittents[c23814e2-87f2-4799-aeaf-b902e1bcd0ab]
[2016-07-06 00:04:20,489: INFO/MainProcess] Task detect-intermittents[b0279cea-838e-4f13-a27c-71f89f4d99b7] succeeded in 0.00416307151318s: None
[2016-07-06 00:08:28,471: INFO/MainProcess] Task fetch-allthethings[bc207f97-2844-4bed-a3e4-0dcec800ad94] succeeded in 508.419383425s: None
[2016-07-06 00:08:28,474: INFO/MainProcess] Task detect-intermittents[4fc9a4e4-43db-4ada-9dad-df86edf7d802] succeeded in 0.00268764048815s: None
[2016-07-06 00:08:28,477: INFO/MainProcess] Received task: generate-alerts[e04ab59e-b98e-4a54-b504-f085010a5214]
[2016-07-06 00:08:28,478: INFO/MainProcess] Task detect-intermittents[89a8b221-5fbb-4700-b76a-e0a4f965983c] succeeded in 0.00360686145723s: None
[2016-07-06 00:08:28,479: INFO/MainProcess] Received task: generate-alerts[bc3c2c53-b6e0-41a4-8643-850534129b2b]
[2016-07-06 00:08:28,481: INFO/MainProcess] Received task: generate-alerts[9a7ac7ec-34af-4aaa-ac5b-5152261355a5]
[2016-07-06 00:08:28,482: INFO/MainProcess] Task detect-intermittents[737cca87-372f-4664-9e1c-2011ff9a993a] succeeded in 0.00197583064437s: None
[2016-07-06 01:56:12,757: INFO/Worker-1] New Relic Python Agent (2.66.0.49)
[2016-07-06 01:56:12,804: INFO/Worker-2] New Relic Python Agent (2.66.0.49)
[2016-07-06 01:56:12,849: INFO/Worker-3] New Relic Python Agent (2.66.0.49)

After a `supervisorctl restart run_celery_worker` it started taking jobs again, and the queues are now down to zero:
https://rpm.newrelic.com/accounts/677903/dashboard/13318367/page/3?tw%5Bend%5D=1467797132&tw%5Bstart%5D=1467786332

It's probably not worth looking into this much more unless it occurs again.

However I think we should definitely consider adjusting the time some of the long-running periodic tasks run, since several overlap on that machine, which may/may not have contributed to this. (And regardless it would reduce DB load spikes if we spread them out). I'll file another bug for this.
Assignee: nobody → emorley
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.