There is currently an increasing backlog of detect_intermittents and generate_perf_alerts tasks on SCl3 prod. (5000+ of the former, and 2000+ of the latter). These tasks are run from the `bin/run_celery_worker` script: https://github.com/mozilla/treeherder/blob/b040bb40455e2baac403813c882d143711262b49/bin/run_celery_worker#L18 Which is run on the treeherder-rabbitmq2.private.scl3.mozilla.com node. However this node also runs: bin/run_celerybeat bin/run_celery_worker_hp So between the above, the following queues are served by this single node: default cycle_data calculate_durations fetch_bugs fetch_allthethings generate_perf_alerts detect_intermittents classification_mirroring publish_to_pulse
So for whatever reason, the run_celery_worker Python process stopped taking jobs: [2016-07-06 00:04:20,483: INFO/MainProcess] Task fetch-bugs[21bb106d-f2f2-4eb9-bc66-906a74d97039] succeeded in 260.43417345s: None [2016-07-06 00:04:20,486: INFO/MainProcess] Received task: detect-intermittents[c23814e2-87f2-4799-aeaf-b902e1bcd0ab] [2016-07-06 00:04:20,489: INFO/MainProcess] Task detect-intermittents[b0279cea-838e-4f13-a27c-71f89f4d99b7] succeeded in 0.00416307151318s: None [2016-07-06 00:08:28,471: INFO/MainProcess] Task fetch-allthethings[bc207f97-2844-4bed-a3e4-0dcec800ad94] succeeded in 508.419383425s: None [2016-07-06 00:08:28,474: INFO/MainProcess] Task detect-intermittents[4fc9a4e4-43db-4ada-9dad-df86edf7d802] succeeded in 0.00268764048815s: None [2016-07-06 00:08:28,477: INFO/MainProcess] Received task: generate-alerts[e04ab59e-b98e-4a54-b504-f085010a5214] [2016-07-06 00:08:28,478: INFO/MainProcess] Task detect-intermittents[89a8b221-5fbb-4700-b76a-e0a4f965983c] succeeded in 0.00360686145723s: None [2016-07-06 00:08:28,479: INFO/MainProcess] Received task: generate-alerts[bc3c2c53-b6e0-41a4-8643-850534129b2b] [2016-07-06 00:08:28,481: INFO/MainProcess] Received task: generate-alerts[9a7ac7ec-34af-4aaa-ac5b-5152261355a5] [2016-07-06 00:08:28,482: INFO/MainProcess] Task detect-intermittents[737cca87-372f-4664-9e1c-2011ff9a993a] succeeded in 0.00197583064437s: None [2016-07-06 01:56:12,757: INFO/Worker-1] New Relic Python Agent (184.108.40.206) [2016-07-06 01:56:12,804: INFO/Worker-2] New Relic Python Agent (220.127.116.11) [2016-07-06 01:56:12,849: INFO/Worker-3] New Relic Python Agent (18.104.22.168) After a `supervisorctl restart run_celery_worker` it started taking jobs again, and the queues are now down to zero: https://rpm.newrelic.com/accounts/677903/dashboard/13318367/page/3?tw%5Bend%5D=1467797132&tw%5Bstart%5D=1467786332 It's probably not worth looking into this much more unless it occurs again. However I think we should definitely consider adjusting the time some of the long-running periodic tasks run, since several overlap on that machine, which may/may not have contributed to this. (And regardless it would reduce DB load spikes if we spread them out). I'll file another bug for this.
Assignee: nobody → emorley
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.