Periodic cloudamqp alerts about Pulse job ingestion backlogs (store_pulse_jobs)

RESOLVED FIXED

Status

Tree Management
Treeherder: Data Ingestion
P1
normal
RESOLVED FIXED
5 months ago
2 months ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

(Depends on: 1 bug)

Details

(Assignee)

Description

5 months ago
These have been happening on and off:

"""
Name 	treeherder-prod
Server 	REDACTED
Vhost 	REDACTED
Queue 	store_pulse_jobs
Current # messages 	1725
Alarm queue regexp 	.*
Alarm threshold 	1000
"""

Looking at New Relic for when these alerts occur show ingestion a single job can take as long as 5-13 seconds! (Compared to usually < 1 second)

Much of the profile is spent inserting into the job_details table, and is either due to hundreds of inserts (I seem to remember a job with 250 job details!), or else just very slow inserts (looking at the slow query log showed they were clashing with cycle_data deleting data from the table, plus the DB load was just very high due to massive stack trace blob inserts into failure_line).

It's hard to pinpoint which jobs are causing this, since the New Relic traces don't have job_id or similar - we should start by adding that.
(Assignee)

Updated

5 months ago
Assignee: nobody → emorley
(Assignee)

Updated

5 months ago
Depends on: 1342296
(Assignee)

Comment 1

5 months ago
In the meantime I've bumped the store_pulse_jobs workers on both stage and prod from 5 to 6 (P1s).
(Assignee)

Comment 2

2 months ago
Bumping the dyno count appears to have helped for now. Bug 1407377 will cover going through New Relic slow transactions low hanging fruit.
Status: NEW → RESOLVED
Last Resolved: 2 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.