Closed Bug 1384485 Opened 7 years ago Closed 7 years ago

Periodic cloudamqp alerts about Pulse job ingestion backlogs (store_pulse_jobs)

Categories

(Tree Management :: Treeherder: Data Ingestion, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

These have been happening on and off:

"""
Name 	treeherder-prod
Server 	REDACTED
Vhost 	REDACTED
Queue 	store_pulse_jobs
Current # messages 	1725
Alarm queue regexp 	.*
Alarm threshold 	1000
"""

Looking at New Relic for when these alerts occur show ingestion a single job can take as long as 5-13 seconds! (Compared to usually < 1 second)

Much of the profile is spent inserting into the job_details table, and is either due to hundreds of inserts (I seem to remember a job with 250 job details!), or else just very slow inserts (looking at the slow query log showed they were clashing with cycle_data deleting data from the table, plus the DB load was just very high due to massive stack trace blob inserts into failure_line).

It's hard to pinpoint which jobs are causing this, since the New Relic traces don't have job_id or similar - we should start by adding that.
Assignee: nobody → emorley
Depends on: 1342296
In the meantime I've bumped the store_pulse_jobs workers on both stage and prod from 5 to 6 (P1s).
Bumping the dyno count appears to have helped for now. Bug 1407377 will cover going through New Relic slow transactions low hanging fruit.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.