Closed
Bug 1384485
Opened 7 years ago
Closed 7 years ago
Periodic cloudamqp alerts about Pulse job ingestion backlogs (store_pulse_jobs)
Categories
(Tree Management :: Treeherder: Data Ingestion, enhancement, P1)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
References
Details
These have been happening on and off: """ Name treeherder-prod Server REDACTED Vhost REDACTED Queue store_pulse_jobs Current # messages 1725 Alarm queue regexp .* Alarm threshold 1000 """ Looking at New Relic for when these alerts occur show ingestion a single job can take as long as 5-13 seconds! (Compared to usually < 1 second) Much of the profile is spent inserting into the job_details table, and is either due to hundreds of inserts (I seem to remember a job with 250 job details!), or else just very slow inserts (looking at the slow query log showed they were clashing with cycle_data deleting data from the table, plus the DB load was just very high due to massive stack trace blob inserts into failure_line). It's hard to pinpoint which jobs are causing this, since the New Relic traces don't have job_id or similar - we should start by adding that.
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → emorley
Assignee | ||
Comment 1•7 years ago
|
||
In the meantime I've bumped the store_pulse_jobs workers on both stage and prod from 5 to 6 (P1s).
Assignee | ||
Comment 2•7 years ago
|
||
Bumping the dyno count appears to have helped for now. Bug 1407377 will cover going through New Relic slow transactions low hanging fruit.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•