These have been happening on and off: """ Name treeherder-prod Server REDACTED Vhost REDACTED Queue store_pulse_jobs Current # messages 1725 Alarm queue regexp .* Alarm threshold 1000 """ Looking at New Relic for when these alerts occur show ingestion a single job can take as long as 5-13 seconds! (Compared to usually < 1 second) Much of the profile is spent inserting into the job_details table, and is either due to hundreds of inserts (I seem to remember a job with 250 job details!), or else just very slow inserts (looking at the slow query log showed they were clashing with cycle_data deleting data from the table, plus the DB load was just very high due to massive stack trace blob inserts into failure_line). It's hard to pinpoint which jobs are causing this, since the New Relic traces don't have job_id or similar - we should start by adding that.
In the meantime I've bumped the store_pulse_jobs workers on both stage and prod from 5 to 6 (P1s).
Bumping the dyno count appears to have helped for now. Bug 1407377 will cover going through New Relic slow transactions low hanging fruit.