Closed Bug 1384485 Opened 7 years ago Closed 7 years ago

Periodic cloudamqp alerts about Pulse job ingestion backlogs (store_pulse_jobs)

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Ed Morley [:emorley]

Assignee

Description

•

7 years ago

These have been happening on and off:

"""
Name 	treeherder-prod
Server 	REDACTED
Vhost 	REDACTED
Queue 	store_pulse_jobs
Current # messages 	1725
Alarm queue regexp 	.*
Alarm threshold 	1000
"""

Looking at New Relic for when these alerts occur show ingestion a single job can take as long as 5-13 seconds! (Compared to usually < 1 second)

Much of the profile is spent inserting into the job_details table, and is either due to hundreds of inserts (I seem to remember a job with 250 job details!), or else just very slow inserts (looking at the slow query log showed they were clashing with cycle_data deleting data from the table, plus the DB load was just very high due to massive stack trace blob inserts into failure_line).

It's hard to pinpoint which jobs are causing this, since the New Relic traces don't have job_id or similar - we should start by adding that.

Ed Morley [:emorley]

Assignee

Updated

•

7 years ago

Assignee: nobody → emorley

Ed Morley [:emorley]

Assignee

Updated

•

7 years ago

Depends on: 1342296

Ed Morley [:emorley]

Assignee

Comment 1

•

7 years ago

In the meantime I've bumped the store_pulse_jobs workers on both stage and prod from 5 to 6 (P1s).

Ed Morley [:emorley]

Assignee

Comment 2

•

7 years ago

Bumping the dyno count appears to have helped for now. Bug 1407377 will cover going through New Relic slow transactions low hanging fruit.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Periodic cloudamqp alerts about Pulse job ingestion backlogs (store_pulse_jobs)

Categories

(Tree Management :: Treeherder: Data Ingestion, enhancement, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: emorley)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2