Closed
Bug 1288170
Opened 8 years ago
Closed 8 years ago
Stage submitter is sending about 1/10 to stage since July 15
Categories
(Socorro :: Infra, task, P1)
Socorro
Infra
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: peterbe, Assigned: jschneider)
References
()
Details
If you look at the Crashes per Day (see attached URL), you'll see the numbers in the table (focus on the "Crashes" column for each version), starting to be about a tenth since the 15th of July.
Reporter | ||
Updated•8 years ago
|
Assignee: nobody → jschneider
Assignee | ||
Comment 1•8 years ago
|
||
I've restarted the submitter. No code changes have occurred there in some time, so this may fix it.
Comment 2•8 years ago
|
||
Fleshing out some data so I know where I can find it later on after I've forgotten. On July 14th at around 19:12, the monitor we have sum(last_5m):avg:crashmover.save_raw_crash{environment:stage} shows a dip and then putters around between 50 and 70 until JP restarted the stage submitter at which point it spikes upwards. https://app.datadoghq.com/monitors#238219?group=all&from_ts=1468528801835&to_ts=1469039612645 It doesn't look like we have any monitors or graphs showing the performance of the stage submitter--only graphs showing the effects of the stage submitter on the -stage environment. Do we have anything monitoring the submitter directly?
Assignee | ||
Comment 3•8 years ago
|
||
We are monitoring the # of submissions, but it needs to go to zero for it to alert. We can specify a number that is the threshold, so when it's less than that # for five minutes it alerts. What do you think that number should be?
Comment 4•8 years ago
|
||
When you say "# of submissions" is that the "save_raw_crash" data I was mentioning? Or are you referring to a different graph/monitor?
Comment 5•8 years ago
|
||
Marking infra bugs that are important to get fixed asap as P1.
Priority: -- → P1
Comment 6•8 years ago
|
||
40 is the lowest 'normal' behavior I see in the last month on WillKG's metric. JP -- what exactly is being monitored?
Flags: needinfo?(jschneider)
Assignee | ||
Comment 7•8 years ago
|
||
We're monitoring that #, and can easily manage the threshold. Currently, it will alert us if there is 0 crash submissions to staging in the last 5 minutes.
Flags: needinfo?(jschneider)
Assignee | ||
Comment 8•8 years ago
|
||
Executive decision made. 150 or less in 10 minutes is a warning, and less than 5 in 10 minutes is a critical. This is how it usually looks: https://www.dropbox.com/s/h33fww33krdej3o/Screenshot%202016-08-25%2004.14.01.jpg
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•