Closed Bug 1288170 Opened 6 years ago Closed 6 years ago

Stage submitter is sending about 1/10 to stage since July 15

Categories

(Socorro :: Infra, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Assigned: jschneider)

References

()

Details

If you look at the Crashes per Day (see attached URL), you'll see the numbers in the table (focus on the "Crashes" column for each version), starting to be about a tenth since the 15th of July.
Assignee: nobody → jschneider
I've restarted the submitter.  No code changes have occurred there in some time, so this may fix it.
Fleshing out some data so I know where I can find it later on after I've forgotten.

On July 14th at around 19:12, the monitor we have sum(last_5m):avg:crashmover.save_raw_crash{environment:stage} shows a dip and then putters around between 50 and 70 until JP restarted the stage submitter at which point it spikes upwards.

https://app.datadoghq.com/monitors#238219?group=all&from_ts=1468528801835&to_ts=1469039612645

It doesn't look like we have any monitors or graphs showing the performance of the stage submitter--only graphs showing the effects of the stage submitter on the -stage environment.

Do we have anything monitoring the submitter directly?
We are monitoring the # of submissions, but it needs to go to zero for it to alert.  We can specify a number that is the threshold, so when it's less than that # for five minutes it alerts.  What do you think that number should be?
When you say "# of submissions" is that the "save_raw_crash" data I was mentioning? Or are you referring to a different graph/monitor?
Marking infra bugs that are important to get fixed asap as P1.
Priority: -- → P1
See Also: → 1289772
Depends on: 1289783
40 is the lowest 'normal' behavior I see in the last month on WillKG's metric.

JP -- what exactly is being monitored?
Flags: needinfo?(jschneider)
We're monitoring that #, and can easily manage the threshold.  Currently, it will alert us if there is 0 crash submissions to staging in the last 5 minutes.
Flags: needinfo?(jschneider)
Executive decision made.

150 or less in 10 minutes is a warning, and less than 5 in 10 minutes is a critical.

This is how it usually looks:
https://www.dropbox.com/s/h33fww33krdej3o/Screenshot%202016-08-25%2004.14.01.jpg
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.