Stage submitter is sending about 1/10 to stage since July 15

RESOLVED FIXED

Status

Socorro
Infra
P1
normal
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: peterbe, Assigned: jp)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(URL)

(Reporter)

Description

a year ago
If you look at the Crashes per Day (see attached URL), you'll see the numbers in the table (focus on the "Crashes" column for each version), starting to be about a tenth since the 15th of July.
(Reporter)

Updated

a year ago
Assignee: nobody → jschneider
(Assignee)

Comment 1

a year ago
I've restarted the submitter.  No code changes have occurred there in some time, so this may fix it.
Fleshing out some data so I know where I can find it later on after I've forgotten.

On July 14th at around 19:12, the monitor we have sum(last_5m):avg:crashmover.save_raw_crash{environment:stage} shows a dip and then putters around between 50 and 70 until JP restarted the stage submitter at which point it spikes upwards.

https://app.datadoghq.com/monitors#238219?group=all&from_ts=1468528801835&to_ts=1469039612645

It doesn't look like we have any monitors or graphs showing the performance of the stage submitter--only graphs showing the effects of the stage submitter on the -stage environment.

Do we have anything monitoring the submitter directly?
(Assignee)

Comment 3

a year ago
We are monitoring the # of submissions, but it needs to go to zero for it to alert.  We can specify a number that is the threshold, so when it's less than that # for five minutes it alerts.  What do you think that number should be?
When you say "# of submissions" is that the "save_raw_crash" data I was mentioning? Or are you referring to a different graph/monitor?
Marking infra bugs that are important to get fixed asap as P1.
Priority: -- → P1
(Reporter)

Updated

a year ago
See Also: → bug 1289772
(Reporter)

Updated

a year ago
Depends on: 1289783

Comment 6

a year ago
40 is the lowest 'normal' behavior I see in the last month on WillKG's metric.

JP -- what exactly is being monitored?
Flags: needinfo?(jschneider)
(Assignee)

Comment 7

a year ago
We're monitoring that #, and can easily manage the threshold.  Currently, it will alert us if there is 0 crash submissions to staging in the last 5 minutes.
Flags: needinfo?(jschneider)
(Assignee)

Comment 8

a year ago
Executive decision made.

150 or less in 10 minutes is a warning, and less than 5 in 10 minutes is a critical.

This is how it usually looks:
https://www.dropbox.com/s/h33fww33krdej3o/Screenshot%202016-08-25%2004.14.01.jpg
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.