Closed Bug 573381 Opened 14 years ago Closed 14 years ago

Provide current throttle configuration plus history of any recent changes

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: laura, Assigned: aravind)

References

Details

Attachments

(1 file)

Aravind, can you please:
1. attach a copy of the current throttling config to this bug
2. tell us, is this config version controlled?
3. If so, check the history of recent changes
4.  If not, check and cc myself and lars on any recent bugs that might help us reconstruct changes 

A throttle rule that did not correctly apply to 3.6.4 or some 3.6.4 builds and blocked a random percentage of all crashes for 3.6.4 would handily and entirely account for the spike.  (I can't believe we didn't think to check this already.)
Blocks: 571118
Assignee: server-ops → aravind
Severity: major → blocker
Since we store 100% of submitted crashes in HBase, we can do a scan and count how many 3.6.4 crashes were stored per day on the days in question.  Then we can compare that with the total number of 3.6.4 crashes.
An important node, the ADU traffic reports we ran last week show that up until the 10th, a significant portion of the 3.6.4 traffic was falling under the release channel.  On the 9th and 10th there was a cutover to the beta channel.

If channel is a determining factor in the throttling rules, this could play a part.

Also note that for a new release, we sometimes see a lag of two days for ADU to reflect usage which might put the cutover on the 8th.
wrt comment #1, 
We already have code that counts the total number of Fx 3.6.4 crashes submitted to HBase (https://bugzilla.mozilla.org/show_bug.cgi?id=573093)

We can run it fairly quickly for the date range 6/6 to 6/19, lemme know.
This is the current configuration, and it hasn't changed in a long long time.. from back when someone filed a bug asking that we process 100% of 3.6.4 crashes.


throttleConditions = cm.Option()
throttleConditions.default = [
  ('Version', '3.6.4', 100.0),
  ("Version", re.compile(r'\..*?[a-zA-Z]+'), 100), # 100% of all alpha, beta or special
  ("Comments", lambda x: x, 100), # 100% of crashes with comments
  ("ProductName", lambda x: x[0] == 'F' and x[-1] == 'x', 15), # 15% of Firefox - exluding someone's bogus "Firefox3" product
  ("ProductName", lambda x: x[0] in 'TSC', 100), # 100% of Thunderbird, SeaMonkey & Camino
  (None, True, 0) # reject everything else
]
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to comment #3)
> wrt comment #1, 
> We already have code that counts the total number of Fx 3.6.4 crashes submitted
> to HBase (https://bugzilla.mozilla.org/show_bug.cgi?id=573093)
> 
> We can run it fairly quickly for the date range 6/6 to 6/19, lemme know.

Can we do this please Anurag?
(In reply to comment #0)
> Aravind, can you please:

Still open:

> 2. tell us, is this config version controlled?
> 3. If so, check the history of recent changes
> 4.  If not, check and cc myself and lars on any recent bugs that might help us
> reconstruct changes 
> 

5. Can you point me at the bug you mentioned for 3.6.4 at 100%?
6. How is the config pushed out to all collectors?  Could they have had different configs via a failed git push for example?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This throttle configuration is correct.  Anything identifying itself has having exactly "3.6.4" as its version string will be passed through for processing at 100%.  Since it is the first rule in the list, it is the first applied, no other rules come into play for such a crash.

I do not believe that the throttling mechanism has contributed to the spike.
One thing to also confirm on this bug is that we don't have a pile of 3.6.4 reports in 'deferred storage.'   I think we already checked that, but that is the ultimate way to figure out if we had a configuration file problem, or some bug in the logic that interprets the config file rules.
Sorry, here are the answers.

> > 2. tell us, is this config version controlled?

It is in git on the server itself, but I am not sure we keep the repo intact for long.  Other than that, its not in a different version control system.

> > 3. If so, check the history of recent changes

Looks like these settings for 3.6.4 were put in on 2010-04-20 20:42:03 (from git blame)

> > 4.  If not, check and cc myself and lars on any recent bugs that might help us
> > reconstruct changes 
> > 

Not sure what this means.


> 
> 5. Can you point me at the bug you mentioned for 3.6.4 at 100%?

I can't find it in my bugzilla searches, maybe lars/beltzner/chofmann have the bug number.

> 6. How is the config pushed out to all collectors?  Could they have had
> different configs via a failed git push for example?

Its pushed out using our normal sync script, which is a git pull.  Yes, its possible we had stuck locks, and I am not sure we have alerts/logs for those.

Need anything else?
(In reply to comment #9)
> Sorry, here are the answers.
> 
> > > 2. tell us, is this config version controlled?
> 
> It is in git on the server itself, but I am not sure we keep the repo intact
> for long.  Other than that, its not in a different version control system.
> 
> > > 3. If so, check the history of recent changes
> 
> Looks like these settings for 3.6.4 were put in on 2010-04-20 20:42:03 (from
> git blame)
> 

Have we made any other changes to the config since?  If so, what were the dates and times?

> 
> > 6. How is the config pushed out to all collectors?  Could they have had
> > different configs via a failed git push for example?
> 
> Its pushed out using our normal sync script, which is a git pull.  Yes, its
> possible we had stuck locks, and I am not sure we have alerts/logs for those.
> 
> 

Can you please check?

Could you also please manually check that all boxes have the correct config right now?
The output from git sync is unfortunately sent to /dev/null.  So I don't have any history of failures (on the git-syncs on the webheads).


This is the current status on the collectors.

pm-app-collector01
  ('Version', '3.6.4', 100.0),
  ("Version", re.compile(r'\..*?[a-zA-Z]+'), 100), # 100% of all alpha, beta or special
minimalVersionForUnderstandingRefusal = cm.Option()
minimalVersionForUnderstandingRefusal.default = { 'Firefox': '3.5.4' }
pm-app-collector02
  ('Version', '3.6.4', 100.0),
  ("Version", re.compile(r'\..*?[a-zA-Z]+'), 100), # 100% of all alpha, beta or special
minimalVersionForUnderstandingRefusal = cm.Option()
minimalVersionForUnderstandingRefusal.default = { 'Firefox': '3.5.4' }
pm-app-collector03
  ('Version', '3.6.4', 100.0),
  ("Version", re.compile(r'\..*?[a-zA-Z]+'), 100), # 100% of all alpha, beta or special
minimalVersionForUnderstandingRefusal = cm.Option()
minimalVersionForUnderstandingRefusal.default = { 'Firefox': '3.5.4' }
Anything else needed from IT for this bug?
I note that the next set of changes we made after adding the 100% rule for 3.6.4 were on June 7, between about 1pm and 4pm PT, that is, immediately before we observed the spike.

Aravind, do you recall purging stuck git locks at that time?
Just adding back of the envelope numbers:
If one collector was going at 100% and the other two stuck on old configuration at 15%, for a total of ~23k crash reports before the spike, going to three collectors at 100% we would expect to see an increase to roughly 53k.  This is a little higher than what we actually saw, which was an increase to somewhere between 44k and 55k crash reports per day in the week after the spike.
(In reply to comment #8)
I agree checking deferred storage (starting with 6/7) seems like the next step.
I very well might have, I unfortunately don't recall what might have happened that day.  The changelog indicates a lot of back and forth for hbaseSubmissionRate.default, which suggests we were dealing with some hbase related collector issues.  These are the last log messages on the servers.

collector03:
root     pts/0        v81boris.mozilla Fri Jun 11 14:48 - 14:48  (00:00)
root     pts/0        v81boris.mozilla Mon Jun  7 16:11 - 11:09 (1+18:57)
root     pts/0        v81boris.mozilla Mon Jun  7 16:09 - 16:09  (00:00)
reboot   system boot  2.6.18-164.11.1. Sat Jan  2 06:08         (161+23:33)
root     pts/0        v81boris.mozilla Sun May 30 06:41 - 06:41  (00:00)

collector01:
root     pts/0        v81boris.mozilla Thu Jun 10 14:32 - crash (-159+-6:-53
root     pts/0        v81boris.mozilla Mon Jun  7 16:09 - 16:11  (00:02)
reboot   system boot  2.6.18-164.11.1. Sat Dec 26 06:03         (177+03:46)
root     pts/0        v81boris.mozilla Sat Jun  5 03:33 - 06:44  (03:11)


collector02:
root     pts/0        v81boris.mozilla Mon Jun  7 16:06 - 16:09  (00:02)
root     tty1                          Sun May 30 06:29 - 06:29  (00:00)
...
root     pts/0        v81boris.mozilla Fri Apr 30 02:28 - down   (00:27)
root     pts/0        v81boris.mozilla Sun Apr 25 06:37 - 06:37  (00:00)
reboot   system boot  2.6.18-164.11.1. Fri Apr 23 02:37         (7+00:18)
reboot   system boot  2.6.18-164.11.1. Tue Apr 20 08:34         (9+18:21)
root     pts/0        v81boris.mozilla Tue Apr 20 06:40 - down   (01:50)
root     pts/0        v81boris.mozilla Tue Apr 20 04:31 - 06:11  (01:40)


So if we started seeing the spike around 4:00 PM on the 7'th, then this (clearing gitlock) theory holds water.
Aravind, can you take a look at deferred storage?  Are there a pile of 3.6.4 reports from April 10 - June 7 in there?
From Hbase via https://bugzilla.mozilla.org/show_bug.cgi?id=573556#c1:

For 6/6:
3.6.4: Total crashes: 40263
3.6.4: Total hang crashes: 19635
3.6.4: Total plugin crashes: 4076
3.6.3: Total crashes: 1641921
(In reply to comment #18)
> Aravind, can you take a look at deferred storage?  Are there a pile of 3.6.4
> reports from April 10 - June 7 in there?

The answer to this question was yes.

The spike was in fact due to collector misconfiguration thanks to stuck git locks.  Now resolved, closing.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: