573381 - Provide current throttle configuration plus history of any recent changes

Reporter

Description

•

14 years ago

Aravind, can you please:
1. attach a copy of the current throttling config to this bug
2. tell us, is this config version controlled?
3. If so, check the history of recent changes
4.  If not, check and cc myself and lars on any recent bugs that might help us reconstruct changes 

A throttle rule that did not correctly apply to 3.6.4 or some 3.6.4 builds and blocked a random percentage of all crashes for 3.6.4 would handily and entirely account for the spike.  (I can't believe we didn't think to check this already.)

Laura Thomson :laura

Reporter

Updated

•

14 years ago

Blocks: 571118

Laura Thomson :laura

Reporter

Updated

•

14 years ago

Assignee: server-ops → aravind

Severity: major → blocker

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 1

•

14 years ago

Since we store 100% of submitted crashes in HBase, we can do a scan and count how many 3.6.4 crashes were stored per day on the days in question.  Then we can compare that with the total number of 3.6.4 crashes.

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 2

•

14 years ago

An important node, the ADU traffic reports we ran last week show that up until the 10th, a significant portion of the 3.6.4 traffic was falling under the release channel.  On the 9th and 10th there was a cutover to the beta channel.

If channel is a determining factor in the throttling rules, this could play a part.

Also note that for a new release, we sometimes see a lag of two days for ADU to reflect usage which might put the cutover on the 8th.

Anurag Phadke[:aphadke@mozilla.com]

Comment 3

•

14 years ago

wrt comment #1, 
We already have code that counts the total number of Fx 3.6.4 crashes submitted to HBase (https://bugzilla.mozilla.org/show_bug.cgi?id=573093)

We can run it fairly quickly for the date range 6/6 to 6/19, lemme know.

Aravind Gottipati [:aravind]

Assignee

Comment 4

•

14 years ago

This is the current configuration, and it hasn't changed in a long long time.. from back when someone filed a bug asking that we process 100% of 3.6.4 crashes.


throttleConditions = cm.Option()
throttleConditions.default = [
  ('Version', '3.6.4', 100.0),
  ("Version", re.compile(r'\..*?[a-zA-Z]+'), 100), # 100% of all alpha, beta or special
  ("Comments", lambda x: x, 100), # 100% of crashes with comments
  ("ProductName", lambda x: x[0] == 'F' and x[-1] == 'x', 15), # 15% of Firefox - exluding someone's bogus "Firefox3" product
  ("ProductName", lambda x: x[0] in 'TSC', 100), # 100% of Thunderbird, SeaMonkey & Camino
  (None, True, 0) # reject everything else
]

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Laura Thomson :laura

Reporter

Comment 5

•

14 years ago

(In reply to comment #3)
> wrt comment #1, 
> We already have code that counts the total number of Fx 3.6.4 crashes submitted
> to HBase (https://bugzilla.mozilla.org/show_bug.cgi?id=573093)
> 
> We can run it fairly quickly for the date range 6/6 to 6/19, lemme know.

Can we do this please Anurag?

Laura Thomson :laura

Reporter

Comment 6

•

14 years ago

(In reply to comment #0)
> Aravind, can you please:

Still open:

> 2. tell us, is this config version controlled?
> 3. If so, check the history of recent changes
> 4.  If not, check and cc myself and lars on any recent bugs that might help us
> reconstruct changes 
> 

5. Can you point me at the bug you mentioned for 3.6.4 at 100%?
6. How is the config pushed out to all collectors?  Could they have had different configs via a failed git push for example?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

K Lars Lohn [:lars] [:klohn]

Comment 7

•

14 years ago

This throttle configuration is correct.  Anything identifying itself has having exactly "3.6.4" as its version string will be passed through for processing at 100%.  Since it is the first rule in the list, it is the first applied, no other rules come into play for such a crash.

I do not believe that the throttling mechanism has contributed to the spike.

chris hofmann

Comment 8

•

14 years ago

One thing to also confirm on this bug is that we don't have a pile of 3.6.4 reports in 'deferred storage.'   I think we already checked that, but that is the ultimate way to figure out if we had a configuration file problem, or some bug in the logic that interprets the config file rules.

Aravind Gottipati [:aravind]

Assignee

Comment 9

•

14 years ago

Sorry, here are the answers.

> > 2. tell us, is this config version controlled?

It is in git on the server itself, but I am not sure we keep the repo intact for long.  Other than that, its not in a different version control system.

> > 3. If so, check the history of recent changes

Looks like these settings for 3.6.4 were put in on 2010-04-20 20:42:03 (from git blame)

> > 4.  If not, check and cc myself and lars on any recent bugs that might help us
> > reconstruct changes 
> > 

Not sure what this means.


> 
> 5. Can you point me at the bug you mentioned for 3.6.4 at 100%?

I can't find it in my bugzilla searches, maybe lars/beltzner/chofmann have the bug number.

> 6. How is the config pushed out to all collectors?  Could they have had
> different configs via a failed git push for example?

Its pushed out using our normal sync script, which is a git pull.  Yes, its possible we had stuck locks, and I am not sure we have alerts/logs for those.

Need anything else?

Laura Thomson :laura

Reporter

Comment 10

•

14 years ago

(In reply to comment #9)
> Sorry, here are the answers.
> 
> > > 2. tell us, is this config version controlled?
> 
> It is in git on the server itself, but I am not sure we keep the repo intact
> for long.  Other than that, its not in a different version control system.
> 
> > > 3. If so, check the history of recent changes
> 
> Looks like these settings for 3.6.4 were put in on 2010-04-20 20:42:03 (from
> git blame)
> 

Have we made any other changes to the config since?  If so, what were the dates and times?

> 
> > 6. How is the config pushed out to all collectors?  Could they have had
> > different configs via a failed git push for example?
> 
> Its pushed out using our normal sync script, which is a git pull.  Yes, its
> possible we had stuck locks, and I am not sure we have alerts/logs for those.
> 
> 

Can you please check?

Could you also please manually check that all boxes have the correct config right now?

Aravind Gottipati [:aravind]

Assignee

Comment 11

•

14 years ago

Attached patch The entire history of collectorconfig. — Details — Splinter Review

Aravind Gottipati [:aravind]

Assignee

Comment 12

•

14 years ago

The output from git sync is unfortunately sent to /dev/null.  So I don't have any history of failures (on the git-syncs on the webheads).


This is the current status on the collectors.

pm-app-collector01
  ('Version', '3.6.4', 100.0),
  ("Version", re.compile(r'\..*?[a-zA-Z]+'), 100), # 100% of all alpha, beta or special
minimalVersionForUnderstandingRefusal = cm.Option()
minimalVersionForUnderstandingRefusal.default = { 'Firefox': '3.5.4' }
pm-app-collector02
  ('Version', '3.6.4', 100.0),
  ("Version", re.compile(r'\..*?[a-zA-Z]+'), 100), # 100% of all alpha, beta or special
minimalVersionForUnderstandingRefusal = cm.Option()
minimalVersionForUnderstandingRefusal.default = { 'Firefox': '3.5.4' }
pm-app-collector03
  ('Version', '3.6.4', 100.0),
  ("Version", re.compile(r'\..*?[a-zA-Z]+'), 100), # 100% of all alpha, beta or special
minimalVersionForUnderstandingRefusal = cm.Option()
minimalVersionForUnderstandingRefusal.default = { 'Firefox': '3.5.4' }

Aravind Gottipati [:aravind]

Assignee

Comment 13

•

14 years ago

Anything else needed from IT for this bug?

Laura Thomson :laura

Reporter

Comment 14

•

14 years ago

I note that the next set of changes we made after adding the 100% rule for 3.6.4 were on June 7, between about 1pm and 4pm PT, that is, immediately before we observed the spike.

Aravind, do you recall purging stuck git locks at that time?

Laura Thomson :laura

Reporter

Comment 15

•

14 years ago

Just adding back of the envelope numbers:
If one collector was going at 100% and the other two stuck on old configuration at 15%, for a total of ~23k crash reports before the spike, going to three collectors at 100% we would expect to see an increase to roughly 53k.  This is a little higher than what we actually saw, which was an increase to somewhere between 44k and 55k crash reports per day in the week after the spike.

Austin King [:ozten]

Comment 16

•

14 years ago

(In reply to comment #8)
I agree checking deferred storage (starting with 6/7) seems like the next step.

Aravind Gottipati [:aravind]

Assignee

Comment 17

•

14 years ago

I very well might have, I unfortunately don't recall what might have happened that day.  The changelog indicates a lot of back and forth for hbaseSubmissionRate.default, which suggests we were dealing with some hbase related collector issues.  These are the last log messages on the servers.

collector03:
root     pts/0        v81boris.mozilla Fri Jun 11 14:48 - 14:48  (00:00)
root     pts/0        v81boris.mozilla Mon Jun  7 16:11 - 11:09 (1+18:57)
root     pts/0        v81boris.mozilla Mon Jun  7 16:09 - 16:09  (00:00)
reboot   system boot  2.6.18-164.11.1. Sat Jan  2 06:08         (161+23:33)
root     pts/0        v81boris.mozilla Sun May 30 06:41 - 06:41  (00:00)

collector01:
root     pts/0        v81boris.mozilla Thu Jun 10 14:32 - crash (-159+-6:-53
root     pts/0        v81boris.mozilla Mon Jun  7 16:09 - 16:11  (00:02)
reboot   system boot  2.6.18-164.11.1. Sat Dec 26 06:03         (177+03:46)
root     pts/0        v81boris.mozilla Sat Jun  5 03:33 - 06:44  (03:11)


collector02:
root     pts/0        v81boris.mozilla Mon Jun  7 16:06 - 16:09  (00:02)
root     tty1                          Sun May 30 06:29 - 06:29  (00:00)
...
root     pts/0        v81boris.mozilla Fri Apr 30 02:28 - down   (00:27)
root     pts/0        v81boris.mozilla Sun Apr 25 06:37 - 06:37  (00:00)
reboot   system boot  2.6.18-164.11.1. Fri Apr 23 02:37         (7+00:18)
reboot   system boot  2.6.18-164.11.1. Tue Apr 20 08:34         (9+18:21)
root     pts/0        v81boris.mozilla Tue Apr 20 06:40 - down   (01:50)
root     pts/0        v81boris.mozilla Tue Apr 20 04:31 - 06:11  (01:40)


So if we started seeing the spike around 4:00 PM on the 7'th, then this (clearing gitlock) theory holds water.

Laura Thomson :laura

Reporter

Comment 18

•

14 years ago

Aravind, can you take a look at deferred storage?  Are there a pile of 3.6.4 reports from April 10 - June 7 in there?

Laura Thomson :laura

Reporter

Comment 19

•

14 years ago

From Hbase via https://bugzilla.mozilla.org/show_bug.cgi?id=573556#c1:

For 6/6:
3.6.4: Total crashes: 40263
3.6.4: Total hang crashes: 19635
3.6.4: Total plugin crashes: 4076
3.6.3: Total crashes: 1641921

Laura Thomson :laura

Reporter

Comment 20

•

14 years ago

(In reply to comment #18)
> Aravind, can you take a look at deferred storage?  Are there a pile of 3.6.4
> reports from April 10 - June 7 in there?

The answer to this question was yes.

The spike was in fact due to collector misconfiguration thanks to stuck git locks.  Now resolved, closing.

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard