scaling strategies and actions for increased volume of plugin crash and hang pairs

RESOLVED FIXED

Status

RESOLVED FIXED
8 years ago
a year ago

People

(Reporter: chofmann, Unassigned)

Tracking

Trunk
x86
Mac OS X
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

8 years ago
we processed 315,440 of the plugin crash and hang pairs yesterday.  we expected volume increases, but the actual volume is above the projections as we get deeper into the update of the user base to 3.6.4.

this is the tracking bug for things we need to do to monitor and keep pace with the volume of the plugin hang/crash pairs.

first actions were taken last night with bugs to increase the throttling rate [Bug 574562] , and add more processors [Bug 574553].

but we are probably going to need more.


we need a full day of today's data running at 10% throttling to see what the current run rate is at that level, but we also had 30Million ADUs, so we are only at the half way, or 1/3 of the load that we should expect to get too soon as more users get on 3.6.4

some more initial data at 

http://people.mozilla.com/~chofmann/crash-stats/20100624/crash-counts.txt
(Reporter)

Comment 1

8 years ago
range of actions could include, more throttling, more processors, and even slowing down the push of 3.6.4 until we get socorro scaled right.  lets use this bug to track other ideas and spin off bugs as needed.
Depends on: 574562, 574553
(Reporter)

Updated

8 years ago
Depends on: 574250

Comment 2

8 years ago
I think there might be a ceiling to the number of processors we can add while:
a) we're still pushing stuff to PostgreSQL (ending in 1.9)
b) we're having trouble with HBase connections timing out (ending ???)

Once we have today's numbers let's see where we stand and review.
(Reporter)

Comment 3

8 years ago
I'm still trying to figure out the best way to express the change in dynanmics, and we need a few more days of stable data to really understand what is going on, but here is the first crack.

We might have gone from about 1.5%-1.8% of our ADUs submitting hang/crash pair reports to somewhere between 6% to 10% of our ADUs making these submissions as we have ramped from less than a million beta testers to 30 Milion users yesterday.

date     adus  #pair_rprts  pair_rprts/( adus * throttle )
<previous dates look mostly like 06/19-21>
20100619   751961  19313  0.018 *
20100620   747986  18783  0.018 *
20100621   915614  18635  0.014 *
20100622   956628  50323  0.041 *
20100623  5316613 222176  0.024 **
20100624 30708389 315440  0.068 **  0.102 ***

* no throttle
* 15% throttle
* 10% throttle

I think June 22 might be the day of release so its difficult to align ADU numbers to the start of update and accounts for that erratic count. 

June 23 also has the effect of no throttling on the crash/hang pairs until Bug 574250  was fixed and we ignored the throttable=0 flag so that data is erratic.

June 24 was a pretty good day of data collection except for the throttle adjustment in bug 574562.

that table above (if it holds up) helps to make the case for slower ramp up in release deployments so we can understand infection points better and figure out the size of the beta test pool we really need to understand the impact of some changes and different browsing behavior pattern.  it also make the case for our need to build a bigger beta audience like we are doing in firefox 4.
(Reporter)

Comment 5

8 years ago
the changes in 3.6.6 to increase the time out looks like it eliminates the need for any more actions here.

http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/3.6.6/1

looks a lot like

http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/3.6.3/1

now.
(Assignee)

Updated

7 years ago
Component: Socorro → General
Product: Webtools → Socorro

Comment 6

a year ago
we have achieved scale
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.