we processed 315,440 of the plugin crash and hang pairs yesterday. we expected volume increases, but the actual volume is above the projections as we get deeper into the update of the user base to 3.6.4. this is the tracking bug for things we need to do to monitor and keep pace with the volume of the plugin hang/crash pairs. first actions were taken last night with bugs to increase the throttling rate [Bug 574562] , and add more processors [Bug 574553]. but we are probably going to need more. we need a full day of today's data running at 10% throttling to see what the current run rate is at that level, but we also had 30Million ADUs, so we are only at the half way, or 1/3 of the load that we should expect to get too soon as more users get on 3.6.4 some more initial data at http://people.mozilla.com/~chofmann/crash-stats/20100624/crash-counts.txt
range of actions could include, more throttling, more processors, and even slowing down the push of 3.6.4 until we get socorro scaled right. lets use this bug to track other ideas and spin off bugs as needed.
I think there might be a ceiling to the number of processors we can add while: a) we're still pushing stuff to PostgreSQL (ending in 1.9) b) we're having trouble with HBase connections timing out (ending ???) Once we have today's numbers let's see where we stand and review.
I'm still trying to figure out the best way to express the change in dynanmics, and we need a few more days of stable data to really understand what is going on, but here is the first crack. We might have gone from about 1.5%-1.8% of our ADUs submitting hang/crash pair reports to somewhere between 6% to 10% of our ADUs making these submissions as we have ramped from less than a million beta testers to 30 Milion users yesterday. date adus #pair_rprts pair_rprts/( adus * throttle ) <previous dates look mostly like 06/19-21> 20100619 751961 19313 0.018 * 20100620 747986 18783 0.018 * 20100621 915614 18635 0.014 * 20100622 956628 50323 0.041 * 20100623 5316613 222176 0.024 ** 20100624 30708389 315440 0.068 ** 0.102 *** * no throttle * 15% throttle * 10% throttle I think June 22 might be the day of release so its difficult to align ADU numbers to the start of update and accounts for that erratic count. June 23 also has the effect of no throttling on the crash/hang pairs until Bug 574250 was fixed and we ignored the throttable=0 flag so that data is erratic. June 24 was a pretty good day of data collection except for the throttle adjustment in bug 574562. that table above (if it holds up) helps to make the case for slower ramp up in release deployments so we can understand infection points better and figure out the size of the beta test pool we really need to understand the impact of some changes and different browsing behavior pattern. it also make the case for our need to build a bigger beta audience like we are doing in firefox 4.
the changes in 3.6.6 to increase the time out looks like it eliminates the need for any more actions here. http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/3.6.6/1 looks a lot like http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/3.6.3/1 now.
Component: Socorro → General
Product: Webtools → Socorro
we have achieved scale
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.