2.72 KB, text/plain
23.55 KB, patch
|Details | Diff | Splinter Review|
Reported by morgamic, Nov 29, 2007 Need to add a report for average mean time between failure.
morgamic, ok to move this to P1 now? lets hash out how we can do the computation of the number and report it. in the past I think we have just added up all the time-of-crash minus browser-start-time numbers for each black box for a specific release to come up with the total number of hours run; then we divided that number by the number of crashes. sample size has been last 10 days, but we could switch to two weeks if we think that has some value. for example a report like this would allow us to directly compare to 2.x releases. Total blackboxes in this sample: 288999 Total unique users: 147090 MTBF For these builds is estimated at 25.625648 hours, based on 273144 reports and 6999491.939167 hours of user testing
and we should do any needed sanity checking and clean up of the db and the sample before we do the calculations as in bug 422549
ken, let's work out for how we want to calculate this. I think the old crash reporting system was basically using something like this. -pull a sample for all blackboxes or a particular release (e.g. grab all the reports for windows, mac, and linux build numbers for the release like firefox 3.0 beta 4, or final) -throw out any outliers like 0 or negative time since start up, or anything that looked like a duplicate submission -add up all the time since start up times. -divide by the number of blackboxes in the sample. jay can confirm what we used in the custom report we built for past tracking.
sample size has been " 10 days of data " previously but there might be good reasons to move to a two week window since we know active users drop over weekends and might create wiggles in the reporting as a different kind of user base comes in and out on weekends.
and of course he holy grail on this is to tie it into AUS active user data. We will be eye-balling a correlation between active users and total crashes received until we can get the two tied together with automation, but that shouldn't hold us up for now. right now the fact that we are only receiving crashes from the users that "opt in" can give us a distorted view of what is going on, but having that same distortion applied across multiple releases has yeilded valuable feedback... e.g. 14 days into beta 4 MTBF was 30.3 hours 14 days into beta 5 MTBF was 35.5 hours so we must have fixed the right set of top crashers to improve stability and not not introduced any crash regressions. those are really the kind of numbers we are after here. we also used to have graphs that aligned the releases to show the changes for MTBF over time since release. it would be cool if we could also get those going again at some point
Chofmann and I recently spoke regarding some new stats that would useful: (1) daily number of crashes. this is sort of a raw/ignorant look at the data, but it could be helpful. (2) median tbf + anything else that describes the distribution of failure time. e.g., is one user responsible for all crashes?, is the tbf normally distributed?, etc. (3) ratio of comments to crashes
We need a mockup that shows what t his type of report would look like.
Chofmann wants a graph with these properties: * x-axis: days since releases * y-axis: hours * series: release versions only
Created attachment 351635 [details] [diff] [review] A first cut at MTBF Development URL: http://aking.khan.mozilla.org/reporter/mtbf/of/Firefox/major Screenshots: http://people.mozilla.org/~aking/Socorro/mtbf.html See following attachements with DB schema for more context.
Created attachment 351637 [details] A first cut at the DDL for productdims, mtbffacts, and mtbfconfig tables
Cron Script: When run startMtbf.py will populate MTBF facts table for the previous day. Date can be overriden say startMtbf.py -d 2008-12-01 Database Changes: To see more realistic data - look at breakpad_aking DB on Postgres on khan.mozilla.org. that DB shows realistic values in all three tables. I don't have much data to work with, so it is 5 days of data instead several release builds on day 1 through day 30 or 60. TODO: I know of a couple bugs, Need indexes on tables, Have a flot redisplay bug, etc but wanted to get a review. Thanks.
Created attachment 353470 [details] [diff] [review] A cleaner patch. Minor updates.
is the plan to have thunderbird be one of the products reported?
I don't have a firm plan around products and versions. If you give me versions and start dates then I will set this up. Optionally you can give me end dates or 60 days will be default. Example(made up data): Thunderbird 22.214.171.124 - 12/10 - major release 126.96.36.199 - 1/10/2009 - major release 3.0a3 - 9/12 - milestone release 3.0b2pre - 11/15 - developer release etc I will be getting this info for Firefox from S.S, but I don't have any other data or person for any other products yet.
Setting this up for Thunderbird would be fantastic. I think all the data for released version is likely to be available on the Release pages linked to from <https://wiki.mozilla.org/Releases/>. It would be great to track all the Thunderbird 3 releases there (3.0a1, 3.0a2, 3.0a3, 3.0b1). At least the last several Thunderbird 2 releases would be very helpful as well. I believe our branch nightlies are 3.0b2pre and our trunk nightlies are 3.1a1pre. gozer probably has exact start dates for those. 60 days sounds like a perfectly reasonable default to start with. Thanks!
Created attachment 353593 [details] [diff] [review] Updated with some bug fixes as well as Lars feedback
Here are my comments for the reporter changes. - the data should be listed in a table under the graph in case scaling makes it hard to interpret - the major/milestone/development links shouldn't rotate, all three should be visible at all times - text for top nav should be "Release type: Major Milestone Development" More on table layout: # Firefox 3.0- MTBF 13010 seconds based on 50103 crash reports of 32726 users (blackboxen) from period between 2008-08-01 and 2008-11-20 # Firefox 3.0.1- MTBF 250139 seconds based on 765446 crash reports of 496840 users (blackboxen) from period between 2008-08-01 and 2008-11-20 # Firefox 3.0 Win- MTBF 10119 seconds based on 39161 crash reports of 24196 users (blackboxen) from period between 2008-08-01 and 2008-11-20 Should be changed to: Product | Version | OS | MTBF | # Reports | # Users | Start | End That was UX stuff, looking at PHP code.
Indentation is messed up in load_product_info(). Looks like there are tabs mixed in with spaces, so the code is littered with some indentation issues. Question - for the zero-case (no data) seems like some of the behavior is to show an empty white box -- is that expected? Functionally, it works for me, so let's move forward and iterate on it.
This code is checked in and scheduled to be released tonight. r751 with some initial configuration checked in under r753.
I'm not such how much history you have, but I'd like to do MTBF for the following builds: * Firefox 3.0.3 (starting Sept 24) * Firefox 3.0.4 (starting Nov 5) * Firefox 3.0.5 (starting Dec 10) * All Firefox 3.0.x pre builds starting with 3.0.4pre (start these when 3.0.[n-1] started; i.e., start 3.0.4pre on Sept 24) * Firefox 3.1b1 (starting Oct 7) * Firefox 3.1b2 (starting Dec 1) * All Firefox 3.1pre builds starting with 3.1b2pre (starting Oct 7) For Thunderbird, do the following builds: * Thunderbird 3.0a3 (starting Oct 7) * Thunderbird 3.0b1 (starting Dec 2) * Thunderbird 3.0b1pre (starting Oct 7) * Thunderbird 3.0b2pre (starting Nov 28) If you have data prior to Sept 24 (when the first one of these starts), let me know and we can add more, but this is a great start.
(In reply to comment #15) > At least the last several Thunderbird 2 releases would be very helpful as well. Thunderbird 2 can't be done in this style since it's Socorro dependent, but you look at MTBF for Thunderbird 2 builds at: http://talkback-public.mozilla.org/reports/thunderbird/ Simply select a release (e.g., Thunderbird 188.8.131.52) and under "Smart Analysis" on the left side, select "All Platforms". MTBF appears at the top of the smart analysis report. Note: This isn't comparing apples to apples since the crash reporting is very different between 1.8 and 1.9.
Oh, and 60-day default is a good start. We can start specifying end-dates as needed later (I'll let you know what those are when we get there). Let's get this going! :)
What about SeaMonkey? 2.0 alpha 1 and 2 have been released by now, so I suppose the following SeaMonkey builds (or build families) could be added to the list (subject, I suppose, to some agreed-upon time-limit such as that in comment #22). 2.0a1pre 2.0a1 2.0a2pre 2.0a2 2.0a3pre Also, what about Firefox 3.2a1pre, which is already coming out in the form of nightlies? AFAIK, they're the only builds already being done based on Gecko 1.9.2. Not sure how much statistical data would be available as yet, but wouldn't it be worth while to have the MTBF reports up and rolling by the time Sm 2.0 and/or Fx 3.2 are ready for a release, or maybe even for a beta?
Austin, I filed a couple of follow ups to look at since some of this is live already. See the "Depends On" field.
(In reply to comment #23) > What about SeaMonkey? 2.0 alpha 1 and 2 have been released by now, so I suppose > the following SeaMonkey builds (or build families) could be added to the list > (subject, I suppose, to some agreed-upon time-limit such as that in comment > #22). > 2.0a1pre > 2.0a1 > 2.0a2pre > 2.0a2 > 2.0a3pre > > Also, what about Firefox 3.2a1pre, which is already coming out in the form of > nightlies? AFAIK, they're the only builds already being done based on Gecko > 1.9.2. > > Not sure how much statistical data would be available as yet, but wouldn't it > be worth while to have the MTBF reports up and rolling by the time Sm 2.0 > and/or Fx 3.2 are ready for a release, or maybe even for a beta? I am happy to add these to the MTBF reports. I need start dates which is "day 0" for calculating uptime. I will add SeaMonkey 2.0a2 and 2.0a3pre to the top crash by url reports also. As for 3.2a1pre is the Product Minefeild or Firefox?
I still need two more pieces of information for all the SeaMonke builds. 1) major|milestone|dev 2) start and end dates (60 days) I've taken a guess at these. Please fill in and confirm. SeaMonkey 2.0a1pre, developer, ?? - ?? SeaMonkey 2.0a1, milestone, 2008-10-05, 2008-12-03 SeaMonkey 2.0a2pre, developer, ?? - ?? SeaMonkey 2.0a2, milestone, 2008-12-10, 2009-02-07 SeaMonkey 2.0a3pre, developer, ?? - ??
SeaMonkey 2.0a1pre, developer, 2007-07-09 - (60 days) SeaMonkey 2.0a1, milestone, 2008-10-05, 2008-12-03 SeaMonkey 2.0a2pre, developer, 2008-09-25 - (60 days) SeaMonkey 2.0a2, milestone, 2008-12-10, 2009-02-07 SeaMonkey 2.0a3pre, developer, 2008-12-02 - (60 days)
Adding dependency on 477914 which has the SeaMonkey update. Will schedule a push with IT after SQL, shell script review.
SeaMonkey 2.0a1pre, developer, 2007-07-09 - (60 days) is either wrong, or we don't have data for it back in 2007. http://crash-stats.mozilla.com/?do_query=1&product=SeaMonkey&version=SeaMonkey%3A2.0a1pre&query_search=signature&query_type=contains&query=&date=2007-07-19&range_value=1&range_unit=weeks
(In reply to comment #29) > SeaMonkey 2.0a1pre, developer, 2007-07-09 - (60 days) > is either wrong, or we don't have data for it back in 2007. I got to this date by trying to find out since when SeaMonkey had crashreporter support, but it may not have worked correctly from the start. Can we find out when we got the first SeaMonkey 2.0a1pre crash reports and start the window with that? Also, we started the 2.0b1pre dev cycle on 2009-02-19 and released the 2.0a3 milestone yesterday, what's the process for getting those added?
Please open a new bug for MTBF entries.
not sure we are still planning to do this but it appear that we also have values like "Install Age" 7057413 seconds (11.7 weeks) since version was first installed. We should also integrate that into the calculation, or a parallel metric that reports the lower value of TimeSinceLastCrash or InstallAge to produce MTBF_For_Current_Build This would be a a bit different number that total MTBF, but also useful to understanding the time between failure on individual builds.