Last Comment Bug 411424 - Need to have a report for MTBF per build
: Need to have a report for MTBF per build
Status: RESOLVED FIXED
:
Product: Socorro
Classification: Server Software
Component: General (show other bugs)
: Trunk
: All All
: P2 normal (vote)
: 0.7
Assigned To: Austin King [:ozten]
: socorro
Mentors:
http://code.google.com/p/socorro/issu...
Depends on: 470621 470622 477914
Blocks:
  Show dependency treegraph
 
Reported: 2008-01-08 18:31 PST by Samuel Sidler (old account; do not CC)
Modified: 2011-12-28 10:40 PST (History)
11 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
A first cut at MTBF (24.06 KB, patch)
2008-12-05 17:01 PST, Austin King [:ozten]
no flags Details | Diff | Review
A first cut at the DDL for productdims, mtbffacts, and mtbfconfig tables (2.72 KB, text/plain)
2008-12-05 17:08 PST, Austin King [:ozten]
no flags Details
A cleaner patch. Minor updates. (22.20 KB, patch)
2008-12-17 10:05 PST, Austin King [:ozten]
no flags Details | Diff | Review
Updated with some bug fixes as well as Lars feedback (23.55 KB, patch)
2008-12-17 17:47 PST, Austin King [:ozten]
morgamic: review+
Details | Diff | Review

Description Samuel Sidler (old account; do not CC) 2008-01-08 18:31:11 PST
Reported by morgamic, Nov 29, 2007

Need to add a report for average mean time between failure.
Comment 1 chris hofmann 2008-03-12 15:47:07 PDT
morgamic, ok to move this to P1 now?   

lets hash out how we can do the computation of the number and report it.

in the past I think we have just added up all the time-of-crash minus browser-start-time numbers for each black box for a specific release to come up with the total number of hours run; then we divided that number by the number of crashes.

sample size has been last 10 days, but we could switch to two weeks if we think that has some value.

for example a report like this would allow us to directly compare to 2.x releases.

 Total blackboxes in this sample:  288999
 Total unique users:  147090
 MTBF For these builds is estimated at 25.625648 hours,
 based on 273144 reports and 6999491.939167 hours of user testing

Comment 2 chris hofmann 2008-03-12 17:27:23 PDT
and we should do any needed sanity checking and clean up of the db and the sample before we do the calculations as in bug 422549
Comment 3 chris hofmann 2008-03-19 19:19:13 PDT
ken, let's work out for how we want to calculate this.

I think the old crash reporting system was basically using something like this.

-pull a sample for all blackboxes or a particular release (e.g. grab all the reports for windows, mac, and linux build numbers for the release like firefox 3.0 beta 4, or final)
-throw out any outliers like 0 or negative time since start up, or anything that looked like a duplicate submission
-add up all the time since start up times.
-divide by the number of blackboxes in the sample.


jay can confirm what we used in the custom report we built for past tracking.

Comment 4 chris hofmann 2008-03-19 19:23:39 PDT
sample size has been " 10 days of data " previously but there might be good reasons to move to a two week window since we know active users drop over weekends and might create wiggles in the reporting as a different kind of user base comes in and out on weekends.
Comment 5 chris hofmann 2008-03-19 19:33:27 PDT
and of course he holy grail on this is to tie it into AUS active user data.  We will be eye-balling a correlation between active users and total crashes received until we can get the two tied together with automation, but that shouldn't hold us up for now.

right now the fact that we are only receiving crashes from the users that "opt in" can give us a distorted view of what is going on, but having that same distortion applied across multiple releases has yeilded valuable feedback...

e.g.   
14 days into beta 4 MTBF was 30.3 hours
14 days into beta 5 MTBF was 35.5 hours

so we must have fixed the right set of top crashers to improve stability and not not introduced any crash regressions.   those are really the kind of numbers we are after here.

we also used to have graphs that aligned the releases to show the changes for MTBF over time since release.  it would be cool if we could also get those going again at some point

Comment 6 ken kovash 2008-04-04 11:56:08 PDT
Chofmann and I recently spoke regarding some new stats that would useful:
(1) daily number of crashes.  this is sort of a raw/ignorant look at the data, but it could be helpful.
(2) median tbf + anything else that describes the distribution of failure time.  e.g., is one user responsible for all crashes?, is the tbf normally distributed?, etc.
(3) ratio of comments to crashes
Comment 7 Michael Morgan [:morgamic] 2008-11-14 10:40:47 PST
We need a mockup that shows what t his type of report would look like.
Comment 8 Michael Morgan [:morgamic] 2008-11-14 10:44:40 PST
Chofmann wants a graph with these properties:
* x-axis: days since releases
* y-axis: hours
* series: release versions only
Comment 9 Austin King [:ozten] 2008-12-05 17:01:05 PST
Created attachment 351635 [details] [diff] [review]
A first cut at MTBF

Development URL:
http://aking.khan.mozilla.org/reporter/mtbf/of/Firefox/major

Screenshots:
http://people.mozilla.org/~aking/Socorro/mtbf.html

See following attachements with DB schema for more context.
Comment 10 Austin King [:ozten] 2008-12-05 17:08:36 PST
Created attachment 351637 [details]
A first cut at the DDL for productdims, mtbffacts, and mtbfconfig tables
Comment 11 Austin King [:ozten] 2008-12-05 17:14:30 PST
Cron Script:
When run startMtbf.py will populate MTBF facts table for the previous day. Date can be overriden say
startMtbf.py -d 2008-12-01 

Database Changes:
To see more realistic data - look at breakpad_aking DB on Postgres on khan.mozilla.org. that DB shows realistic values in all three tables. I don't have much data to work with, so it is 5 days of data instead several release builds on day 1 through day 30 or 60.

TODO:
I know of a couple bugs, Need indexes on tables, Have a flot redisplay bug, etc but wanted to get a review.

Thanks.
Comment 12 Austin King [:ozten] 2008-12-17 10:05:56 PST
Created attachment 353470 [details] [diff] [review]
A cleaner patch. Minor updates.
Comment 13 Wayne Mery (:wsmwk, use Needinfo for questions) 2008-12-17 10:12:39 PST
is the plan to have thunderbird be one of the products reported?
Comment 14 Austin King [:ozten] 2008-12-17 10:25:12 PST
I don't have a firm plan around products and versions.

If you give me versions and start dates then I will set this up.
Optionally you can give me end dates or 60 days will be default.

Example(made up data):
Thunderbird
2.0.0.19 - 12/10 - major release
2.0.0.20 - 1/10/2009 - major release
3.0a3 - 9/12 - milestone release
3.0b2pre - 11/15 - developer release

etc

I will be getting this info for Firefox from S.S, but I don't have any other data or person for any other products yet.
Comment 15 Dan Mosedale (:dmose) 2008-12-17 12:00:18 PST
Setting this up for Thunderbird would be fantastic.  

I think all the data for released version is likely to be available on the Release pages linked to from <https://wiki.mozilla.org/Releases/>.  It would be great to track all the Thunderbird 3 releases there (3.0a1, 3.0a2, 3.0a3, 3.0b1).
At least the last several Thunderbird 2 releases would be very helpful as well.

I believe our branch nightlies are 3.0b2pre and our trunk nightlies are 3.1a1pre.  gozer probably has exact start dates for those.

60 days sounds like a perfectly reasonable default to start with.

Thanks!
Comment 16 Austin King [:ozten] 2008-12-17 17:47:27 PST
Created attachment 353593 [details] [diff] [review]
Updated with some bug fixes as well as Lars feedback
Comment 17 Michael Morgan [:morgamic] 2008-12-18 10:43:14 PST
Here are my comments for the reporter changes.

- the data should be listed in a table under the graph in case scaling makes it hard to interpret
- the major/milestone/development links shouldn't rotate, all three should be visible at all times
- text for top nav should be "Release type: Major Milestone Development"

More on table layout:
# Firefox 3.0- MTBF 13010 seconds based on 50103 crash reports of 32726 users (blackboxen) from period between 2008-08-01 and 2008-11-20
# Firefox 3.0.1- MTBF 250139 seconds based on 765446 crash reports of 496840 users (blackboxen) from period between 2008-08-01 and 2008-11-20
# Firefox 3.0 Win- MTBF 10119 seconds based on 39161 crash reports of 24196 users (blackboxen) from period between 2008-08-01 and 2008-11-20

Should be changed to:
Product | Version | OS | MTBF | # Reports | # Users | Start | End

That was UX stuff, looking at PHP code.
Comment 18 Michael Morgan [:morgamic] 2008-12-18 14:52:09 PST
Indentation is messed up in load_product_info().  Looks like there are tabs mixed in with spaces, so the code is littered with some indentation issues.

Question - for the zero-case (no data) seems like some of the behavior is to show an empty white box -- is that expected?

Functionally, it works for me, so let's move forward and iterate on it.
Comment 19 Austin King [:ozten] 2008-12-18 16:14:24 PST
This code is checked in and scheduled to be released tonight.
r751 with some initial configuration checked in under r753.
Comment 20 Samuel Sidler (old account; do not CC) 2008-12-18 19:04:06 PST
I'm not such how much history you have, but I'd like to do MTBF for the following builds:
  * Firefox 3.0.3 (starting Sept 24)
  * Firefox 3.0.4 (starting Nov 5)
  * Firefox 3.0.5 (starting Dec 10)
  * All Firefox 3.0.x pre builds starting with 3.0.4pre (start these when
    3.0.[n-1] started; i.e., start 3.0.4pre on Sept 24)
  * Firefox 3.1b1 (starting Oct 7)
  * Firefox 3.1b2 (starting Dec 1)
  * All Firefox 3.1pre builds starting with 3.1b2pre (starting Oct 7)

For Thunderbird, do the following builds:
  * Thunderbird 3.0a3 (starting Oct 7)
  * Thunderbird 3.0b1 (starting Dec 2)
  * Thunderbird 3.0b1pre (starting Oct 7)
  * Thunderbird 3.0b2pre (starting Nov 28)

If you have data prior to Sept 24 (when the first one of these starts), let me know and we can add more, but this is a great start.
Comment 21 Samuel Sidler (old account; do not CC) 2008-12-18 19:06:56 PST
(In reply to comment #15)
> At least the last several Thunderbird 2 releases would be very helpful as well.

Thunderbird 2 can't be done in this style since it's Socorro dependent, but you look at MTBF for Thunderbird 2 builds at:

  http://talkback-public.mozilla.org/reports/thunderbird/

Simply select a release (e.g., Thunderbird 2.0.0.18) and under "Smart Analysis" on the left side, select "All Platforms". MTBF appears at the top of the smart analysis report. Note: This isn't comparing apples to apples since the crash reporting is very different between 1.8 and 1.9.
Comment 22 Samuel Sidler (old account; do not CC) 2008-12-18 19:09:03 PST
Oh, and 60-day default is a good start. We can start specifying end-dates as needed later (I'll let you know what those are when we get there). Let's get this going! :)
Comment 23 Tony Mechelynck [:tonymec] 2008-12-20 00:41:44 PST
What about SeaMonkey? 2.0 alpha 1 and 2 have been released by now, so I suppose the following SeaMonkey builds (or build families) could be added to the list (subject, I suppose, to some agreed-upon time-limit such as that in comment #22).
2.0a1pre
2.0a1
2.0a2pre
2.0a2
2.0a3pre

Also, what about Firefox 3.2a1pre, which is already coming out in the form of nightlies? AFAIK, they're the only builds already being done based on Gecko 1.9.2.

Not sure how much statistical data would be available as yet, but wouldn't it be worth while to have the MTBF reports up and rolling by the time Sm 2.0 and/or Fx 3.2 are ready for a release, or maybe even for a beta?
Comment 24 Samuel Sidler (old account; do not CC) 2008-12-20 21:42:27 PST
Austin, I filed a couple of follow ups to look at since some of this is live already. See the "Depends On" field.
Comment 25 Austin King [:ozten] 2008-12-30 09:31:15 PST
(In reply to comment #23)
> What about SeaMonkey? 2.0 alpha 1 and 2 have been released by now, so I suppose
> the following SeaMonkey builds (or build families) could be added to the list
> (subject, I suppose, to some agreed-upon time-limit such as that in comment
> #22).
> 2.0a1pre
> 2.0a1
> 2.0a2pre
> 2.0a2
> 2.0a3pre
> 
> Also, what about Firefox 3.2a1pre, which is already coming out in the form of
> nightlies? AFAIK, they're the only builds already being done based on Gecko
> 1.9.2.
> 
> Not sure how much statistical data would be available as yet, but wouldn't it
> be worth while to have the MTBF reports up and rolling by the time Sm 2.0
> and/or Fx 3.2 are ready for a release, or maybe even for a beta?

I am happy to add these to the MTBF reports. I need start dates which is "day 0" for calculating uptime.

I will add SeaMonkey 2.0a2 and 2.0a3pre to the top crash by url reports also.

As for 3.2a1pre is the Product Minefeild or Firefox?
Comment 26 Austin King [:ozten] 2009-02-11 14:30:53 PST
I still need two more pieces of information for all the SeaMonke builds.

1) major|milestone|dev
2) start and end dates (60 days)

I've taken a guess at these. Please fill in and confirm.

SeaMonkey 2.0a1pre, developer, ?? - ??
SeaMonkey 2.0a1, milestone, 2008-10-05, 2008-12-03
SeaMonkey 2.0a2pre, developer, ?? - ??
SeaMonkey 2.0a2, milestone, 2008-12-10, 2009-02-07
SeaMonkey 2.0a3pre, developer, ?? - ??
Comment 27 Robert Kaiser (not working on stability any more) 2009-02-12 04:56:00 PST
SeaMonkey 2.0a1pre, developer, 2007-07-09 - (60 days)
SeaMonkey 2.0a1, milestone, 2008-10-05, 2008-12-03
SeaMonkey 2.0a2pre, developer, 2008-09-25 - (60 days)
SeaMonkey 2.0a2, milestone, 2008-12-10, 2009-02-07
SeaMonkey 2.0a3pre, developer, 2008-12-02 - (60 days)
Comment 28 Austin King [:ozten] 2009-02-12 09:05:40 PST
Adding dependency on 477914 which has the SeaMonkey update. Will schedule a push with IT after SQL, shell script review.
Comment 29 Austin King [:ozten] 2009-02-23 15:17:35 PST
SeaMonkey 2.0a1pre, developer, 2007-07-09 - (60 days)
is either wrong, or we don't have data for it back in 2007.
http://crash-stats.mozilla.com/?do_query=1&product=SeaMonkey&version=SeaMonkey%3A2.0a1pre&query_search=signature&query_type=contains&query=&date=2007-07-19&range_value=1&range_unit=weeks
Comment 30 Robert Kaiser (not working on stability any more) 2009-03-04 05:44:23 PST
(In reply to comment #29)
> SeaMonkey 2.0a1pre, developer, 2007-07-09 - (60 days)
> is either wrong, or we don't have data for it back in 2007.

I got to this date by trying to find out since when SeaMonkey had crashreporter support, but it may not have worked correctly from the start. Can we find out when we got the first SeaMonkey 2.0a1pre crash reports and start the window with that?

Also, we started the 2.0b1pre dev cycle on 2009-02-19 and released the 2.0a3 milestone yesterday, what's the process for getting those added?
Comment 31 Austin King [:ozten] 2009-06-23 08:40:11 PDT
Please open a new bug for MTBF entries.
Comment 32 chris hofmann 2010-06-25 10:16:03 PDT
not sure we are still planning to do this but it appear that we also have values like

  "Install Age"	7057413 seconds (11.7 weeks) since version was first installed.

We should also integrate that into the calculation, or a parallel metric that reports the lower value of TimeSinceLastCrash or InstallAge  to produce  MTBF_For_Current_Build

This would be a a bit different number that total MTBF, but also useful to understanding the time between failure on individual builds.

Note You need to log in before you can comment on or make changes to this bug.