Closed Bug 424960 Opened 16 years ago Closed 16 years ago

only enable crash reporting for X% of release builds

Categories

(Toolkit :: Crash Reporting, defect, P2)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: ted, Assigned: ted)

Details

Attachments

(2 files, 1 obsolete file)

Currently Breakpad crash reporting is enabled for 100% of builds. With Talkback, we disabled crash reporting on all but 10% of installs by default, allowing the user to opt-in from the installer. Given the number of users we expect to be using the release, it doesn't make sense to collect 100% of reports.

One proposal from an email thread was to simply disable the "[ ] Tell Mozilla about this crash" checkbox by default (100 - X) % of the time. This would allow us to still provide a restart path for users (and avoid inferior OS crash reporter dialogs), as well as allow a simple path for users to opt-in to sending reports. The only downside to this is that it would allow for extremely easy opt-in, and persist the user's decision, so the percentage of users sending crash reports could grow very quickly.
Flags: blocking1.9?
Attached patch wip patch for windows/linux (obsolete) — Splinter Review
WIP, works fine on Windows, didn't test Linux yet. I'm not quite sure how to do Mac, since we persist that value via interface builder.
We need to resolve throttling on server or client one way or another.  I'm a fan of the server-side throttling - but then again I'm not doing any of the work :-).
Flags: blocking1.9? → blocking1.9+
Priority: -- → P2
(In reply to comment #2)
> I'm fan of the server-side throttling

By that you surely don't mean "don't accept (100 - X)% of reports sent in," right?  

If a user submits a crash report and then goes to file a bug (or comment on an existing bug) and then finds the crash he thought he reported was not reported, it's going to be 1) confusing and 2) dataloss of that crash data.  1) is bad because we want the reporter to be considered reliable; 2) is bad because we never know what user is going to have that key piece of data, and might cause Bugzilla churn closing bugs filed in anticipation of crash reports that don't exist.

If that's not what you meant, than just ignore that paragraph ;)

(In reply to comment #1)
> I'm not quite sure how to do Mac, since we persist that value via interface builder.

I'm not sure I follow; does that checkbox state not persist on other platforms?
(In reply to comment #3)
> (In reply to comment #2)
> > I'm fan of the server-side throttling
> 
> By that you surely don't mean "don't accept (100 - X)% of reports sent in,"
> right?  

No - I mean store 100% of reports and process a subset. Storage and acceptance is cheap. Processing is expensive.
 
> If a user submits a crash report and then goes to file a bug (or comment on an
> existing bug) and then finds the crash he thought he reported was not reported,
> it's going to be 1) confusing and 2) dataloss of that crash data.  1) is bad
> because we want the reporter to be considered reliable; 2) is bad because we
> never know what user is going to have that key piece of data, and might cause
> Bugzilla churn closing bugs filed in anticipation of crash reports that don't
> exist.

I suggested that if you ever requested a particular crash id and it wasn't processed it would get added to the queue and processed.  This was deemed as "hard".  Again just my recommendations :-)
(In reply to comment #3)
> I'm not sure I follow; does that checkbox state not persist on other platforms?

Yes, but we handle it manually in the code. On Mac, it's just bound to "Shared Defaults.values.submitReport" from IB. Is there a way to detect when that value is unset and set it appropriately?

Schrep: I think that server-side throttling would be nicer, but I do think we'll run into storage and management issues. I think this patch will be pretty low-impact, and allow people to easily opt-in, so that testers/developers shouldn't have to worry about having their reports sent. My only concern is that it might make it *too* easy to opt-in, so we'll still have to be prepared for a pretty high volume of reports.

(In reply to comment #5)
> (In reply to comment #3)
> > I'm not sure I follow; does that checkbox state not persist on other platforms?
> 
> Yes, but we handle it manually in the code. On Mac, it's just bound to "Shared
> Defaults.values.submitReport" from IB. Is there a way to detect when that value
> is unset and set it appropriately?

Presumably there's a nice way to do that from code, but you'd have to consult a real Cocoa-head.  All IB really should be doing there is writing the state to the user defaults (and reading the defaults when launching), so, at worst case, you could defaults read org.mozilla.crashreporter submitReport and then defaults write org.mozilla.crashreporter submitReport [-bool|-int] the other value to manipulate your X% default (not really sure how that's being done, so my comment could be completely off in left field).
Yeah, I realized that I could just do it manually the other day. I'll whip up a comprehensive patch in a bit.
Ok, this works on Mac as well.
Attachment #311590 - Attachment is obsolete: true
Attachment #314124 - Flags: review?(benjamin)
Attachment #314124 - Flags: review?(benjamin) → review+
Comment on attachment 314124 [details] [diff] [review]
complete patch [checked in]

Checked this in. I guess we should use this option on Windows trunk, just to make sure it doesn't have any ill effects. I'll attach a patch for that in a sec.
Attachment #314124 - Attachment description: complete patch → complete patch [checked in]
This will only enable the "submit" checkbox by default 25% of the time on Windows builds. This only takes effect if there's no existing value, so existing nightly testers will not be affected, only new users. Note that you can still check the box at any time if it's unchecked to submit a report. (Or uncheck it if it's checked, of course.)
Attachment #314177 - Flags: review?(robert)
Comment on attachment 314177 [details] [diff] [review]
set enable percent to 25% on fx-win32-tbox

Should this be on for release builds too or just nightlies?
Attachment #314177 - Flags: review?(robert) → review+
For releases. Actually more important for releases than for nightlies.
>This will only enable the "submit" checkbox by default 25% of the time on
>Windows builds. This only takes effect if there's no existing value, so
>existing nightly testers will not be affected, only new users.

Does this design mean it will be hard to change the percentage between, say, Firefox 3 and Firefox 3.0.1?
just to be clear, we want to try and get 100% of the crashes for all nightlies, betas, and even release candidates that will get downloaded less than a million or so times.   when we get bits that we know will be pushed as "final" and could eventually be downloaded by tens of millons of users we want to flip some bit that allows us to throttle back to 10 or 25%.   Using a build config flag seems and requiring a rebuild sounds like more work/risk than we might want to apply to bits that are in transition from an RC to Final.
RCs are RCs: we don't change the bits from RC to final.

We discussed being able to set the percentage from a text file over IRC. That solution is not optimal because it re-uses a l10n file for other content. It is more practical to compile in the random percentage at this point.

I don't think we need 100% of crashes for RCs in any case... the statistical sample sizes are already huge, so we don't gain much statistical knowledge with more users; and since any users can check the box to send a particular report, to help QA particular crashes, I don't think we're going to miss a lot.
Jesse, changing between 3 and 3.0.1 would involve a mozconfig change, but isn't "hard" or risky.
I think what Jesse means is that because this is "sticky", once a user has crashed the value is saved, so changing the percentage won't change any existing users. We could tweak this slightly so that we only persist actual user-selected values, and let the random value always be random, or persist the random value separately from the user-selected value, but I don't know how useful that would be.
Also, did we actually have a way to change the Talkback percentage via updates, or was it solely through the installer?
> I don't think we need 100% of crashes for RCs in any case... the statistical
sample sizes are already huge, so we don't gain much statistical knowledge with
more users;

we are respinning 2.0.0.13 now because a top crash regression went undetected in release candidate and it turned out to be serious enough to require reaction.   You're probably right that 3.0RC1,2,3... will have several million users, pretty good sample size, and long enough bake time to detect a possible topcrash ship blocker.   The question is if 3.0.0.1-13 will have all those things working for them.  Historically it has been tough to build in enough bake time and big enough sample size to spot top crash and other problems that create the need for respins of maintenance releases.

> Also, did we actually have a way to change the Talkback percentage via updates, or was it solely through the installer?

Talkback provided a protocol where the client would ask for instructions from the server when it started the submission process.   One of the instructions the server could respond with is "shut yourself down and don't send reports for this release in the future", so in effect we could throttle post install and turn off clients where the data was no longer valuable.

> topcrash ship blocker.   The question is if 3.0.0.1-13 will have all those
> things working for them.  Historically it has been tough to build in enough
> bake time and big enough sample size to spot top crash and other problems that
> create the need for respins of maintenance releases.

How do you propose to fix that? RCs are bit-identical to the actual release (if we don't find blockers, we release those precise bits).

> the server could respond with is "shut yourself down and don't send reports for
> this release in the future", so in effect we could throttle post install and

We have the same functionality in breakpad, see bug 412788
> How do you propose to fix that? RCs are bit-identical to the actual release (if we don't find blockers, we release those precise bits).


here is the idea we have kicked around before.  use the installer package naming convention that we have been using for the last couple of years..

have the installer run some code like

   if the name of the installer package matches "beta" "alpha" "pre" 
     then no throttling
   else throttle

RC builds do not contain beta/alpha/pre in the filename:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2.0.0.14-candidates/rc1/
but they could (and should?) to avoid confusion on what they are, and make this work to our advantage for gathering the maximum amount of crash data.
Well, it's somewhat irrelevant to breakpad since we don't choose the random percentage at install-time, but rather at crash-time.
I checked the second patch in on trunk and the 'release' branch, so it should be set for RC1. We can back it out on trunk after it bakes without problems.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
How did we arrive at 25%?
I don't recall. Anyway, I wonder if this is somewhat ineffective, given how easy it is to check the box. If we're still having scaling problems, we could do something more effective.
Well, we are getting 200K+ reports a day when we expected 60K a day, so something is messed up.  Seems like 25% is arbitrarily picked anyway, so I was wondering if there's a statistical reason for that number.  We should pick whatever is the minimum to have statistically significant samples.  Have we discussed what that % would be?
I don't know that anyone knows what that number is. If we're getting too many reports, clearly we ought to ratchet it down. Given that number though, I question the effectiveness of this approach, and think we should consider something more drastic, like completely disabling crash reporting for x% of installs.
(In reply to comment #29)
> reports, clearly we ought to ratchet it down. Given that number though, I
> question the effectiveness of this approach, and think we should consider

I kind of question it except that we have no stats otherwise. Now might be a good time to experiment. Turn the percentage down to 10% and see if the number of reports drops to 40%.

But maybe we shouldn't be discussing this in a closed bug?
Yea, file another one, we can cover it there.  Ted - not saying you/we did anything wrong, I want to understand how we're throttling a little better.
(In reply to comment #31)
> Yea, file another one, we can cover it there.

Filed bug 444033.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: