Closed Bug 444961 Opened 13 years ago Closed 11 years ago

more effective client-side breakpad throttling for release versions

Categories

(Toolkit :: Crash Reporting, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: chofmann, Unassigned)

References

Details

Attachments

(1 file)

broken off from bug 424960 and bug 444033 as another attempt at throttling and reducing load on the socorro server for released versions of firefox.

ted wrote:

> I was thinking maybe just a file in the install dir, 
> like "crashreporter.disabled", whose presence would cause us to not
> install the exception handler. 

here is another idea that would allow us to 
1) not change RC bits after they are created
2) collect a 100% of crashes during release candidate testing period.
3) throttle back to only showing the crash dialog to 10% of users after the
official is on the wire.

It would involve something along the lines of what ted suggested.  We could use
a "born on date" on the client installation to figure out whether the client
should send all reports or become part of the 90% of installations that don't
sent reports for final releases.  To do something like this we would need files
or prefs to test against where we had run checks to figure out if the client is
in the 10% random pool, and if the crashreporter_is.disabled or the
crashreporter_is.enabled.   Maybe change the name of the crashreporter.disabled
file to crashreporter.participation and then insert "enabled" or "disabled" as
contents of the file.   Here is a rough outline of what the logic might look
like.


if crashreporter.participation file doesn't exist set one up

   if build date < 20 days ago
      echo "enabled" > crashreporter.participation

   if build date > 20 days ago

      if in the random 10% pool
           echo "enabled" > crashreporter.participation
      else 
           echo "disabled" > crashreporter.participation


 if crash.participation file contains "enabled"
   show the crash dialog and let the user send in the crash data
 else 
   don't install the exception handler and/or show the crash dialog to the
user.


That would give us a 20 day window after the builds are created as 100%
reporting, then we would start to set up a random 90% of clients as "disabled"
after that.

It would also make it easy for advanced users to "turn on crash reporting" if
they really wanted to, or to shut it down if they knew how and where to go find
this file and edit the contents.
Please do not make triage harder as it is, there are already a few things to tell a reporter if you want a crash id (flash breaks it, is he using a Mozilla.org build or a build from a distri, does the user really crashes or does he got a hang/freeze).
about:crashes got implemented to make it easier to get the crash IDs without searching for the crashreporter file of submitted reports.
Coming back to a file to check somewhere in your profile is bad idea IMO.
I suggest something like a switch in about:crashes, an extension that you can install that places a enable-file somewhere or an about:config switch.

I have just written http://www.mversen.de/crash/ because i don't want to explain every reporter the same every day and it will be much longer if I must explain how to enable the reporter again.

I agree that the current solution is bad because a typical user thinks that Firefox should not crash and if he sends a report, his crash will be fixed and he enables the checkbox manually and we get a overloaded breakpad server system.
My alternative solutions might be either impossible or hard to implement but it would make the life of the daily bug triage much easier.

Thanks for listening
I believe that the design goals are:

* always show the "you crashed" UI, for easy recovery for the user
* allow people who are crashing an easy way to say "please submit my crash reports"
* throttle people who don't care to some low but statistically-significant percentage

I would personally be surprised if a significant number of people actually read or clicked any button in the crash report dialog other than "Restart Firefox"... why are we theorizing that's an issue?
those design goals aren't working.  more people are opting in and checking the box than expected.

it's an issue.  with 10 million fx3 daily users the servers are overloaded.  with another 50 or 60 million users to upgrade we put the whole crash reporting system in a state where is it useless for either reporting crashes, or doing the analysis on crash reports.

Whiteboard: [MU?]
The bug is not with the goals; it may be with the implementation. We could, for example, suppress the "send my report to mozilla" checkbox in the crash reporter for the 90% case. That makes it harder for people to submit crash reports when asked, but that can be fixed with a setting in about:crashes or some other less obvious mechanism.
That's going to make the UI very confusing. I honestly think we should just abandon goal #1 on the 1.9.0 branch and implement something like chofmann's idea here, disabling crash reporting entirely in most cases. We can revamp the crashreporter UI a bit on 1.9.1 to make it better, but that's a lot harder to do on a string-frozen branch, and we need to do *something* here, because we can't handle the current volume of crash reports.
Ted, why is removing the checkbox hard/harder than disabling the exception handler? It involves no string changes since it's a simple removal, right?
It's not hard, but it's going to make the UI confusing. Consider this (poorly) modified screenshot. You're faced with a bunch of disabled controls, and no way to change them. We could go all out and remove all of those controls, but then you're left with the text "To help us diagnose and fix the problem, you can send us a crash report.", but no UI to make that happen. Ideally we'd present a UI with just the first paragraph, and the quit/restart buttons, but we can't do that on 1.9.0 without string changes.
I'm not sure if this is practical, but here's an idea: throttle on the server side.  Have all clients send their reports, but don't process them all.  Thus all crashes have a report sent and can be queued for processing by requesting them via about:crashes (which most people don't do), however only a very small percentage are automatically processed and tabulated.  (a percentage based on platform)  This way we'd get the best of both worlds: full reporting but throttled server load, assuming the actual sending of the un-processed reports can be made efficient enough.
OS: Mac OS X → All
Hardware: PC → All
> The bug is not with the goals; it may be with the implementation.

You're right.  There is nothing wrong with the goal and implementation for

  * always show the "you crashed" UI, for easy recovery for the user

but I think there are problems with the other two:

  * allow people who are crashing an easy way to say "please submit my crash
reports"
 
  -- *easy* means people will/are doing it and the result is overloaded servers.  Firefox users in general are passionate and are interested in helping out when we ask.   I think its the case that we just need to ask fewer people to help out on this.

* throttle people who don't care to some low but statistically-significant
percentage

--  right.  this is about effective throttling.

re: comment 7

Rather thank disabling all the crash submission controls is it possible to remove them from the dialog for the 90%?   If the dialog just said:

     Firefox has encountered a problem and has crashed.  
     We will try and restore your tabs and windows when you restart.

      [quit]    [restart]


that might work.


re: comment 8

This idea still has bandwidth and processing cost that tie up the server.  It also has support cost when we throw crashes overboard, and then the developers/advanced users go looking for their reports to investigate.   We will get a flood of reports about "I submitted a crash report but it's not showing up in the database..."

(In reply to comment #9)
> This idea still has bandwidth and processing cost that tie up the server.  It
> also has support cost when we throw crashes overboard, and then the
> developers/advanced users go looking for their reports to investigate.   We
> will get a flood of reports about "I submitted a crash report but it's not
> showing up in the database..."

The point is that the report would be sent and processed on request only.  It would be the same effect as checking the report quickly after sending; the user would get a "this request is now pending" message on checking and then eventually it'll show up processed.  This way we don't process the report unless it is needed by someone who chooses to investigate.

Yes, there'd still be a bandwidth cost on the server to accept these reports.  I'm simply saying that I think it's a necessary hit to be able to always be able to get at all crash reports, even if we don't automatically process them all.
When I say "easy", I mean "not worse than FF2": intelligent users can choose the "custom" option of the installer and manually choose to install talkback.

We need something equivalent for FF3. It can be English-only UI, hidden behind about:crashes, to avoid l10n problems, if necessary.
Mike - I think this should block MU and 3.0.2.
Flags: blocking1.9.0.2?
I mentioned this in today's meeting (http://wiki.mozilla.org/Breakpad/Status_Meetings/2008-Jul-16) I would rather we do all throttling server-side, and not change the experience for users if possible. Even if the users send reports and we decide not to process them (unless, of course, they load that crash ID thus turning the report into a priority request)
MU I can see, but is there any reason to block 3.0.2 if that turns out to not be the MU version?

Also, to be clear, the client side piece of this is:
  
  #  Change dialog box % to 10% 
Whiteboard: [MU?] → [MU+]
(In reply to comment #14)
> MU I can see, but is there any reason to block 3.0.2 if that turns out to not
> be the MU version?
> 
> Also, to be clear, the client side piece of this is:
> 
>   #  Change dialog box % to 10% 


That was bug 444033. This bug was filed to look at a more effective throttling, by not offering users the option to submit. If we're not going to do that here, we should WONTFIX this.
From some discussion today it sounds like strings in the crash dialog for: 

1) the explanation that a user has crashed 
and 2) the opt-in/send the crash info
and 3) the quit/restart buttons 

are all part of the same string blob so we can't do the suggestion in comment 9
without breaking the string freeze and participation from all the locale teams.

We should break these up in firefox 3.1 so we have the option for showing 1 and
3, or 1, 2, and 3 depending on if the user is in the 10% selected random pool
of crash reporting users.  Long term this is going to provide the best user
experience and most reliable system for submitting crash reports, making the
most efficient use of the processing resources, and having the crash reports
show up in the database as the user expects.

filed bug 445600

(In reply to comment #14)
> MU I can see, but is there any reason to block 3.0.2 if that turns out to not
> be the MU version?

I mostly agree. But I'd say this should block the next release because with each release we get more users ("I'll just wait for the second maintenance version so they can fix all the bugs.") and more users == bad (right now).

> Also, to be clear, the client side piece of this is:
> 
>   #  Change dialog box % to 10%

For now it is. If we need to have something more effective, we can file new bugs for that.

(In reply to comment #15)
> If we're not going to do that here, we should WONTFIX this.

I saw this bug more as a meta bug for the various issues that need to get taken care of for more effective throttling. bug 444033 was part of that, but server-side fixing is another part. Yes, no?
> I mentioned this in today's meeting
> (http://wiki.mozilla.org/Breakpad/Status_Meetings/2008-Jul-16) 
> I would rather we do all throttling server-side, and not change the experience
> for users if possible.

This is for the short term, right?   I think we all agreed that because of the current big string blob trying to fix the user experience was going to be more difficult and raises the possibility of doing more harm that good.

Longer term when we have the opportunity to change strings and design a better system for throttling client side we ought to do that to make the user experience better for all of our users.

for the 90-95% that really don't know what crash details are or care what we do with them, its still pretty ingenuous to have a dialog that says 

  "To help us diagnose and fix the problem, you can send us a crash report."
    ( https://bugzilla.mozilla.org/attachment.cgi?id=329431 )

and then proceed to programatically send their report to the bit bucket 90% of the time from now to eternity.  We shouldn't ask the users for data that we know we won't or can't use.   I'd like to see that incorporated into a data policy somewhere, but I guess that's another discussion.

For the remaining small pct. of people that are in our development and extended QA community (100k-3M+ ?) that are passionate about helping us improve and might actually look at data, or at least want to verify that they *are* helping out we don't have a good user experience either.   Making them send in the report, then jump though additional hoops to get their report processed by some sign up system to "really (really, really) be a participant in the crash program" isn't a good user experience either.

There is still UX work to do here, right?


Whiteboard: [MU+] → [MU+][server side]
This isn't blocking 1.9.0.2, even though it's the MU version. MU will hopefully be a week or two after 1.9.0.2, which means we need a solution in place in the next couple of weeks to test...

morgamic, bump this up on your radar for me? ;)
Flags: blocking1.9.0.2? → blocking1.9.0.2-
It's currently being worked on -- Lars' current focus.  Was bumped already.
Should we assign this to Lars and switch components or file another server-side bug?
I changed the title for this bug to be client side throttling.  I think we will eventually want to do something on the client side.  Lets open up one or more server side bugs for shorter term work.
Summary: more effective breakpad throttling for release versions → more effective client-side breakpad throttling for release versions
I just realized that even though we un-check the default for sending in crash reports in the client, we have a very high visibility (1 click away) appeal for every user to submit crash data.

bottom of the start page has a link 

 Do you love Firefox? So do millions of other people. Help us spread the word!  

that link points to http://en-us.www.mozilla.com/en-US/firefox/community/

The #4 bullet point on that page is is

  * Ensure that you're submitting crash data to the development team

with more details about the crash reporting system under that link.


We should reconsider if we want to do this in such a high visibility way.  We really don't need, or want high volume of crash reports from all the users on release builds, or even all the users that "want to help out" 

Removing these high visibility links and the "user training" that we are doing with them might have some impact on reducing the volume of crash reports over time.
(In reply to comment #23)
> that link points to http://en-us.www.mozilla.com/en-US/firefox/community/
> 
> The #4 bullet point on that page is is
> 
>   * Ensure that you're submitting crash data to the development team
> with more details about the crash reporting system under that link.
> 
> We should reconsider if we want to do this in such a high visibility way.  We
> really don't need, or want high volume of crash reports from all the users on
> release builds, or even all the users that "want to help out" 

Just as a bit of context about this page, although it was redesigned recently, the content on it wasn't updated for Firefox 3 and is actually pretty old. So, a lot of those links are probably due for a refresh (or removal, as the case may be).
Given the awesome work Lars has done on server-side throttling, I don't think this is necessary anymore.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
Yeah, it kind of is. I'll post more later...
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Whiteboard: [MU+][server side]
I think this bug should block Firefox 3.1, even if it has a localization hit. We need to not have this many crash reports submitted. It's far too much storage.
Flags: blocking1.9.1?
Sam, could you be more specific? What specific reason prevents us from throwing away reports on the server pretty quickly to avoid the disk space issues? I haven't seen a good proposal for a client-side throttling solution that has a good crash/restart UI.  I really think this bug should be WONTFIX.

I certainly don't think we should do anything for 1.9.1, absent some overwhelming data that hasn't yet been articulated.
Current server side throttling places submitted crashes into two categories: deferred (held for two weeks) and processed (held for 120 days).  Currently, deferred storage is the disk hog because 90% of submitted jobs are in this category.  The two week hold date is simply to allow a developer that amount of time to request a report about a crash.  To save storage, we can just shorten the hold period.  However, that would only be a temporary solution.  If we could get some stats to chart the trends, we could determine how temporary.

I've asked aravind to report on exactly how much disk space is being used by the two categories of storage.
The server-side solution with about:crashes reprocessing is mostly in place. I don't think we need or want this for 1.9.1
Flags: blocking1.9.1? → blocking1.9.1-
we could also do some overall system monitoring and alearting when overall traffic gets outside of the norms

for instance we could do an updating graph like this that shows incoming crashes per minute.
https://bug519423.bugzilla.mozilla.org/attachment.cgi?id=403767

then given current trends we might set alerts at:

   more than 225 crashes per minute 
      -- we might be getting DoSed? or some website or plugin, 
      or product release might be driving crashes nuts

   or less than 50 crashes per minute. 
      -- we might not be processing incoming reports due to network 
      or other errors
oops, wrong bug.  ingnore comment 31.
AIUI the plan is to improve Socorro to process 100% of reports, so we don't need this.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.