Open Bug 528657 Opened 15 years ago Updated 2 years ago

add a bit more state info to help diagnose crashes [meta]

Categories

(Toolkit :: Crash Reporting, defect)

x86
All
defect

Tracking

()

People

(Reporter: chofmann, Unassigned)

References

(Depends on 3 open bugs)

Details

(Keywords: helpwanted, meta, student-project)

the recent discussion in 
https://bugzilla.mozilla.org/show_bug.cgi?id=526545#c16 about providing more state info about the urls in transition and  loading 

and in
https://bugzilla.mozilla.org/show_bug.cgi?id=528645 about trying to understand if gc is invovled

made me thing if there is more state we can gather to make diagnosis go faster.
the more things that get added to the list the more this would be sensitive to performance concerns.  

maybe we could put some of this under pref control then flip the pref on for trunk/beta builds or when we see users running into problems and need more data.
choffman, I made some comments in the CrashKill meeting today that I think you asked me to add to this bug. I think we were talking primarily about how to determine how much client-side throttling is happening, and more generally how to find out the "real" number of crashes for comparison over time.

To me, it seemed hard to really measure the rate of client-side throttling (although if someone can figure out how to do it, that would certainly be great) and track it over time. So I thought it would be more likely to work if we can measure quantities of "stuff" that is expected to generate crashes, and then track crashes per "stuff" over time. "Stuff" can include:

- run time (i.e., how many seconds has firefox.exe been running since the last crash)
- "active" run time (to try to separate out active use vs. the browser just sitting there--maybe go to "idle" state X seconds after user input)
- number of GCs run, or number of GCs run
- number of memory allocations made, or bytes of memory allocated
- number of pages loaded, or bytes of HTML displayed
- number of JS scripts run

Finding out which of these measures (if any) have a stable, meaningful, crashes/X events metric can only be determined empirically. But I think it would be really interesting to test some and compare vs. crashes/daily users.

Another interesting thing to measure is the number of crashes per crash report. That is, set a counter to zero when the user submits a crash report, and increment for each crash. That would tell us the % of crashes that are sent as crash reports *for users that ever send a crash report at all*. (It seems really hard to measure users that never send a crash report--they are like the dark matter of crashes. But maybe we don't need to measure them--if we can get the stats on our other users, that should be enough.)

A final extra metric is to have the browser send a heartbeat to us every so often (say 1 hour of run time). That would give another measure of how much usage is going on relative to crashes.
testpilot might be another way we could enable some of this general heartbeat and crash related data collection.
(In reply to comment #2)
> - run time (i.e., how many seconds has firefox.exe been running since the last
> crash)

This sounds useful. Can you file that as a separate bug? We currently send "time of last crash", "app start time" and "time of current crash", which you can subtract to get "absolute time since last crash", and "time since starting the app", but not "amount of run time since last crash" if the user runs the app multiple times between crashes.

> - "active" run time (to try to separate out active use vs. the browser just
> sitting there--maybe go to "idle" state X seconds after user input)

This sounds trickier (although we do have the idle service, so we could have an observer that flips between idle/non-idle and accrues time).

> - number of GCs run, or number of GCs run
> - number of memory allocations made, or bytes of memory allocated
> - number of pages loaded, or bytes of HTML displayed
> - number of JS scripts run

I think all of these are feasible and don't need any new APIs.

> Another interesting thing to measure is the number of crashes per crash report.
> That is, set a counter to zero when the user submits a crash report, and
> increment for each crash. That would tell us the % of crashes that are sent as
> crash reports *for users that ever send a crash report at all*. (It seems
> really hard to measure users that never send a crash report--they are like the
> dark matter of crashes. But maybe we don't need to measure them--if we can get
> the stats on our other users, that should be enough.)

This is also an interesting idea, and we talked about something similar when we stopped sending a UUID per user. This would be pretty trivial to add, can you file a separate bug on it?

> A final extra metric is to have the browser send a heartbeat to us every so
> often (say 1 hour of run time). That would give another measure of how much
> usage is going on relative to crashes.

We already have update and blocklist pings that we use to calculate average daily users, I don't think we need any new data here.
See also previously filed bug 510408.
Depends on: 510408
lets make this the tracking bug for all the spin offs suggested in comment 4
Depends on: 529405
Depends on: 529409
I filed bugs on the 2 specific items you asked for in comment 4. If there are others you want me to file as separate bugs, just let me know. Also:

>> A final extra metric is to have the browser send a heartbeat to us every so
>> often (say 1 hour of run time). That would give another measure of how much
>> usage is going on relative to crashes.
>
>We already have update and blocklist pings that we use to calculate average
>daily users, I don't think we need any new data here.

OK. I didn't know that. Out of curiosity, do we have docs on what those pings are and where they are sent?

It does sound like it measures more the number of people that use the browser at least once a day, and doesn't track how much they are using it during the day, but I guess those measures are probably fairly proportional anyway.
the block list ping is a once a day event.

what we talked about at the meeting would be something more detailed that would give us some idea of the combined number of sessions and amount of time spent on-line each day.
time of url load is an interesting addition to help understand how close to page load the crash happens.  that would enable us to do a better job of picking url that are click to crash for retesting in the crash automation.   thats bug https://bugzilla.mozilla.org/show_bug.cgi?id=510408

We could also add things like key prefs that have been fiddled with that might also cause instability or would help to get to a reproducable set up quicker.   It turns out several addons are flipping enablePrivilege and codebase_principal_support the state of those settting might be as interesting as extension version check override.

the big push for q409 crash kill seem to be over for now and stability seems to be back on track, but I think these improvements to breakpad are still needed.  maybe a q2 goal?
any chance we could make this a q2 goal for getting some of this stuff in to breakpad, and then support into socorro?
It doesn't really have much to do with Breakpad, it just requires someone to use the existing API for annotating a crash report to put the data they want into the crash report.
Depends on: 574174
I combined the measures of user activity as far as page/pluging loads,  gc, and memory allocation into a single bug.

    *   Bug 574174  - track garbage collection, memory use, as measures of user activity and send as part of crash reports

    *   bug 510408:  add time of url load to crash reporting.  [NEW ; assigned to nobody@mozilla.org; Target: ---]   
    *   bug 529409: Measure crashes per crash report [NEW ; assigned to nobody@mozilla.org; Target: ---] 
    *   bug 529405: Track and report run time since last crash [NEW ; assigned to nobody@mozilla.org; Target: ---] 


beltzner/damon, this sounds like its not really breakpad work, but just instrumenting the various places that take these actions, then using the existing breakpad api's to send along with the crash reports.

maybe this is a good place for one or more interns to get working on to make an early firefox 4 beta.  all the list above could go in parallel, but if we had to prioritize I'd suggest the order above.  The first one is a bit harder and could have potential perf impact since we do those operations more often.  If these turn out to be too complicated, then move to the next set of easier items.
ted, might be late but how about a short session at the summit on using the breakpa api's.
Keywords: meta
OS: Mac OS X → All
Summary: add a bit more state info to help diagnose crashes → add a bit more state info to help diagnose crashes [meta]
Depends on: 493779
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.