Open Bug 1317972 Opened 8 years ago Updated 2 years ago

Work out what to do for crash reports with the GPU process


(Core :: Graphics, defect, P3)





(Reporter: gw280, Unassigned)


(Blocks 1 open bug)


(Whiteboard: [gfx-noted])

We don't currently have a strategy for dealing with GPU process crashes. When the process dies, the crashreport should end up in the pending crash reports directory but there's no UI to prompt the user to submit it.

A few things to bear in mind here:

- There's ongoing work in bug 1280469 to do client-side stackwalking. This will allow us to get crash stacks via Telemetry without requiring user input. This is very valuable and whilst not finished yet, is supposed to ride the trains at some point.
- There's value in us getting more information than just the stack. dvander can comment more, but I suspect getting crash dumps will be the biggest thing we can get from a normal crash report that isn't in the Telemetry data.
- The GPU process is intended to be as transparent to the user as possible. It's being designed such that if the process crashes, it restarts instantly without any user interaction or notification. As such, there's no obvious opportunity to prompt the user to submit a crash report.

In my mind right now, there are two possibilities for dealing with this:

- See if we can get the pending crash report submission UI to be bumped up so that it rides the trains.
- Provide a small, unobtrusive UI at the time of a GPU process crash prompting the user to submit a report.

There may be a third option where we either decide that the data we will get from the Telemetry pings will be sufficient, or are able to add the data we need to the pings if it's not sensitive enough to require user authorisation.
Philipp, any thoughts here?
Flags: needinfo?(philipp)
Running with George's idea on the "small, unobtrusive UI":

* We could wait for a few crashes to accumulate before worrying about informing the user. If it is something that happens once a month, it may not be worth informing them, or getting the reports.

* We wouldn't say "you crashed" - we would say "due to your Windows graphics driver instability, we have temporarily disabled the hardware acceleration.  Firefox will continue operating normally, but you may notice a slowdown in some operations.  Restarting Firefox will restore the hardware acceleration..." or something like that.

* Part of acknowledging this message would be "submit information about this problem" checkbox (opt-out).

* We should probably have "don't show this again, keep submitting the information/never submit information", although the data/privacy should be consulted whether we can have the blanket "yes" for the crash report submission.  On the other hand, things like URL shouldn't be a part of the GPU process crash report, right?
Priority: -- → P3
Whiteboard: [gfx-noted]
Also worth noting is that in my mind, the "unobtrusive UI" would be styled somewhat like the slow script notification when using e10s.

Waiting for a few crashes may be difficult as the current plan is to immediately fall back to in-process compositing when the GPU process fails. I'm not sure what the timeline is for full GPU process restarts.
(In reply to George Wright (:gw280) (:gwright) from comment #3)
> Waiting for a few crashes ...

Cumulatively, for the user - not within a single session.  If the user only crashes twice a year, we are probably OK without their crash report.  With the startup crashes weighed heavily, of course :)
ni'ing myself to make sure this stays on UX's radar.
Flags: needinfo?(mconley)
I'm a little conflicted on these options.

I don't think it makes sense to display any kind of UI when the GPU process crashes. By design it's something the user shouldn't know about, and it's not supposed to affect their session. Instead we'd be intruding, and it's very hard to explain why. "My compositor process crashed? What's a compositor process?"

If it's a super rare event - maybe that's actually okay. Hopefully we'll know how rare that is soon.

The unsubmitted notification route seems more appealing. I hope we can accelerate that and figure out how to let it ride the trains. It makes even more sense once we do restart the GPU process. I'd expect that to happen fairly soon - there's really no reason not to get it working now, it should be the next milestone.

The best thing of all would be crash stacks in telemetry, but I don't know when we'll have that :)
gw280, phlsa and I are going to sort this out in Hawaii.
Flags: needinfo?(mconley)
Here are the notes from our meeting in Hawaii:


- Do not want to interrupt the user if at all possible.
- Telemetry crash stack pings probably won't ride an early enough train for us.
- Unsubmitted crash reports UI won't ride trains ever, will terminate at beta.

Data we'd like from GPU process crashes:

- Symbolicated stack is probably the most vital piece of data. We can grab this without needing user opt-in.
- Driver/GPU environment info (versions etc), as I suspect a lot of what we do with the data is correlate crash frequency against driver versions and blacklist accordingly.
- Frequency of crashes on particular driver versions (may be able to calculate this by looking at existing telemetry data?)
- Raw dumps for debugging particularly complex crashes (definitely need opt-in for this)

I don't think there's anything other than the raw dumps which would need user opt-in. We can most likely get the first three then without displaying anything to the user. Is there anything I've missed here?

In terms of what we're going to do when handling GPU process crashes, it seems unlikely to me that a GPU process crash will occur by virtue of simply being out-of-process. That is, if we crash frequently and want to mitigate that, we're unlikely to fallback to in-process accelerated compositing. As such, any fallback that occurs is going to have an explicit impact on the user because h/w acceleration will be disabled. Based on this, it seems reasonable at some point to inform the user that we're disabling h/w acceleration because their experience will be impacted in a perceivable way.

We were all in agreement with this, but we need to define the threshold at which we inform the user that something's wrong. If we fallback to software but there's no perceptible difference in performance/responsiveness then we shouldn't inform the user as it's confusing and annoying. However, if we fallback and Firefox is now janky, we should definitely inform them and ask them to help us improve the experience.

Action items for now:

- I'm going to ask around and discuss the possibility of scrubbing the unsubmitted crash reports of sensitive data and automatically submitting those. This would get us the majority of what we'd like, I think.

- We need to define the threshold at which we inform the user that things are terrible. The wording is also going to be critical here; we want to avoid saying things like "hardware issues" or anything too technical. Something along the lines of "We have noticed that your Firefox experience hasn't been optimal, would you like to send us information about your computer so that we can work on making this better?" would be more like what we're after.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.