Crash parent process instead of disabling GPU process at runtime
Categories
(Core :: Graphics, task)
Tracking
()
Tracking | Status | |
---|---|---|
firefox123 | --- | fixed |
People
(Reporter: jnicol, Assigned: jnicol)
References
Details
Attachments
(3 files)
Currently if the GPU process crashes enough times (without being declared stable) then we eventually disable it.
We should consider crashing the parent process instead:
- Disabling the GPU process loses the security benefits
- We fallback to a lesser-tested code path
- We have poor visibility of this problem occurring without collecting crash reports
We believe the issues we're seeing in bug 1866555 occur following the GPU process disabled (at least, that had been the case for at least one user who is affected), so this could help find the underlying cause.
Of course this could lead to a gigantic spike in crashes, so we will need to monitor it closely.
Comment 1•2 years ago
•
|
||
Additional things to consider:
- Adding CI test variants that disable the GPU process for ASAN/etc testing on Windows; ASAN/TSAN on Linux as well if Linux ever ships the GPU process. Similar tests for Android.
- Would falling back to software WebRender only in the parent process if we ever tried the GPU process reasonable? It would lower the attack surface.
- Don't we collect telemetry on the GPU_PROCESS feature status?
- We could tweak the fallback algorithm further without always crashing. If it ever was declared stable, then we could never take the GPU process away, and crash the parent if it has N GPU process crashes in a row that weren't stable.
Assignee | ||
Comment 2•2 years ago
|
||
If we reach the limit of unstable GPU process launch attempts, then
crash the parent process rather than disabling the GPU
process. Disabling the GPU process loses the security benefits of the
process, and switches users to a lesser-tested code path.
Additionally, crashing the parent process will give us more visibility
into any underlying issues via crash reports.
If a stable process has never been launched successfully, however,
then do allow the process to be disabled. Additionally, reset the
unstable launch counter after a stable launch, so that we are less
eager to either disable it or crash.
This is controllable via a pref layers.gpu-process.allow-disabling.
Assignee | ||
Comment 3•2 years ago
|
||
Add probes for total number of launch attempts and unstable number of
launch attempts. Port expired CRASH_FALLBACKS probe to glean and renew
it, telling us the number of times we handled each crash fallback with
each method. And finally, add a probe to tell us the status of the GPU
process gfxFeature - this is already available on desktop via the
telemetry environment, but adding a glean probe allows us to gather
the data on Android too.
Depends on D196467
Assignee | ||
Comment 4•2 years ago
|
||
Comment 5•2 years ago
|
||
Comment on attachment 9369619 [details]
data_review_request.md
PRELIMINARY NOTES:
Did you know that, since these metrics are Glean, you can use ./mach data-review
to generate a partially-filled Data Review Request template?
DATA COLLECTION REVIEW RESPONSE:
Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?
Yes.
Is there a control mechanism that allows the user to turn the data collection on and off?
Yes. This collection can be controlled through the product's preferences.
If the request is for permanent data collection, is there someone who will monitor the data over time?
Yes, :jnicol is responsible.
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
Category 1, Technical.
Is the data collection request for default-on or default-off?
Default on for all channels.
Does the instrumentation include the addition of any new identifiers?
No.
Is the data collection covered by the existing Firefox privacy notice?
Yes.
Does the data collection use a third-party collection tool?
No.
Result: datareview+
Assignee | ||
Comment 6•2 years ago
|
||
I'm going to land the patch to add telemetry now, but leave the force-crash one until after the holidays so I don't ruin somebody's christmas
Comment 8•2 years ago
|
||
bugherder |
Description
•