Crash after "Killing GPU process due to IPC reply timeout" with: Linux, nvidia, suspend/resume, WebRender
Categories
(Core :: Graphics: WebRender, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox70 | --- | disabled |
People
(Reporter: jld, Unassigned)
References
(Blocks 1 open bug)
Details
(Keywords: crash, hang, regression)
STR (unreliable): on Linux, using the nvidia
proprietary driver (version 418.74) with WebRender, suspend/resume as for bug 1492580.
Recently the browser started hanging for unusually long after resume (no updates at all to the window), and sometimes it crashes completely with [GFX1-]: Killing GPU process due to IPC reply timeout
followed by a lot of IPC channel errors (probably caused by killing the process). In that case the main process exits on SIGTERM; there are no error messages suggesting where that might come from.
I don't know what the regression window might be, or if the regressor might be a driver update rather than a browser change.
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Updated•5 years ago
|
Comment hidden (offtopic) |
Comment 2•5 years ago
•
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #0)
and sometimes it crashes completely with
[GFX1-]: Killing GPU process due to IPC reply timeout
followed by a lot of IPC channel errors (probably caused by killing the process). In that case the main process exits on SIGTERM;
In the spirit of better saying something than nothing:
With bug 1561976 I reported a full browser crash despite of having a GPU process.
They think it might be related to bug 1562616. Crashes spiked after landing bug 1561178.
Comment 3•5 years ago
•
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #0)
[GFX1-]: Killing GPU process due to IPC reply timeout
there are no error messages suggesting where that might come from.
I don't know what the regression window might be
Bugzilla search has three results: this bug, bug 1560598 (already in "See Also"), and bug 1564127:
There is a problem regarding GPU / RDD / VR process: bug 1564127 comment 6
Did you enable dom.vr.process.enabled? Or it might be just an RDD/GPU process problem for you?
Comment 4•5 years ago
|
||
(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #3)
(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #0)
[GFX1-]: Killing GPU process due to IPC reply timeout
there are no error messages suggesting where that might come from.
I don't know what the regression window might be
Bugzilla search has three results: this bug, bug 1560598 (already in "See Also"), and bug 1564127:
There is a problem regarding GPU / RDD / VR process: bug 1564127 comment 6
Did you enable dom.vr.process.enabled? Or it might be just an RDD/GPU process problem for you?
VR process is only enabled in Windows, and IIRC, FF Linux also has no GPU process.
Comment 5•5 years ago
•
|
||
(In reply to Daosheng Mu[:daoshengmu] from comment #4)
VR process is only enabled in Windows, and IIRC, FF Linux also has no GPU process.
I had the vr process enabled in the past, so you have hidden it: bug 1513022
On X11 there is a GPU process by default: bug 1549965 (Wayland doesn't support it at the moment.)
Reporter | ||
Comment 6•5 years ago
|
||
I happened to be running top -H
when this happened, and I noticed a thread named Renderer
using CPU for a while after resume and before Firefox either became responsive again or crashed. Here's a profile.
As for VR, I have dom.vr.enabled
set to false.
Reporter | ||
Comment 7•5 years ago
|
||
I've figured out another part of the problem: GPUProcessHost::KillHard
is killing pid 0, i.e., the entire process group of the caller. I'll file a bug to stop accepting dangerous magic numbers in base::KillProcess
. (Note that if the in-band null had been mozilla::ipc::kInvalidProcessHandle
rather than 0
, Firefox would instead have killed all my other processes and logged me out.)
Reporter | ||
Comment 8•5 years ago
|
||
ni? me to file that bug, do something about the misdirected kill, and see if the GPU process is successfully restarted when I do the STR with that patch
Comment 9•5 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #7)
is killing pid 0, i.e., the entire process group of the caller
Is that bug 1314711?
Reporter | ||
Comment 10•5 years ago
|
||
(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #9)
(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #7)
is killing pid 0, i.e., the entire process group of the caller
Is that bug 1314711?
Yes, it is. I'll comment over there with more details.
Reporter | ||
Comment 11•5 years ago
|
||
The GPU process does restart with the incorrect kill
prevented (either with a patch or by having a debugger attached with a breakpoint on kill
and changing the arguments). With the default prefs it will still hang for up to 10 seconds and then turn off the GPU process after the 3rd time it goes over, but that's an improvement over crashing.
So it's still kind of a problem that we're spending many seconds in a method unassumingly named bind_framebuffer
, but I don't know if there's anything we can do about it and/or if it's worth trying to ask our contacts at NV.
Comment 12•3 years ago
|
||
Likely bug 1656361.
Description
•