Closed Bug 1564976 Opened 5 years ago Closed 3 years ago

Crash after "Killing GPU process due to IPC reply timeout" with: Linux, nvidia, suspend/resume, WebRender

Categories

(Core :: Graphics: WebRender, defect, P3)

Desktop
Linux
defect

Tracking

()

RESOLVED DUPLICATE of bug 1656361
Tracking Status
firefox70 --- disabled

People

(Reporter: jld, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, hang, regression)

STR (unreliable): on Linux, using the nvidia proprietary driver (version 418.74) with WebRender, suspend/resume as for bug 1492580.

Recently the browser started hanging for unusually long after resume (no updates at all to the window), and sometimes it crashes completely with [GFX1-]: Killing GPU process due to IPC reply timeout followed by a lot of IPC channel errors (probably caused by killing the process). In that case the main process exits on SIGTERM; there are no error messages suggesting where that might come from.

I don't know what the regression window might be, or if the regressor might be a driver update rather than a browser change.

Priority: -- → P3
Severity: normal → critical
Component: Graphics → Graphics: WebRender
Keywords: crash, hang, regression
OS: Unspecified → Linux
Hardware: Unspecified → Desktop

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #0)

and sometimes it crashes completely with [GFX1-]: Killing GPU process due to IPC reply timeout followed by a lot of IPC channel errors (probably caused by killing the process). In that case the main process exits on SIGTERM;

In the spirit of better saying something than nothing:
With bug 1561976 I reported a full browser crash despite of having a GPU process.
They think it might be related to bug 1562616. Crashes spiked after landing bug 1561178.

Priority: -- → P3

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #0)

[GFX1-]: Killing GPU process due to IPC reply timeout

there are no error messages suggesting where that might come from.

I don't know what the regression window might be

Bugzilla search has three results: this bug, bug 1560598 (already in "See Also"), and bug 1564127:

There is a problem regarding GPU / RDD / VR process: bug 1564127 comment 6
Did you enable dom.vr.process.enabled? Or it might be just an RDD/GPU process problem for you?

See Also: → 1564127

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #3)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #0)

[GFX1-]: Killing GPU process due to IPC reply timeout

there are no error messages suggesting where that might come from.

I don't know what the regression window might be

Bugzilla search has three results: this bug, bug 1560598 (already in "See Also"), and bug 1564127:

There is a problem regarding GPU / RDD / VR process: bug 1564127 comment 6
Did you enable dom.vr.process.enabled? Or it might be just an RDD/GPU process problem for you?

VR process is only enabled in Windows, and IIRC, FF Linux also has no GPU process.

(In reply to Daosheng Mu[:daoshengmu] from comment #4)

VR process is only enabled in Windows, and IIRC, FF Linux also has no GPU process.

I had the vr process enabled in the past, so you have hidden it: bug 1513022
On X11 there is a GPU process by default: bug 1549965 (Wayland doesn't support it at the moment.)

I happened to be running top -H when this happened, and I noticed a thread named Renderer using CPU for a while after resume and before Firefox either became responsive again or crashed. Here's a profile.

As for VR, I have dom.vr.enabled set to false.

I've figured out another part of the problem: GPUProcessHost::KillHard is killing pid 0, i.e., the entire process group of the caller. I'll file a bug to stop accepting dangerous magic numbers in base::KillProcess. (Note that if the in-band null had been mozilla::ipc::kInvalidProcessHandle rather than 0, Firefox would instead have killed all my other processes and logged me out.)

ni? me to file that bug, do something about the misdirected kill, and see if the GPU process is successfully restarted when I do the STR with that patch

Flags: needinfo?(jld)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #7)

is killing pid 0, i.e., the entire process group of the caller

Is that bug 1314711?

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #9)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #7)

is killing pid 0, i.e., the entire process group of the caller

Is that bug 1314711?

Yes, it is. I'll comment over there with more details.

See Also: 15641271314711
See Also: → 1568291

The GPU process does restart with the incorrect kill prevented (either with a patch or by having a debugger attached with a breakpoint on kill and changing the arguments). With the default prefs it will still hang for up to 10 seconds and then turn off the GPU process after the 3rd time it goes over, but that's an improvement over crashing.

So it's still kind of a problem that we're spending many seconds in a method unassumingly named bind_framebuffer, but I don't know if there's anything we can do about it and/or if it's worth trying to ask our contacts at NV.

Flags: needinfo?(jld)

Likely bug 1656361.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.