Closed Bug 1564976 Opened 5 years ago Closed 3 years ago

Crash after "Killing GPU process due to IPC reply timeout" with: Linux, nvidia, suspend/resume, WebRender

Tracking

()

Status:

RESOLVED DUPLICATE of bug 1656361

Tracking Flags:

Tracking

Status

firefox70

---

disabled

People

(Reporter: jld, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, hang, regression)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Description

•

5 years ago

STR (unreliable): on Linux, using the nvidia proprietary driver (version 418.74) with WebRender, suspend/resume as for bug 1492580.

Recently the browser started hanging for unusually long after resume (no updates at all to the window), and sometimes it crashes completely with [GFX1-]: Killing GPU process due to IPC reply timeout followed by a lot of IPC channel errors (probably caused by killing the process). In that case the main process exits on SIGTERM; there are no error messages suggesting where that might come from.

I don't know what the regression window might be, or if the regressor might be a driver update rather than a browser change.

Timothy Nikkel (:tnikkel)

Updated

•

5 years ago

Priority: -- → P3

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Updated

•

5 years ago

Blocks: wr-nv-linux

Darkspirit

Updated

•

5 years ago

Severity: normal → critical

status-firefox70: --- → disabled

Component: Graphics → Graphics: WebRender

Keywords: crash, hang, regression

OS: Unspecified → Linux

Hardware: Unspecified → Desktop

Comment hidden (offtopic)

Darkspirit

Comment 2

•

5 years ago

•

Edited

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #0)

and sometimes it crashes completely with [GFX1-]: Killing GPU process due to IPC reply timeout followed by a lot of IPC channel errors (probably caused by killing the process). In that case the main process exits on SIGTERM;

In the spirit of better saying something than nothing:
With bug 1561976 I reported a full browser crash despite of having a GPU process.
They think it might be related to bug 1562616. Crashes spiked after landing bug 1561178.

Priority: -- → P3

Darkspirit

Comment 3

•

5 years ago

•

Edited

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #0)

[GFX1-]: Killing GPU process due to IPC reply timeout

there are no error messages suggesting where that might come from.

I don't know what the regression window might be

Bugzilla search has three results: this bug, bug 1560598 (already in "See Also"), and bug 1564127:

There is a problem regarding GPU / RDD / VR process: bug 1564127 comment 6
Did you enable dom.vr.process.enabled? Or it might be just an RDD/GPU process problem for you?

Darkspirit

Updated

•

5 years ago

Comment 4

•

5 years ago

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #3)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #0)

[GFX1-]: Killing GPU process due to IPC reply timeout

there are no error messages suggesting where that might come from.

I don't know what the regression window might be

Bugzilla search has three results: this bug, bug 1560598 (already in "See Also"), and bug 1564127:

There is a problem regarding GPU / RDD / VR process: bug 1564127 comment 6
Did you enable dom.vr.process.enabled? Or it might be just an RDD/GPU process problem for you?

VR process is only enabled in Windows, and IIRC, FF Linux also has no GPU process.

Darkspirit

Comment 5

•

5 years ago

•

Edited

(In reply to Daosheng Mu[:daoshengmu] from comment #4)

VR process is only enabled in Windows, and IIRC, FF Linux also has no GPU process.

I had the vr process enabled in the past, so you have hidden it: bug 1513022
On X11 there is a GPU process by default: bug 1549965 (Wayland doesn't support it at the moment.)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 6

•

5 years ago

I happened to be running top -H when this happened, and I noticed a thread named Renderer using CPU for a while after resume and before Firefox either became responsive again or crashed. Here's a profile.

As for VR, I have dom.vr.enabled set to false.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 7

•

5 years ago

I've figured out another part of the problem: GPUProcessHost::KillHard is killing pid 0, i.e., the entire process group of the caller. I'll file a bug to stop accepting dangerous magic numbers in base::KillProcess. (Note that if the in-band null had been mozilla::ipc::kInvalidProcessHandle rather than 0, Firefox would instead have killed all my other processes and logged me out.)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 8

•

5 years ago

ni? me to file that bug, do something about the misdirected kill, and see if the GPU process is successfully restarted when I do the STR with that patch

Flags: needinfo?(jld)

Darkspirit

Comment 9

•

5 years ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #7)

is killing pid 0, i.e., the entire process group of the caller

Is that bug 1314711?

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 10

•

5 years ago

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #9)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #7)

is killing pid 0, i.e., the entire process group of the caller

Is that bug 1314711?

Yes, it is. I'll comment over there with more details.

Updated

•

5 years ago

Comment 11

•

5 years ago

The GPU process does restart with the incorrect kill prevented (either with a patch or by having a debugger attached with a breakpoint on kill and changing the arguments). With the default prefs it will still hang for up to 10 seconds and then turn off the GPU process after the 3rd time it goes over, but that's an improvement over crashing.

So it's still kind of a problem that we're spending many seconds in a method unassumingly named bind_framebuffer, but I don't know if there's anything we can do about it and/or if it's worth trying to ask our contacts at NV.

Flags: needinfo?(jld)

Darkspirit

Comment 12

•

3 years ago

Likely bug 1656361.

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → DUPLICATE

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Crash after "Killing GPU process due to IPC reply timeout" with: Linux, nvidia, suspend/resume, WebRender

Categories

(Core :: Graphics: WebRender, defect, P3)

Tracking

()

People

(Reporter: jld, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, hang, regression)

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Comment 12