<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Reporter

Comment 3

•

5 years ago

In that profile (https://perfht.ml/2zS2RGI) you can see that it took more than 400s to recreate the GPU process. During that time the whole Firefox UI is frozen.

Is there anything else which I can provide to further investigate the problem?

Flags: needinfo?(matt.woodrow)

Comment 4

•

5 years ago

Henrik, it would be great to test on Nightly.
Ideally, you could help us find the mozregression range where this broke (and/or got fixed).
Also, please attach the full "about:support".

Flags: needinfo?(hskupin)

Reporter

Comment 5

•

5 years ago

Attached file about:support — Details

Here is the full output from about:support. Note that the included failure protocol is interesting:

(#0) Error  Killing GPU process due to IPC reply timeout
(#1) Error  Failed buffer for 0, 0, 1920, 1048
(#2) Error  Failed buffer for 1903, 0, 17, 938
(#3) Error  Failed buffer for 0, 0, 17, 314
(#4) Error  Failed buffer for 0, 0, 1903, 2608
(#5) Error  Receive IPC close with reason=AbnormalShutdown
(#6) Error  Receive IPC close with reason=AbnormalShutdown
(#7)    CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown
(#8)    CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown

So the GPU process got killed due to an IPC timeout. Why does it take such a long time to get it restarted? Is there maybe a preference to shorten the time?

I'm not saying this is a regression, and I cannot easily run various tests with different versions of Firefox because that machine is in a foreign office. So I hope the above information will help to better diagnose it. And maybe Matt can have a look at it.

Flags: needinfo?(hskupin)

Comment 6

•

5 years ago

It would be good to know if this is reproducible on Nightly, and whether enabling WebRender via gfx.webrender.all pref changes this in any way.
Not sure how to prioritize this yet, otherwise.

Comment 7

•

5 years ago

Is it possible that the GPU process hung, and we tried to send it a sync message from the parent process (SendReceiveMouseInputEvent), which blocked for 400s waiting on the hung process?

Then we finally kill the hung process, and quickly restart it, leading to things working again.

So maybe it's not that it took a long time to restart the process, but instead it took us a long time to realize that we needed to do so (and that it broke in the first place)?

Flags: needinfo?(matt.woodrow)

Comment 8

•

5 years ago

Thanks Matt, that makes sense.

Henrik,
The following experiments would help:

test on Nightly
test with WebRender enabled (via gfx.webrender.all, needs browser restart)
record a Gecko profile of the hang, but stop recording before the hang is resolved. We want to see what the GPU process is doing during the hang, and if the recording stops after GPU process is restarted, we appear to only be getting the new GPU process activity.

Marking as S3 for the lack of understanding of how widespread this problem is. Ready to bump this!

Severity: -- → S3

Flags: needinfo?(hskupin)

Reporter

Comment 9

•

5 years ago

(In reply to Dzmitry Malyshau [:kvark] from comment #8)

record a Gecko profile of the hang, but stop recording before the hang is resolved. We want to see what the GPU process is doing during the hang, and if the recording stops after GPU process is restarted, we appear to only be getting the new GPU process activity.

Note that I do NOT have a way to just stop the profiler. As mentioned above the whole Firefox UI is frozen, and as such no event processing takes place. Or does webrender get rid of this problem, and only hangs the appropriate background processes and not the main thread? I not any other idea how to stop the profiler before the new GPU process has been created?

I actually wonder why we are loosing the recorded data from the former (as hung) GPU process'. It would be great to keep it as separate row. Julien, any idea if that is possible?

Flags: needinfo?(hskupin)

Flags: needinfo?(felash)

Flags: needinfo?(dmalyshau)

Julien Wajsberg [:julienw]

Comment 10

•

5 years ago

Yes, unfortunately we can't get a Gecko profile of a thread that truly hangs, since it will not send the data over (as noted by :mstange in #gfx room).

Flags: needinfo?(dmalyshau)

Florian Quèze [:florian]

Comment 11

•

5 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #9)

I actually wonder why we are loosing the recorded data from the former (as hung) GPU process'.

Child processes send their profiling data to the parent when they shutdown, or when the parent process sends an IPC requesting the profile. When a child process gets killed, we are in neither of these cases.

It would be great to keep it as separate row. Julien, any idea if that is possible?

It's not possible right now. In some cases where a process appears unresponsive (eg. the main thread is busy with JS code running in an infinite loop) but is still processing IPC, if we changed the code that kills the unresponsive process to first request the profile, we might get the profile. In cases where the child process isn't processing IPC anymore (like we saw in bug 1629824), we still wouldn't get anything.

Comment 12

•

5 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #9)

I actually wonder why we are loosing the recorded data from the former (as hung) GPU process'. It would be great to keep it as separate row. Julien, any idea if that is possible?

Gerald will know better, redirecting the request to him :-)

Flags: needinfo?(felash) → needinfo?(gsquelart)

Reporter

Comment 13

•

5 years ago

(In reply to Dzmitry Malyshau [:kvark] from comment #10)

Yes, unfortunately we can't get a Gecko profile of a thread that truly hangs, since it will not send the data over (as noted by :mstange in #gfx room).

Dzmitry, would you mind telling me again the MOZ_LOG entries that I should use when testing again with a debug build of Firefox? I can recall something about IPC, but not sure if there is a prefix, or also others. Thanks!

Flags: needinfo?(dmalyshau)

https://share.firefox.dev/3cv1JH1

Reporter

Updated

•

5 years ago

Comment 14

•

5 years ago

I have an update here. When I tried to reproduce the problem today with a Nightly debug build I was not able to crash the GPU process. Maybe there is some race condition involved here and that debug builds are simply too slow?

Anyway, a 79a1 Nightly opt build actually let me catch a content crash which is related to the creation of shared memory. I filed that as bug 1643016. The crash happened when Firefox froze, so I hope that this is actually something helpful. Matt, can you have a look at this?

Also I recorded a new profile whereby hitting the freeze wasn't that easy. While navigating forward and backward on some pages I finally reproduced it ~5 minutes later. Note that there is still no news for the former (crashes) GPU process. Even with webrender enabled the whole UI is still hanging and doesn't let me stop the profiler. So not sure if you can find new information, but here it is:

Due to time limitations I was not able to create log files via MOZ_LOG. I could do that next time, and maybe by then we know more about this crash (if it's related) and might have it fixed?

Flags: needinfo?(gsquelart) → needinfo?(matt.woodrow)

Reporter

Updated

•

5 years ago

status-firefox76: affected → wontfix

status-firefox77: --- → wontfix

status-firefox78: --- → affected

status-firefox79: --- → affected

Comment 15

•

5 years ago

MOZ_LOG question got answered by Matt on #gfx:mozilla.org:

Maybe "ipc", and some of the "apz.*" ones?
(unsure if wildcard actually works, or if you need to specify the apz modules individually)

Flags: needinfo?(dmalyshau)

Comment 16

•

5 years ago

It looks like the crash happens because we've allocated 500 million shmems, and have run out of identifiers...

I can't see how we'd consume anywhere near that many in 5 minutes, especially since most usages are using a pool.

Flags: needinfo?(matt.woodrow)

Reporter

Comment 17

•

5 years ago

Note that the uptime in that case was only 222 seconds (3 minutes and 42 seconds) - see bp-a7df3eed-20bd-4b27-8c19-a73620200603.

Could that be related to the profiler? Actually I'm not sure if it was running in this particular Firefox session.

Reporter

Comment 18

•

5 years ago

Matt, do you think that the crash I was seeing could be the reason why the GFX process stopped working? Or was it maybe only a side-effect? If it's the reason I might be blocked on further investigation. Otherwise I might wanna try again to create a LOG file with the before mentioned MOZ_LOG options set.

Flags: needinfo?(matt.woodrow)

Comment 19

•

5 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #18)

Matt, do you think that the crash I was seeing could be the reason why the GFX process stopped working? Or was it maybe only a side-effect? If it's the reason I might be blocked on further investigation. Otherwise I might wanna try again to create a LOG file with the before mentioned MOZ_LOG options set.

Yeah, seems very likely to be related. Seems like it'd be worth trying to log some of the media IPC code to see what it's doing (but I don't think there is existing logging for that sorry).

Flags: needinfo?(matt.woodrow)

Reporter

Comment 20

•

5 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #19)

Yeah, seems very likely to be related. Seems like it'd be worth trying to log some of the media IPC code to see what it's doing (but I don't think there is existing logging for that sorry).

Ok, so how valuable would further investigation be here? The user of that machine actually wants YouTube working and I don't want to hold off for longer. Also it looks like we won't get much traction on bug 1643967 in the next weeks. As such I would upgrade the graphics drivers (right now there is a version from 2016 installed!), but it could also mean the problem will no longer be existent.

Flags: needinfo?(matt.woodrow)

Comment 21

•

4 years ago

I think we need someone from the graphics team to be able to reproduce this, to try narrow it down further sorry.

Flags: needinfo?(matt.woodrow)