Closed Bug 1639544 Opened 5 years ago Closed 4 years ago

Firefox sometimes freezes / hangs when loading pages with video on Youtube

Categories

(Core :: Graphics, defect)

defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox76 --- wontfix
firefox77 --- wontfix
firefox78 --- affected
firefox79 --- affected

People

(Reporter: whimboo, Unassigned)

References

Details

(Keywords: hang)

Attachments

(1 file)

User-Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0

Navigating on YouTube can sometimes cause a freeze / hang of Firefox when loading pages with a video element embedded.

Steps:

  1. Open https://www.youtube.com/
  2. Search for anything
  3. Click a search result
  4. Wait a second if loading the video freezes Firefox
  5. If not go back to search result and try the next result
  6. Repeat until Firefox freezes

Here is a profile from the freeze: https://perfht.ml/2AMFD5x

Florian, do you have an idea?

Flags: needinfo?(florian)

Here the details of the graphics card:

Aktiv 	Ja
Beschreibung 	AMD Radeon(TM) R5 Graphics
Herstellerkennung 	0x1002
Gerätekennung 	0x9874
Treiber-Version 	16.300.2701.0
Treiber-Datum 	8-9-2016
Treiber 	aticfx64 aticfx64 aticfx64 amdxc64 aticfx32 aticfx32 aticfx32 amdxc32 atiumd64 atidxx64 atidxx64 atiumdag atidxx32 atidxx32 atiumdva atiumd6a atitmm64
Subsys-ID 	00000000
RAM 	512

Note that this is a very old driver version.

It looks a lot like a GPU process died, and the hang is the time we waited until a new one got created. I see a "PCompositorBridge::Msg_NotifyChildRecreated — sent to Compositor (Thread ID: 7412)" IPC after the hang.

Component: General → Graphics
Flags: needinfo?(florian)
Product: Firefox → Core

In that profile (https://perfht.ml/2zS2RGI) you can see that it took more than 400s to recreate the GPU process. During that time the whole Firefox UI is frozen.

Is there anything else which I can provide to further investigate the problem?

Flags: needinfo?(matt.woodrow)

Henrik, it would be great to test on Nightly.
Ideally, you could help us find the mozregression range where this broke (and/or got fixed).
Also, please attach the full "about:support".

Flags: needinfo?(hskupin)
Attached file about:support

Here is the full output from about:support. Note that the included failure protocol is interesting:

(#0) Error  Killing GPU process due to IPC reply timeout
(#1) Error  Failed buffer for 0, 0, 1920, 1048
(#2) Error  Failed buffer for 1903, 0, 17, 938
(#3) Error  Failed buffer for 0, 0, 17, 314
(#4) Error  Failed buffer for 0, 0, 1903, 2608
(#5) Error  Receive IPC close with reason=AbnormalShutdown
(#6) Error  Receive IPC close with reason=AbnormalShutdown
(#7)    CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown
(#8)    CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown

So the GPU process got killed due to an IPC timeout. Why does it take such a long time to get it restarted? Is there maybe a preference to shorten the time?

I'm not saying this is a regression, and I cannot easily run various tests with different versions of Firefox because that machine is in a foreign office. So I hope the above information will help to better diagnose it. And maybe Matt can have a look at it.

Flags: needinfo?(hskupin)

It would be good to know if this is reproducible on Nightly, and whether enabling WebRender via gfx.webrender.all pref changes this in any way.
Not sure how to prioritize this yet, otherwise.

Is it possible that the GPU process hung, and we tried to send it a sync message from the parent process (SendReceiveMouseInputEvent), which blocked for 400s waiting on the hung process?

Then we finally kill the hung process, and quickly restart it, leading to things working again.

So maybe it's not that it took a long time to restart the process, but instead it took us a long time to realize that we needed to do so (and that it broke in the first place)?

Flags: needinfo?(matt.woodrow)

Thanks Matt, that makes sense.

Henrik,
The following experiments would help:

  1. test on Nightly
  2. test with WebRender enabled (via gfx.webrender.all, needs browser restart)
  3. record a Gecko profile of the hang, but stop recording before the hang is resolved. We want to see what the GPU process is doing during the hang, and if the recording stops after GPU process is restarted, we appear to only be getting the new GPU process activity.

Marking as S3 for the lack of understanding of how widespread this problem is. Ready to bump this!

Severity: -- → S3
Flags: needinfo?(hskupin)

(In reply to Dzmitry Malyshau [:kvark] from comment #8)

  1. record a Gecko profile of the hang, but stop recording before the hang is resolved. We want to see what the GPU process is doing during the hang, and if the recording stops after GPU process is restarted, we appear to only be getting the new GPU process activity.

Note that I do NOT have a way to just stop the profiler. As mentioned above the whole Firefox UI is frozen, and as such no event processing takes place. Or does webrender get rid of this problem, and only hangs the appropriate background processes and not the main thread? I not any other idea how to stop the profiler before the new GPU process has been created?

I actually wonder why we are loosing the recorded data from the former (as hung) GPU process'. It would be great to keep it as separate row. Julien, any idea if that is possible?

Flags: needinfo?(hskupin)
Flags: needinfo?(felash)
Flags: needinfo?(dmalyshau)

Yes, unfortunately we can't get a Gecko profile of a thread that truly hangs, since it will not send the data over (as noted by :mstange in #gfx room).

Flags: needinfo?(dmalyshau)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #9)

I actually wonder why we are loosing the recorded data from the former (as hung) GPU process'.

Child processes send their profiling data to the parent when they shutdown, or when the parent process sends an IPC requesting the profile. When a child process gets killed, we are in neither of these cases.

It would be great to keep it as separate row. Julien, any idea if that is possible?

It's not possible right now. In some cases where a process appears unresponsive (eg. the main thread is busy with JS code running in an infinite loop) but is still processing IPC, if we changed the code that kills the unresponsive process to first request the profile, we might get the profile. In cases where the child process isn't processing IPC anymore (like we saw in bug 1629824), we still wouldn't get anything.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #9)

I actually wonder why we are loosing the recorded data from the former (as hung) GPU process'. It would be great to keep it as separate row. Julien, any idea if that is possible?

Gerald will know better, redirecting the request to him :-)

Flags: needinfo?(felash) → needinfo?(gsquelart)

(In reply to Dzmitry Malyshau [:kvark] from comment #10)

Yes, unfortunately we can't get a Gecko profile of a thread that truly hangs, since it will not send the data over (as noted by :mstange in #gfx room).

Dzmitry, would you mind telling me again the MOZ_LOG entries that I should use when testing again with a debug build of Firefox? I can recall something about IPC, but not sure if there is a prefix, or also others. Thanks!

Flags: needinfo?(dmalyshau)
See Also: → 1643016

I have an update here. When I tried to reproduce the problem today with a Nightly debug build I was not able to crash the GPU process. Maybe there is some race condition involved here and that debug builds are simply too slow?

Anyway, a 79a1 Nightly opt build actually let me catch a content crash which is related to the creation of shared memory. I filed that as bug 1643016. The crash happened when Firefox froze, so I hope that this is actually something helpful. Matt, can you have a look at this?

Also I recorded a new profile whereby hitting the freeze wasn't that easy. While navigating forward and backward on some pages I finally reproduced it ~5 minutes later. Note that there is still no news for the former (crashes) GPU process. Even with webrender enabled the whole UI is still hanging and doesn't let me stop the profiler. So not sure if you can find new information, but here it is:

https://share.firefox.dev/3cv1JH1

Due to time limitations I was not able to create log files via MOZ_LOG. I could do that next time, and maybe by then we know more about this crash (if it's related) and might have it fixed?

Flags: needinfo?(gsquelart) → needinfo?(matt.woodrow)

MOZ_LOG question got answered by Matt on #gfx:mozilla.org:

Maybe "ipc", and some of the "apz.*" ones?
(unsure if wildcard actually works, or if you need to specify the apz modules individually)

Flags: needinfo?(dmalyshau)

It looks like the crash happens because we've allocated 500 million shmems, and have run out of identifiers...

I can't see how we'd consume anywhere near that many in 5 minutes, especially since most usages are using a pool.

Flags: needinfo?(matt.woodrow)

Note that the uptime in that case was only 222 seconds (3 minutes and 42 seconds) - see bp-a7df3eed-20bd-4b27-8c19-a73620200603.

Could that be related to the profiler? Actually I'm not sure if it was running in this particular Firefox session.

Matt, do you think that the crash I was seeing could be the reason why the GFX process stopped working? Or was it maybe only a side-effect? If it's the reason I might be blocked on further investigation. Otherwise I might wanna try again to create a LOG file with the before mentioned MOZ_LOG options set.

Flags: needinfo?(matt.woodrow)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #18)

Matt, do you think that the crash I was seeing could be the reason why the GFX process stopped working? Or was it maybe only a side-effect? If it's the reason I might be blocked on further investigation. Otherwise I might wanna try again to create a LOG file with the before mentioned MOZ_LOG options set.

Yeah, seems very likely to be related. Seems like it'd be worth trying to log some of the media IPC code to see what it's doing (but I don't think there is existing logging for that sorry).

Flags: needinfo?(matt.woodrow)

(In reply to Matt Woodrow (:mattwoodrow) from comment #19)

Yeah, seems very likely to be related. Seems like it'd be worth trying to log some of the media IPC code to see what it's doing (but I don't think there is existing logging for that sorry).

Ok, so how valuable would further investigation be here? The user of that machine actually wants YouTube working and I don't want to hold off for longer. Also it looks like we won't get much traction on bug 1643967 in the next weeks. As such I would upgrade the graphics drivers (right now there is a version from 2016 installed!), but it could also mean the problem will no longer be existent.

Flags: needinfo?(matt.woodrow)

I think we need someone from the graphics team to be able to reproduce this, to try narrow it down further sorry.

Flags: needinfo?(matt.woodrow)

With the graphics driver update the problem is gone and no longer reproducible. Not sure how easy it will be to get it reproduced with a similar setup. Maybe we should close this bug as incomplete for now.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
See Also: → 1663227
See Also: 1663227
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: