Open Bug 1592378 Opened 20 days ago Updated 17 days ago

Firefox is causing random hangs with AMD Navi GPU on Linux

Categories

(Core :: Graphics: WebRender, defect, P3)

71 Branch
x86_64
Linux
defect

Tracking

()

UNCONFIRMED

People

(Reporter: shtetldik, Unassigned)

References

(Blocks 1 open bug)

Details

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0

Steps to reproduce:

Regular usage of Firefox is causing random GPU hangs, when using Sapphire RX 5700 XT on Debian testing, and was confirmed by many other AMD Navi users.

It's hard to narrow down the bug. Could be Firefox doing something with OpenGL in WebRender, could be radeonsi, could be amdgpu itself. So opening one for Firefox, may be WebRender developers can help narrowing it down.

Corresponding upstream bugs:

amdgpu: https://bugs.freedesktop.org/show_bug.cgi?id=111481
radeonsi: https://gitlab.freedesktop.org/mesa/mesa/issues/1910

OS: Debian testing Linux, KDE Plasma 5.14.5.
Kernel: 5.4.0-rc5 with 2 extra patches suggested by amdgpu developers.
Mesa: 19.2.1 (but hangs also with Mesa master).
WebRender: enabled.

OS: Unspecified → Linux
Hardware: Unspecified → x86_64

Bugbug thinks this bug should belong to this component, but please revert this change in case of error.

Component: Untriaged → Graphics: WebRender
Product: Firefox → Core

To clarify, the hangs are not very frequent and rather random (but I noticed, they tend to happen more often when opening or closing a tab).

Switching between tabs tends to spike memory transfers from CPU to GPU and potentially texture allocations because there are usually a bunch of new images to show at once. Other than that I can't think of something specific happening during tab switching as far as rendering is concerned.

I am not very familiar with a lot of the terminology in the upstream bugs but I see "DMA" poping up a lot in what affects the issue which I suppose also points towards memory related operations.

As is it looks hard to action on our side although I'd love to hear recommendations if there are workarounds we can do that don't change the expected behavior (for example there are different ways to upload data and we could be using a different one that plays better with AMD hardware or drivers specifically).

Dzmitry, do we have AMD+Linux configs in Toronto on which we'd have a chance to reproduce the issue?

Flags: needinfo?(dmalyshau)
Priority: -- → P3

I can boot up a live Linux USB on our Ryzen machine with AMD GPU.

There is a lot of catch up in those links... the first one (amdgpu) doesn't seem to be Firefox related?
Anyhow, looking at "dma_fence_wait_timeout" in the stack, that might be related to our texture transfers.

First thing I'd recommend testing is forcing the 256 alignment like we started doing for AMD on macOS: https://phabricator.services.mozilla.com/D48862
We can just expand this check to Linux and provide the try build to those who can reproduce.

Flags: needinfo?(dmalyshau)

Here is a quick try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0277240cca85f47ab91d7493b8b4ffcc8d873ce4
Just need to wait for the artifacts to be built.

Flags: needinfo?(shtetldik)

I'll give it a try later today and will use it for a while. Though hangs are not too common, so it might take a while to see if it improved. Also, AMD posted a few additional patches for amdgpu, so I'll be running with that as well.

Note, that this issue is quite Navi specific and as far as I know, it doesn't happen on older AMD cards like Vega and etc. So if you are trying to reproduce it, you'd need to find a Navi card like RX 5700 or RX 5700 XT.

I've been using it for a few hours, and so far no hangs. But that also could be due to recent updates in the amdgpu which I pulled to rebuild the kernel. I'll continue using this nightly for a while and see how it goes.

I spoke too soon. Just got a GPU hang when using that nightly. Here is what I can see in dmesg (accessing the computer over ssh):

[ 1152.450269] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] ERROR Waiting for fences timed out!
[ 1158.338571] [drm:amdgpu_job_timedout [amdgpu]] ERROR ring sdma1 timeout, signaled seq=8349, emitted seq=8351
[ 1158.338637] [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process GPU Process pid 5474 thread firefox-bi:cs0 pid 5556
[ 1158.338639] [drm] GPU recovery disabled.

Again sdma tiemout hang like before.

Flags: needinfo?(shtetldik)
You need to log in before you can comment on or make changes to this bug.