Open Bug 1878186 Opened 8 months ago Updated 3 hours ago

Radeon VAAPI: Crash in [@ mozalloc_abort | abort | amdgpu_ctx_set_sw_reset_status]

Categories

(Core :: Graphics, defect)

x86_64
Linux
defect

Tracking

()

Tracking Status
firefox124 --- disabled

People

(Reporter: mccr8, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/8ed3292d-7783-4fae-a95c-57e310240128

MOZ_CRASH Reason: Redirecting call to abort() to mozalloc_abort

Top 10 frames of crashing thread:

0  firefox-bin  MOZ_Crash  mfbt/Assertions.h:301
0  firefox-bin  mozalloc_abort  memory/mozalloc/mozalloc_abort.cpp:35
1  firefox-bin  abort  memory/mozalloc/mozalloc_abort.cpp:88
2  libgallium_drv_video.so  amdgpu_ctx_set_sw_reset_status  /usr/src/debug/mesa/mesa-23.3.3/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c:462
3  libgallium_drv_video.so  amdgpu_cs_submit_ib  /usr/src/debug/mesa/mesa-23.3.3/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c:1785
4  libgallium_drv_video.so  util_queue_thread_func  /usr/src/debug/mesa/mesa-23.3.3/src/util/u_queue.c:309
5  libgallium_drv_video.so  impl_thrd_routine  /usr/src/debug/mesa/mesa-23.3.3/src/c11/impl/threads_posix.c:67
6  firefox-bin  set_alt_signal_stack_and_start  mozglue/interposers/pthread_create_interposer.cpp:81
7  libc.so.6  start_thread  /usr/src/debug/glibc/glibc/nptl/pthread_create.c:444
8  libc.so.6  __GI___clone  /usr/src/debug/glibc/glibc/sysdeps/unix/sysv/linux/x86_64/clone.S:100

The volume is low here, but it looks like we're hitting an abort inside some kind of video driver, in the RDD process, so I figured I'd file it in case it was interesting.

I am seeing crash reports that go all the way back to builds from Firefox version 120a1, but that the crashes are only recent, since December, and seem to correspond to Mesa versions 23.3.1 or later, in particular an uptick around Mesa version 23.3.1 in mid-December, and up to version 23.3.3. Mesa 23.3.4 was released only a week or so ago, so it might take time before we see potential crash reports from that, or maybe it was fixed in that version, but I see nothing in the release notes indicating something like that.

All the crash reports seem to have in common a gfx critical error in the log: "GFX: RenderThread detected a device reset in PostUpdate".

I can't really see that it was something we changed per se.

Glenn or Andrew, does this seem like anything we've seen before on either the WR or media side with weird context loss failures in the Mesa amd driver?

Flags: needinfo?(gwatson)
Flags: needinfo?(aosmond)
Severity: -- → S3
OS: Unspecified → Linux

I don't think I've seen anything like this on the WR side before.

Flags: needinfo?(gwatson)
Hardware: Unspecified → x86_64
Summary: Crash in [@ mozalloc_abort | abort | amdgpu_ctx_set_sw_reset_status] → Radeon VAAPI: Crash in [@ mozalloc_abort | abort | amdgpu_ctx_set_sw_reset_status]

This happens pretty regularly for me now - at least once a day when watching youtube videos. I have submitted many crash reports so far, is there anything I can do to help?

My setup is Firefox within Flatpak (so own mesa libs, not system mesa which is Mesa 22.3.6), running on Debian 12. Hardware is Ryzen 7840/Radeon 780.

It happened again, after updating to mesa 23.3.4 (git-27405fd573) and newest stable firefox (122.0.1) within flatpak. Uploaded crash report also.

For Debian; can you please make sure that you've updated the linux-firmware to UPSTREAM. Several GFX issues in Debian are actually root caused to an older GPU firmware snapshot.

Hi Mario, thanks for the idea. I have updated my firmware with the upstream version:

[ 3.081154] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/psp_13_0_4_toc.bin
[ 3.081754] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/psp_13_0_4_ta.bin
[ 3.083180] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/dcn_3_1_4_dmcub.bin
[ 3.084664] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_pfp.bin
[ 3.086035] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_me.bin
[ 3.087368] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_rlc.bin
[ 3.088397] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_mec.bin
[ 3.090267] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/vcn_4_0_2.bin
[ 3.092629] amdgpu 0000:c3:00.0: firmware: failed to load amdgpu/gc_11_0_1_mes_2.bin (-2)
[ 3.093087] firmware_class: See https://wiki.debian.org/Firmware for information about missing firmware
[ 3.093565] amdgpu 0000:c3:00.0: firmware: failed to load amdgpu/gc_11_0_1_mes_2.bin (-2)
[ 3.094016] amdgpu 0000:c3:00.0: Direct firmware load for amdgpu/gc_11_0_1_mes_2.bin failed with error -2
[ 3.095019] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_mes.bin
[ 3.096463] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_mes1.bin
[ 3.099131] [drm] Loading DMUB firmware via PSP: version=0x08000500
[ 3.099205] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_imu.bin
[ 3.100072] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/sdma_6_0_1.bin
[ 3.100227] [drm] Found VCN firmware Version ENC: 1.17 DEC: 6 VEP: 0 Revision: 10
[ 3.100239] amdgpu 0000:c3:00.0: amdgpu: Will use PSP to load VCN firmware

I will report if that changes anything.

If it's showing messages about missing firmware you haven't updated to the upstream version properly.

Ah, you're right, I forgot to run update-initramfs - now there's no more missing firmware.

[ 3.086428] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/psp_13_0_4_toc.bin
[ 3.087017] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/psp_13_0_4_ta.bin
[ 3.088316] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/dcn_3_1_4_dmcub.bin
[ 3.090020] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_pfp.bin
[ 3.091383] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_me.bin
[ 3.092735] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_rlc.bin
[ 3.093779] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_mec.bin
[ 3.095646] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/vcn_4_0_2.bin
[ 3.098084] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_mes_2.bin
[ 3.099375] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_mes1.bin
[ 3.102039] [drm] Loading DMUB firmware via PSP: version=0x08003300
[ 3.102112] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_1_imu.bin
[ 3.102798] amdgpu 0000:c3:00.0: firmware: direct-loading firmware amdgpu/sdma_6_0_1.bin
[ 3.102952] [drm] Found VCN firmware Version ENC: 1.19 DEC: 7 VEP: 0 Revision: 0
[ 3.102964] amdgpu 0000:c3:00.0: amdgpu: Will use PSP to load VCN firmware

Thanks for the hint! I will update here if another crash happens. Btw, I run the 6.5 kernel from debian-backports, if that's important to know.

Linux 6.5.0-0.deb12.4-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.5.10-1~bpo12+1 (2023-11-23) x86_64 GNU/Linux

So far, no more crashes. From my gut feeling, it would have crashed at least once already with the old firmware.

I will report if that changes.

Can you please file a bug with Debian to get this fixed? It's going to make people point fingers at Firefox otherwise.

Crash Signature: [@ mozalloc_abort | abort | amdgpu_ctx_set_sw_reset_status] → [@ mozalloc_abort | abort | amdgpu_ctx_set_sw_reset_status] [@ abort | amdgpu_ctx_set_sw_reset_status ]

This issue should be closed in Firefox, it's caused by Debian not providing updated GPU F/W.

Is this the same issue as (or related to) https://gitlab.freedesktop.org/mesa/mesa/-/issues/10851 ?

It shouldn't be. The old firmware issue is specifically a Debian problem and that's an Arch issue you linked.

Attached file mesacrash.txt

gdb captured firefox crash in mesa

Not sure if the same issue but I had to disable webgl completely because of frequent crashes.
Using gdb config

set detach-on-fork off
set mi-async on
set non-stop on
set pagination off
handle SIGPIPE nostop noprint pass
handle SIGBUS nostop noprint pass
handle SIGSYS nostop noprint pass
set history save
set history size unlimited
set history remove-duplicates unlimited
show history expansion
show commands +
set filename-display absolute

I was able to capture exact call stack, attached.

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on beta
  • Top 5 RDD process crashes on beta
  • Top 5 desktop browser crashes on Linux on beta

:bhood, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(bhood)
Keywords: topcrash

Firefox 130.0b7 seem to solve the craches or atleast reduced to a Warning: "g_object_get_is_valid_property: object class 'GdkX11DeviceCore' has no property named 'device-id'" and doesn't happen with same frequency, but still in some cases mostly when playing multiple vidoes at the same time..

(In reply to noreply from comment #17)

Firefox 130.0b7 seem to solve the craches or atleast reduced to a Warning: "g_object_get_is_valid_property: object class 'GdkX11DeviceCore' has no property named 'device-id'" and doesn't happen with same frequency, but still in some cases mostly when playing multiple vidoes at the same time..

Just to clarify crashes is completely gone, just warning as I mentioned, maybe couple of them on a day.

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

(In reply to BugBot [:suhaib / :marco/ :calixte] from comment #19)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

.. It could be if I start reporting everything again, only testing when something have changed. Have a vague recollection that it has been like this for a while now. I will stay on stable for now.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: