Closed Bug 1701863 Opened 4 years ago Closed 4 years ago

Assertion failure in compositor thread when using 16-bit deep X11 VNC display

Categories

(Core :: Graphics: WebRender, defect, P2)

Unspecified
Linux
defect

Tracking

()

RESOLVED FIXED
96 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- wontfix
firefox87 --- unaffected
firefox88 --- wontfix
firefox89 --- wontfix
firefox90 --- wontfix
firefox91 --- wontfix
firefox94 --- wontfix
firefox95 --- wontfix
firefox96 --- fixed

People

(Reporter: jseward, Assigned: rmader)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: crash, regression)

Attachments

(2 files, 1 obsolete file)

This is a regression in mozilla-unified relative to approximately a week ago.

STR: m-u 641416:ea0d3d4933a4, x86_64-linux. Run it on a 16-bit deep VNC X11
server. Load

https://www.solitaireparadise.com/games_list/pyramid_solitaire_ancient_egypt.html

This takes a few seconds to load. Before the game starts, Fx fails on a
MOZ_DIAGNOSTIC_ASSERT with the stack below. The underlying reason is the
attempt to unwrap a Nothing here:

dist/include/mozilla/webrender/WebRenderTypes.h:

 126      ImageDescriptor(const gfx::IntSize& aSize, uint32_t aByteStride,
 127                      gfx::SurfaceFormat aFormat,
 128                      bool aPreferCompositorSurface = false) {
>129        format = wr::SurfaceFormatToImageFormat(aFormat).value(); <---HERE
 130        width = aSize.width;
 131        height = aSize.height;
 132        stride = aByteStride;
 133        opacity = gfx::IsOpaque(aFormat) ? OpacityType::Opaque
 134                                         : OpacityType::HasAlphaChannel;
 135        prefer_compositor_surface = aPreferCompositorSurface;
 136      }

As an Fx user I don't care much about 16 bit VNC display support. But as a
developer, I use them a lot in order to test Fx's WebAssembly support on
remote targets that I don't have access to (AArch64-Linux boxes).

This doesn't happen on a 32-bit deep VNC display. I could switch to using
that, but I'd prefer not to, on the assumption (possibly wrong) that a 32-bit
deep display requires moving more data across the network (to the VNC viewer)
and more CPU load on the VNC server.

In any case, even if it has been decided that 16-bit deep X displays are no
longer supported, Fx shouldn't crash like this - it should fail gracefully,
somehow.

The failing stack is:

Thread 41 "Compositor" received signal SIGSEGV, Segmentation fault.
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:49
#1  0x00007fffed4395d3 in FatalSignalHandler() () at toolkit/profile/nsProfileLock.cpp:174
#2  0x00007fffed8606b2 in UnixExceptionHandler() ()
    at js/src/ds/MemoryProtectionExceptionHandler.cpp:272
#3  0x00007fffede5f9d7 in WasmTrapHandler() () at js/src/wasm/WasmSignalHandlers.cpp:981
#4  0x00007ffff7f851e0 in <signal handler called> () at /lib64/libpthread.so.0
#5  value () at clang-Og-nondebug-systemalloc/dist/include/mozilla/Maybe.h:752
#6  ImageDescriptor ()
    at clang-Og-nondebug-systemalloc/dist/include/mozilla/webrender/WebRenderTypes.h:129
#7  PushResourceUpdates() () at gfx/layers/composite/TextureHost.cpp:719
#8  0x00007fffea10fdc1 in PushResourceUpdates() () at gfx/layers/wr/WebRenderTextureHost.cpp:205
#9  0x00007fffea170e28 in UpdateImageKeys() () at gfx/layers/wr/AsyncImagePipelineManager.cpp:284
#10 0x00007fffea1711d4 in ApplyAsyncImageForPipeline() ()
    at gfx/layers/wr/AsyncImagePipelineManager.cpp:385
#11 0x00007fffea171b39 in ApplyAsyncImageForPipeline() ()
    at gfx/layers/wr/AsyncImagePipelineManager.cpp:500
#12 0x00007fffea190e1b in ProcessWebRenderParentCommands() ()
    at gfx/layers/wr/WebRenderBridgeParent.cpp:1471
#13 0x00007fffea190b97 in ProcessDisplayListData() () at gfx/layers/wr/WebRenderBridgeParent.cpp:1170
#14 0x00007fffea191470 in RecvSetDisplayList() () at gfx/layers/wr/WebRenderBridgeParent.cpp:1224
#15 0x00007fffe9d2040d in OnMessageReceived() () at PWebRenderBridgeParent.cpp:403
#16 0x00007fffe9a81497 in OnMessageReceived() () at PCompositorManagerParent.cpp:200
#17 0x00007fffe99b4934 in DispatchAsyncMessage() () at ipc/glue/MessageChannel.cpp:2154
#18 0x00007fffe99b3663 in DispatchMessage() () at ipc/glue/MessageChannel.cpp:2078
#19 0x00007fffe99b3eef in RunMessage() () at ipc/glue/MessageChannel.cpp:1926
#20 0x00007fffe99b4324 in Run() () at ipc/glue/MessageChannel.cpp:1957
#21 0x00007fffe92f3103 in ProcessNextEvent() () at xpcom/threads/nsThread.cpp:1149
#22 0x00007fffe92f6828 in NS_ProcessNextEvent() () at xpcom/threads/nsThreadUtils.cpp:548
#23 0x00007fffe99b7595 in Run() () at ipc/glue/MessagePump.cpp:332
#24 0x00007fffe993e278 in RunInternal () at ipc/chromium/src/base/message_loop.cc:335
#25 RunHandler () at ipc/chromium/src/base/message_loop.cc:328
#26 Run() () at ipc/chromium/src/base/message_loop.cc:310
#27 0x00007fffe92f09da in ThreadFunc() () at xpcom/threads/nsThread.cpp:391
#28 0x00007ffff72486ea in _pt_root () at nsprpub/pr/src/pthreads/ptthread.c:201
#29 0x00007ffff7f7a3f9 in start_thread (arg=0x7fffb8d73640) at pthread_create.c:463
#30 0x00007ffff7b57b53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

bug 1699864 comment 19 defined 32 bit color depth for Linux/EGL a week ago.

Are you using EGL (MOZ_X11_EGL=1 env var or gfx.x11-egl.force-enabled=true) or does this occur with GLX?

Can you check if the assumed regression range is correct? https://hg.mozilla.org/mozilla-central/shortlog/ad4e73130915
$ pip3 install -U mozregression

Is this first bad revision?
$ mozregression --repo autoland --launch ad4e73130915d4b9d354c7b3d45654afc7973ead --pref gfx.webrender.all:true -a https://www.solitaireparadise.com/games_list/pyramid_solitaire_ancient_egypt.html

Is this last good revision?
$ mozregression --repo autoland --launch 47e8f1174eba00f4ba21f10b83d1552d33eef4aa --pref gfx.webrender.all:true -a https://www.solitaireparadise.com/games_list/pyramid_solitaire_ancient_egypt.html

Has Regression Range: --- → yes

Setting NeedInfo flag to Julian for the question in comment #2.

Flags: needinfo?(jseward)

(In reply to Darkspirit from comment #2)

Are you using EGL (MOZ_X11_EGL=1 env var

no

or gfx.x11-egl.force-enabled=true)

no

or does this occur with GLX?

I don't know how to find that out. What does it mean?

$ mozregression --repo autoland --launch ad4e73130915d4b9d354c7b3d45654afc7973ead --pref gfx.webrender.all:true -a https://www.solitaireparadise.com/games_list/pyramid_solitaire_ancient_egypt.html

This fails (segfaults) on a 16-bit deep VNC (X11) display. It works ok on a 32-bit deep display.

$ mozregression --repo autoland --launch 47e8f1174eba00f4ba21f10b83d1552d33eef4aa --pref gfx.webrender.all:true -a https://www.solitaireparadise.com/games_list/pyramid_solitaire_ancient_egypt.html

This behaves the same as the ad4e snapshot.

I notice, when running my own from-source m-c build on both the 16- and 32-bit deep
displays, that I get the following warnings at startup. Maybe they are relevant?

Crash Annotation GraphicsCriticalError: |[0][GFX1-]: More than 1 GPU detected via PCI, cannot deduce vendor (t=0.117266) [GFX1-]: More than 1 GPU detected via PCI, cannot deduce vendor
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: More than 1 GPU detected via PCI, cannot deduce vendor (t=0.117266) |[1][GFX1-]: PCI candidate 0x8086/0x191d (t=0.117329) [GFX1-]: PCI candidate 0x8086/0x191d
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: More than 1 GPU detected via PCI, cannot deduce vendor (t=0.117266) |[1][GFX1-]: PCI candidate 0x8086/0x191d (t=0.117329) |[2][GFX1-]: PCI candidate 0x10de/0x13b0 (t=0.117339) [GFX1-]: PCI candidate 0x10de/0x13b0
Flags: needinfo?(jseward)
Flags: needinfo?(robert.mader)

Err, not 100% sure why I'm needinfod, but Julian, can you provide us with your about:support so we know what exactly we're dealing with?

Flags: needinfo?(robert.mader) → needinfo?(jseward)

Robert, the needinfo was because you're the assignee on bug 1699864 which apparently caused this. :)

I can include my about:support, sure, but since the browser MOZ_DIAGNOSTIC_ASSERTs
on the 16-bit-deep display at startup, I can't get you the about:support info on that
display. I can get it when running on my native Linux desktop, but then the "Graphics"
info all pertains to the Intel and NVidia GPUs in the machine. Is that of any use to you?

Flags: needinfo?(jseward)

Hm, would it be possible to change the VNC to 32 to get a about:support? IIUC we're failing in Webrender, which apparently doesn't even bother to support 16bbp yet - could you also try to run with MOZ_WEBRENDER=0, so we use the "Basic" compositor?

The function may return Nothing() - handle that case more gracefully.

This hardcoded value was introduced in D109737 to work around an
issue for X11/EGL + WR. In order to not regress other cases where
we traditionally use the EGL provider (X11 + Basic + WebGL?), only
set it when requesting HW-WR.

Julian: can check the following try builds with the patches above?

https://treeherder.mozilla.org/jobs?repo=try&revision=0a8d7d7a4075ed7548d5fb0bb223250b8a9cd5d9 only has the first patch and hopefully makes us fail more gracefully.

https://treeherder.mozilla.org/jobs?repo=try&revision=734e224678d2543e52a4529a0fc74a2e924c0913 also has the second patch, which hopefully makes things work again as expected.

P.S. once the builds have finished you find the direct download links by clicking on the B for the Linux x64 opt build, selecting Artifacts and then select target.tar.bz2.

Flags: needinfo?(jseward)

Robert, thanks for looking at this. I tried m-c 579533:e2e39e32a588 with
D114958 only and with both D114958 and D114959 applied. In both cases,
unfortunately it is not fixed. Instead, the failure moves to the
assertion/stack shown in the attached log.

Flags: needinfo?(jseward)

I see. Can you try with MOZ_WEBRENDER=0 to force the basic backend?

Flags: needinfo?(jseward)

Ah, with MOZ_WEBRENDER=0 (and the two patches), it no longer crashes.

Is there a corresponding pref? I tried earlier with gfx.webrender.software = true,
but that made no difference.

Flags: needinfo?(jseward)
Severity: -- → S2
Priority: -- → P2
Attachment #9221555 - Attachment is obsolete: true
Assignee: nobody → robert.mader
Status: NEW → ASSIGNED

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:rmader, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(robert.mader)
Flags: needinfo?(bwerth)
Flags: needinfo?(bwerth)

Landing the r+ patch - in may not be enough to make WR work on 16bit platforms, but at least fixes the assertion here, pushing possible errors further to some place where they may be more meaningful.

Flags: needinfo?(robert.mader)
Pushed by robert.mader@posteo.de: https://hg.mozilla.org/integration/autoland/rev/78ded94b19fb Provide default value when using wr::SurfaceFormatToImageFormat r=gfx-reviewers,bradwerth
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 96 Branch

Is this something we should consider backporting to ESR? The patch grafts cleanly, but I'm not sure how common this bug is likely to be in the wild.

Flags: needinfo?(robert.mader)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #19)

Is this something we should consider backporting to ESR? The patch grafts cleanly, but I'm not sure how common this bug is likely to be in the wild.

I'm actually not sure if this really fixes any bug - it likely only passes the error down further the line. So while it shouldn't hurt to take it, I don't think it's really worth it.

Flags: needinfo?(robert.mader)

Did this end up helping with anything? Silently returning 0 seems a lot scarier than aborting right away.

Flags: needinfo?(robert.mader)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #21)

Did this end up helping with anything? Silently returning 0 seems a lot scarier than aborting right away.

Hm, most likely not - it should have turned one assertion into a different one. The rather unreadable trace from comment 0 into e.g. Assertion failure: false (Unhandled external image format), at gfx/webrender_bindings/RenderTextureHostSWGL.cpp:66.

Flags: needinfo?(robert.mader)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: