Closed Bug 1741956 Opened 2 years ago Closed 2 years ago

Parent-process crash in [@ XGetImage]

Categories

(Core :: Graphics: WebRender, defect)

Firefox 94
All
Linux
defect

Tracking

()

RESOLVED FIXED
96 Branch
Tracking Status
thunderbird_esr91 --- unaffected
firefox-esr91 --- unaffected
firefox94 --- wontfix
firefox95 + fixed
firefox96 --- fixed

People

(Reporter: dholbert, Assigned: lsalzman)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file, 2 obsolete files)

Crash report: https://crash-stats.mozilla.org/report/index/4d76a195-4fe5-489f-b28c-7bc8f0211015

Reason: SIGTRAP

Top 10 frames of crashing thread:

0 libglib-2.0.so.0 g_log_writer_default 
1 libglib-2.0.so.0 g_log_structured_array 
2 libglib-2.0.so.0 g_log_structured_standard glib/gmessages.c:2021
3 libgdk-3.so.0 _gdk_x11_display_error_event gdk/x11/gdkdisplay-x11.c:2766
4 libgdk-3.so.0 gdk_x_error gdk/x11/gdkmain-x11.c:260
5 libX11.so.6 _XError src/XlibInt.c:1503
6 libX11.so.6 handle_error src/xcb_io.c:207
7 libX11.so.6 _XReply src/xcb_io.c:793
8 libX11.so.6 XGetImage 
9 libxul.so _get_image_surface gfx/cairo/cairo/src/cairo-xlib-surface.c:813

This crash signature seems to have increased a bit over the last month. In the few reports I've looked at, we're coming from WebRender painting code, so I'm tentatively filing this as a WebRender bug.

Seems to not be fission-dependent, since some of these crash reports don't show fission being enabled in their "Crash Annotations" tab (e.g. bp-763856cf-dedd-46c3-b21f-47a780211115)

I think (?) this is an intentional crash that's being handled safely -- at least, the innermost functions here seem like an "abort due to an error condition being detected" sort of thing, via handle_error, XError, etc. down into g_log_writer_default which is presumably printing some sort of error message and then aborting.

Also notable: this is a parent-process crash, so the whole browser is being taken down, unfortunately.

Summary: Crash in [@ XGetImage] → Parent-process crash in [@ XGetImage]

Glenn, want to be sure this is on your radar since it seems to be a WebRender thing with increasing crash frequency (per graph in the Crash Signature section), which seems to take down the whole browser.

Flags: needinfo?(gwatson)

It looks like this might be coming from SWGL, but I don't really know how SWGL interacts with cairo. Lee, any ideas what could be going on here?

Flags: needinfo?(gwatson) → needinfo?(lsalzman)

I was noticing that I was getting this crash locally starting with 94 release (even in HW-WR), so I wonder if this has something to do with popups getting SW-WR. The X error that is ultimately causing Firefox to crash is a BadMatch error that is spouted by a GDK error handler.

But if you look at the crash stack from one of my crashes: https://crash-stats.mozilla.org/report/index/0c3ded18-ed34-40ee-ac56-0b0fe0211128
You see that on the Renderer thread we're calling into XGetImage from inside Cairo, which surrounds the call with XSetError to a no-op error handler, that should be ignoring the error. So then why is the GDK X error handler spouting the error? If you then click "Show other threads" and look at thread 0, you can see the problem: it is doing its own Xlib call which uses the gdk_x11_display_error_trap mechanism, which is its own wrapper around XSetError.

Cairo and GDK are thus racing around using their own internal usages of XSetError, and we end up with GDK overwriting Cairo's handler while we're inside XGetImage, and then we go boom...

Ideas about how to deal with this problem are welcome. At a minimum we might have to do something sad like sending RenderCompositor::EndFrame commits to the main thread, but I would really prefer something less horrible if at all possible.

Flags: needinfo?(lsalzman)
OS: Unspecified → Linux
Hardware: Unspecified → All
Blocks: wr-linux

A devious idea - I wonder if just pushing a global GDK error trap at the beginning when we bring up the process would ultimately cause these races to cancel out? Most of these explicit XSetError error traps (like in Cairo), we don't actually care if they run, so long as X11 errors are supressed, because they never actually analyze the error generated, or if there even was an error other than the function returning null. The global GDK error trap would cause the previous handler at the time of any XSetError calls outside GDK to always be GDK's. So even if there would end up being a race in XSetError between Cairo and GDK, the race would always end up installing GDK's handler (because Cairo would try to reinstate GDK's handler), which would suppress the error since GDK wants to store it for the global error trap. Net result - no crashiness, a couple lines of code...

In truth, we have a couple other places in the code that call XSetError (like gfx/gl), not just from Cairo, so this would be a catch-all solution to help all these random cases. I don't think we want X11 errors going unhandled and taking down the browser anyway, so a global GDK error trap would make sense...

Assignee: nobody → lsalzman
Status: NEW → ASSIGNED

In light of the fact that GDK already installs a global error handler on library initialization, all we really need to do is ensure there is a GDK error trap in place in case a race causes the GDK error handler to be taken instead of Cairo's. The error trap then prevents GDK from aborting, and all is well again.

https://crash-stats.mozilla.org/signature/?product=Firefox&version=94.0a1&version=95.0a1&version=96.0a1&version=93.0a1&signature=XGetImage&date=%3E%3D2021-05-28T17%3A55%3A00.000Z&date=%3C2021-11-28T17%3A55%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=build_id&_sort=-date&page=1#reports

Only looking at Nightly crashes, Nightly 20210930215026 seems to be the first affected one.

Commits before that:
mozregression --good 2021-09-28 --bad 20210930215026

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=db6864017ca9142bcca71a4e34f288bc6e00baf8&tochange=45cb924ad91fe330dadd11046cac31a0f7ace4d0

Ctrl+F stransky

c40dba1932cb24ddf59d6287f4c829d8b4b1f79d Robert Mader — Bug 1732443 - Remove remaining XRender leftovers,r=emilio,stransky

  • This touched cairo (e.g. removed GetXRenderFormat, cairo_xlib_surface_get_xrender_format).

0a41f3d6a0da3b837cc8520fd02b06d24144ab86 stransky — Bug 1733057 [Linux] Return early when XWindow handle is missing at GLContextProviderGLX::CreateForWidget(), r=rmader
40fb20cfc4368eb089b01c4eec3d0446d7b42d02 stransky — Bug 1729656 [Wayland] Don't use gfxPlatform to configure DMABuf, r=rmader
ec6135a35df95ec92cd106dd74593f4ef2de9808 stransky — Bug 1731125 [Linux/EGL] Render popups by SW-WR on X11/EGL due to missing transparency, r=rmader

77ee40b13a1199d55fd5ab11db6d22d4177f5564 Robert Mader — Bug 1732436 - Use Xrandr to detect display modes on X11, r=stransky
474308da68345ef6733b3b098dccb0f8b67e2994 Robert Mader — Bug 1732436 - Link Xrandr directly instead of using dlopen, r=stransky,glandium

ef22d8cbf4efa1165559b79d294c9948eed87cf1 stransky — Bug 1724242 [Linux] Use DBus remote when Firefox is built with --enable-dbus, r=glandium

I would expect bug 1731125 is what exposed this. But the bug is more the fact that we have the race in the first place, rather than a regression.

Okay, the plot thickens a bit. This occurs when we draw shape-masked popups like the hamburger menu, which are BGRA surfaces. WindowSurfaceX11Image::Commit uses DrawSurface(OP_OVER) to do the final blit from a gfxImageSurface to the window wrapped by a gfxXlibSurface. Cairo notices the OP_OVER and does an undesirable sequence of fallbacks - it does an XGetImage on the window to get the image data, blends to it, and then XPutImage's the result back to the window. So the XGetImage becomes bad for two reasons - 1) it can cause an X error, 2) it is causing a readback from the window server that we really don't want for performance reasons.

So a potentially less horrible solution to this whole mess is to just avoid that fallback to XGetImage in the first place by switching the OP_OVER to OP_SOURCE...

When we use OP_OVER for the final blit in WindowSurfaceX11Image::Commit,
this causes Cairo to hit a compositing fallback that uses XGetImage on
the window to get the pixel data, blends to it, then XPutImage's the
result back. We want to avoid this both because XGetImage may cause
errors and it is slow to read back from the window server.

If we use CopySurface/OP_SOURCE instead we avoid the readback via
XGetImage and so bypass both problems.

Version: unspecified → Firefox 94
Regressed by: 1731125
Has Regression Range: --- → yes
Attachment #9252747 - Attachment is obsolete: true
Crash Signature: [@ XGetImage] → [@ XGetImage] [@ X11Error]
Crash Signature: [@ XGetImage] [@ X11Error] → [@ XGetImage] [@ X11Error] [@ handle_response | XSync]

Depends on D132319

See Also: → 1743386
Pushed by lsalzman@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/456ac0463f5c
Avoid OP_OVER on Cairo Xlib window surface. r=jrmuizel

IIUC this may fix bug 1678804. The glitch there may well be the result of us reading back garbage.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 96 Branch
Blocks: 1735959

It looks like the race with libxcb and libX11 is unfortunately so intractable that my attempt to inject XLockDisplay into it doesn't really help much. Though the XGetImage signature remains fixable via the patch I've already committed. Going to abandon the second XLockDisplay patch and have a bug on file for the race condition even though there isn't much we can do about it as of yet.

Attachment #9252847 - Attachment is obsolete: true
See Also: → 1743551

Comment on attachment 9252783 [details]
Bug 1741956 - Avoid OP_OVER on Cairo Xlib window surface. r?jrmuizel

Beta/Release Uplift Approval Request

  • User impact if declined: On Linux/X11 systems, any attempt to open a menu (context, hamburger, etc.) or other popup can cause crashes. Locally, since I open context menus a lot sometimes in the course of daily work, I notice that I can crash up to a few times daily.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Avoids a fallback to the XGetImage rendering path that it should have never been using in the first place, as verified in local testing. Otherwise doesn't significantly change rendering semantics.
  • String changes made/needed:
Attachment #9252783 - Flags: approval-mozilla-beta?

Comment on attachment 9252783 [details]
Bug 1741956 - Avoid OP_OVER on Cairo Xlib window surface. r?jrmuizel

We have already merged mozilla-beta into mozilla-release and built our Release Candidate for shipping next week. Moving the uplift request for the Release channel. Potentially a ride along for a build 2 or a dot release if we have one.

Attachment #9252783 - Flags: approval-mozilla-beta? → approval-mozilla-release?

The patch landed in nightly and beta is affected.
:lsalzman, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(lsalzman)
Flags: needinfo?(lsalzman)
Blocks: 1744060

Comment on attachment 9252783 [details]
Bug 1741956 - Avoid OP_OVER on Cairo Xlib window surface. r?jrmuizel

Approved for a 95 dot release, thanks.

Attachment #9252783 - Flags: approval-mozilla-release? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.