Open Bug 1890074 Opened 1 year ago Updated 1 month ago

Crash in [@ HandleGLibMessage] with broken pipe

Categories

(Core :: Widget: Gtk, defect, P1)

defect

Tracking

()

ASSIGNED
Tracking Status
firefox-esr128 --- affected
firefox125 --- wontfix
firefox126 --- wontfix
firefox127 --- wontfix
firefox128 --- wontfix
firefox129 --- wontfix
firefox130 --- wontfix
firefox131 --- wontfix
firefox132 --- affected
firefox133 --- affected

People

(Reporter: mccr8, Assigned: stransky)

References

(Blocks 1 open bug)

Details

(Keywords: crash, leave-open, topcrash)

Crash Data

Attachments

(6 files, 1 obsolete file)

Crash report: https://crash-stats.mozilla.org/report/index/526646de-7201-4b24-b0b5-e15dc0240405

MOZ_CRASH Reason: Error reading events from display: Broken pipe

Top 10 frames of crashing thread:

0  libxul.so  MOZ_Crash  mfbt/Assertions.h:317
0  libxul.so  HandleGLibMessage  toolkit/xre/nsSigHandlers.cpp:178
1  libxul.so  glib_log_writer_func  toolkit/xre/nsSigHandlers.cpp:205
2  libglib-2.0.so.0  g_log_structured_array  glib/gmessages.c:1994
2  libglib-2.0.so.0  g_log_structured_array  glib/gmessages.c:1967
3  libglib-2.0.so.0  g_log_structured_standard  glib/gmessages.c:2051
4  libgdk-3.so.0  gdk_event_source_check  gdk/wayland/gdkeventsource.c:96
5  libglib-2.0.so.0  g_main_context_check  glib/gmain.c:4072
6  libglib-2.0.so.0  g_main_context_iterate  glib/gmain.c:4245
7  libglib-2.0.so.0  g_main_context_iteration  glib/gmain.c:4313

Pretty high volume of these on Nightly.

Summary: Crash in [@ HandleGLibMessage] → Crash in [@ HandleGLibMessage] with broken pipe

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly (startup)

For more information, please visit BugBot documentation.

Blocks: wayland
Priority: -- → P3
See Also: → 1876882

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

This is still a top crash in Nightly 128.

The volume of this crash signature spiked on Nightly over the last 2 months.

Indeed, the nightly crash rate has increased more than 4x, starting around late March.

Odd, in the 2 months prior to Feb 20 there were 15 user comments, but since then not a single nightly crash report has a user comment, even though the crash rate is much higher

The following crashes mention scrolling or clicking on mouse

AFAICT none of the crash comments for channels other than nightly (which are numerous) mention scrolling.

Flags: needinfo?(stransky)

(In reply to Wayne Mery (:wsmwk) from comment #5)

Odd, in the 2 months prior to Feb 20 there were 15 user comments, but since then not a single nightly crash report has a user comment, even though the crash rate is much higher

I had a crash that took down sway which also resulted in a firefox crash report tagged as @HandleGLibMessage. Since there's no chance to fill out the crash report dialog in such cases an increase in compositor crashes might explain that.

Looks like there is no crash report generated

MOZ_LOG=cubeb:5 MOZ_LOG_FILE=firefox.log flatpak run org.mozilla.firefox
libva info: VA-API version 1.19.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/radeonsi_drv_video.so
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/intel-vaapi-driver/radeonsi_drv_video.so
libva info: Trying to open /usr/lib/x86_64-linux-gnu/GL/lib/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_1_19
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
libva info: va_openDriver() returns 0
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
[Child 523, MediaDecoderStateMachine #1] WARNING: 7f1cecde8820 OpenCubeb() failed to init cubeb: file /builds/worker/checkouts/gecko/dom/media/AudioStream.cpp:285

sorry wrong ticket

Will look at it.

Flags: needinfo?(stransky)
Flags: needinfo?(stransky)
See Also: → 1917270
Duplicate of this bug: 1917270
Flags: needinfo?(stransky)
Priority: P3 → P2

We're getting crash from main process only (although we use wayland-proxy for main process only). I checked on try that wayland proxy is used by default.

There are various crashes produced by Gtk3, like:

Error reading events from display: Broken pipe
Error flushing display: Broken pipe
Error 32 (Broken pipe) dispatching to Wayland display.
Lost connection to Wayland compositor.

but these errors are issues behind the proxy. It means the proxy itself disconnected Gtk3 from Wayland compositor. We need to add more logs/diagnostics to find out what happens here.

We also should publish more data about actual Wayland compositor. It's possible that most of the crashes comes from unstable ones like Sway/Hyprland.

Assignee: nobody → stransky
Flags: needinfo?(stransky)

You also might want to gather pressure stall information (/proc/pressure/*), which can indicate resource starvation/load spikes which could lead to a compositor disconnect if the wayland client can't keep up with the server.

Investigation more the reports. Looks like I don't have access to private report data but I can access it via Graphs. Some of them have 'Shutdown reason' set to AppClose which looks like we crash on Firefox quit (if read it correctly).

Depends on: 1922116

https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188/diffs?commit_id=f48e02c3cab9581ea76398eb9d37256216f39b6d may be relates here. It claims the buffer is unbounded now but from my testing it's enough to stop even processing for while (block wayland proxy for instance), focus frozen Firefox window and we're disconnected.

Flags: needinfo?(stransky)

Did more testing here. Firefox with mutter-46 can be easily crashes with 'broken pipe' crash.

Mutter-47 adds wl_display_set_default_max_buffer_size(..., 1024 * 1024) call which extends default compositor buffer size from 4K to 1M and that makes Firefox more stable - I can't reproduce the 'broken pipe' crash any more.

But that's quite unfortunate as application itself can't control server buffer size and has to rely on default. Let's hope that other compositors will be updated with larger buffers soon.

Wayland client side provides only wl_display_set_max_buffer_size() which adjusts wayland client buffer size and that's not very useful for us as we usually suffer from server buffer overflow.

As we still may hit such issues we may consider an update to proxy cache to read events from compositor more aggressively to make sure we get all events from compositor ASAP.

Let's also reconsider the real-time priority for compositor reading event. Should not be an issue if we spend most of the time in poll():
https://bugzilla.mozilla.org/show_bug.cgi?id=1743144#c96

  sched_param param;
  if (pthread_attr_getschedparam(&attr, &param) == 0) {
    param.sched_priority = sched_get_priority_max(SCHED_FIFO);
    pthread_attr_setschedparam(&attr, &param);
  }

btw. Looks like KDE/Kwin already adds wl_display_set_default_max_buffer_size() too (https://bugs.kde.org/show_bug.cgi?id=392376)

With wayland protocol 1.23 we can incerease buffer for wayland events on client side as a counterpart of wl_display_set_default_max_buffer_size() on server side.
Let's use the same values as mutter uses, i.e. 1M buffer size for events.

Use SCHED_FIFO for wayland proxy thread instead of SCHED_RR. It means the proxy will not be interupted until
all events are processed and we'll wait in poll(). It helps to get all events from compositor in time
and don't be disconnected as unresponsible application.

Depends on D224739

Let's see if the two patches here helps to lower the crash ratio on Nightly at least.

Flags: needinfo?(stransky)

Note that some of these crashes are also compositor crashes (e.g. I've seen this when KWin crashed here for example). But sure the changes looks good.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #23)

Note that some of these crashes are also compositor crashes (e.g. I've seen this when KWin crashed here for example). But sure the changes looks good.

That's interesting. Let's see how the patches behave in nightly and if there's any change in crash ratio.

Pushed by stransky@redhat.com: https://hg.mozilla.org/integration/autoland/rev/1c8b487d8fc1 [Wayland] Set client wayland buffer size r=emilio https://hg.mozilla.org/integration/autoland/rev/6a5f881dcf7f [Wayland] Use SCHED_FIFO for wayland proxy thread r=emilio
Depends on: 1923086

(In reply to Emilio Cobos Álvarez (:emilio) from comment #23)

Note that some of these crashes are also compositor crashes (e.g. I've seen this when KWin crashed here for example). But sure the changes looks good.

I think we can detect wayland compositor crash by wayland-proxy, will look at it (Bug 1923086). We can at least filter out such event and provide better crash message.

Status: NEW → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
Target Milestone: --- → 133 Branch

Since nightly and release are affected, beta will likely be affected too.
For more information, please visit BugBot documentation.

Let's keep this open to track the crashes. The patches here may not fix the issue but I hope to get less crash ratio. Bug 1923086 may filter our compositor crashes as we use different crash handler there.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 133 Branch → ---
Status: REOPENED → ASSIGNED
Flags: needinfo?(stransky)

The crash ratio doesn't look affected the patches here. Let's see if Bug 1923086 makes any difference and filters out compositor crashes.

Flags: needinfo?(stransky)
Keywords: leave-open
Flags: needinfo?(stransky)
Pushed by stransky@redhat.com: https://hg.mozilla.org/integration/autoland/rev/f17ad05e0870 [Wayland] Print compositor name in protocol handler error r=emilio

Looks like we're still getting crashes without compositor signature here. That means the crash comes directly from gdk_event_source_check() as
https://github.com/GNOME/gtk/blob/1d2fe52e96685464b4bd11b7ba597b434ce60ca7/gdk/wayland/gdkeventsource.c#L116

which means proxy cache is missing. According to the backtraces the proxy cache is off or already terminated by previous error / closed connection or so.

We may add more info about proxy cache to crash report (off / terminated) and extend error messages generated by HandleGLibMessage().

I'll try to add compositor name to HandleGLibMessage() and Wayland proxy state to App Notes.

Attachment #9430717 - Attachment is obsolete: true
Pushed by stransky@redhat.com: https://hg.mozilla.org/integration/autoland/rev/3173bfa616ca [Wayland] Implement wayland proxy state string r=emilio https://hg.mozilla.org/integration/autoland/rev/4b9daccbd5c9 [Linux] Print more info by Wayland error handlers r=emilio
Pushed by stransky@redhat.com: https://hg.mozilla.org/integration/autoland/rev/6d68f30cda4f [Linux] Add missing <fstream> to fix GCC build r=emilio

From the early logs it looks like we're getting compositor crashes mostly from KDE (5x KDE, 1x Gnome) but compositor disconnection from both. I think Fedora 41 / Gnome 47 will improve the situation a bit as it comes with an extended buffer on server side.

OTOH we're getting 'disconnection' logs from Gnome only so far (4x).

(gnome) Error 32 (Broken pipe) dispatching to Wayland display. Proxy: WP:E WP:RT WP:CA WP:CR WP:CPCA WP:CPCF 

That means Firefox operates as expected and cache is running but we were disconnected by compositor for an unknown reason (CPCF - compositor closed the connection to proxy). Looks like Gnome is more picky about application response time while KDE compositor crashes likely.

See Also: → 1925684

We may be covered here at least from long term perspective. Now we know that HandleGLibMessage crash is caused by heavy system load under Gnome due to small Wayland message buffer.

The heavy load issue should be addressed by Mutter 47 where wayland message buffer is greatly extended (from 4k to 1M) by explicit wl_display_set_default_max_buffer_size() call.

Flags: needinfo?(stransky)

We see lot of crashes from Arch. However Arch already packaged mutter-47.0 (https://archlinux.org/packages/extra/x86_64/mutter/) so it may hit users soon as it's rolling release distro.

There are Firefox Wayland crashes on Fedora 41 / Gnome which contains extended Wayland buffer and it's supposed to be fixed. So the crash reason must be something different or there's another reason for it.

A possible issue may be https://gitlab.gnome.org/GNOME/gtk/-/merge_requests/7859 - blocked Wayland event read if multiple threads are reading (poll) from Wayland fd connection. That happens if HW rendering is used with Mesa/Wayland backend - Mesa has its own event loop and waits for GL front buffer release. We call that code from Rendering thread as eglSwapBuffers().

AFAIK we don't use HW rendering on our testsuite so we may not see such crashes there.

I just reproduced the crash locally, Fedora 41 / Gnome:

  1. Load system with something (I used Firefox compilation on background)
  2. Run nested Wayland compositor - mutter --nested
  3. Run Firefox testsuite (I used WAYLAND_DISPLAY=wayland-1 ./mach mochitest --setpref gfx.webrender.software=true dom) where is dedicated nested compositor from 2) where the test is actually running so it doesn't interfere with recent desktop session.
Flags: needinfo?(stransky)

If it's related to https://gitlab.gnome.org/GNOME/gtk/-/merge_requests/7859 we'd need a custom gtk3 build for testing.

See Also: → 1863047

Hit repeatedly on Firefox start while debugging something else. I expected to get Wayland protocol error but I got this one (compositor disconnected bug). Only info I've got from journal is:

Nov 28 09:14:59 fedora-laptop gnome-shell[2305]: WL: error in client communication (pid 148075)
Nov 28 09:14:59 fedora-laptop gnome-shell[2305]: (../src/wayland/meta-wayland-buffer.c:1013):meta_wayland_buffer_finalize: runtime check failed: (buffer->use_count == 0)
Nov 28 09:15:19 fedora-laptop gnome-shell[2305]: meta_wayland_buffer_process_damage: assertion 'buffer->resource' failed

I hit another incarnation of the bug. Crashes on GlibHandle and journal contains:
Couldn't map window 0x7ff97f923160 as subsurface because its parent is not mapped.

testcase:

  1. Create desktop with two monitors (one with scale 300%, one 200%), place side by side
  2. Open Firefox on 200% screen, go to https://developer.mozilla.org/en-US/docs/Web/API/Geolocation_API/Using_the_Geolocation_API#examples
  3. Open geolocation popup, keep it on top
  4. Flip Firefox state between Tiled/Normal state, tiled mode needs to be on side where 300% scaled monitor is located.
  5. Repeat until crash

This testcase doesn't need system load.

I wonder if this sequence is the problem:

[2428839.234] {mesa egl surface queue}  -> wl_surface#66.attach(wl_buffer#75, 0, 0)
[2428839.240] {mesa egl surface queue}  -> wl_surface#66.damage_buffer(0, 0, 1024, 512)
[2428839.245] {mesa egl surface queue}  -> wl_surface#66.commit()
[2428839.249] {mesa egl surface queue}  -> wl_display#1.sync(new id wl_callback#70)
[2428839.451] {mesa egl surface queue}  -> wl_buffer#75.destroy()

MESA destroys wl_buffer#75 before compositor releases it. That may match mutter message from journalctl:

Nov 29 08:25:06 fedora-laptop gnome-shell[2307]: meta_wayland_buffer_process_damage: assertion 'buffer->resource' failed

Seems to be related to EGL as I can't reproduce it with gfx.webrender.software = true.

Jonas Adahl pointed out that the bug here is related to actual Firefox bug which causes wayland protocol error but it's not routed to client from server. Local debugging shows that mutter sends:

wl_display#1.error(wl_surface#66, 2, "Buffer size (2337x1767) must be an integer multiple of the buffer_scale (2).")

but the message is not received on Firefox side (with both proxy enabled/disabled) and we're just disconnected by Mutter. Looks like the messages are not flushed to client so we're getting a generic 'client was disconnected' error instead of the real one.

Jonas also suggested to use viewports as workaround for the fixed scale settings (which causes the issues) here so I'll investigate this direction.

Depends on: 1934217

Bug 1934217 adds more time to process error messages on Firefox side. Let's see what bugs it reveals.

Duplicate of this bug: 1934247
Flags: needinfo?(stransky)

When Bug 1934217 lands I expect HandleGLibMessage crash transfer to mozilla::widget::WlLogHandler crash (Bug 1932639) where we get the actual Wayland error.

Duplicate of this bug: 1932415
Flags: needinfo?(stransky)
See Also: → 1932639

Another crash hidden in HandleGLibMessage / broken pipe is D&D one (Bug 1941119) which produces the

[GFX1-]: (gnome) Wayland protocol error: unknown object (4278190080), message error(ous)

bug which is recently a topcrash on Wayland AFAIK.

Priority: P2 → P1
See Also: → 1941119

(In reply to Martin Stránský [:stransky] (ni? me) from comment #63)

Another crash hidden in HandleGLibMessage / broken pipe is D&D one (Bug 1941119) which produces the

[GFX1-]: (gnome) Wayland protocol error: unknown object (4278190080), message error(ous)

bug which is recently a topcrash on Wayland AFAIK.

Wayland protocol error: unknown object should be fixed by Bug 1949726.

See Also: 19411191949726
Duplicate of this bug: 1954822
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: