Crash in [@ HandleGLibMessage] with broken pipe
Categories
(Core :: Widget: Gtk, defect, P1)
Tracking
()
People
(Reporter: mccr8, Assigned: stransky)
References
(Blocks 1 open bug)
Details
(Keywords: crash, leave-open, topcrash)
Crash Data
Attachments
(6 files, 1 obsolete file)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/526646de-7201-4b24-b0b5-e15dc0240405
MOZ_CRASH Reason: Error reading events from display: Broken pipe
Top 10 frames of crashing thread:
0 libxul.so MOZ_Crash mfbt/Assertions.h:317
0 libxul.so HandleGLibMessage toolkit/xre/nsSigHandlers.cpp:178
1 libxul.so glib_log_writer_func toolkit/xre/nsSigHandlers.cpp:205
2 libglib-2.0.so.0 g_log_structured_array glib/gmessages.c:1994
2 libglib-2.0.so.0 g_log_structured_array glib/gmessages.c:1967
3 libglib-2.0.so.0 g_log_structured_standard glib/gmessages.c:2051
4 libgdk-3.so.0 gdk_event_source_check gdk/wayland/gdkeventsource.c:96
5 libglib-2.0.so.0 g_main_context_check glib/gmain.c:4072
6 libglib-2.0.so.0 g_main_context_iterate glib/gmain.c:4245
7 libglib-2.0.so.0 g_main_context_iteration glib/gmain.c:4313
Pretty high volume of these on Nightly.
Reporter | ||
Updated•1 year ago
|
Comment 1•1 year ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly (startup)
For more information, please visit BugBot documentation.
Updated•1 year ago
|
Comment 2•1 year ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Updated•1 year ago
|
Comment 3•1 year ago
|
||
This is still a top crash in Nightly 128.
Updated•11 months ago
|
Comment 4•11 months ago
|
||
The volume of this crash signature spiked on Nightly over the last 2 months.
Comment 5•11 months ago
|
||
Indeed, the nightly crash rate has increased more than 4x, starting around late March.
Odd, in the 2 months prior to Feb 20 there were 15 user comments, but since then not a single nightly crash report has a user comment, even though the crash rate is much higher
The following crashes mention scrolling or clicking on mouse
- bp-ade617e6-cf49-4348-9909-ae2490231219 (Thunderbird
- bp-38acfaff-48ea-4d0f-b55d-6069f0231219
- bp-25fb92e1-cfb5-4708-837d-492ed0231226
- bp-3290cc66-8b16-4579-b6e2-f0cbe0231230
- bp-d8843a75-2a06-4c34-b855-428280240108
AFAICT none of the crash comments for channels other than nightly (which are numerous) mention scrolling.
Assignee | ||
Updated•10 months ago
|
Updated•10 months ago
|
Comment hidden (Intermittent Failures Robot) |
(In reply to Wayne Mery (:wsmwk) from comment #5)
Odd, in the 2 months prior to Feb 20 there were 15 user comments, but since then not a single nightly crash report has a user comment, even though the crash rate is much higher
I had a crash that took down sway which also resulted in a firefox crash report tagged as @HandleGLibMessage
. Since there's no chance to fill out the crash report dialog in such cases an increase in compositor crashes might explain that.
Updated•9 months ago
|
Updated•8 months ago
|
Looks like there is no crash report generated
MOZ_LOG=cubeb:5 MOZ_LOG_FILE=firefox.log flatpak run org.mozilla.firefox
libva info: VA-API version 1.19.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/radeonsi_drv_video.so
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/intel-vaapi-driver/radeonsi_drv_video.so
libva info: Trying to open /usr/lib/x86_64-linux-gnu/GL/lib/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_1_19
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
libva info: va_openDriver() returns 0
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
Failed to create /var/home/kay/.var/app/org.mozilla.firefox/cache for shader cache (Permission denied)---disabling.
[Child 523, MediaDecoderStateMachine #1] WARNING: 7f1cecde8820 OpenCubeb() failed to init cubeb: file /builds/worker/checkouts/gecko/dom/media/AudioStream.cpp:285
Assignee | ||
Updated•7 months ago
|
Assignee | ||
Updated•7 months ago
|
Assignee | ||
Comment 12•7 months ago
|
||
We're getting crash from main process only (although we use wayland-proxy for main process only). I checked on try that wayland proxy is used by default.
There are various crashes produced by Gtk3, like:
Error reading events from display: Broken pipe
Error flushing display: Broken pipe
Error 32 (Broken pipe) dispatching to Wayland display.
Lost connection to Wayland compositor.
but these errors are issues behind the proxy. It means the proxy itself disconnected Gtk3 from Wayland compositor. We need to add more logs/diagnostics to find out what happens here.
We also should publish more data about actual Wayland compositor. It's possible that most of the crashes comes from unstable ones like Sway/Hyprland.
Assignee | ||
Updated•7 months ago
|
Assignee | ||
Updated•7 months ago
|
Comment 13•7 months ago
|
||
You also might want to gather pressure stall information (/proc/pressure/*
), which can indicate resource starvation/load spikes which could lead to a compositor disconnect if the wayland client can't keep up with the server.
Assignee | ||
Comment 14•7 months ago
•
|
||
Investigation more the reports. Looks like I don't have access to private report data but I can access it via Graphs. Some of them have 'Shutdown reason' set to AppClose which looks like we crash on Firefox quit (if read it correctly).
Assignee | ||
Updated•7 months ago
|
Assignee | ||
Comment 15•7 months ago
|
||
https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188/diffs?commit_id=f48e02c3cab9581ea76398eb9d37256216f39b6d may be relates here. It claims the buffer is unbounded now but from my testing it's enough to stop even processing for while (block wayland proxy for instance), focus frozen Firefox window and we're disconnected.
Assignee | ||
Updated•7 months ago
|
Assignee | ||
Comment 16•7 months ago
|
||
Did more testing here. Firefox with mutter-46 can be easily crashes with 'broken pipe' crash.
Mutter-47 adds wl_display_set_default_max_buffer_size(..., 1024 * 1024) call which extends default compositor buffer size from 4K to 1M and that makes Firefox more stable - I can't reproduce the 'broken pipe' crash any more.
But that's quite unfortunate as application itself can't control server buffer size and has to rely on default. Let's hope that other compositors will be updated with larger buffers soon.
Wayland client side provides only wl_display_set_max_buffer_size() which adjusts wayland client buffer size and that's not very useful for us as we usually suffer from server buffer overflow.
Assignee | ||
Comment 17•7 months ago
|
||
As we still may hit such issues we may consider an update to proxy cache to read events from compositor more aggressively to make sure we get all events from compositor ASAP.
Assignee | ||
Comment 18•7 months ago
|
||
Let's also reconsider the real-time priority for compositor reading event. Should not be an issue if we spend most of the time in poll():
https://bugzilla.mozilla.org/show_bug.cgi?id=1743144#c96
sched_param param;
if (pthread_attr_getschedparam(&attr, ¶m) == 0) {
param.sched_priority = sched_get_priority_max(SCHED_FIFO);
pthread_attr_setschedparam(&attr, ¶m);
}
Assignee | ||
Comment 19•7 months ago
|
||
btw. Looks like KDE/Kwin already adds wl_display_set_default_max_buffer_size() too (https://bugs.kde.org/show_bug.cgi?id=392376)
Assignee | ||
Comment 20•7 months ago
|
||
With wayland protocol 1.23 we can incerease buffer for wayland events on client side as a counterpart of wl_display_set_default_max_buffer_size() on server side.
Let's use the same values as mutter uses, i.e. 1M buffer size for events.
Assignee | ||
Comment 21•7 months ago
|
||
Use SCHED_FIFO for wayland proxy thread instead of SCHED_RR. It means the proxy will not be interupted until
all events are processed and we'll wait in poll(). It helps to get all events from compositor in time
and don't be disconnected as unresponsible application.
Depends on D224739
Assignee | ||
Comment 22•7 months ago
|
||
Let's see if the two patches here helps to lower the crash ratio on Nightly at least.
Comment 23•7 months ago
|
||
Note that some of these crashes are also compositor crashes (e.g. I've seen this when KWin crashed here for example). But sure the changes looks good.
Assignee | ||
Comment 24•7 months ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #23)
Note that some of these crashes are also compositor crashes (e.g. I've seen this when KWin crashed here for example). But sure the changes looks good.
That's interesting. Let's see how the patches behave in nightly and if there's any change in crash ratio.
Comment 25•7 months ago
|
||
Assignee | ||
Comment 26•7 months ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #23)
Note that some of these crashes are also compositor crashes (e.g. I've seen this when KWin crashed here for example). But sure the changes looks good.
I think we can detect wayland compositor crash by wayland-proxy, will look at it (Bug 1923086). We can at least filter out such event and provide better crash message.
Comment 27•7 months ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/1c8b487d8fc1
https://hg.mozilla.org/mozilla-central/rev/6a5f881dcf7f
Comment 28•7 months ago
|
||
Since nightly and release are affected, beta will likely be affected too.
For more information, please visit BugBot documentation.
Updated•7 months ago
|
Assignee | ||
Comment 29•7 months ago
|
||
Let's keep this open to track the crashes. The patches here may not fix the issue but I hope to get less crash ratio. Bug 1923086 may filter our compositor crashes as we use different crash handler there.
Assignee | ||
Updated•7 months ago
|
Assignee | ||
Updated•7 months ago
|
Assignee | ||
Comment 30•7 months ago
|
||
The crash ratio doesn't look affected the patches here. Let's see if Bug 1923086 makes any difference and filters out compositor crashes.
Assignee | ||
Comment 31•7 months ago
|
||
Assignee | ||
Updated•7 months ago
|
Assignee | ||
Updated•7 months ago
|
Comment 32•7 months ago
|
||
Comment 33•7 months ago
|
||
bugherder |
Assignee | ||
Comment 34•7 months ago
|
||
Looks like we're still getting crashes without compositor signature here. That means the crash comes directly from gdk_event_source_check() as
https://github.com/GNOME/gtk/blob/1d2fe52e96685464b4bd11b7ba597b434ce60ca7/gdk/wayland/gdkeventsource.c#L116
which means proxy cache is missing. According to the backtraces the proxy cache is off or already terminated by previous error / closed connection or so.
We may add more info about proxy cache to crash report (off / terminated) and extend error messages generated by HandleGLibMessage().
Assignee | ||
Comment 35•7 months ago
|
||
I'll try to add compositor name to HandleGLibMessage() and Wayland proxy state to App Notes.
Assignee | ||
Comment 36•7 months ago
|
||
Updated•7 months ago
|
Assignee | ||
Comment 37•7 months ago
|
||
Assignee | ||
Comment 38•7 months ago
|
||
Depends on D225484
Comment 39•7 months ago
|
||
Comment 40•7 months ago
|
||
bugherder |
Assignee | ||
Comment 41•7 months ago
|
||
Comment 42•7 months ago
|
||
Comment 43•7 months ago
|
||
bugherder |
Assignee | ||
Comment 44•6 months ago
|
||
From the early logs it looks like we're getting compositor crashes mostly from KDE (5x KDE, 1x Gnome) but compositor disconnection from both. I think Fedora 41 / Gnome 47 will improve the situation a bit as it comes with an extended buffer on server side.
Assignee | ||
Comment 45•6 months ago
•
|
||
OTOH we're getting 'disconnection' logs from Gnome only so far (4x).
(gnome) Error 32 (Broken pipe) dispatching to Wayland display. Proxy: WP:E WP:RT WP:CA WP:CR WP:CPCA WP:CPCF
That means Firefox operates as expected and cache is running but we were disconnected by compositor for an unknown reason (CPCF - compositor closed the connection to proxy). Looks like Gnome is more picky about application response time while KDE compositor crashes likely.
Assignee | ||
Comment 46•6 months ago
|
||
We may be covered here at least from long term perspective. Now we know that HandleGLibMessage crash is caused by heavy system load under Gnome due to small Wayland message buffer.
The heavy load issue should be addressed by Mutter 47 where wayland message buffer is greatly extended (from 4k to 1M) by explicit wl_display_set_default_max_buffer_size() call.
Assignee | ||
Comment 47•6 months ago
|
||
We see lot of crashes from Arch. However Arch already packaged mutter-47.0 (https://archlinux.org/packages/extra/x86_64/mutter/) so it may hit users soon as it's rolling release distro.
Assignee | ||
Comment 48•6 months ago
|
||
There are Firefox Wayland crashes on Fedora 41 / Gnome which contains extended Wayland buffer and it's supposed to be fixed. So the crash reason must be something different or there's another reason for it.
A possible issue may be https://gitlab.gnome.org/GNOME/gtk/-/merge_requests/7859 - blocked Wayland event read if multiple threads are reading (poll) from Wayland fd connection. That happens if HW rendering is used with Mesa/Wayland backend - Mesa has its own event loop and waits for GL front buffer release. We call that code from Rendering thread as eglSwapBuffers().
Assignee | ||
Updated•6 months ago
|
Assignee | ||
Comment 49•6 months ago
|
||
AFAIK we don't use HW rendering on our testsuite so we may not see such crashes there.
Assignee | ||
Comment 50•6 months ago
|
||
I just reproduced the crash locally, Fedora 41 / Gnome:
- Load system with something (I used Firefox compilation on background)
- Run nested Wayland compositor - mutter --nested
- Run Firefox testsuite (I used WAYLAND_DISPLAY=wayland-1 ./mach mochitest --setpref gfx.webrender.software=true dom) where is dedicated nested compositor from 2) where the test is actually running so it doesn't interfere with recent desktop session.
Assignee | ||
Updated•6 months ago
|
Assignee | ||
Comment 51•6 months ago
|
||
If it's related to https://gitlab.gnome.org/GNOME/gtk/-/merge_requests/7859 we'd need a custom gtk3 build for testing.
Comment 52•5 months ago
|
||
Assignee | ||
Comment 53•5 months ago
|
||
Hit repeatedly on Firefox start while debugging something else. I expected to get Wayland protocol error but I got this one (compositor disconnected bug). Only info I've got from journal is:
Nov 28 09:14:59 fedora-laptop gnome-shell[2305]: WL: error in client communication (pid 148075)
Nov 28 09:14:59 fedora-laptop gnome-shell[2305]: (../src/wayland/meta-wayland-buffer.c:1013):meta_wayland_buffer_finalize: runtime check failed: (buffer->use_count == 0)
Nov 28 09:15:19 fedora-laptop gnome-shell[2305]: meta_wayland_buffer_process_damage: assertion 'buffer->resource' failed
Assignee | ||
Comment 54•5 months ago
•
|
||
I hit another incarnation of the bug. Crashes on GlibHandle and journal contains:
Couldn't map window 0x7ff97f923160 as subsurface because its parent is not mapped.
testcase:
- Create desktop with two monitors (one with scale 300%, one 200%), place side by side
- Open Firefox on 200% screen, go to https://developer.mozilla.org/en-US/docs/Web/API/Geolocation_API/Using_the_Geolocation_API#examples
- Open geolocation popup, keep it on top
- Flip Firefox state between Tiled/Normal state, tiled mode needs to be on side where 300% scaled monitor is located.
- Repeat until crash
This testcase doesn't need system load.
Assignee | ||
Comment 55•5 months ago
|
||
Assignee | ||
Comment 56•5 months ago
|
||
I wonder if this sequence is the problem:
[2428839.234] {mesa egl surface queue} -> wl_surface#66.attach(wl_buffer#75, 0, 0)
[2428839.240] {mesa egl surface queue} -> wl_surface#66.damage_buffer(0, 0, 1024, 512)
[2428839.245] {mesa egl surface queue} -> wl_surface#66.commit()
[2428839.249] {mesa egl surface queue} -> wl_display#1.sync(new id wl_callback#70)
[2428839.451] {mesa egl surface queue} -> wl_buffer#75.destroy()
MESA destroys wl_buffer#75 before compositor releases it. That may match mutter message from journalctl:
Nov 29 08:25:06 fedora-laptop gnome-shell[2307]: meta_wayland_buffer_process_damage: assertion 'buffer->resource' failed
Assignee | ||
Comment 57•5 months ago
|
||
Seems to be related to EGL as I can't reproduce it with gfx.webrender.software = true.
Assignee | ||
Comment 58•5 months ago
•
|
||
Jonas Adahl pointed out that the bug here is related to actual Firefox bug which causes wayland protocol error but it's not routed to client from server. Local debugging shows that mutter sends:
wl_display#1.error(wl_surface#66, 2, "Buffer size (2337x1767) must be an integer multiple of the buffer_scale (2).")
but the message is not received on Firefox side (with both proxy enabled/disabled) and we're just disconnected by Mutter. Looks like the messages are not flushed to client so we're getting a generic 'client was disconnected' error instead of the real one.
Jonas also suggested to use viewports as workaround for the fixed scale settings (which causes the issues) here so I'll investigate this direction.
Assignee | ||
Comment 59•5 months ago
|
||
Bug 1934217 adds more time to process error messages on Firefox side. Let's see what bugs it reveals.
Assignee | ||
Updated•5 months ago
|
Assignee | ||
Comment 61•5 months ago
|
||
When Bug 1934217 lands I expect HandleGLibMessage crash transfer to mozilla::widget::WlLogHandler crash (Bug 1932639) where we get the actual Wayland error.
Assignee | ||
Comment 63•3 months ago
|
||
Another crash hidden in HandleGLibMessage / broken pipe is D&D one (Bug 1941119) which produces the
[GFX1-]: (gnome) Wayland protocol error: unknown object (4278190080), message error(ous)
bug which is recently a topcrash on Wayland AFAIK.
Comment hidden (obsolete) |
Assignee | ||
Comment 65•2 months ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #63)
Another crash hidden in HandleGLibMessage / broken pipe is D&D one (Bug 1941119) which produces the
[GFX1-]: (gnome) Wayland protocol error: unknown object (4278190080), message error(ous)
bug which is recently a topcrash on Wayland AFAIK.
Wayland protocol error: unknown object should be fixed by Bug 1949726.
Updated•2 months ago
|
Comment hidden (Intermittent Failures Robot) |
Description
•