Closed Bug 1860153 Opened 7 months ago Closed 6 months ago

Crash in [@ HandleGLibMessage]

Categories

(Core :: Widget: Gtk, defect)

Firefox 120
Unspecified
Linux
defect

Tracking

()

RESOLVED DUPLICATE of bug 1743144

People

(Reporter: matt.fagnani, Unassigned)

References

Details

(Keywords: topcrash, topcrash-startup, wayland, Whiteboard: [tbird crash])

Crash Data

I ran Firefox 120.0a1 20231019042043 on Wayland with WebRender compositing in Plasma 5.27.8 in a Fedora 39 KDE Plasma installation. I logged into my instagram account. I played various videos in stories and posts. Firefox froze while a video was playing in a story. Firefox crashed 10 seconds or so later. The Crash Reporter appeared, and I submitted the crash report. The crash reason was Lost connection to Wayland compositor. I've occasionally seen Firefox crash with Lost connection to Wayland compositor. in the journal, but the Crash Reporter didn't appear at those times. This is the first crash I've seen with the signature [@ HandleGLibMessage]

Crash report: https://crash-stats.mozilla.org/report/index/7adb8996-ddb8-44c4-a3a2-78ba80231019

MOZ_CRASH Reason: Lost connection to Wayland compositor.

Top 10 frames of crashing thread:

0  libxul.so  MOZ_Crash  mfbt/Assertions.h:281
0  libxul.so  HandleGLibMessage  toolkit/xre/nsSigHandlers.cpp:178
1  libxul.so  glib_log_writer_func  toolkit/xre/nsSigHandlers.cpp:205
2  libglib-2.0.so.0  <name omitted>  /usr/src/debug/glib2-2.78.0-3.fc39.x86_64/glib/gmessages.c:1984
2  libglib-2.0.so.0  g_log_structured_array  /usr/src/debug/glib2-2.78.0-3.fc39.x86_64/glib/gmessages.c:1957
3  libglib-2.0.so.0  g_log_structured_standard  /usr/src/debug/glib2-2.78.0-3.fc39.x86_64/glib/gmessages.c:2041
4  libgdk-3.so.0  _gdk_wayland_display_queue_events  /usr/src/debug/gtk3-3.24.38-3.fc39.x86_64/gdk/wayland/gdkeventsource.c:210
5  libgdk-3.so.0  gdk_display_get_event  /usr/src/debug/gtk3-3.24.38-3.fc39.x86_64/gdk/gdkdisplay.c:442
6  libgdk-3.so.0  gdk_event_source_dispatch  /usr/src/debug/gtk3-3.24.38-3.fc39.x86_64/gdk/wayland/gdkeventsource.c:120
7  libglib-2.0.so.0  g_main_dispatch  /usr/src/debug/glib2-2.78.0-3.fc39.x86_64/glib/gmain.c:3476

The journal at the time of the crash had the following messages. The first two involving kde.dataengine.mpris not getting a dbus reply and failing to find a working MPRIS2 interface for "org.mpris.MediaPlayer2.firefox.instance26890" might be involved with Firefox freezing when the video was playing.

Oct 19 14:52:02 plasmashell[25092]: kde.dataengine.mpris: "org.mpris.MediaPlayer2.firefox.instance26890" does not implement org.freedesktop.DBus.Properties correctly Error message was "org.freedesktop.DBus.Error.NoReply" "Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken."
Oct 19 14:52:02 plasmashell[25092]: kde.dataengine.mpris: Failed to find working MPRIS2 interface for "org.mpris.MediaPlayer2.firefox.instance26890"
Oct 19 14:52:05 kwin_wayland_wrapper[24915]: error in client communication (pid 26890)
Oct 19 14:52:05 plasmashell[26890]: ExceptionHandler::GenerateDump cloned child
Oct 19 14:52:05 plasmashell[28478]: ExceptionHandler::WaitForContinueSignal waiting for continue signal...
Oct 19 14:52:05 plasmashell[26890]: 28478
Oct 19 14:52:05 plasmashell[26890]: ExceptionHandler::SendContinueSignalToChild sent continue signal to child
Oct 19 14:52:05 plasmashell[28376]: Exiting due to channel error.
Oct 19 14:52:05 plasmashell[27837]: Exiting due to channel error.
Oct 19 14:52:05 plasmashell[27795]: Exiting due to channel error.
Oct 19 14:52:05 plasmashell[27065]: Exiting due to channel error.
Oct 19 14:52:05 plasmashell[28168]: Exiting due to channel error.
Oct 19 14:52:05 plasmashell[27004]: Exiting due to channel error.

The HandleGLibMessage function was added in https://hg.mozilla.org/mozilla-central/rev/9b7c81791d7f first in 120.0a1 20231018094117 for Bug 1859267

See Also: → 1859267

The bug has a crash signature, thus the bug will be considered confirmed.

Status: UNCONFIRMED → NEW
Ever confirmed: true

Yeah so this is the intended effect of that change. Before that we would have no visibility on those crashes.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash
Blocks: 1743144
Keywords: stalled, wayland
OS: Unspecified → Linux
Whiteboard: [tbird crash]

I opened a few crashes to see what's going on, given this is spiking on Firefox. The crashes all originate from _gdk_wayland_display_queue_events() or gdk_event_source_check(). In both cases the root cause is the same: something happened to the pipe that we use to connect to Wayland and we're crashing because of that.

I don't really see a pattern among the crashes, they seem to happen on different distros and different compositors. However the crashiness seems to have started in the nightly channel with buildid 20231018062241, so that's where I'd start to look at.

(Personal note: I can't help but be somewhat critical of a design that relies on a connection but apparently has now way to recover from errors with the connection itself)

Ignore the part about buildid 1859267, it was pointed out to me that this didn't crash before bug 1859267 landed, so there's no relationship with the problem itself.

I've gone through the comments on the crashes and they can be summarized under three groups:

  • People doing some slow I/O (e.g. from an NFS mount or hard-drive, dragging a file into Firefox, opening a large file with the DevTools)
  • People doing something heavyweight within the browser (opening lots of bookmarks together, calling about:memory)
  • Startup crashes

In all three cases we're clearly yanking the main thread and this leads to crashes. Some comments mention closing tabs or windows. On nightly that can lead to a child process shutdown, which could lead to a minidump being taken if it takes too long, which in turn could yank the main process' main thread causing it to stall and stop processing messages.

Given that there's no real pattern except for us being slow I wonder if something changed on Wayland's side, possibly related to how much buffering or delay it tolerates? I don't see a way to solve this easily, except from moving the communication with Wayland outside of the main process' main thread, maybe via a proxy, but that would involve significant changes and I'm not even sure it's possible.

Given that there's no real pattern except for us being slow I wonder if something changed on Wayland's side, possibly related to how much buffering or delay it tolerates?

Nothing changed other than us reporting these, see bug 1743144 which has a lot more context. Some DEs have mitigations for this, but some don't.

I just hit this locally, FWIW: bp-b8bda7eb-934b-4315-8c97-a30f80231127 (on relatively "stock" Ubuntu 22.04, with Gnome).

In my case, I had just clicked measure on about:memory. Firefox stopped responding to user input while we were gathering the memory report, and the Gnome "application not responding, kill or wait?" popup-dialog appeared. I clicked "Wait" on a few of them (it comes back after a few seconds), and then I just left it alone with a dialog up to let Firefox finish its memory-report business unperturbed. After a minute or so, Firefox suddenly crashed with this signature.

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 5 desktop browser crashes on Linux on beta
  • Top 5 desktop browser crashes on Linux on release (startup)

For more information, please visit BugBot documentation.

It happened like triple times in a row on stock opensuse tumbleweed 20231127 . Intel 5500 (i915), kde-plasma-wayland. Co-incidentally I have i915 debug arguments enabled on kernel because of another issue (drm.debug=0xe log_buf_len) in case logs may help.
There was indeed a heavy CPU utilisation (make -j4) and lots of stuff happening on desktop (compile progress). Now that load is gone, it is stable.
my crash: https://crash-stats.mozilla.org/report/index/25072f4a-39ef-4a14-9dee-7b0fb0231129#tab-bugzilla

Bug 1743144 is the main tracker.

Status: NEW → RESOLVED
Closed: 6 months ago
Duplicate of bug: 1743144
Resolution: --- → DUPLICATE

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled
You need to log in before you can comment on or make changes to this bug.