Closed Bug 1834589 Opened 1 year ago Closed 1 year ago

Fatal Wayland protocol errors crash without crash report

Categories

(Core :: Widget: Gtk, task)

x86_64
Linux
task

Tracking

()

RESOLVED FIXED
117 Branch
Tracking Status
firefox-esr102 --- wontfix
firefox113 --- wontfix
firefox114 --- wontfix
firefox115 --- wontfix
firefox116 --- wontfix
firefox117 --- fixed

People

(Reporter: gcp, Assigned: stransky)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

If I understand https://bugzilla.mozilla.org/show_bug.cgi?id=1831557#c18 well enough, and the bug it pointed to as a regressor, right now if a fatal Wayland error occurs Firefox just exits.

The GTK developers want this to be treated like a crash so we can track when this happens and get actionable data.

This needs some trickery to distinguish between normal logging and fatal errors.

See Also: → 1831557
Keywords: regression
Regressed by: 1826583
Severity: -- → S2
Blocks: wayland
See Also: → 1832158, 1832499

Set release status flags based on info from the regressing bug 1826583

:stransky, since you are the author of the regressor, bug 1826583, could you take a look?

For more information, please visit BugBot documentation.

Flags: needinfo?(stransky)

I believe that https://gitlab.gnome.org/GNOME/gtk/-/issues/4514 (an issue in upstream GTK where they do not allow clients to handle protocol errors) is the reason why we don't get crash reports here. I don't think there's much we can do here without gtk changing their error handling behaviour.

Hm odd, this was previously fixed in bug 1726923

Would it be possible to move the wayland connection to a separate process so the parent process can handle a connection loss more gracefully like a gpu crash?

(Darkspirit from bug 1826583 comment #21)

Can you printf if wl_log contains "destroyed while proxies still attached" and otherwise keep the MOZ_CRASH, so that we don't miss other warnings?

Hm odd, this was previously fixed in bug 1726923

It had to be backed out, which regressed this again:
https://bugzilla.mozilla.org/show_bug.cgi?id=1826583

Would it be possible to move the wayland connection to a separate process so the parent process can handle a connection loss more gracefully like a gpu crash?

Only in theory I think. The faulty error handling is in GTK, so you'd have to remote all of GTK, else any error would just end up with GTK _exit()-ing and bypassing the crash handler again.

This bug is primarily about remaining fractional scaling (IIUC) Wayland protocol errors which seem to crash without crash report. Most cases have been fixed, the remaining ones should be noticed, filed and fixed.

bug 1743144 might happen under highest system load and only when having very bad luck.
(Actually noticeable: non-Wayland Nightly-only bug 1831548 crashes without crash report as well. Please help there if you can.)

bug 1743144 might happen under highest system load and only when having very bad luck.

It happens more often under sway (see bug 1792754). iiuc that's because it doesn't implement workarounds for stalled clients that some other compositors have.

This bug is primarily about remaining fractional scaling (IIUC) Wayland protocol errors which seem to crash without crash report. Most cases have been fixed, the remaining ones should be noticed, filed and fixed.

After
https://hg.mozilla.org/releases/mozilla-release/rev/94ef9c555302
how sure are we that only fractional scaling protocol errors will cause issues like this?

From what I understand, we're now (back) in a state where we can be crashing under Wayland and never know about it, which is not a good place to be in. Are there strong reasons to believe this is limited to a small set of fractional scaling mistakes?

(In reply to Gian-Carlo Pascutto [:gcp] from comment #10)

how sure are we that only fractional scaling protocol errors will cause issues like this?

No, not only, just primarily at the moment: bug 1803016 (fixed in 113), bug 1832499 (seen with 113 stable), bug 1832158 (seen with 114)
History: https://crash-stats.mozilla.org/signature/?signature=wl_log#bugzilla

From what I understand, we're now (back) in a state where we can be crashing under Wayland and never know about it

Couldn't you combine the old & new code like this (I'm not a C++ programmer):

static void WlLogHandler(const char* format, va_list args) {
  char error[1000];
  VsprintfLiteral(error, format, args);
  bool warning = std::string(error).find("destroyed while proxies still attached") != std::string::npos;
  if (warning) {
    gfxCriticalNote << "Wayland protocol warning: " << error;
  } else {
    MOZ_CRASH_UNSAFE(error);
  }
}

Hm, or we can fix that warning. Interestingly a bunch of apps started printing it, including weston-simple-egl. Will have a look, maybe we just need to have a cleaner cleanup.

That warning has been triggered by EGL drivers and been fixed:
https://github.com/NVIDIA/egl-wayland/pull/79
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21646
It has also been relaxed: https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/297/diffs
But as the warning is still there with older Nvidia/Mesa driver versions on wayland-1.22, it needs to be filtered out when reintroducing MOZ_CRASH_UNSAFE.

I see - so really just a temporary thing. Yeah, the proposal from comment 11 sound exactly right then.

I think the situation is a more complex here. There are two sorts of bugs:

  1. General Wayland display connections bugs, related to client (Firefox/Gtk) - Wayland compositor connection. They're handled by gtk3/gdkeventsource.c when wl_display_* calls fail. It's like 'Error dispatching to Wayland display' and 'Lost connection to Wayland compositor', 'Error flushing display' and so on. The code looks like:
g_message ("Error flushing display: %s", g_strerror (errno));
_exit (1);

and we haven't handled such messages. btw. you'll get it when press CTRL+C on terminal where FF/Wayland is running. I don't think we need to catch these as they may be related to the environment.

  1. Wayland protcol errors related to client (Firefox). They're handled by wayland-client library and issued when client (us) use wrong object or so. There are two calls - wl_log() and wl_abort(). wl_log() just prints something like "destroyed while proxies still attached" and application can continue. wl_abort() is fatal, prints error message and calls abort() then. So all we need is to catch abort() from wl_abort() call if that's possible.
Flags: needinfo?(stransky)

I see the scenario from Bug 1831557

[GFX1-]: Wayland protocol error: wl_surface@132: error 2: Buffer size (1173x128) must be an integer multiple of the buffer_scale (2).
Gdk-Message: 18:33:10.147: Error 71 (Protokolfejl) dispatching to Wayland display.

so wayland-client issued non-fatal log message here ("wl_surface@132: error ...") but then compositor (wayland server) decided to terminate client because of it (note that this bug is fatal in Sway but it's not fatal in Gnome/Mutter).

note that this bug is fatal in Sway but it's not fatal in Gnome/Mutter

It should be fatal (again) in Mutter as of 44.

I don't see any good simple solution as all ones have a drawback:

  • If we crash on wl_log() as we did before we will crash even on harmless messages where other Wayland applications work.
  • If we crash on g_message() hook we will crash on cases when Firefox is just terminated.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #18)

  • If we crash on wl_log() as we did before we will crash even on harmless messages where other Wayland applications work.

I'm not aware of any harmless message apart from "destroyed while proxies still attached":
bug 1831557 and bug 1832499 have been crashes on Gnome, not Sway.
If there are other messages that are known to not crash, they could be filtered out as well, but IIUC we don't know such messages yet.
There is a new ESR dot release each month, uplifting a relaxation is safer than not knowing that there are crashes.

(For more flexibility, there could be added a wayland error relaxation pref that contains a regex. If a user complains Firefox would crash with a specific message, we could ask for a pref change and the user could report back whether it's fine or if Firefox then crashes without crash report. But we are not aware of such a case yet and Robert reported such messages would be fatal again in Mutter 44.)

I'm not aware of any harmless message apart from "destroyed while proxies still attached":

Same here - and AFAIK there shouldn't be any. Wayland normally doesn't have "warning" messages (which is sometimes quite annoying from a compositor perspective).

Summary: Fatal Wayland protocol errors don't trigger a crash → Fatal Wayland protocol errors crash without crash report

Set release status flags based on info from the regressing bug 1826583

Gcp, should we do something for 116/117? Thanks

Flags: needinfo?(gpascutto)

We have no clear solution here I think. Comment 19 seems to suggest fixing the issues if they occur, and I think fixing GTK so it doesn't break crash reporting (https://gitlab.gnome.org/GNOME/gtk/-/issues/4514) would be preferred too?

Flags: needinfo?(gpascutto)
Type: task → defect
Type: task → defect

Removing regression tags as this isn't due to our own code and the "regressor" was more like a "failed workaround".

Type: defect → task
Keywords: regression
No longer regressed by: 1826583

Would it be possible to have a watchdog process instead of triggering the crash report from within the crashing process itself? At a minimum it could submit a dummy crash report indicating a missing sample. Or it could use ptrace to stop the process at exit and still collect stack traces.

IMO we should just follow Comment 19 because

  1. Wayland does not have a concept of non-fatal warnings
  2. the case here was an unfortunate exception, triggered by a certain combination of libwayland and mesa and has been fixed since
  3. such non-fatal warning can only be produced in libwayland, not by different compositor implementations (Comment 16 is slightly wrong here - errors are always fatal, but compositors can choose to not send these errors which Mutter did for a while). So stuff like this can generally only happen on major libwayland updates.
Assignee: nobody → stransky
Status: NEW → ASSIGNED
Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/570c5180c654
[Wayland] Crash on wl_log() unless we get 'destroyed while proxies still attached' message r=emilio
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 117 Branch

I'm still getting crashes under wayland that don't result in crash reports

Firefox 117.0a1 20230720093622
Window Protocol wayland
sway 1.8.1

gdb backtrace when setting a breakpoint on exit_group shows:

Thread 1 (Thread 0x7fde5ec89780 (LWP 1135714) "firefox-bin"):
#0  0x00007fde5e7d8bad in _exit () at /usr/lib/libc.so.6
#1  0x00007fde5ad6b868 in  () at /usr/lib/libgdk-3.so.0
#2  0x00007fde5ad37fb9 in gdk_display_get_event () at /usr/lib/libgdk-3.so.0
#3  0x00007fde5ad72838 in  () at /usr/lib/libgdk-3.so.0
#4  0x00007fde5ac16a31 in g_main_context_dispatch () at /usr/lib/libglib-2.0.so.0
#5  0x00007fde5ac73cc9 in  () at /usr/lib/libglib-2.0.so.0
#6  0x00007fde5ac140e2 in g_main_context_iteration () at /usr/lib/libglib-2.0.so.0
#7  0x00007fde52325752 in nsThread::ProcessNextEvent(bool, bool*) () at /home/the8472/opt/firefox/libxul.so
#8  0x00007fde52373bf0 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) () at /home/the8472/opt/firefox/libxul.so
#9  0x00007fde538f50fe in nsBaseAppShell::Run() () at /home/the8472/opt/firefox/libxul.so
#10 0x00007fde51417025 in nsAppStartup::Run() () at /home/the8472/opt/firefox/libxul.so
#11 0x00007fde53c47495 in XREMain::XRE_mainRun() () at /home/the8472/opt/firefox/libxul.so
#12 0x00007fde514a70cb in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/the8472/opt/firefox/libxul.so
#13 0x00007fde514a7658 in XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/the8472/opt/firefox/libxul.so
#14 0x000056380ada7f58 in main ()

On the tty I'm getting one of the two messages during each crash:

  • Gdk-Message: 22:41:28.825: Error 32 (Broken pipe) dispatching to Wayland display.
  • Gdk-Message: 22:43:42.937: Lost connection to Wayland compositor.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

(In reply to The 8472 from comment #30)

I'm still getting crashes under wayland that don't result in crash reports

Firefox 117.0a1 20230720093622
Window Protocol wayland
sway 1.8.1

gdb backtrace when setting a breakpoint on exit_group shows:

Thread 1 (Thread 0x7fde5ec89780 (LWP 1135714) "firefox-bin"):
#0  0x00007fde5e7d8bad in _exit () at /usr/lib/libc.so.6
#1  0x00007fde5ad6b868 in  () at /usr/lib/libgdk-3.so.0
#2  0x00007fde5ad37fb9 in gdk_display_get_event () at /usr/lib/libgdk-3.so.0
#3  0x00007fde5ad72838 in  () at /usr/lib/libgdk-3.so.0
#4  0x00007fde5ac16a31 in g_main_context_dispatch () at /usr/lib/libglib-2.0.so.0
#5  0x00007fde5ac73cc9 in  () at /usr/lib/libglib-2.0.so.0
#6  0x00007fde5ac140e2 in g_main_context_iteration () at /usr/lib/libglib-2.0.so.0
#7  0x00007fde52325752 in nsThread::ProcessNextEvent(bool, bool*) () at /home/the8472/opt/firefox/libxul.so
#8  0x00007fde52373bf0 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) () at /home/the8472/opt/firefox/libxul.so
#9  0x00007fde538f50fe in nsBaseAppShell::Run() () at /home/the8472/opt/firefox/libxul.so
#10 0x00007fde51417025 in nsAppStartup::Run() () at /home/the8472/opt/firefox/libxul.so
#11 0x00007fde53c47495 in XREMain::XRE_mainRun() () at /home/the8472/opt/firefox/libxul.so
#12 0x00007fde514a70cb in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/the8472/opt/firefox/libxul.so
#13 0x00007fde514a7658 in XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/the8472/opt/firefox/libxul.so
#14 0x000056380ada7f58 in main ()

On the tty I'm getting one of the two messages during each crash:

  • Gdk-Message: 22:41:28.825: Error 32 (Broken pipe) dispatching to Wayland display.
  • Gdk-Message: 22:43:42.937: Lost connection to Wayland compositor.

Please file a new bug for it. Run Firefox with WAYLAND_DEBUG=1 env variable and attach the log there.
Thanks.

Flags: needinfo?(bugzilla.mozilla.org)

Ok, will do.

Flags: needinfo?(bugzilla.mozilla.org)
Status: REOPENED → RESOLVED
Closed: 1 year ago1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: