Fatal Wayland protocol errors crash without crash report
Categories
(Core :: Widget: Gtk, task)
Tracking
()
People
(Reporter: gcp, Assigned: stransky)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
If I understand https://bugzilla.mozilla.org/show_bug.cgi?id=1831557#c18 well enough, and the bug it pointed to as a regressor, right now if a fatal Wayland error occurs Firefox just exits.
The GTK developers want this to be treated like a crash so we can track when this happens and get actionable data.
This needs some trickery to distinguish between normal logging and fatal errors.
Reporter | ||
Updated•1 year ago
|
Reporter | ||
Updated•1 year ago
|
Updated•1 year ago
|
Comment 1•1 year ago
|
||
Set release status flags based on info from the regressing bug 1826583
:stransky, since you are the author of the regressor, bug 1826583, could you take a look?
For more information, please visit BugBot documentation.
Comment 2•1 year ago
|
||
I believe that https://gitlab.gnome.org/GNOME/gtk/-/issues/4514 (an issue in upstream GTK where they do not allow clients to handle protocol errors) is the reason why we don't get crash reports here. I don't think there's much we can do here without gtk changing their error handling behaviour.
Comment 3•1 year ago
|
||
Hm odd, this was previously fixed in bug 1726923
Would it be possible to move the wayland connection to a separate process so the parent process can handle a connection loss more gracefully like a gpu crash?
Comment 5•1 year ago
|
||
(Darkspirit from bug 1826583 comment #21)
Can you printf if wl_log contains "destroyed while proxies still attached" and otherwise keep the MOZ_CRASH, so that we don't miss other warnings?
Reporter | ||
Comment 6•1 year ago
|
||
Hm odd, this was previously fixed in bug 1726923
It had to be backed out, which regressed this again:
https://bugzilla.mozilla.org/show_bug.cgi?id=1826583
Reporter | ||
Comment 7•1 year ago
•
|
||
Would it be possible to move the wayland connection to a separate process so the parent process can handle a connection loss more gracefully like a gpu crash?
Only in theory I think. The faulty error handling is in GTK, so you'd have to remote all of GTK, else any error would just end up with GTK _exit()
-ing and bypassing the crash handler again.
Comment 8•1 year ago
•
|
||
This bug is primarily about remaining fractional scaling (IIUC) Wayland protocol errors which seem to crash without crash report. Most cases have been fixed, the remaining ones should be noticed, filed and fixed.
bug 1743144 might happen under highest system load and only when having very bad luck.
(Actually noticeable: non-Wayland Nightly-only bug 1831548 crashes without crash report as well. Please help there if you can.)
bug 1743144 might happen under highest system load and only when having very bad luck.
It happens more often under sway (see bug 1792754). iiuc that's because it doesn't implement workarounds for stalled clients that some other compositors have.
Reporter | ||
Comment 10•1 year ago
|
||
This bug is primarily about remaining fractional scaling (IIUC) Wayland protocol errors which seem to crash without crash report. Most cases have been fixed, the remaining ones should be noticed, filed and fixed.
After
https://hg.mozilla.org/releases/mozilla-release/rev/94ef9c555302
how sure are we that only fractional scaling protocol errors will cause issues like this?
From what I understand, we're now (back) in a state where we can be crashing under Wayland and never know about it, which is not a good place to be in. Are there strong reasons to believe this is limited to a small set of fractional scaling mistakes?
Comment 11•1 year ago
•
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #10)
how sure are we that only fractional scaling protocol errors will cause issues like this?
No, not only, just primarily at the moment: bug 1803016 (fixed in 113), bug 1832499 (seen with 113 stable), bug 1832158 (seen with 114)
History: https://crash-stats.mozilla.org/signature/?signature=wl_log#bugzilla
From what I understand, we're now (back) in a state where we can be crashing under Wayland and never know about it
Couldn't you combine the old & new code like this (I'm not a C++ programmer):
static void WlLogHandler(const char* format, va_list args) {
char error[1000];
VsprintfLiteral(error, format, args);
bool warning = std::string(error).find("destroyed while proxies still attached") != std::string::npos;
if (warning) {
gfxCriticalNote << "Wayland protocol warning: " << error;
} else {
MOZ_CRASH_UNSAFE(error);
}
}
Comment 12•1 year ago
|
||
Hm, or we can fix that warning. Interestingly a bunch of apps started printing it, including weston-simple-egl
. Will have a look, maybe we just need to have a cleaner cleanup.
Comment 13•1 year ago
•
|
||
That warning has been triggered by EGL drivers and been fixed:
https://github.com/NVIDIA/egl-wayland/pull/79
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21646
It has also been relaxed: https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/297/diffs
But as the warning is still there with older Nvidia/Mesa driver versions on wayland-1.22, it needs to be filtered out when reintroducing MOZ_CRASH_UNSAFE.
Comment 14•1 year ago
|
||
I see - so really just a temporary thing. Yeah, the proposal from comment 11 sound exactly right then.
Assignee | ||
Comment 15•1 year ago
•
|
||
I think the situation is a more complex here. There are two sorts of bugs:
- General Wayland display connections bugs, related to client (Firefox/Gtk) - Wayland compositor connection. They're handled by gtk3/gdkeventsource.c when wl_display_* calls fail. It's like 'Error dispatching to Wayland display' and 'Lost connection to Wayland compositor', 'Error flushing display' and so on. The code looks like:
g_message ("Error flushing display: %s", g_strerror (errno));
_exit (1);
and we haven't handled such messages. btw. you'll get it when press CTRL+C on terminal where FF/Wayland is running. I don't think we need to catch these as they may be related to the environment.
- Wayland protcol errors related to client (Firefox). They're handled by wayland-client library and issued when client (us) use wrong object or so. There are two calls - wl_log() and wl_abort(). wl_log() just prints something like "destroyed while proxies still attached" and application can continue. wl_abort() is fatal, prints error message and calls abort() then. So all we need is to catch abort() from wl_abort() call if that's possible.
Assignee | ||
Comment 16•1 year ago
|
||
I see the scenario from Bug 1831557
[GFX1-]: Wayland protocol error: wl_surface@132: error 2: Buffer size (1173x128) must be an integer multiple of the buffer_scale (2).
Gdk-Message: 18:33:10.147: Error 71 (Protokolfejl) dispatching to Wayland display.
so wayland-client issued non-fatal log message here ("wl_surface@132: error ...") but then compositor (wayland server) decided to terminate client because of it (note that this bug is fatal in Sway but it's not fatal in Gnome/Mutter).
Comment 17•1 year ago
|
||
note that this bug is fatal in Sway but it's not fatal in Gnome/Mutter
It should be fatal (again) in Mutter as of 44.
Assignee | ||
Comment 18•1 year ago
|
||
I don't see any good simple solution as all ones have a drawback:
- If we crash on wl_log() as we did before we will crash even on harmless messages where other Wayland applications work.
- If we crash on g_message() hook we will crash on cases when Firefox is just terminated.
Comment 19•1 year ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #18)
- If we crash on wl_log() as we did before we will crash even on harmless messages where other Wayland applications work.
I'm not aware of any harmless message apart from "destroyed while proxies still attached":
bug 1831557 and bug 1832499 have been crashes on Gnome, not Sway.
If there are other messages that are known to not crash, they could be filtered out as well, but IIUC we don't know such messages yet.
There is a new ESR dot release each month, uplifting a relaxation is safer than not knowing that there are crashes.
(For more flexibility, there could be added a wayland error relaxation pref that contains a regex. If a user complains Firefox would crash with a specific message, we could ask for a pref change and the user could report back whether it's fine or if Firefox then crashes without crash report. But we are not aware of such a case yet and Robert reported such messages would be fatal again in Mutter 44.)
Comment 20•1 year ago
|
||
I'm not aware of any harmless message apart from "destroyed while proxies still attached":
Same here - and AFAIK there shouldn't be any. Wayland normally doesn't have "warning" messages (which is sometimes quite annoying from a compositor perspective).
Updated•1 year ago
|
Updated•1 year ago
|
Comment 21•1 year ago
|
||
Set release status flags based on info from the regressing bug 1826583
Updated•1 year ago
|
Reporter | ||
Comment 23•1 year ago
|
||
We have no clear solution here I think. Comment 19 seems to suggest fixing the issues if they occur, and I think fixing GTK so it doesn't break crash reporting (https://gitlab.gnome.org/GNOME/gtk/-/issues/4514) would be preferred too?
Reporter | ||
Updated•1 year ago
|
Reporter | ||
Updated•1 year ago
|
Reporter | ||
Comment 24•1 year ago
|
||
Removing regression tags as this isn't due to our own code and the "regressor" was more like a "failed workaround".
Comment 25•1 year ago
|
||
Would it be possible to have a watchdog process instead of triggering the crash report from within the crashing process itself? At a minimum it could submit a dummy crash report indicating a missing sample. Or it could use ptrace to stop the process at exit and still collect stack traces.
Comment 26•1 year ago
|
||
IMO we should just follow Comment 19 because
- Wayland does not have a concept of non-fatal warnings
- the case here was an unfortunate exception, triggered by a certain combination of libwayland and mesa and has been fixed since
- such non-fatal warning can only be produced in libwayland, not by different compositor implementations (Comment 16 is slightly wrong here - errors are always fatal, but compositors can choose to not send these errors which Mutter did for a while). So stuff like this can generally only happen on major libwayland updates.
Assignee | ||
Comment 27•1 year ago
|
||
Updated•1 year ago
|
Comment 28•1 year ago
|
||
Pushed by stransky@redhat.com: https://hg.mozilla.org/integration/autoland/rev/570c5180c654 [Wayland] Crash on wl_log() unless we get 'destroyed while proxies still attached' message r=emilio
Comment 29•1 year ago
|
||
bugherder |
Comment 30•1 year ago
|
||
I'm still getting crashes under wayland that don't result in crash reports
Firefox 117.0a1 20230720093622
Window Protocol wayland
sway 1.8.1
gdb backtrace when setting a breakpoint on exit_group shows:
Thread 1 (Thread 0x7fde5ec89780 (LWP 1135714) "firefox-bin"):
#0 0x00007fde5e7d8bad in _exit () at /usr/lib/libc.so.6
#1 0x00007fde5ad6b868 in () at /usr/lib/libgdk-3.so.0
#2 0x00007fde5ad37fb9 in gdk_display_get_event () at /usr/lib/libgdk-3.so.0
#3 0x00007fde5ad72838 in () at /usr/lib/libgdk-3.so.0
#4 0x00007fde5ac16a31 in g_main_context_dispatch () at /usr/lib/libglib-2.0.so.0
#5 0x00007fde5ac73cc9 in () at /usr/lib/libglib-2.0.so.0
#6 0x00007fde5ac140e2 in g_main_context_iteration () at /usr/lib/libglib-2.0.so.0
#7 0x00007fde52325752 in nsThread::ProcessNextEvent(bool, bool*) () at /home/the8472/opt/firefox/libxul.so
#8 0x00007fde52373bf0 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) () at /home/the8472/opt/firefox/libxul.so
#9 0x00007fde538f50fe in nsBaseAppShell::Run() () at /home/the8472/opt/firefox/libxul.so
#10 0x00007fde51417025 in nsAppStartup::Run() () at /home/the8472/opt/firefox/libxul.so
#11 0x00007fde53c47495 in XREMain::XRE_mainRun() () at /home/the8472/opt/firefox/libxul.so
#12 0x00007fde514a70cb in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/the8472/opt/firefox/libxul.so
#13 0x00007fde514a7658 in XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/the8472/opt/firefox/libxul.so
#14 0x000056380ada7f58 in main ()
On the tty I'm getting one of the two messages during each crash:
- Gdk-Message: 22:41:28.825: Error 32 (Broken pipe) dispatching to Wayland display.
- Gdk-Message: 22:43:42.937: Lost connection to Wayland compositor.
Assignee | ||
Comment 31•1 year ago
|
||
(In reply to The 8472 from comment #30)
I'm still getting crashes under wayland that don't result in crash reports
Firefox 117.0a1 20230720093622
Window Protocol wayland
sway 1.8.1gdb backtrace when setting a breakpoint on exit_group shows:
Thread 1 (Thread 0x7fde5ec89780 (LWP 1135714) "firefox-bin"): #0 0x00007fde5e7d8bad in _exit () at /usr/lib/libc.so.6 #1 0x00007fde5ad6b868 in () at /usr/lib/libgdk-3.so.0 #2 0x00007fde5ad37fb9 in gdk_display_get_event () at /usr/lib/libgdk-3.so.0 #3 0x00007fde5ad72838 in () at /usr/lib/libgdk-3.so.0 #4 0x00007fde5ac16a31 in g_main_context_dispatch () at /usr/lib/libglib-2.0.so.0 #5 0x00007fde5ac73cc9 in () at /usr/lib/libglib-2.0.so.0 #6 0x00007fde5ac140e2 in g_main_context_iteration () at /usr/lib/libglib-2.0.so.0 #7 0x00007fde52325752 in nsThread::ProcessNextEvent(bool, bool*) () at /home/the8472/opt/firefox/libxul.so #8 0x00007fde52373bf0 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) () at /home/the8472/opt/firefox/libxul.so #9 0x00007fde538f50fe in nsBaseAppShell::Run() () at /home/the8472/opt/firefox/libxul.so #10 0x00007fde51417025 in nsAppStartup::Run() () at /home/the8472/opt/firefox/libxul.so #11 0x00007fde53c47495 in XREMain::XRE_mainRun() () at /home/the8472/opt/firefox/libxul.so #12 0x00007fde514a70cb in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/the8472/opt/firefox/libxul.so #13 0x00007fde514a7658 in XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/the8472/opt/firefox/libxul.so #14 0x000056380ada7f58 in main ()
On the tty I'm getting one of the two messages during each crash:
- Gdk-Message: 22:41:28.825: Error 32 (Broken pipe) dispatching to Wayland display.
- Gdk-Message: 22:43:42.937: Lost connection to Wayland compositor.
Please file a new bug for it. Run Firefox with WAYLAND_DEBUG=1 env variable and attach the log there.
Thanks.
Comment 33•1 year ago
|
||
Filed bug 1844690
Updated•11 months ago
|
Description
•