Closed Bug 1819969 Opened 2 years ago Closed 2 years ago

Crash in [@ PollWrapper] caused by corrupt/wrong signal information

Categories

(Toolkit :: Crash Reporting, defect)

Unspecified
Linux
defect

Tracking

()

RESOLVED FIXED
113 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox110 --- wontfix
firefox111 + wontfix
firefox112 + fixed
firefox113 --- fixed

People

(Reporter: gsvelto, Assigned: gsvelto)

References

(Regression)

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/28bac7d7-8cd7-4b3a-a699-42e1b0230302

Reason: unknown 0x00000440 / 0x00000000

No crashing thread identified; using thread 0.

Top 10 frames of crashing thread:

0  libc.so.6  __GI___poll  /build/glibc-SzIz7B/glibc-2.31/sysdeps/unix/sysv/linux/poll.c:29
1  libxul.so  PollWrapper  /build/firefox/parts/firefox/build/widget/gtk/nsAppShell.cpp:78
2  libglib-2.0.so.0  g_main_context_iterate.isra.0  /build/gnome-3-38-2004-sdk/parts/glib/src/glib/gmain.c:4042
3  libglib-2.0.so.0  g_main_context_iteration  /build/gnome-3-38-2004-sdk/parts/glib/src/glib/gmain.c:4108
4  libxul.so  nsAppShell::ProcessNextNativeEvent  /build/firefox/parts/firefox/build/widget/gtk/nsAppShell.cpp:426
4  libxul.so  nsBaseAppShell::DoProcessNextNativeEvent  /build/firefox/parts/firefox/build/widget/nsBaseAppShell.cpp:131
4  libxul.so  nsBaseAppShell::OnProcessNextEvent  /build/firefox/parts/firefox/build/widget/nsBaseAppShell.cpp:267
4  libxul.so  {virtual override thunk}  /build/firefox/parts/firefox/build/widget/nsBaseAppShell.cpp
4  libxul.so  nsThread::ProcessNextEvent  /build/firefox/parts/firefox/build/xpcom/threads/nsThread.cpp:1111
4  libxul.so  NS_ProcessNextEvent  /build/firefox/parts/firefox/build/xpcom/threads/nsThreadUtils.cpp:473

The crashes under this signature are all borked: in particular the crash address and the crash reason make no sense. However they all have something in common which points to bad crash generation. Take this crash for example:

  • The raw crash reason is MOZ_DIAGNOSTIC_ASSERT(!IsUsed()) which points to this line
  • Indeed, thread 25 has a stack that contains that line... with the frames of the exception handler on top of it. The frames from the exception handler shouldn't be visible in a crash as the stack walker would start walking the stack from the point of crash (which we find via siginfo.sa_addr), this points to siginfo.si_addr being wrong.
  • siginfo.si_signo is clearly wrong... but consistently so, there's only a handful of variations

The above makes me think that there's an issue in how we handle the crashed thread context:

  • The thread that crashed is stuck here
  • Which means it sent the context with the signal info to the main process here. The context is this blob.
  • The context is passed from here
  • The context is filled here so this is one of the places where things could go wrong
  • Things could also go wrong on the writing side here where we have a very unpleasant hack because of the discrepancy between Breakpad's context and minidump-writer's context

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

:gsvelto, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)
Keywords: topcrash
Severity: -- → S2
Flags: needinfo?(gsvelto)

I've opened a minidump and did a few tests to narrow down what's happening here: these are the raw contents of the MINIDUMP_EXCEPTION structure in a regular crash:

900a 0004 0000 0000 000b 0000 0001 0000
0000 0000 0000 0000 0000 0000 0000 0000
*
04d0 0000 332c 0002

We've got the thread ID (0x4900a), then the signal number (0xb, that's SIGSEGV), the code (0x1, SEGV_MAPERR), the exception record and address which are both NULL (which is expected here), then all emptiness until the thread context size (1232) and RVA (0x2332c). So far so good. Let's do the same with a bogus crash:

0000 0000 0000 0000 0440 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
*
0000 0000 0000 0000

Oh boy, this is all wrong. We don't have a thread ID, the signal number is 440 but there's no code and the rest is NULL, so no reference to a context. Assuming this is what we've received from the exception handler there's something that really doesn't click: the thread ID is obtained inside the handler so it should be present even if signal delivery was bad.

Crash Signature: [@ PollWrapper] → [@ PollWrapper] [@ sys_read]

Given the timeframe this is very, very likely to be a regression from bug 1793784.

Severity: S2 → S1
Keywords: regression
Priority: -- → P1
Regressed by: 1793784

Note: while the crash volume here might look alarming it is not. This is a bug in how a minidump is written, so the crashes ending up here were crashes happening somewhere else and which we're not writing out properly. Once I fix this all those crashes will go back to their own signatures and the overall crash volume should be unchanged.

https://wiki.mozilla.org/BMO/UserGuide/BugFields#bug_severity

S1: (Catastrophic) Blocks development/testing, may impact more than 25% of users, causes data loss, likely dot release driver, and no workaround available

I'm wondering if this should maybe be an S2? :)

Flags: needinfo?(gsvelto)

Losing our ability to tell apart crashes on a major platform is pretty significant, I've dropped everything else to work on this and I feel we'll have to do a dot release unless we want to live with a blind spot until the next major release.

Flags: needinfo?(gsvelto)

111 goes to RC next week, so it's very unlikely we'll dot release 110 for it at this point.

Alright, I'll better fix this fast.

Severity: S1 → S2
Priority: P1 → --

Alright, I can repro. Let's see what's going on.

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED

This can be reproduced with the Snap version of Firefox on Ubuntu, the .deb-packaged version is unaffected.

I can reproduce this on Debian unstable by running the Snap package's binary outside of Snap (i.e., /snap/firefox/current/usr/lib/firefox/firefox --no-remote --profile /path/to/profile/dir); STR is about:crashcontent. I tried debugging it a little; the signal handler doesn't appear to have been modified, the siginfo and ucontext arguments look reasonable when we enter WasmTrapHandler and later when it calls the crash reporter's handler. But, as comment #2 mentions, signs point to signal delivery not being the problem.

So I was wondering if something might wrong with these memcpy calls to translate the context for Rust, and I have found a problem which has been affecting crash reporting for a while, but I can't explain all of the breakage yet.

The part I understand is: Rust is using struct signalfd_siginfo, but what we have is a siginfo_t, which is not the same. They're both padded out to 128 bytes, but for example si_addr and ssi_addr are at different offsets, so the Rust tools are always reporting 0 for the address. (Empirically: the kernel does zero out the padding even if there's stack garbage; it does not write the address anywhere other than si_addr.) In particular, for SIGSYS crashes we set si_addr (otherwise unused) to the syscall number for easier analysis on crash-stats; this appears to have stopped working sometime between builds 20221231080701 and 20230115094128, which matches bug 1793784. For segfaults, it's possible that this is being partly masked by the crash processor changes to disassemble the crashing instruction and calculate the EA if the reported address is null (bug 1493342).

However, ssi_{signo,errno,code} do match their equivalents in siginfo_t, so that doesn't explain all of it. The other thing I noticed is, if C++ and Rust disagree on the size of struct _libc_fpstate (C) / crash_context::fpregset_t (Rust), that would mess up the offset for the siginfo and pid/tid fields. I don't know why that would happen for some builds and not others, but given full debug info for a broken build it should be easy to check for that mismatch.

Thanks for the detailed investigation Jed, I was suspecting myself that the issue stemmed from around the code that bridged the two different contexts since it's the only significant change between the old minidump_writer_linux crate and the new minidump-writer one. I'll try to replace my hack with something more robust and which is less likely to cause incompatibilities between the Rust and the C sides. One explanation to what we're seeing is that the Snap version of Firefox is built with a different toolchain than the non-Snap version and that's causing the consistent mishandling of the crash context.

Depends on: 1820768

This adds an explicit conversion between the siginfo_t structure provided by
Breakpad and the signalfd_siginfo one required by minidump-writer.

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a6cfb7f7b3c9 Correctly convert signal information receieved by a child process when generating a crash report; r=gerard-majax

Backed out 2 changesets (Bug 1819969, Bug 1820768) for bc failures on browser_content_crashes.js.
Backout link
Push with failures <--> bc2
Failure Log

Flags: needinfo?(gsvelto)

I've updated the patches and will wait until the end of the soft freeze, then request uplift.

Flags: needinfo?(gsvelto)

I did another round of tests, everything seems to be fine, landing again.

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bd1057c15030 Correctly convert signal information receieved by a child process when generating a crash report; r=gerard-majax

Set release status flags based on info from the regressing bug 1793784

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 113 Branch

The patch landed in nightly and beta is affected.
:gsvelto, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox112 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #11)

The other thing I noticed is, if C++ and Rust disagree on the size of struct _libc_fpstate (C) / crash_context::fpregset_t (Rust), that would mess up the offset for the siginfo and pid/tid fields. I don't know why that would happen for some builds and not others, but given full debug info for a broken build it should be easy to check for that mismatch.

Almost but not quite: they disagree on the size of ucontext_t (libc headers vs. the definition in the crash-context crate), because glibc added a field to support Intel CET shadow stacks in 2.28. That's not relevant to the ucontext that's written by the kernel for a signal handler — it ends after uc_sigmask, and FP state is stored elsewhere on the stack — but the size difference would throw off the offsets for all of the other fields. (This also means that the __private: [u8; 512] field in crash_context::ucontext_t is not useful to us and we should not try to interpret those bits.)

So, gsvelto's patches should fix this (but we should be careful about ucontext_t in the future).

Comment on attachment 9322378 [details]
Bug 1819969 - Correctly convert signal information receieved by a child process when generating a crash report; r=gerard-majax

Beta/Release Uplift Approval Request

  • User impact if declined: Not user visible, but crashes affecting child processes cannot be analyzed, they all lump together under this signature
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The code we changed is covered by tests and it's only triggered when a crash happens, thus affecting a fairly small amount of users
  • String changes made/needed:
  • Is Android affected?: No
Flags: needinfo?(gsvelto)
Attachment #9322378 - Flags: approval-mozilla-beta?

:gsvelto does the test in bug 1820768 (along with its regressions) need to be uplifted as well?

Flags: needinfo?(gsvelto)

The test covers part of this functionality but only runs in nightly builds, so there's no point in uplifting it.

Flags: needinfo?(gsvelto)

Comment on attachment 9322378 [details]
Bug 1819969 - Correctly convert signal information receieved by a child process when generating a crash report; r=gerard-majax

Approved for 112.0b3

Attachment #9322378 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Marking as wontfix for 111, but reach out if you have any concerns or would like this for 111.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: