1819969 - Crash in [@ PollWrapper] caused by corrupt/wrong signal information

Gabriele Svelto [:gsvelto]

Assignee

Description

•

2 years ago

Crash report: https://crash-stats.mozilla.org/report/index/28bac7d7-8cd7-4b3a-a699-42e1b0230302

Reason: unknown 0x00000440 / 0x00000000

No crashing thread identified; using thread 0.

Top 10 frames of crashing thread:

0  libc.so.6  __GI___poll  /build/glibc-SzIz7B/glibc-2.31/sysdeps/unix/sysv/linux/poll.c:29
1  libxul.so  PollWrapper  /build/firefox/parts/firefox/build/widget/gtk/nsAppShell.cpp:78
2  libglib-2.0.so.0  g_main_context_iterate.isra.0  /build/gnome-3-38-2004-sdk/parts/glib/src/glib/gmain.c:4042
3  libglib-2.0.so.0  g_main_context_iteration  /build/gnome-3-38-2004-sdk/parts/glib/src/glib/gmain.c:4108
4  libxul.so  nsAppShell::ProcessNextNativeEvent  /build/firefox/parts/firefox/build/widget/gtk/nsAppShell.cpp:426
4  libxul.so  nsBaseAppShell::DoProcessNextNativeEvent  /build/firefox/parts/firefox/build/widget/nsBaseAppShell.cpp:131
4  libxul.so  nsBaseAppShell::OnProcessNextEvent  /build/firefox/parts/firefox/build/widget/nsBaseAppShell.cpp:267
4  libxul.so  {virtual override thunk}  /build/firefox/parts/firefox/build/widget/nsBaseAppShell.cpp
4  libxul.so  nsThread::ProcessNextEvent  /build/firefox/parts/firefox/build/xpcom/threads/nsThread.cpp:1111
4  libxul.so  NS_ProcessNextEvent  /build/firefox/parts/firefox/build/xpcom/threads/nsThreadUtils.cpp:473

The crashes under this signature are all borked: in particular the crash address and the crash reason make no sense. However they all have something in common which points to bad crash generation. Take this crash for example:

The raw crash reason is MOZ_DIAGNOSTIC_ASSERT(!IsUsed()) which points to this line
Indeed, thread 25 has a stack that contains that line... with the frames of the exception handler on top of it. The frames from the exception handler shouldn't be visible in a crash as the stack walker would start walking the stack from the point of crash (which we find via siginfo.sa_addr), this points to siginfo.si_addr being wrong.
siginfo.si_signo is clearly wrong... but consistently so, there's only a handful of variations

The above makes me think that there's an issue in how we handle the crashed thread context:

The thread that crashed is stuck here
Which means it sent the context with the signal info to the main process here. The context is this blob.
The context is passed from here
The context is filled here so this is one of the places where things could go wrong
Things could also go wrong on the writing side here where we have a very unpleasant hack because of the discrepancy between Breakpad's context and minidump-writer's context

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

2 years ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 desktop browser crashes on nightly

:gsvelto, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)

Keywords: topcrash

Gabriele Svelto [:gsvelto]

Assignee

Updated

•

2 years ago

Severity: -- → S2

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Assignee

Comment 2

•

2 years ago

•

Edited

I've opened a minidump and did a few tests to narrow down what's happening here: these are the raw contents of the MINIDUMP_EXCEPTION structure in a regular crash:

900a 0004 0000 0000 000b 0000 0001 0000
0000 0000 0000 0000 0000 0000 0000 0000
*
04d0 0000 332c 0002

We've got the thread ID (0x4900a), then the signal number (0xb, that's SIGSEGV), the code (0x1, SEGV_MAPERR), the exception record and address which are both NULL (which is expected here), then all emptiness until the thread context size (1232) and RVA (0x2332c). So far so good. Let's do the same with a bogus crash:

0000 0000 0000 0000 0440 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
*
0000 0000 0000 0000

Oh boy, this is all wrong. We don't have a thread ID, the signal number is 440 but there's no code and the rest is NULL, so no reference to a context. Assuming this is what we've received from the exception handler there's something that really doesn't click: the thread ID is obtained inside the handler so it should be present even if signal delivery was bad.

Gabriele Svelto [:gsvelto]

Assignee

Updated

•

2 years ago

Crash Signature: [@ PollWrapper] → [@ PollWrapper] [@ sys_read]

Gabriele Svelto [:gsvelto]

Assignee

Comment 3

•

2 years ago

Given the timeframe this is very, very likely to be a regression from bug 1793784.

Severity: S2 → S1

tracking-firefox111: --- → ?

tracking-firefox112: --- → ?

Keywords: regression

Priority: -- → P1

Regressed by: 1793784

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

status-firefox110: --- → wontfix

status-firefox111: --- → affected

status-firefox112: --- → affected

status-firefox-esr102: --- → unaffected

tracking-firefox111: ? → +

tracking-firefox112: ? → +

Gabriele Svelto [:gsvelto]

Assignee

Comment 4

•

2 years ago

Note: while the crash volume here might look alarming it is not. This is a bug in how a minidump is written, so the crashes ending up here were crashes happening somewhere else and which we're not writing out properly. Once I fix this all those crashes will go back to their own signatures and the overall crash volume should be unchanged.

Ryan VanderMeulen [:RyanVM]

Comment 5

•

2 years ago

https://wiki.mozilla.org/BMO/UserGuide/BugFields#bug_severity

S1: (Catastrophic) Blocks development/testing, may impact more than 25% of users, causes data loss, likely dot release driver, and no workaround available

I'm wondering if this should maybe be an S2? :)

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Assignee

Comment 6

•

2 years ago

Losing our ability to tell apart crashes on a major platform is pretty significant, I've dropped everything else to work on this and I feel we'll have to do a dot release unless we want to live with a blind spot until the next major release.

Flags: needinfo?(gsvelto)

Ryan VanderMeulen [:RyanVM]

Comment 7

•

2 years ago

111 goes to RC next week, so it's very unlikely we'll dot release 110 for it at this point.

Gabriele Svelto [:gsvelto]

Assignee

Comment 8

•

2 years ago

Alright, I'll better fix this fast.

Severity: S1 → S2

Priority: P1 → --

Gabriele Svelto [:gsvelto]

Assignee

Comment 9

•

2 years ago

Alright, I can repro. Let's see what's going on.

Assignee: nobody → gsvelto

Status: NEW → ASSIGNED

Gabriele Svelto [:gsvelto]

Assignee

Comment 10

•

2 years ago

This can be reproduced with the Snap version of Firefox on Ubuntu, the .deb-packaged version is unaffected.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 11

•

2 years ago

I can reproduce this on Debian unstable by running the Snap package's binary outside of Snap (i.e., /snap/firefox/current/usr/lib/firefox/firefox --no-remote --profile /path/to/profile/dir); STR is about:crashcontent. I tried debugging it a little; the signal handler doesn't appear to have been modified, the siginfo and ucontext arguments look reasonable when we enter WasmTrapHandler and later when it calls the crash reporter's handler. But, as comment #2 mentions, signs point to signal delivery not being the problem.

So I was wondering if something might wrong with these memcpy calls to translate the context for Rust, and I have found a problem which has been affecting crash reporting for a while, but I can't explain all of the breakage yet.

The part I understand is: Rust is using struct signalfd_siginfo, but what we have is a siginfo_t, which is not the same. They're both padded out to 128 bytes, but for example si_addr and ssi_addr are at different offsets, so the Rust tools are always reporting 0 for the address. (Empirically: the kernel does zero out the padding even if there's stack garbage; it does not write the address anywhere other than si_addr.) In particular, for SIGSYS crashes we set si_addr (otherwise unused) to the syscall number for easier analysis on crash-stats; this appears to have stopped working sometime between builds 20221231080701 and 20230115094128, which matches bug 1793784. For segfaults, it's possible that this is being partly masked by the crash processor changes to disassemble the crashing instruction and calculate the EA if the reported address is null (bug 1493342).

However, ssi_{signo,errno,code} do match their equivalents in siginfo_t, so that doesn't explain all of it. The other thing I noticed is, if C++ and Rust disagree on the size of struct _libc_fpstate (C) / crash_context::fpregset_t (Rust), that would mess up the offset for the siginfo and pid/tid fields. I don't know why that would happen for some builds and not others, but given full debug info for a broken build it should be easy to check for that mismatch.

Gabriele Svelto [:gsvelto]

Assignee

Comment 12

•

2 years ago

Thanks for the detailed investigation Jed, I was suspecting myself that the issue stemmed from around the code that bridged the two different contexts since it's the only significant change between the old minidump_writer_linux crate and the new minidump-writer one. I'll try to replace my hack with something more robust and which is less likely to cause incompatibilities between the Rust and the C sides. One explanation to what we're seeing is that the Snap version of Firefox is built with a different toolchain than the non-Snap version and that's causing the consistent mishandling of the crash context.

Gabriele Svelto [:gsvelto]

Assignee

Updated

•

2 years ago

Depends on: 1820768

Gabriele Svelto [:gsvelto]

Assignee

Comment 13

•

2 years ago

Attached file Bug 1819969 - Correctly convert signal information receieved by a child process when generating a crash report; r=gerard-majax — Details

This adds an explicit conversion between the siginfo_t structure provided by
Breakpad and the signalfd_siginfo one required by minidump-writer.

Pulsebot

Comment 14

•

2 years ago

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a6cfb7f7b3c9 Correctly convert signal information receieved by a child process when generating a crash report; r=gerard-majax

Marian-Vasile Laza

Comment 15

•

2 years ago

•

Edited

Backed out 2 changesets (Bug 1819969, Bug 1820768) for bc failures on browser_content_crashes.js.
Backout link
Push with failures <--> bc2
Failure Log

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Assignee

Comment 16

•

2 years ago

I've updated the patches and will wait until the end of the soft freeze, then request uplift.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Assignee

Comment 17

•

2 years ago

I did another round of tests, everything seems to be fine, landing again.

Pulsebot

Comment 18

•

2 years ago

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bd1057c15030 Correctly convert signal information receieved by a child process when generating a crash report; r=gerard-majax

BugBot [:suhaib / :marco/ :calixte]

Comment 19

•

2 years ago

Set release status flags based on info from the regressing bug 1793784

status-firefox113: --- → affected

Natalia Csoregi [:nataliaCs]

Comment 20

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/bd1057c15030

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox113: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 113 Branch

BugBot [:suhaib / :marco/ :calixte]

Comment 21

•

2 years ago

The patch landed in nightly and beta is affected.
:gsvelto, is this bug important enough to require an uplift?

If yes, please nominate the patch for beta approval.
If no, please set status-firefox112 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 22

•

2 years ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #11)

The other thing I noticed is, if C++ and Rust disagree on the size of struct _libc_fpstate (C) / crash_context::fpregset_t (Rust), that would mess up the offset for the siginfo and pid/tid fields. I don't know why that would happen for some builds and not others, but given full debug info for a broken build it should be easy to check for that mismatch.

Almost but not quite: they disagree on the size of ucontext_t (libc headers vs. the definition in the crash-context crate), because glibc added a field to support Intel CET shadow stacks in 2.28. That's not relevant to the ucontext that's written by the kernel for a signal handler — it ends after uc_sigmask, and FP state is stored elsewhere on the stack — but the size difference would throw off the offsets for all of the other fields. (This also means that the __private: [u8; 512] field in crash_context::ucontext_t is not useful to us and we should not try to interpret those bits.)

So, gsvelto's patches should fix this (but we should be careful about ucontext_t in the future).

Gabriele Svelto [:gsvelto]

Assignee

Comment 23

•

2 years ago

Comment on attachment 9322378 [details]
Bug 1819969 - Correctly convert signal information receieved by a child process when generating a crash report; r=gerard-majax

Beta/Release Uplift Approval Request

User impact if declined: Not user visible, but crashes affecting child processes cannot be analyzed, they all lump together under this signature
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): The code we changed is covered by tests and it's only triggered when a crash happens, thus affecting a fairly small amount of users
String changes made/needed:
Is Android affected?: No

Flags: needinfo?(gsvelto)

Attachment #9322378 - Flags: approval-mozilla-beta?

Dianna Smith [:diannaS]

Comment 24

•

2 years ago

:gsvelto does the test in bug 1820768 (along with its regressions) need to be uplifted as well?

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Assignee

Comment 25

•

2 years ago

The test covers part of this functionality but only runs in nightly builds, so there's no point in uplifting it.

Flags: needinfo?(gsvelto)

Dianna Smith [:diannaS]

Comment 26

•

2 years ago

Comment on attachment 9322378 [details]
Bug 1819969 - Correctly convert signal information receieved by a child process when generating a crash report; r=gerard-majax

Approved for 112.0b3

Attachment #9322378 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Dianna Smith [:diannaS]

Comment 27

•

2 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/f412d4dd4729

status-firefox112: affected → fixed

Jared Hirsch [:jhirsch] (he/him) (Needinfo please)

Comment 28

•

2 years ago

Marking as wontfix for 111, but reach out if you have any concerns or would like this for 111.

status-firefox111: affected → wontfix