1800919 - Sub-optimal signatures for crashes in memfd:pipewire-memfd

Reporter

Description

•

1 year ago

This set of signatures is pretty unwieldy, and difficult to track in bugzilla:

https://crash-stats.mozilla.org/search/?signature=~memfd%3Apipewire-memfd&date=%3E%3D2022-11-07T17%3A05%3A00.000Z&date=%3C2022-11-14T17%3A05%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 1

•

1 year ago

Example signatures:

libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x7
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x1007
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x2007
libxul.so@0x3665e29 | memfd:pipewire-memfd (deleted)@0x7
libxul.so@0x366da09 | memfd:pipewire-memfd (deleted)@0x7
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x107
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x1c7
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x2047
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x347

The first frame I can understand. The second frame--I've never seen anything like that before.

Gabriele: What does (deleted)@0xNNN mean in a symbol? Should we add a signature generation normalization pass to coalesce those into something like memfd:pipewire-memfd (deleted)0xNNN or drop the hex altogether? I don't know what information is important here to distinguish between crash signatures.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto] (PTO)

Comment 2

•

1 year ago

There's a few things to unpack here:

We don't have symbols for libxul.so in the first frame, which is odd, but coming from Ubuntu this is potentially a snap build and we've had trouble scraping symbols from them so needinfo-ing Alexandre about that
Because we don't have symbols for libxul.so the stackwalker doesn't know how to find the next frame and scans the stacks. While doing so it falls on a mapping that's been deleted: that is it found what looks like an instruction pointer that points into an area of the memory map where a file was mapped, but the file underlying the mapping was deleted from the filesystem. In particular this was an anonymous file created by pipewire via memfd_create()
This frame is clearly wrong, I don't think we were executing code in that particular area, the stack walker should skip it and go forward with the scanning, even though the real solution here is having symbols in the first place so we don't have to resort to stack scanning

To sum it up: we need symbols but I'll file a bug in the stackwalker to improve behavior under these conditions.

Flags: needinfo?(gsvelto) → needinfo?(lissyx+mozillians)

:gerard-majax

Updated

•

1 year ago

Flags: needinfo?(lissyx+mozillians)

:gerard-majax

Comment 3

•

1 year ago

If I take a look at one of the reports: https://crash-stats.mozilla.org/report/index/ceeb6878-3236-4005-8b48-7430d0221114

BuildID 20221104181631
this likely matches https://launchpad.net/~mozilla-snaps/firefox/+snap/firefox-snap-stable/+build/1932023 which was finished building on 20221104 at 2034 CET
Last run of symbols pulling on november 4th was at 17:04 and it only includes one 107 beta build: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/CwXeEje5QxCos9E9wrJTgw/runs/0/artifacts/public/logs/live_backing.log
Next run is on november 5th 22:53 and it seems we got nothing https://treeherder.mozilla.org/jobs?repo=mozilla-central&searchStr=symbols-snap&fromchange=2fe5fe82b2f89d3f13d58c9d331deeb6175ba856&selectedTaskRun=LTLxq9yUTTyPw4XcGQe4GQ.0

:gerard-majax

Comment 4

•

1 year ago

I could confirm that the buildid matching this above crash is from the 106.0.5-1 build one can download from https://launchpadlibrarian.net/632047151/firefox_106.0.5-1_amd64.debug https://launchpadlibrarian.net/632047166/firefox_106.0.5-1_amd64.snap

Gabriele, do we have a way to check from the debug symbols file the signature is matching?

$ ll libxul.so/0C36CC917C9C83B272B4110218A758D50/libxul.so.sym 
-rw-r--r-- 1 alex alex 660M janv.  1  2010 libxul.so/0C36CC917C9C83B272B4110218A758D50/libxul.so.sym

Also since the wget of packages mentions those URLs in https://firefoxci.taskcluster-artifacts.net/LTLxq9yUTTyPw4XcGQe4GQ/0/public/build/wget_packages.log we know that symbols should have been produced from https://firefoxci.taskcluster-artifacts.net/LTLxq9yUTTyPw4XcGQe4GQ/0/public/logs/live_backing.log but:

this log shows none being extracted
the task-uploaded SHA256SUMS file does not contains anything above 106.0 builds for stable releases: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/LTLxq9yUTTyPw4XcGQe4GQ/runs/0/artifacts/public/build/SHA256SUMS.zip so it confirms it failed to add it

:gerard-majax

Updated

•

1 year ago

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto] (PTO)

Comment 5

•

1 year ago

Yes, the debug ID for libxul.so in those crashes is 0C36CC917C9C83B272B4110218A758D50.

Flags: needinfo?(gsvelto)

:gerard-majax

Comment 6

•

1 year ago

OK, I think I know:

$ grep -G "firefox_106.0.5-1_amd64.snap,[0-9]" SHA256SUMS
firefox_106.0b5-1_amd64.snap,188637184

This should not match. I'll debug that deeper.

:gerard-majax

Updated

•

1 year ago

Depends on: 1801609

Gabriele Svelto [:gsvelto] (PTO)

Comment 7

•

1 year ago

Alexandre has fixed the issue in the Ubuntu Snap symbol scrapers and uploaded the missing symbols, we'll reprocess all the affected crashes shortly and with some luck the bad signatures will all go away.

:gerard-majax

Comment 8

•

1 year ago

•

Edited

(In reply to Gabriele Svelto [:gsvelto] from comment #7)

Alexandre has fixed the issue in the Ubuntu Snap symbol scrapers and uploaded the missing symbols, we'll reprocess all the affected crashes shortly and with some luck the bad signatures will all go away.

A first batch of 106.0.[1-5] symbol was done, and I've shared Gabriele a remainder of 107.0.[1-2] because we were also missing those.

I've also made a pull of GNOME SDK ones, just in case.

Byron Campen [:bwc]

Reporter

Updated

•

1 year ago

Comment 9

•

1 year ago

Where are we with this bug? I'm still seeing crash reports with the problematic signature. Are there other sets of symbols we're missing?

For example, I just reprocessed this crash report and it has the same signature:

bp-7bcca737-d4aa-4288-be39-ff6360221113

It's got libxul.so/0C36CC917C9C83B272B4110218A758D50.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto] (PTO)

Comment 10

•

1 year ago

It seems that the issue morphed from a "missing symbols" issue to a stack walker one. Take this crash for example. The crashing address is 0x36e8da9 within libxul.so. If I look at the symbols we have this line:

FUNC 36e8900 4a9 0 webrtc::BaseCapturerPipeWire::HandleBuffer(pw_buffer*)

36e8900 + 4a9 is exactly 0x36e8da9 so that's where we should be. We do have unwinding information for that function too, however the stackwalker falls back to stack scanning and ultimately considers that the symbol is missing. Given the address is exactly one byte past the end of the function I wonder if we've got an off-by-one issue in the stackwalker, we have a bunch of heuristics about that and I remember that we had to make some compromises there.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto] (PTO)

Comment 11

•

1 year ago

I filed issue #780 in the rust-minidump repo to address this issue (though this means deploying a ~~hack~~ heuristic to work around it).

Gabriele Svelto [:gsvelto] (PTO)

Comment 12

•

1 year ago

Stack walking issue aside I've noticed that there's still several dependencies of the snap packages that are missing. For example:

Crash 1c120051-f195-4a8b-91e8-b41b30230112 is missing symbols for
- libdconfsettings.so
- libspa-support.so
- libpipewire-module-client-node.so
- libpipewire-0.3.so.0
Crash 025a19cb-bba6-4ddf-a971-82e820230109 is missing symbols for
- libxcb.so.1
- libdbus-1.so.3
- libxcb-present.so.0
Crash 078ee570-dc44-4d99-a650-cbb860230108 is missing symbols for
- libcICE.so.6

These aren't a big deal as they rarely appear in stack traces. I still wanted to make a note of it to double-check at a later time.

Gabriele Svelto [:gsvelto] (PTO)

Comment 13

•

1 year ago

A quick update from the discussion in the rust-minidump issue: this crash is an odd beast. The crashing address is off-by-one WRT to the function where it's supposed to be happening but the signal that caused it originated in the Linux kernel - not in userspace. This might explain the odd address so I'm trying to narrow down my hack to only target this very specific case.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 14

•

1 year ago

•

Edited

(In reply to Gabriele Svelto [:gsvelto] from comment #10)

It seems that the issue morphed from a "missing symbols" issue to a stack walker one. Take this crash for example. The crashing address is 0x36e8da9 within libxul.so. If I look at the symbols we have this line:
FUNC 36e8900 4a9 0 webrtc::BaseCapturerPipeWire::HandleBuffer(pw_buffer*)
36e8900 + 4a9 is exactly 0x36e8da9 so that's where we should be. We do have unwinding information for that function too, however the stackwalker falls back to stack scanning and ultimately considers that the symbol is missing. Given the address is exactly one byte past the end of the function

The last 5 bytes of the function, which I suspect are the last instruction: e8 b7 cf fa ff

That's a call to 0x3695d60, as the last instruction. My first guess is that the callee was declared noreturn, but it did return; I've heard of this happening and causing problems like this. The symbol file has:

FUNC 3695d60 5d4 0 rtc::webrtc_logging_impl::Log(rtc::webrtc_logging_impl::LogArgType const*, ...)

But that doesn't appear to be declared noreturn. However, the call might be inlined from a noreturn function; poking around in searchfox finds a few functions named FatalLog or WriteFatalLog with an RTC_NORETURN before the decl. (This looks like it's derived from Chromium logging, where an assertion failure / deliberate crash involves a log message with “fatal” level.)

Edited to add: the GitHub issue linked in comment #11 also found more or less the same thing.

Gabriele Svelto [:gsvelto] (PTO)

Comment 15

•

1 year ago

After going through the discussion on GitHub and looking at the crashes once more I think we're dealing with a compiler bug. The preprocessed code that's crashing is the following:

  if (spaBuffer->datas[0].chunk->size == 0) {
    !rtc::LogMessage::IsNoop<::rtc::LS_ERROR>() && ::rtc::webrtc_logging_impl::LogCall() & ::rtc::webrtc_logging_impl::LogStreamer<>() << ::rtc::webrtc_logging_impl::LogMetadata("/home/gsvelto/projects/mozilla-central/third_party/libwebrtc/modules/desktop_capture/linux/wayland/moz_base_capturer_pipewire.cc", 474, ::rtc::LS_ERROR) << "Failed to get video stream: Zero size.";
    return;
  }

The crash happens when we return from a function call in the second line. The compiled code looks like this:

 36e8da4:       e8 b7 cf fa ff          call   3695d60 <_ZNKSt4lessISt17basic_string_viewIcSt11char_traitsIcEEEclERKS3_S6_@@xul108+0x2f90>
 36e8da9:       00 00                   add    %al,(%rax)

We're deep within inlined code that's manipulating the strings for the log message. Here's what's going on, the last two zeroes at offset 36e8da9 are padding which the compiler inserted for some reason - possibly because it thought that the call would never return - and this looks genuinely like a bug. However since x86 is x86 00 00 is a valid instruction which is executed and accesses the memory pointed by rax. Since rax contains a non-canonical address it triggers a general protection fault which is reported as SIGSEGV / SI_KERNEL by our code (since the signal was raised by the kernel).

BTW I don't think we're hitting a DCHECK() because it doesn't appear there are calls to it in the affected code - but I may have missed some - however even if we'd hit it we'd enter WriteFatalLog() which calls abort() so it shouldn't return.

Gabriele Svelto [:gsvelto] (PTO)

Comment 16

•

1 year ago

Also this is a very long way to say that the signature here cannot be improved: it accurately reflects what's going on which should never happen.

Gabriele Svelto [:gsvelto] (PTO)

Comment 17

•

1 year ago

Just one last note: we are compiling with #define RTC_DCHECK_IS_ON 1 and we do hit WriteFatalLog() from time to time, but it works as expected. See this crash for example which is from one of the affected builds. It's crashing where it's supposed to, so if a miscompilation is happening either it's preventing WriteFatalLog() from being called or it's happening somewhere else entirely (and I think it's the latter given that the last call before the crash is to some std::less<> template and not something within WebRTC's code).

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 18

•

1 year ago

(In reply to Gabriele Svelto [:gsvelto] from comment #17)

(and I think it's the latter given that the last call before the crash is to some std::less<> template and not something within WebRTC's code).

I don't think that's correct; that's probably just the name of the last public symbol (.dynsym rather than .symtab) before the address, due to disassembling a stripped binary. 0x2f90 would be a fairly large offset for a string comparison function, and see also the line of the breakpad symbols I quoted in comment #14.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 19

•

1 year ago

But I do tend to think this is a compiler bug, because I don't see how that call could be legitimately noreturn. From the list of inlines (from the GitHub thread), there's nothing in there that's declared noreturn, and nothing “interesting” going on there — they're taking a sequence of operator<< calls, collecting the argument types into a template parameter pack, and eventually calling webrtc_logging_impl::Log (with an array of type tags and a varargs list of the values, which it then unpacks into a sequence of operator<< calls on another type, which seems weird but not wrong).

As for the RTC_DCHECK_NOTREACHED calls, those could happen if the LogArgType sequence were ill-formed, but we can see what those values are from the C++ types in the symbol name, and it's fine (kLogMetadata, kCharP, kEnd); we'd need something that was statically known bad for that to make any sense, and it would have to somehow take a different path at runtime.

Gabriele Svelto [:gsvelto] (PTO)

Comment 20

•

1 year ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #18)

I don't think that's correct; that's probably just the name of the last public symbol (.dynsym rather than .symtab) before the address, due to disassembling a stripped binary. 0x2f90 would be a fairly large offset for a string comparison function, and see also the line of the breakpad symbols I quoted in comment #14.

You're right, I forgot that the binary was stripped -_-

It's definitely rtc::webrtc_logging_impl::Log() then, but it also gets called in a bunch of other places that keep executing after it. 0x36e8da4 is the only location where the call to that function is followed by padding. If I look for other call-sites that look like that, I can find some, but they're all for noreturn functions, e.g.:

 1091aa8:       e8 83 2d ff ff          call   1084830 <__stack_chk_fail@plt>
 1091aad:       00 00                   add    BYTE PTR [rax],al

Bugzilla

Quick Search

Sub-optimal signatures for crashes in memfd:pipewire-memfd

Categories

(Socorro :: Signature, task)

Tracking

(Not tracked)

People

(Reporter: bwc, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20