1594065 - Crash stack entries of the form .str.NNN.llvm.NNNNNNNNNNNNNNN, mostly on macOS

Assignee

Description

•

5 years ago

These occur as signatures and also as crash stack entries below the signature. But only those that occur as signatures can be quantified:

https://crash-stats.mozilla.com/search/?signature=~llvm&date=%3E%3D2019-10-29T16%3A31%3A00.000Z&date=%3C2019-11-05T16%3A31%3A00.000Z&_facets=signature&_facets=platform_version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-platform_version

They're clearly incorrect, but it's a bit hard to tell where they come from. I suspect Rust, ultimately. But there must also be a reason they occur disproportionately often on macOS.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 1

•

5 years ago

At least on the Mac, these occur often enough to be seriously annoying. They compromise the usefulness of the stacks they occur in, often very badly.

Severity: normal → critical

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 2

•

5 years ago

The only possibly relevant Rust-specific bug I can find in BMO is bug 1398171. I'm trying to find out whether or not that bug has been fixed. If it has been, then it's not implicated here.

Steven Michaud [:smichaud] (Retired)

Assignee

Updated

•

5 years ago

Comment 3

•

5 years ago

This smells like the Mac version of bug 1489094 which we already fixed on Windows.

Gabriele Svelto [:gsvelto]

Comment 4

•

5 years ago

After having looked at some crashes I realized it's not. This is probably a case where either the stackwalker is doing something funny or the CFI information in the symbol file is broken. The frames with the .str.NNN.llvm.NNNNNNNNNNNNNNN names are always found via stack-scanning which means that we "lost" the stack frame information along the way and picked up the first address that looked like a valid candidate for stack walking.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 5

•

5 years ago

What's "the CFI information"? And by "the symbol file" do you mean the output of dump_syms?

Gabriele Svelto [:gsvelto]

Comment 6

•

5 years ago

Yes, the symbol file is the output of dump_syms. The CFI information is a set of instructions contained in the symbol file that teach the stack walker how to find the pointer to the previous stack frame starting from the first stack frame and working up through the stack. If those instructions are not produced correctly - which might be caused by a bug in dump_syms - then the stack trace will be wrong.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 7

•

5 years ago

So if you're right, gsvelto, this bug is likely to be in the Breakpad client code that generates minidump files (and which contains code functionally equivalent to dump_syms). I'm going to be going through that code, looking for a fix for bug 1371390. Along the way I'll also look for something that might fix this bug.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 8

•

5 years ago

For what it's worth, here's the earliest crash stack I can find with this bug. It's "date processed" is "2019-05-06 09:18:57 UTC":

bp-f09d801f-3f3b-4852-a7af-458ee0190506

I'll try to find likely patches that landed just before this.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 9

•

5 years ago

(Following up comment 8)

Never mind. It seems no crashes of any kind earlier than 05/06/2019 12:00AM UTC are currently available on crash-stats.mozilla.com.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 10

•

5 years ago

What does CFI (call frame information) look like in a symbol file (the output of dump_syms)?

As best I can tell, and even with the latest dump_syms from Google (https://github.com/google/breakpad), no symbol file for any macOS module ever has any. Aside from the header (the first line), all the other lines look like this:

PUBLIC 1d40 0 +[NSObject(NSObject) load]

This is even though a macOS macho module generally does have an __eh_frame section in its __TEXT segment.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 11

•

5 years ago

I figured it out on my own (by installing breakpad on Fedora and running dump_syms there). The following is the beginning of a CFI section in libpam.0.so:

    STACK CFI INIT 3020 4c0 .cfa: $rsp 16 + .ra: .cfa -8 + ^
    STACK CFI 3026 .cfa: $rsp 24 +
    STACK CFI INIT 34e0 4b0 .cfa: $rsp 8 + .ra: .cfa -8 + ^
    STACK CFI INIT 3a50 56 .cfa: $rsp 8 + .ra: .cfa -8 + ^
    STACK CFI 3a84 .cfa: $rsp 16 +
    STACK CFI 3aa5 .cfa: $rsp 8 +
    ...

Gabriele Svelto [:gsvelto]

Comment 12

•

5 years ago

(In reply to Steven Michaud [:smichaud] (Retired) from comment #11)

I figured it out on my own (by installing breakpad on Fedora and running dump_syms there). The following is the beginning of a CFI section in libpam.0.so:
    STACK CFI INIT 3020 4c0 .cfa: $rsp 16 + .ra: .cfa -8 + ^
    STACK CFI 3026 .cfa: $rsp 24 +
    STACK CFI INIT 34e0 4b0 .cfa: $rsp 8 + .ra: .cfa -8 + ^
    STACK CFI INIT 3a50 56 .cfa: $rsp 8 + .ra: .cfa -8 + ^
    STACK CFI 3a84 .cfa: $rsp 16 +
    STACK CFI 3aa5 .cfa: $rsp 8 +
    ...

Yeah, stuff on the right of the STACK CFI directives are instructions to reconstruct the stack pointer contents from a certain point in the code by manipulating the values that are already known. Basically the stack walker starts with the known value of the stack pointer, looks for the CFI information of the function in the current frame and uses it to calculate the stack pointer for the previous frame and so on.

Note that we the version of dump_syms we have in mozilla-central has significant changes compared to the vanilla version. You could try comparing the output of both on the macOS files. It's possible that ours does better than the vanilla one that comes with breakpad.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 13

•

5 years ago

I just tried the dump_syms that was built doing a recent local macOS build of mozilla-central. I got the same results as with Google's dump_syms -- no CFI information (on a module, the CoreFoundation framework, that has an __eh_frame section in its __TEXT segment). I'll try to get DumpSymbols::ReadCFI() working on macOS, and see if that makes any difference for this bug. I suspect it won't, but your information on CFI is currently my only lead.

Gabriele Svelto [:gsvelto]

Comment 14

•

5 years ago

I'm wondering if this could be a clang/LLVM issue. Was clang updated recently? Here's why, look at the stack for this crash:

https://crash-stats.mozilla.com/report/index/140863c2-ab57-45eb-9581-6f66f0191101

It's all in libxul and we should have CFI information for that. And yet starting with the second frame the stack walker immediately switches to stack-scanning. That happens if it can't find the frame pointer and it doesn't have CFI information so the only way to crawl the stack is to tentatively scan for pointers.

Gabriele Svelto [:gsvelto]

Comment 15

•

5 years ago

OK, so I opened up the symbol file generated for the bug in comment 14 and it confirmed my guess: there are no STACK CFI directives for the first call on the stack (mozilla::ipc::MessageChannel::Clear()) so the unwinder can't do anything but fall back to stack-scanning. Nathan, are you aware of changes that might have impacted CFI generation when dumping Mac debuginfo?

Flags: needinfo?(nfroyd)

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 16

•

5 years ago

It's all in libxul and we should have CFI information for that.

Yes, I notice that both Mozilla's and Google's dump_syms show CFI information for XUL in a recent Firefox.app and Firefox Nightly.app. Do you have any idea why this works for XUL and not anything else?

OK, so I opened up the symbol file generated for the bug in comment 14

Where do you find this? Do you need access to the minidump?

Nathan Froyd [:froydnj]

Comment 17

•

5 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #15)

OK, so I opened up the symbol file generated for the bug in comment 14 and it confirmed my guess: there are no STACK CFI directives for the first call on the stack (mozilla::ipc::MessageChannel::Clear()) so the unwinder can't do anything but fall back to stack-scanning. Nathan, are you aware of changes that might have impacted CFI generation when dumping Mac debuginfo?

clang was updated to 9.0 about six weeks ago. So that could be a problem?

I'm surprised at comment 13: I would think we'd decode information from __eh_frame even if there isn't any actual CFI information in the DWARF or if the information is unreadable somehow. Maybe the Mac stuff doesn't look at __eh_frame, though?

Flags: needinfo?(nfroyd)

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 18

•

5 years ago

For the record, I ran dump_syms (both Mozilla's and Google's) on the XUL from Firefox 67.0.2 (the version from comment 14's crash stack) and had no problems -- lots of STACK CFI lines.

XUL on the Mac doesn't have a __DWARF segment (only an __eh_frame section in the __TEXT segment). So my guess is that dump_syms is getting its CFI information from the latter. I'll learn more when I start putting hooks in the crashreporter.app process.

Nathan Froyd [:froydnj]

Comment 19

•

5 years ago

Binaries on the Mac don't have the debug information embedded in the binary; we have to run dsymutil on the binary to set up the debug information for proper processing by dump_syms. See https://searchfox.org/mozilla-central/source/toolkit/crashreporter/tools/symbolstore.py#830-894 for the preprocessing we have to do.

Gabriele Svelto [:gsvelto]

Comment 20

•

5 years ago

(In reply to Steven Michaud [:smichaud] (Retired) from comment #16)

Where do you find this? Do you need access to the minidump?

No, you can get the symbol file from https://symbols.mozilla.org/

(In reply to Nathan Froyd [:froydnj] from comment #17)

I'm surprised at comment 13: I would think we'd decode information from __eh_frame even if there isn't any actual CFI information in the DWARF or if the information is unreadable somehow. Maybe the Mac stuff doesn't look at __eh_frame, though?

Mac's dump_syms is different from Linux' dump_syms so anything could be happening :-) It's late today, I'll have a look tomorrow.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 21

•

5 years ago

When I run either dsymutil or llvm_dsymutil on the CoreFoundation framework, I see warning errors that there are "no debug symbols in executable" for any of its architectures. Then when I run dump_syms on the resulting dSYM bundle, I get exactly the same results as I get running dump_syms directly on the CoreFoundation framework.

When I run dsymutil or llvm_dsymutil on XUL from a recent Firefox Nightly, I get the same warning ("no debug symbols in executable"). Then when I run dump_syms on the resulting dSYM bundle I actually get a worse result than when I use dump_syms directly on XUL: Only PUBLIC symbols, no FUNC symbols, no CFI information.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 22

•

5 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #20)

(In reply to Steven Michaud [:smichaud] (Retired) from comment #16)

Where do you find this? Do you need access to the minidump?

No, you can get the symbol file from https://symbols.mozilla.org/

OK, I've been playing around with this at https://symbols.mozilla.org/symbolication, but I still haven't a clue what I need to do. Please be more explicit.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 23

•

5 years ago

(Following up comment #22)

I finally figured out this one, too, I think, thanks to help from https://bluesock.org/~willkg/blog/

Get the "debug_id" and "debug_file" for XUL from https://crash-stats.mozilla.com/report/index/140863c2-ab57-45eb-9581-6f66f0191101#tab-rawdump.
Use this info to compose the following URL:

https://symbols.mozilla.org/{debug_file}/{debug_id}/{debug_file}.sym

I was able to download it using wget --no-check-certificate [url]

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 24

•

5 years ago

One final twist: The file I downloaded in comment 23 is gzipped.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 25

•

5 years ago

•

Edited

(Following up comment #13)

dump_syms actually does add CFI information, if there is an __eh_frame section in the __TEXT segment. When running dump_syms on the CoreFoundation framework, I was running without an -a argument (to specify the architecture), on the assumption that it would choose 'x86_64', which is the default. Instead it chose 'i386' (I don't know why), which (unlike the x86_64 and x86_64h architectures) doesn't have an __eh_frame section. If you specify -a x86_64 or -a x86_64h, dump_syms does add CFI information.

Sorry for the confusion :-(

(I've been testing on macOS 10.14.6.)

Gabriele Svelto [:gsvelto]

Comment 26

•

5 years ago

Yeah, note that when I did that short analysis for comment 15 I didn't find CFI information for mozilla::ipc::MessageChannel::Clear() but there was CFI information for other methods. It's as if we're not emitting it only in some cases, not all.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 27

•

5 years ago

Interestingly, crash stacks with the signature mozilla::ipc::MessageChannel::Clear() dating from 2019-09-10 to the present all have this bug on macOS:

https://crash-stats.mozilla.com/search/?signature=~mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AClear&platform=Mac%20OS%20X&date=%3E%3D2019-09-10T09%3A18%3A00.000Z&date=%3C2019-11-08T09%3A18%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

But none of them do on Windows or Linux:

https://crash-stats.mozilla.com/search/?signature=~mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AClear&platform=Windows&date=%3E%3D2019-09-10T09%3A18%3A00.000Z&date=%3C2019-11-08T09%3A18%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

https://crash-stats.mozilla.com/search/?signature=~mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AClear&platform=Linux&date=%3E%3D2019-09-10T09%3A18%3A00.000Z&date=%3C2019-11-08T09%3A18%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

This argues for this bug having been caused by some kind of change in compilation on 2019-09-10 or just before -- one that messed up macOS far worse than Windows or Linux. (I assume that only a change in compilation could explain the CFI for mozilla::ipc::MessageChannel::Clear() consistently being missing from the __eh_frame section of the __TEXT segment.)

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 28

•

5 years ago

I forgot to add that that none of the macOS crash stacks before 2019-09-10 have this bug, save for an odd sequence of 14 on 2019-05-18, all at the same time and on a single machine:

https://crash-stats.mozilla.com/search/?signature=~mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AClear&platform=Mac%20OS%20X&date=%3E%3D2019-05-09T09%3A30%3A00.000Z&date=%3C2019-09-09T09%3A30%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 29

•

5 years ago

This argues for this bug having been caused by some kind of change in compilation on 2019-09-10 or just before

Not having been caused, since the bug does go back at least six months (comment 8). But for something having happened around then that triggered a significant change.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 30

•

5 years ago

For those interested in understanding the STACK CFI syntax, I found a good explanation at bug 547075 comment 5.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 31

•

5 years ago

•

Edited

I continue to bang my head against this, but I now suspect that this bug has nothing to do with CFI information. The sea change I saw in mozilla::ipc::MessageChannel::Clear() crash stacks (in comment 27) doesn't pan out for any of the other types of crash stack that I examined.

Most of the crashes from comment 0 seem to be IPC related or OOM crashes (for example bug 1595420). I suspect that most (maybe all) of the IPC crashes happen in the child process as it's exiting -- for example those at mozilla::ipc::MessageChannel::Clear(). So we've got two variations on a possible fundamental explanation for the stack corruptions in this bug (and also bug 1594078) -- the crashing process is in an unstable state.

Using a HookCase hook library, I can now reliably reproduce stack corruption by triggering a call to abort() in a method called from mozilla::ipc::MessageChannel::Clear() (in the child process). The corruption isn't the same as was reported here (in fact it's even more spectacular). But I suspect it's the same general phenomenon. Over the next few days I'll be looking into exactly what happens leading up to this stack corruption -- on the crashing machine and on Socorro. Hopefully I'll be able to find some way to ameliorate the problem. But I'm almost certainly not going to be able to make it go away completely.

To test what happens on Socorro, I'll be playing with minidump_stackwalk (Mozilla's and Google's), and also Socorro's own stackwalk (which can be compiled and run separately).

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 32

•

5 years ago

Since I'm doing so much work on this bug, I might as well assign it to myself, at least temporarily.

Assignee: nobody → smichaud

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 33

•

5 years ago

The fix for bug 1371390, which just landed, should make a big difference here on the Mac. I'm hoping that it will even out the numbers, so that these crash stacks no longer appear disproportionately on the Mac. I'll check in a week or two.

Steven Michaud [:smichaud] (Retired)

Assignee

Comment 34

•

5 years ago

These have almost vanished since my patch for bug 1371390 landed:

https://crash-stats.mozilla.com/search/?signature=~llvm&build_id=%3E%3D20191120094758&date=%3E%3D2019-11-03T21%3A33%3A00.000Z&date=%3C2019-12-03T21%3A33%3A00.000Z&_facets=signature&_facets=platform_version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

The few that remain are all on Linux.

Let's resolve this FIXED. It can be reopened if the numbers start creeping up again.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED