Closed Bug 1544416 Opened 5 years ago Closed 5 years ago

panic symbol is missing module name

Categories

(Toolkit :: Crash Reporting, defect, P2)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- fixed
firefox68 --- fixed
firefox69 --- fixed
firefox70 --- fixed

People

(Reporter: willkg, Unassigned)

References

(Regression)

Details

(Keywords: regression)

I was looking into bug #1544246 and it's got these frames:

0 	XUL 	GeckoCrash 	toolkit/xre/nsAppRunner.cpp:5093 	context
1 	XUL 	gkrust_shared::panic_hook 	toolkit/library/rust/shared/lib.rs:240 	frame_pointer
2 	XUL 	core::ops::function::Fn::call 	src/libcore/ops/function.rs:69 	cfi
3 	XUL 	rust_panic_with_hook 	src/libstd/panicking.rs:482 	cfi
4 	XUL 	continue_panic_fmt 	src/libstd/panicking.rs:385 	cfi
5 	XUL 	rust_begin_panic 	src/libstd/panicking.rs:312 	cfi
6 	XUL 	panic_fmt 	src/libcore/panicking.rs:85 	cfi
7 	XUL 	panic 	src/libcore/panicking.rs:49 	cfi
...

The symbols from panicking.rs lack the crate/module/type bits. In other crashes, they were "core::panicking::rust_panic_with_hook", etc.

What changed here and how can we get back to symbols with crate/module/type bits?

The crash in bug #1544246 is from OSX. Looking at the signature report, there are a handful from OSX and a handful from Linux.

Examples:

OSX: https://crash-stats.mozilla.org/report/index/c0351dcd-a07b-4e84-a964-7bfe00190413#tab-details

Linux: https://crash-stats.mozilla.org/report/index/2b9f80f0-e8d6-43ee-a22c-3320b0190412

Here's the same problem on Windows from a crash from yesterday:

https://crash-stats.mozilla.org/report/index/b93c3882-eead-4ac8-b923-539410190414

That has the right signature because signature generation is using "core::panicking::panic" as a signature sentinel. Here's the top 7 frames:

0 	xul.dll 	GeckoCrash 	toolkit/xre/nsAppRunner.cpp:5092 	context
1 	xul.dll 	static void gkrust_shared::panic_hook(struct core::panic::PanicInfo*) 	toolkit/library/rust/shared/lib.rs:240 	cfi
2 	xul.dll 	static void core::ops::function::Fn::call<fn(core::panic::PanicInfo*), (core::panic::PanicInfo*)>(**, struct core::panic::PanicInfo*) 	src/libcore/ops/function.rs:69 	cfi
3 	xul.dll 	static void std::panicking::rust_panic_with_hook() 	src/libstd/panicking.rs:482 	cfi
4 	xul.dll 	static void std::panicking::continue_panic_fmt() 	src/libstd/panicking.rs:385 	cfi
5 	xul.dll 	static void std::panicking::rust_begin_panic() 	src/libstd/panicking.rs:312 	cfi
6 	xul.dll 	static void core::panicking::panic_fmt() 	src/libcore/panicking.rs:85 	cfi
7 	xul.dll 	void core::panicking::panic() 	src/libcore/panicking.rs:49 	cfi

OK, for reasons I do not understand yet, the Rust 1.34 update changed how the panic symbols are represented in symbol files. Before, we had:

4661480:FUNC 57380f0 35 0 std::panicking::begin_panic
4661513:FUNC 5738450 4e 0 std::panicking::begin_panic_fmt
4661518:FUNC 57384a0 6a 0 std::panicking::continue_panic_fmt
4661525:FUNC 5738510 6b2 0 std::panicking::rust_panic_with_hook
...
4663002:FUNC 5744970 17 0 std::panic::resume_unwind
4663005:FUNC 5744990 59 0 std::panicking::update_count_then_panic
4663317:FUNC 5747610 17 0 __rust_start_panic
4663320:FUNC 5747630 17 0 panic_abort::__rust_start_panic::abort
4663882:FUNC 574c4e0 75 0 core::panicking::panic_bounds_check
4663946:FUNC 574ca30 4e 0 core::panicking::panic_fmt
4664144:FUNC 574ded0 75 0 core::panicking::panic

and after, we have:

4734967:FUNC 573bed0 4e 0 begin_panic_fmt
4734972:FUNC 573bf20 6a 0 continue_panic_fmt
4734979:FUNC 573bf90 6ae 0 rust_panic_with_hook
...
4736318:FUNC 5748340 59 0 update_count_then_panic
4737158:FUNC 574fd40 6b 0 panic_bounds_check
4737188:FUNC 5750280 4e 0 panic_fmt
4737357:FUNC 5751510 75 0 panic
4746497:PUBLIC 573bb90 0 std::panicking::begin_panic::h4112fdbb402e7b9e
4746498:PUBLIC 573c700 0 rust_panic
4746535:PUBLIC 5748320 0 std::panic::resume_unwind::hae3c8b50f45750f6
4746541:PUBLIC 574ae70 0 __rust_start_panic
4746542:PUBLIC 574ae90 0 panic_abort::__rust_start_panic::abort::h02f918ce7ddf92c3

which is not quite a 1-to-1 match in terms of ordering, but you can see that, e.g. panic_bounds_check results in a different name.

I'm going to see about filing a Rust bug, and maybe somebody will have bright ideas about what could have regressed this. I looked at the obvious src/libcore sources, and they haven't changed in any way that would trigger this.

Regressed by: 1543469

Rust 1.33 gives this debug information:

 <3><21579134>: Abbrev Number: 300 (DW_TAG_subprogram)
    <21579136>   DW_AT_low_pc      : 0x574c4e0
    <2157913a>   DW_AT_high_pc     : 0x75
    <2157913e>   DW_AT_frame_base  : 1 byte block: 54 	(DW_OP_reg4 (esp))
    <21579140>   DW_AT_linkage_name: (indirect string, offset: 0x9d83e5d): _ZN4core9panicking18panic_bounds_check17hea2fcbc4d6413f35E
    <21579144>   DW_AT_name        : (indirect string, offset: 0x9d83e98): panic_bounds_check
    <21579148>   DW_AT_decl_file   : 37
    <21579149>   DW_AT_decl_line   : 55
    <2157914a>   DW_AT_external    : 1
    <2157914a>   DW_AT_noreturn    : 1
 <4><2157914a>: Abbrev Number: 109 (DW_TAG_inlined_subroutine)
    <2157914b>   DW_AT_abstract_origin: <0x215757ca>
    <2157914f>   DW_AT_ranges      : 0x37aa380
    <21579153>   DW_AT_call_file   : 37
    <21579154>   DW_AT_call_line   : 61
 <4><21579155>: Abbrev Number: 0

Note the DW_AT_linkage_name. Rust 1.34, in contrast:

 <1><216f5dd5>: Abbrev Number: 293 (DW_TAG_subprogram)
    <216f5dd7>   DW_AT_low_pc      : 0x5748740
    <216f5ddb>   DW_AT_high_pc     : 0x6b
    <216f5ddf>   DW_AT_name        : (indirect string, offset: 0x9db194a): panic_bounds_check
 <2><216f5de3>: Abbrev Number: 107 (DW_TAG_inlined_subroutine)
    <216f5de4>   DW_AT_abstract_origin: <0x216f5db4>
    <216f5de8>   DW_AT_ranges      : 0x38f1140
    <216f5dec>   DW_AT_call_file   : 10
    <216f5ded>   DW_AT_call_line   : 61
 <2><216f5dee>: Abbrev Number: 0

Note the lack of DW_AT_linkage_name. We use that hereabouts:

https://searchfox.org/mozilla-central/rev/d33d470140ce3f9426af523eaa8ecfa83476c806/toolkit/crashreporter/google-breakpad/src/common/dwarf_cu_to_module.cc#386

and then the result of that:

https://searchfox.org/mozilla-central/rev/d33d470140ce3f9426af523eaa8ecfa83476c806/toolkit/crashreporter/google-breakpad/src/common/dwarf_cu_to_module.cc#410

dmajor noticed that 1.34 transitioned libstd to Rust 2018, which changed how a bunch of things were imported into libstd:

https://github.com/rust-lang/rust/commit/206988751a040522cd7d5da50ca4082a2e65c265

I filed https://github.com/rust-lang/rust/issues/60020 for the Rust regression. I don't know what policy on handling something like this looks like.

I scry with my little eye that this issue is definitely rooted in the Rust update and isn't going to be fixed any time soon and even if it did, we're probably stuck with it for a while and will have to live with it.

I wonder if this is a problem with all rust symbols or just the panicky ones. I skimmed the proto signatures and stacks for crash reports in nightly channel with buildid >= 20190413214623 and only see the problem manifest with things from panicking.rs:

https://crash-stats.mozilla.com/search/?build_id=%3E%3D20190413214623&version=68.0a1&date=%3E%3D2019-04-15T14%3A12%3A00.000Z&date=%3C2019-04-22T14%3A12%3A00.000Z&_facets=signature&_facets=proto_signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-proto_signature

Is it possible we could band-aid-fix creation of the symbols files?

If not, should we add some kind of "add the module back" rule such that signature generation works?

If not, should we add rust symbols without the module parts to the various signature-generation lists? This could have serious consequences because we won't be able to tell xyz in one module from xyz in another.

Other ideas?

Is the problem that we're on longer matching this rule? https://github.com/mozilla-services/socorro/blob/master/socorro/signature/siglists/irrelevant_signature_re.txt#L91

If so then a band-aid could be to just add some known panic functions by name.

In comment #5, the last idea I had was to add the functions in question to the sentinels list without their module names which is what you're wondering about. However, that could have big consequences since it assumes that we'll never see an "xyz" without a module name that we wouldn't want to be a sentinel. I don't know what the likelihood of that would be or how to estimate it. I suspect it's unlikely for "panic", "panic_bounds_check", "panic_fmt", and "begin_panic", but I have no idea. I'd love to know about the viability of other options first.

Do you have the ability to scan the current db for collisions on these names? Perhaps that could give us some level of reassurance.

It's tricky enough that I'm not sure I would feel fully confident.

The data is in Elasticsearch. I think we can query the proto_signature to see if it has "panic" in it and then look at the resulting crash reports.

However, the situation only started happening 10 days ago and even if we could guarantee we haven't seen a collision of those names, I don't know what we can say about the future.

Socorro has no ability to do something like "use this symbol as a sentinel, but only if the build id is between numbers X and Y".

In Rust culture, is there a rule or requirement or cultural ick that crate authors can't reuse the names of panic functions?

I looked at crash reports with panic related functions in them but missing the module and there were 22 in the last month:

https://crash-stats.mozilla.com/search/?proto_signature=%40.%2A%3F%5C%7C%20panic%20%5C%7C.%2A%3F&proto_signature=%40.%2A%3F%5C%7C%20panic_bounds_check%20%5C%7C.%2A%3F&proto_signature=%40.%2A%3F%5C%7C%20panic_fmt%20%5C%7C.%2A%3F&proto_signature=%40.%2A%3F%5C%7C%20begin_panic%20%5C%7C.%2A%3F&proto_signature=%40.%2A%3F%5C%7C%20begin_panic%5C%3CT%5C%3E%20%5C%7C.%2A%3F&proto_signature=%40.%2A%3F%5C%7C%20begin_panic_fmt%20%5C%7C.%2A%3F&date=%3E%3D2019-03-23T00%3A08%3A00.000Z&date=%3C2019-04-23T00%3A08%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

(Pretty sure that search gets me what I want. SuperSearch coupled with Elasticsearch regexes is tough.)

So maybe collisions with panicky functions aren't going to be a big deal.

Then I looked at 1000 crash reports from today that have a build id >= 20190413214623 which have a frame that's obviously from a Rust file that has a symbol that doesn't have a :: (if you have a better heuristic for honing in on the problem. I'm all ears).

That 1000 is just a sampling and probably not a representative sample. We get a lot of unhelpful crash reports. I didn't check to see if signatures were the same or not. I can do another pass if we think it's worth the time/effort.

Anyhow, 83 of the 1000 crash reports I looked at have symbols that are missing modules:

git:github.com/rust-lang/rust:src/libstd/panicking.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf   rust_begin_panic
git:github.com/rust-lang/rust:src/libstd/sys/unix/thread.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf   thread_start
git:github.com/rust-lang/rust:src/libcore/ptr.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf   <name omitted>
hg:hg.mozilla.org/mozilla-central:xpcom/rust/gkrust_utils/src/lib.rs:b783cd5203ea589bb7505852e5108ed142d2d37a   GkRustUtils_GenerateUUID
hg:hg.mozilla.org/mozilla-central:gfx/webrender_bindings/src/bindings.rs:b783cd5203ea589bb7505852e5108ed142d2d37a   wr_renderer_render
git:github.com/rust-lang/rust:src/libcore/fmt/mod.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf   write
hg:hg.mozilla.org/mozilla-central:servo/components/style/gecko/arc_types.rs:02b89c29412b6c1444fe32a4847e5261e2bb3d00   Servo_ComputedStyle_Release
git:github.com/rust-lang/rust:src/libstd/panicking.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf   continue_panic_fmt
hg:hg.mozilla.org/mozilla-central:media/libcubeb/cubeb-pulse-rs/src/capi.rs:b783cd5203ea589bb7505852e5108ed142d2d37a   pulse_rust_init
hg:hg.mozilla.org/mozilla-central:gfx/webrender_bindings/src/bindings.rs:0cf7ad6c1a23ea598cf72c00d06daeb2607ae039   wr_api_hit_test
git:github.com/rust-lang/rust:src/libcore/panicking.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf   panic_fmt
git:github.com/rust-lang/rust:src/libstd/panicking.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf   rust_panic_with_hook
git:github.com/rust-lang/rust:src/libstd/thread/mod.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf   park
git:github.com/rust-lang/rust:src/libcore/panicking.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf   panic
hg:hg.mozilla.org/mozilla-central:gfx/webrender_bindings/src/bindings.rs:2049f94666461e2eff7e67f026730a4932c98788   wr_renderer_flush_pipeline_info
git:github.com/rust-lang/rust:src/libstd/sys/unix/thread.rs:91856ed52c58aa5ba66a015354d1cc69e9779bdf   new
hg:hg.mozilla.org/mozilla-central:servo/components/style/stylist.rs:72be82c6809e2cc187e5cffdff0c3e686c564a57   Servo_StyleSet_FlushStyleSheets

Pretty sure those symbols have all changed since we upgraded to Rust 1.34. Many of them are from rust-lang, but seems like some of them are from Firefox code.

All those from Firefox code are actually named that way, they are boundary functions between rust and C, and are extern "C", so they're not mangled or anything, and aren't meant to be.

I've just run dump_syms on my local build (using rust 1.35) and the symbols look sane again, e.g.:

FUNC a08cbd0 2e 0 _ZN3std9panicking11begin_panic17h9faa7e103fb56e56E

PUBLIC a08cbd0 0 std::panicking::begin_panic::h9faa7e103fb56e56

Will are you still seeing this?

Flags: needinfo?(willkg)
Priority: -- → P2

I looked at recent nightlies and I think it looks right now. For example bp-84fea11f-6b30-4f0b-b3e7-74a340190719 which is a crash for Firefox nightly with buildid 20190717215119. It has this in the stack:

{
    "file": "git:github.com/rust-lang/rust:src/libstd/panicking.rs:a53f9df32fbb0b5f4382caaad8f1a46f36ea887c",
    "frame": 4,
    "function": "std::panicking::continue_panic_fmt",
    "function_offset": "0x4d",
    "line": 381,
    "module": "libxul.so",
    "module_offset": "0x61a187d",
    "offset": "0x7f761225487d",
    "trust": "cfi"
},

So, assuming I'm looking at an appropriate crash, I think this looks right now.

Flags: needinfo?(willkg)

Thanks Will, let's mark this fixed.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.