Closed Bug 1397926 Opened 7 years ago Closed 7 years ago

Add signature rewrite rule to remote "[clone .cold.N]" from function names

Categories

(Socorro :: Backend, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mccr8, Assigned: willkg)

Details

Attachments

(1 file)

I noticed in some recent Linux crashes that the signature for the top frame contains "[clone .cold.N]", where N is some integer. This is obviously not great because it makes it harder to find existing bugs for crashes.

For instance: [@ @0x0 | mozilla::dom::Element::AddSizeOfExcludingThis const [clone .cold.398] ]
  bp-ea260c48-9d43-4596-a90f-a3c6f0170903

I'm not sure at what stage of the crash report pipeline this is showing up. It shows up in the "raw dump".

I found a crash report with a build from 8-16, but the first crash time I see (also the same crash) is from 8-31, so maybe it is a server side issue?
  bp-8fd0dd5e-59b2-424d-b041-f93e80170902

This shows up with a number of different signatures, all Linux, but not all Linux signatures have this.
Ted, do you have any guesses what might be going wrong, and who could look into this? Thanks.
Flags: needinfo?(ted)
I was going to suggest that maybe this was something new in GCC 6 (bug 1389435), but it looks like that merged to central on August 22nd, so those crashes predate it by a few days.

In any event, I suspect this is just some weirdo naming coming from the compiler, maybe due to the PGO optimizer separating hot and cold blocks of a function? I downloaded the symbol file for libxul from this build (you can find the link on the modules tab) and that symbol is present there:
PUBLIC 11aef74 0 mozilla::dom::Element::AddSizeOfExcludingThis(nsWindowSizes&, unsigned long*) const [clone .cold.398]
PUBLIC 34ce150 0 mozilla::dom::Element::AddSizeOfExcludingThis(nsWindowSizes&, unsigned long*) const

https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/libxul.so/B615F8577A4F30B08E51CC1A5E8CF7D60/libxul.so.sym

The DWARF symbols are available at the same place, slightly different URL:
https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/libxul.so/B615F8577A4F30B08E51CC1A5E8CF7D60/libxul.so.dbg.gz

Just looking at the symbol table shows that yeah, this symbol name exists there:
$ nm libxul.so.dbg | rg 7Element22AddSizeOfExcludingThis
312215:00000000034ce150 t _ZNK7mozilla3dom7Element22AddSizeOfExcludingThisER13nsWindowSizesPm
312216:00000000011aef74 t _ZNK7mozilla3dom7Element22AddSizeOfExcludingThisER13nsWindowSizesPm.cold.398

$ nm -C libxul.so.dbg | rg mozilla::dom::Element::AddSizeOfExcludingThis
312215:00000000034ce150 t mozilla::dom::Element::AddSizeOfExcludingThis(nsWindowSizes&, unsigned long*) const
312216:00000000011aef74 t mozilla::dom::Element::AddSizeOfExcludingThis(nsWindowSizes&, unsigned long*) const [clone .cold.398]

That's about as far as I got at the moment because dwarfdump throws an error on some Rust debug info (filed https://github.com/rust-lang/rust/issues/44412).
Flags: needinfo?(ted)
Oh! But:
 <0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)
    <c>   DW_AT_producer    : (indirect string, offset: 0x1049): GNU C++11 6.4.0 -mtune=generic -march=x86-64 -g -O3 -std=gnu++11 -fPIC -fno-strict-aliasing -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -fno-omit-frame-pointer

This *is* code compiled by GCC 6!
OK, so my theory about this being due to PGO is correct. This was originally introduced in 2013:
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=ec6146042a621b7c0868f2bff62eee7cd23af411;hp=198d7b78ede581f481041a341a553187bcaf0dfc

but the bit that adds the numeric suffix was added in 2015, so that's why we're only noticing it with the switch to GCC 6:
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=08cb962fa19224ab6e256a5c163c0584e21cf014;hp=36f7991ca0401bf6defcee8c215d2037199a5edf

I'm going to move this to the Socorro component, we'll just probably want to write a signature rule that rewrites these to strip off the `[clone .cold.\d+]` bit.
Component: Crash Reporting → Backend
Product: Toolkit → Socorro
Summary: Weird "[clone .cold.N]" in Linux crash signatures → Add signature rewrite rule to remote "[clone .cold.N]" from function names
We currently do rewriting to strip certain template parameters and function arguments from signatures, that code is here:
https://github.com/mozilla-services/socorro/blob/ccca0440df91dd67fa284daa423f67d21fffdd21/socorro/signature/signature_utilities.py#L191
Lonnen, any idea who might be working on Socorro who could fix this? The new version of GCC has gone out to beta and it would be nice to get this fixed. Thanks.
Flags: needinfo?(chris.lonnen)
I can grab this.

The plan is to update that normalize_signature method Ted pointed out in comment #5 to also nix the "[clone .cold.xxx]" bit completely.

After doing that, the signature for bp-ea260c48-9d43-4596-a90f-a3c6f0170903 would go from:

    [@ @0x0 | mozilla::dom::Element::AddSizeOfExcludingThis const [clone .cold.398] ]

to:

    [@ @0x0 | mozilla::dom::Element::AddSizeOfExcludingThis const]

Does that sound like what we want?
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Flags: needinfo?(chris.lonnen) → needinfo?(continuation)
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #7)
>     [@ @0x0 | mozilla::dom::Element::AddSizeOfExcludingThis const]
> Does that sound like what we want?

Yes. Thanks!
Flags: needinfo?(continuation)
Commit pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/bcd2b8c8aebc4187dec48d629417bbc6ff2a9b96
fixes bug 1397926 - remove [clone .cold.xxx] from signatures (#4012)

GCC 6 includes an optimizer that identifies hot and cold code sections and marks
them. That messes up signatures because it includes a number at the end that
changes.

This strips the frame function of that tawdry bit so as to make signatures
bucket better.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
This went to prod a little bit ago.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: