Find a way to deal with build-specific LLVM signatures

RESOLVED FIXED in Firefox 65

Status

()

defect
RESOLVED FIXED
9 months ago
5 months ago

People

(Reporter: philipp, Assigned: gsvelto)

Tracking

unspecified
mozilla66
Points:
---
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox62 unaffected, firefox63 wontfix, firefox64- wontfix, firefox65+ fixed, firefox66+ fixed)

Details

Attachments

(1 attachment)

Reporter

Description

9 months ago
probably due to the clang-cl compiler change in 63 there are now some crash reports in the area of memory allocation that have build-specific signatures:
https://crash-stats.mozilla.com/search/?signature=~%40%40%40Z.llvm&date=>%3D2018-06-01

maybe we could find a skiplist rule to end up with uniform signatures across builds for those cases.
I don't understand what this bug is about. Are we talking about adding demangling code to Socorro signature generation or skipping mangled frames altogether?
Reporter

Comment 2

9 months ago
i am not sure what the solution might be here myself and how the compiler changes impacted signatures in 63 exactly.

for example i'm assuming that this signature in 62
> [@ arena_t::DallocSmall | je_free | static void DefaultFreeEntry ] (bp336af821-c0e4-4f38-99b0-b65400180628)
turned into the following one in 63:
> [@ arena_t::DallocSmall | ?arena_dalloc@@YAXPEAX_KPEAUarena_t@@@Z.llvm.1274928215564211852 ] (bp-0d69846f-63fd-4cb1-9596-9f5680180910)

i'm not sure if adding a demangling rule + including the next frame [@ arena_t::DallocSmall | ?arena_dalloc | SECOID_Shutdown] or just totally skipping the mangled frame [@ arena_t::DallocSmall | SECOID_Shutdown] would be a better solution.

david, as you sent an announcement mail about this to the stability list, would you have any input on how to best deal with these (currently build-specific) signatures?
Flags: needinfo?(dmajor)
I think this is http://llvm-cs.pcc.me.uk/include/llvm/IR/ModuleSummaryIndex.h#1044 (and there's a more extensive comment at http://llvm-cs.pcc.me.uk/lib/Transforms/Utils/FunctionImportUtils.cpp#97)

Presumably Socorro is already running a demangler at some layer. Maybe we should strip \.llvm\.[0-9]+$ before running mangled names through it. 

Another option (more invasive) would be to switch the demangler to llvm-undname, which understands this syntax.
Flags: needinfo?(dmajor)
For example, given `?arena_dalloc@@YAXPEAX_KPEAUarena_t@@@Z.llvm.1274928215564211852`, if you drop the end bit and run only `?arena_dalloc@@YAXPEAX_KPEAUarena_t@@@Z` through a demangler, you get `void __cdecl arena_dalloc(void *, unsigned __int64, struct arena_t *)`
Socorro uses minidump-stackwalk to parse the miinidump file for stacks. For each stack, it goes through the frames and pulls down symbols files from S3 to symbolicate them. As I understand it, minidump-stackwalk doesn't do any demangling of function names--the symbols files should contain demangled function names.

I looked through some of the crashes in that supersearch query, but don't have any insights.

I'm not sure what the right thing to do here is. If I had my druthers, it'd be whatever is causing the mangled names in the build process gets fixed and we don't do anything in Socorro. I'd rather not write and maintain demangling code in Socorro (though maybe we could just turn around and use breakpad's existing demangling bits?). I don't think adding "skip mangled names" to signature generation is the right thing to do, but maybe?

I don't know offhand what the situation with the llvm compiler is (is it a temporary problem? is it fixed already? is there a bug for it?), but I'm pretty sure Ted knows or Ted knows who knows, so I'm going to tag him in.
Flags: needinfo?(ted)
This is a deliberate mangling performed by LLVM. It's not a bug from their side.

Ok, so I was mistaken about where the demangling occurs -- if it's not in Socorro, then [whatever does the demangling of our symbols currently] should be updated to ignore the end of the string. I was told in #llvm that newer demanglers will drop anything after a dot.
I switched Philipps supersearch query to look at proto-signature. There are 448 instances all of which on Windows NT. The first occurs on July 18th, 2018.

https://crash-stats.mozilla.com/search/?proto_signature=~%40%40%40Z.llvm&date=%3E%3D2018-05-29T20%3A00%3A00.000Z&date=%3C2018-09-12T08%3A42%3A00.000Z&_sort=-date&_facets=platform&_facets=build_id&_facets=proto_signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

Some more examples of mangled names:

* ?arena_dalloc@@YAXPEAX_KPEAUarena_t@@@Z.llvm.1274928215564211852
* ?huge_dalloc@@YAXPAXPAUarena_t@@@Z.llvm.6124856101597571778 
* ?GetLayerActivityForUpdate@mozilla@@YAPEAVLayerActivity@1@PEAVnsIFrame@@@Z.llvm.8837179988416576506 
* ?ErrorLoadingSheet@@YAXPEAVnsIURI@@PEBDW4FailureAction@css@mozilla@@@Z.llvm.14596013026831653111

I checked signature generation to see if we have a precedent for ignoring mangled names and I don't see one. I don't think mangled names in signatures help, but I also don't think ignoring mangled names is better.

Since July, there were 448 of these out of millions of crash reports. Looking at a facet of these signatures, it doesn't seem like it affects one specific cause of crash report.

Ultimately, I think this should get fixed in the build process--it should be demangling names for symbols files. Someone who understands this more should write up that bug.

Beyond that, I'm on the fence about whether to ignore mangled names in signature generation.
Component: Signature → General
Product: Socorro → Firefox Build System
Component: General → Crash Reporting
Product: Firefox Build System → Toolkit
> Beyond that, I'm on the fence about whether to ignore mangled names in
> signature generation.

IMO we shouldn't. If we fix this issue and other ones sneak through in the future, I'd like to know about it (so we can fix it closer to the root cause) rather than silently ignore it.
Duplicate of this bug: 1491156
In bug 1491156, I came across a particularly egregious instance of this that messes up the crash-stats UI because it is so large.
Duplicate of this bug: 1507758
Assignee

Comment 13

6 months ago
This is getting pretty bad. From what I can tell there's ~500 instances of this per week and 10K+ in the last three months.

If we don't want to use llvm-undname we can certainly modify dump_syms to filter out the ".llvm.$(hash)" bits from the symbols. The hash seems to be always 20 characters long and made of digits only.
Assignee

Comment 14

6 months ago
More of this on nightly:

https://crash-stats.mozilla.com/signature/?product=Firefox&signature=%3FMOZ_CrashOOL%40%40YAXPEBDH0%40Z.llvm.5035872941563483689

https://crash-stats.mozilla.com/signature/?product=Firefox&signature=%3FMOZ_CrashOOL%40%40YAXPBDH0%40Z.llvm.6972455652712233956

Ted, would the quick & dirty solution in comment 13 be enough to fix this (at least temporarily)? This is making triage quite challenging especially WRT WebRender panics.
Flags: needinfo?(ted)
(In reply to Gabriele Svelto [:gsvelto] from comment #13)
> The hash seems to be always 20 characters long and made of digits only.

I wouldn't rely on the number 20 -- the digits are a stringification of a uint64 so occasionally it has a shorter length. But \.llvm\.[0-9]+$ should be fine.
Assignee

Comment 16

6 months ago
(In reply to David Major [:dmajor] from comment #15)
> (In reply to Gabriele Svelto [:gsvelto] from comment #13)
> > The hash seems to be always 20 characters long and made of digits only.
> 
> I wouldn't rely on the number 20 -- the digits are a stringification of a
> uint64 so occasionally it has a shorter length. But \.llvm\.[0-9]+$ should
> be fine.

OK, let's do that, we'll think of a proper fix later, we need useful signatures ASAP.
Assignee

Comment 17

6 months ago
BTW I just checked what dump_syms spits out on Linux and I get lines like this one:

PUBLIC 8c2d90 0 GCAtomTable() [clone .llvm.9401545488759447361]
Adding 65 as affected as quite a few signatures in nightly suffer from this problem: In the last 7 days, here is a list of signatures I have come across, many of which do not have bugs associated with them: https://bit.ly/2Qmr0ft. As Gabriele notes, it is difficult to file bugs because the different signatures often have mixed moz crash reasons. Anything to help clarify these crash signatures would be greatly appreciated.
?MOZ_CrashOOL@@YAXPEBDH0@Z.llvm.6210015070520384445 is the top non-shutdown hang crash on the 11-28 Windows Nightlies.
I'm requesting tracking because this issue is making crash triage difficult.
Assignee

Comment 21

6 months ago
Quick update: I've got a quick hack to strip the .llvm.<digits> part in the Windows-specific dump_syms tool and it seems to work well. It works by manually calling UnDecorateSymbolNameW() on the stripped names. I'll put it up for review ASAP.
Assignee

Comment 23

6 months ago
Another quick update: the fix I mentioned in comment 21 is not fully reliable and there doesn't seem to be a way to make it reliable using Microsoft's public APIs. So there's no easy fix for this short of using undocumented functionality or rewriting the Windows dump_syms from scratch. Ted pointed out that there's a readily available PDB parser in rust [1] so I'm experimenting with it to see if it's possible to put together a working alternative to dump_syms with the necessary changes to deal LLVM-specific signatures correctly.

[1] https://github.com/willglynn/pdb
(In reply to Gabriele Svelto [:gsvelto] from comment #23)
> Another quick update: the fix I mentioned in comment 21 is not fully
> reliable and there doesn't seem to be a way to make it reliable using
> Microsoft's public APIs. So there's no easy fix for this short of using
> undocumented functionality or rewriting the Windows dump_syms from scratch.

Can you elaborate on what you tried and what makes it unreliable?
Assignee

Comment 25

6 months ago
(In reply to David Major [:dmajor] from comment #24)
> Can you elaborate on what you tried and what makes it unreliable?

dump_syms uses DIA to access symbols in PDB files so I looked for a way to extract the mangled symbols from IDiaSymbol. Unfortunately there doesn't seem to be a documented way to get them. However calling IDiaSymbol::get_undecoratedNameEx() with the Flags.UNDNAME_TYPE_ONLY flag often returns the mangled symbol, but not always. I hoped I would be able to use that to get the mangled symbol, strip the .llvm.xyz suffix and then feed the whole symbol into UnDecorateSymbolName() but without reliable access to the mangled symbol it doesn't work.

I could not find another way to access mangled symbols though the fact that I can access them sometimes means that they're stored somewhere within DIA. I might poke the IDiaSymbol instances with a debugger to see if the mangled symbol is accessible from there in some undocumented way but I'd rather find a cleaner solution if possible.
Assignee

Comment 26

6 months ago
I may have found another way around the issue: in public symbol names it's possible to directly access the decorated (i.e. mangled) name via the get_name() method. Looking at how DIA reads a PDB it seems that for every function symbol there's a corresponding public symbol name so I might use _that_ to access the mangled name (and demangle it manually without the .llvm.xyz part).
Assignee

Comment 27

6 months ago
Another miss: even when accessed via the public entry symbols with .llvm.xyz suffixes are printed out in undecorated/demangled form.
Assignee

Comment 28

6 months ago
OK I have figured this out, I could find the proper mangled symbols and demangled them by hand using UnDecorateSymbolName(). Patch pending. One of the things that confused me about this work is that I encountered Itanium-mangled signatures in PDB files where I only expected to find VS-mangled signatures. It's unclear if this is a bug or a feature of clang but it made me waste a few days chasing them.
Flags: needinfo?(ted)
(In reply to Gabriele Svelto [:gsvelto] from comment #28)
> OK I have figured this out, I could find the proper mangled symbols and
> demangled them by hand using UnDecorateSymbolName(). Patch pending. One of
> the things that confused me about this work is that I encountered
> Itanium-mangled signatures in PDB files where I only expected to find
> VS-mangled signatures. It's unclear if this is a bug or a feature of clang
> but it made me waste a few days chasing them.

Presumably these Itanium-mangled signatures are from Rust, rather than other parts of the code?
Assignee

Comment 30

5 months ago
(In reply to Nathan Froyd [:froydnj] from comment #29)
> Presumably these Itanium-mangled signatures are from Rust, rather than other
> parts of the code?

Possibly. They also don't show up in opt builds, I only found them in my local debug build.

Comment 32

5 months ago
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a01ad4f60152
Properly demangle build-specific LLVM signatures on Windows r=ted
Assignee

Comment 33

5 months ago
OK, let's land this, onwards to better WebRender signatures.

Comment 34

5 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/a01ad4f60152
Status: NEW → RESOLVED
Last Resolved: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
Assignee: nobody → gsvelto
Please nominate this for Beta approval when you get a chance. Not setting 64 to wontfix at this point, though I'm not sure we really need it there either since WR experiments are only happening on Beta.
Flags: needinfo?(gsvelto)
Assignee

Comment 36

5 months ago
Comment on attachment 9031059 [details]
Bug 1489094 - Properly demangle build-specific LLVM signatures on Windows

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: Recent versions of rustc/clang/llvm

User impact if declined: No impact on  users but problematic for developers. Crash signatures for certain crashes are build-specific, making them hard to track and causing a single signature to span multiple crash reasons.

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): The patch does not affect Firefox' sources, the changes are only used in automation to dump a build's symbols

String changes made/needed: None
Flags: needinfo?(gsvelto)
Attachment #9031059 - Flags: approval-mozilla-beta?
Comment on attachment 9031059 [details]
Bug 1489094 - Properly demangle build-specific LLVM signatures on Windows

[Triage Comment]
Fixes broken crash signatures so that issues are easier to track across releases. Approved for 65.0b5.
Attachment #9031059 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.