Windows symbol dumper crashes on symbols that refer to rust arrays

RESOLVED FIXED in Firefox 55

Status

()

Core
Build Config
RESOLVED FIXED
a year ago
11 months ago

People

(Reporter: kats, Assigned: kats)

Tracking

Other Branch
mozilla55
Points:
---

Firefox Tracking Flags

(firefox55 fixed)

Details

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(3 attachments)

Created attachment 8842555 [details]
xul.sym

In bug 1342450 I'm trying to have webrender building by default in desktop builds. A recent try push [1] with some patches resulted in mysterious windows test failures. What appears to be happening is that debug builds naturally sometimes hit NS_ASSERTIONs, and a backtrace is dumped. But if the .sym file is bad, the backtrace dumper dies a horrible death. That seems to the case here - in particular it's finding a FUNC line with too few parameters and dies at [2].

I downloaded the symbols zip [3], and looked for FUNC lines with too few tokens. Turns out the xul.sym file looks truncated, because it ends in the middle of an address printout. Full file is attached.

I then looked at the log of the job that generated the symbol file [4], and as far as I can see there's no problems or timeouts reported. The stuff relevant to xul.pdb is this:

...
08:35:47     INFO -  4500: Submitting jobs for files: ('.\\toolkit\\library\\xul.pdb',)
08:35:47     INFO -  4812: Worker processing files: ('.\\toolkit\\library\\xul.pdb',)
08:35:47     INFO -  4812: c:\builds\moz2_slave\try-w32-d-00000000000000000000\build\src\obj-firefox\dist\host\bin\dump_syms.exe .\toolkit\library\xul.pdb
... [snip] ...
08:35:47     INFO -  4340: Worker finished processing ('.\\js\\src\\jsapi-tests\\jsapi-tests.pdb',) in 22.24s
08:35:47     INFO -  4812: Worker finished processing ('.\\toolkit\\library\\xul.pdb',) in 152.75s
08:35:47     INFO -  echo packing symbols
08:35:47     INFO -  packing symbols
...

My best guess is that the file wasn't fully flushed to disk or something before it got zipped up, but that seems unlikely. Or maybe there's a timeout of 150 seconds or so on the worker? I'm not sure, but I'll dig around.

Ted, any thoughts?

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=bcc2605203a77e52b7c6216e0ea7099eb9d77307
[2] http://searchfox.org/mozilla-central/rev/9c1c7106eef137e3413fd867fc1ddfb1d3f6728c/tools/rb/fix_stack_using_bpsyms.py#49
[3] https://queue.taskcluster.net/v1/task/TH_Po8RgR72VzFONT4skhg/artifacts/public/build/firefox-54.0a1.en-US.win32.crashreporter-symbols.zip
[4] https://archive.mozilla.org/pub/firefox/try-builds/kgupta@mozilla.com-bcc2605203a77e52b7c6216e0ea7099eb9d77307/try-win32-debug/try-win32-debug-bm78-try1-build1550.txt.gz
Flags: needinfo?(ted)
FWIW the webrender-enabled windows build on the graphics branch (e.g. [1]) processes xul.pdb in 70.03 seconds but it still appears to be truncated. (You can download the symbols zip file from the job details pane in TreeHerder).

[1] https://treeherder.mozilla.org/#/jobs?repo=graphics&revision=9059ef02d91ba49bb3cf3b9cdd4c92bc99a4d4b0&selectedJob=80866645
I'm starting to suspect the dump_syms.exe program is dying, because when I compare the webrender-enabled and webrender-disabled .sym files, the -disabled one is much larger and has a longer runtime (to completion). So it's not like it's timing out or running out of space. I presume I can reproduce this locally using the firefox-54.0a1.en-US.win32.crashreporter-symbols-full.zip - I'll try doing that tomorrow once I'm home and have access to my windows machine again.

Still, in the meantime, any suggestions or things to look for would be useful.
I can reproduce this locally, it looks like dump_syms.exe is crashing with a divide-by-zero error. I'll need to rebuild dump_syms.exe with debugging stuff and symbols, will chase it down.
So, it looks like at [1] when we try to get the name of the function for a IDiaSymbol, it crashes somewhere inside msdia140.dll with a divide by zero. Seems like a bug in msdia140.dll. I'll see if I can work around it somehow.

[1] http://searchfox.org/mozilla-central/rev/546a05fec017cb785fca62a5199d42812a6a1fd3/toolkit/crashreporter/google-breakpad/src/common/windows/pdb_source_line_writer.cc#973
Flags: needinfo?(ted)
I added a try/catch block to catch the div-by-zero and that seems to work locally. There's something like 33 symbols in the webrender-enabled xul.pdb that have this problem.

Try push to verify: https://treeherder.mozilla.org/#/jobs?repo=try&revision=106917c598749edd1949c214df95dfdf081b80da
Assignee: nobody → bugmail
Ok, that seems to work. The try push has what appears to be a complete xul.sym file, with 53 inaccessible names in the debug build's symbol file.
Comment hidden (mozreview-request)
This would need to go upstream to Breakpad. Do you know what symbols specifically the code is crashing on?
Not specifically - I know it's something in webrender or one of the rust dependencies of webrender, because enabling that is what causes this to trigger. I'm not sure how to find out exactly which symbol - I tried printing the get_name() value but it seems to be non-ascii as far as I can tell. I'm not familiar with windows symbol mangling, is there a tool I can use to produce this information?
I would think the `get_name` function would produce useful output here, but maybe the root cause is that the symbol name has bad data or something? I think you can use WinDBG for this--take the offset address from the FUNC line in the .sym file for one of these functions, then load up the build in WinDBG and do `ln xul+0xaddr`, which should show you the matching function name. I haven't tested that--it's possible that WinDBG command isn't quite right. If it doesn't work dmajor might be able to advise, he knows more WinDBG tricks than I do.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #10)
> I would think the `get_name` function would produce useful output here, but
> maybe the root cause is that the symbol name has bad data or something?

This was my mistake - I was doing a bad job of trying to print the BSTR.

> I
> think you can use WinDBG for this--take the offset address from the FUNC
> line in the .sym file for one of these functions, then load up the build in
> WinDBG and do `ln xul+0xaddr`, which should show you the matching function
> name.

I did this on some of the addresses, and got results like:
    xul!core::array::{{impl}}::into_iter<webrender::internal_types::BlurAttribute> (void)
    xul!core::array::{{impl}}::into_iter<webrender::internal_types::BlurAttribute> (void)
    xul!core::array::{{impl}}::into_iter<webrender::texture_cache::FreeListBin> (void)
    xul!core::array::{{impl}}::into_iter<webrender::internal_types::ClipAttribute> (void)
    xul!gamma_lut::GammaLut::replace_pixels_rgb (void)

Now that I'm using get_name properly I get similar results there. I'll update the patch to fall through to the get_name codepath instead of returning "name inaccessible", since that will produce more useful results.
Comment hidden (mozreview-request)
This sounds fishy. I'd like to better understand why this crash happens before we sweep it under the rug. Mind if I poke at this a bit before you land?
I have no objections to you poking at it, but I also don't want to delay landing this patch, as it is blocking Quantum Render work. Specifically, without this we cannot properly run debug tests on windows QR-enabled builds. If it turns out there's a better way to fix this I'm happy to back out this patch later.

Also for the record, I have a friend who works on the VS studio team at Microsoft, so I sent him STR, he said he would file a bug for it internally. If you want a standalone way to reproduce this, you can download the bundle I made at https://people-mozilla.org/~kgupta/tmp/dia_bug.tar.gz (warning: 100+ MB file, includes a 600+ MB xul.pdb file), build the dump_syms VS project inside the folder, and run the binary on the included xul.pdb file.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #10)
> I would think the `get_name` function would produce useful output here, but
> maybe the root cause is that the symbol name has bad data or something? I
> think you can use WinDBG for this--take the offset address from the FUNC
> line in the .sym file for one of these functions, then load up the build in
> WinDBG and do `ln xul+0xaddr`, which should show you the matching function
> name. I haven't tested that--it's possible that WinDBG command isn't quite
> right. If it doesn't work dmajor might be able to advise, he knows more
> WinDBG tricks than I do.

At the stack frame of:
dump_syms!google_breakpad::PDBSourceLineWriter::PrintFunction

Some of the locals are:
          length = 0x10
             rva = 0x1a64c0

Disassembling that rva in a second debugger gets:
0:000> u xul+0x1a64c0 xul+0x1a64c0+0x10
xul+0x1a64c0:
101a64c0 55              push    ebp
101a64c1 89e5            mov     ebp,esp
101a64c3 56              push    esi
101a64c4 89ce            mov     esi,ecx
101a64c6 e8551e0200      call    xul+0x1c8320 (101c8320)
101a64cb 89f0            mov     eax,esi
101a64cd 5e              pop     esi
101a64ce 5d              pop     ebp
101a64cf c3              ret
101a64d0 55              push    ebp

So the |length| is at least sane-looking. But notice that the debugger didn't print a name for the function. Confirmed with "ln", the result is empty.

Seems like we're doing a bad job of producing symbols for rust functions. Even if we get dump_syms crawling, this may pose a problem for live debugging.

I have the div-by-zero up in my debugger but haven't dug into it yet.
Ok, so around the crash stack of:

00 0073ebb0 6b3f2601 msdia140!_aulldiv+0x14
01 0073ec8c 6b3f2a1b msdia140!SymbolNameBuilder::AppendArrayTypeName+0x3b1
02 0073edc8 6b3f2740 msdia140!SymbolNameBuilder::AppendTypeName+0x12b
03 0073edec 6b3f29fc msdia140!SymbolNameBuilder::AppendPointerTypeName+0x30
04 0073ef28 6b3f2ced msdia140!SymbolNameBuilder::AppendTypeName+0x10c
05 0073ef5c 6b3f1d88 msdia140!SymbolNameBuilder::AppendFunctionName+0x18d
06 0073ef7c 6b3f3209 msdia140!SymbolNameBuilder::GetName+0x58
07 0073eff0 6b3986b8 msdia140!GetUndecoratedName+0x389
08 0073f42c 0030cc4a msdia140!CDiaSymbol::get_undecoratedNameEx+0x2b8
09 0073f560 0030f1c6 dump_syms!google_breakpad::PDBSourceLineWriter::GetSymbolFunctionName+0x4a

MSDIA is doing something like (pseudocode):

uint64_t numerator, denominator;
pDiaSymbol1->get_length(&numerator);   // == 3
pDiaSymbol2->get_length(&denominator); // == 0
sprintf(tmpBuf, "[%llu]", numerator / denominator);
Append(symbolNameBuf, tmpBuf);

Given the square brackets in the printf and the word "Array" in the function name, I bet it's trying to calculate an array size with the old sizeof(thing)/sizeof(thing[0]) formula.

In my debugger I called get_name on these DiaSymbols, and the thing-of-length-3 has name "", and the thing-of-length-0 has name "webrender::internal_types::BlurAttribute".

So I guess this is an array of BlurAttribute? But the PDB thinks that a single BlueAttribute has size zero?
Created attachment 8844291 [details] [diff] [review]
Rust runction that takes an array and crashes dump_syms

It is indeed arrays that are triggering this. And it doesn't have to be custom types: even a [u8; 256] is enough to crash dump_syms. Here's a file that crashes when applied to m-c.

Anybody know where in the code rustc generates debug info for arrays?
Summary: Windows symbol dumper might have issues when dealing with lots of rust code → Windows symbol dumper crashes on symbols that refer to rust arrays
That'd be a good question to ask in #rust-internals on IRC.
No luck in #rust-internals. Anthony, can you see if mw is able to help with this? (I'm not sure he has a bugzilla account I can needinfo yet...)

The problem is that the debug information for rust array types says that the size is 0 (even for non-zero-size types). This code may be a starting point, but I'm just guessing: https://dxr.mozilla.org/rust/search?q=fixed_vec_metadata
Flags: needinfo?(ajones)

Comment 20

a year ago
I'll look into it. Did you open a Github issue in the Rust repo?
(In reply to michaelwoerister from comment #20)
> I'll look into it. Did you open a Github issue in the Rust repo?

Thanks.
Flags: needinfo?(ajones)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #14)
 Also for the record, I have a friend who works on the VS studio team at
> Microsoft, so I sent him STR, he said he would file a bug for it internally.

This happened, and the issue was fixed. It will eventually make its way out into a release.
(In reply to michaelwoerister from comment #20)
> I'll look into it. Did you open a Github issue in the Rust repo?

I didn't see an issue filed by :dmajor, so I filed https://github.com/rust-lang/rust/issues/40477

Comment 24

a year ago
I've taken a first look at this and it seems to arise when there's the Rust equivalent of a C "flexible array member" [1], that is, an array the size of which is not known at compile-time. What the Rust compiler does at the moment results in correct (or at least sensible) DWARF: A DW_TAG_subrange_type with a DW_AT_lower_bound but no DW_AT_count. I haven't looked at the MSVC version yet. It would be interesting what kind of debuginfo MSVC generates for a C struct like:

struct line {
  int length;
  char contents[];
};

[1] https://en.wikipedia.org/wiki/Flexible_array_member
(In reply to michaelwoerister from comment #24)
> I've taken a first look at this and it seems to arise when there's the Rust
> equivalent of a C "flexible array member" [1], that is, an array the size of
> which is not known at compile-time.

Comment 17 shows this happening even for `[u8; 256]`. Is that really considered to be unknown-size?

Comment 26

a year ago
(In reply to David Major [:dmajor] from comment #25)
> Comment 17 shows this happening even for `[u8; 256]`. Is that really
> considered to be unknown-size?

No, that by itself should not be considered to be of unknown size. However, I suspect that there's an implicit coercion to [u8] in that code somewhere.

To verify that, you could try to compile the following library (which should not contain any coercions) and see if that crashes too:

#![crate_type="rlib"]

pub fn foo() -> [u8; 256] {
   [0; 256]
}

Comment 27

a year ago
mozreview-review
Comment on attachment 8843408 [details]
Bug 1343625 - Catch and gracefully handle div-by-zero exceptions in get_undecoratedNameEx.

https://reviewboard.mozilla.org/r/117182/#review122634

OK. Let's just land this as a local patch instead of upstreaming it to Breakpad. Hopefully upstream Rust will fix this and we'll be able to remove it when we pick up that fix.
Attachment #8843408 - Flags: review?(ted) → review+

Comment 28

a year ago
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0286599493ff
Catch and gracefully handle div-by-zero exceptions in get_undecoratedNameEx. r=ted

Comment 29

a year ago
Is there an easy way for me to reproduce this locally?
Breakpad has a prebuilt dump_syms binary that should be sufficient to reproduce the crash given a PDB file containing the rust debug info:
https://chromium.googlesource.com/breakpad/breakpad/+/master/src/tools/windows/binaries/

Comment 31

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/0286599493ff
Status: NEW → RESOLVED
Last Resolved: a year ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55

Comment 32

a year ago
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #30)
> Breakpad has a prebuilt dump_syms binary that should be sufficient to
> reproduce the crash given a PDB file containing the rust debug info:
> https://chromium.googlesource.com/breakpad/breakpad/+/master/src/tools/
> windows/binaries/

Oh, nice! That *is* easy. Thanks Ted!

Comment 33

11 months ago
I was able to reproduce this with a small test program and looked into it some more. dmajor was right: A simple fixed-size array is sufficient to trigger the bug. I'm not sure how to best fix this. The Rust compiler is producing the correct LLVM metadata, and (as far as I can tell, see [1] and [2]) LLVM produces the correct CodeView debuginfo from that.

[1] https://github.com/rust-lang/llvm/blob/859fb269364623b17e092efaba3f94e70ce97c5e/lib/CodeGen/AsmPrinter/CodeViewDebug.cpp#L982
[2] http://pierrelib.pagesperso-orange.fr/exec_formats/MS_Symbol_Type_v1.0.pdf (p. 33)
Is there equivalent C code (like your comment 24) that you could run through clang-cl to see what the generated debug info looks like?
If possible, running the same code through MSVC to compare seems useful.

Comment 36

11 months ago
So I ran this through Clang and it seems that dump_syms has a problem with fixed-size arrays as function parameters. I'll have to investigate further. It might be related to the array being implicitly passed as a pointer.
You need to log in before you can comment on or make changes to this bug.