Closed
Bug 1343625
Opened 8 years ago
Closed 8 years ago
Windows symbol dumper crashes on symbols that refer to rust arrays
Categories
(Firefox Build System :: General, defect)
Tracking
(firefox55 fixed)
RESOLVED
FIXED
mozilla55
Tracking | Status | |
---|---|---|
firefox55 | --- | fixed |
People
(Reporter: kats, Assigned: kats)
References
(Depends on 1 open bug)
Details
Attachments
(3 files)
In bug 1342450 I'm trying to have webrender building by default in desktop builds. A recent try push [1] with some patches resulted in mysterious windows test failures. What appears to be happening is that debug builds naturally sometimes hit NS_ASSERTIONs, and a backtrace is dumped. But if the .sym file is bad, the backtrace dumper dies a horrible death. That seems to the case here - in particular it's finding a FUNC line with too few parameters and dies at [2].
I downloaded the symbols zip [3], and looked for FUNC lines with too few tokens. Turns out the xul.sym file looks truncated, because it ends in the middle of an address printout. Full file is attached.
I then looked at the log of the job that generated the symbol file [4], and as far as I can see there's no problems or timeouts reported. The stuff relevant to xul.pdb is this:
...
08:35:47 INFO - 4500: Submitting jobs for files: ('.\\toolkit\\library\\xul.pdb',)
08:35:47 INFO - 4812: Worker processing files: ('.\\toolkit\\library\\xul.pdb',)
08:35:47 INFO - 4812: c:\builds\moz2_slave\try-w32-d-00000000000000000000\build\src\obj-firefox\dist\host\bin\dump_syms.exe .\toolkit\library\xul.pdb
... [snip] ...
08:35:47 INFO - 4340: Worker finished processing ('.\\js\\src\\jsapi-tests\\jsapi-tests.pdb',) in 22.24s
08:35:47 INFO - 4812: Worker finished processing ('.\\toolkit\\library\\xul.pdb',) in 152.75s
08:35:47 INFO - echo packing symbols
08:35:47 INFO - packing symbols
...
My best guess is that the file wasn't fully flushed to disk or something before it got zipped up, but that seems unlikely. Or maybe there's a timeout of 150 seconds or so on the worker? I'm not sure, but I'll dig around.
Ted, any thoughts?
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=bcc2605203a77e52b7c6216e0ea7099eb9d77307
[2] http://searchfox.org/mozilla-central/rev/9c1c7106eef137e3413fd867fc1ddfb1d3f6728c/tools/rb/fix_stack_using_bpsyms.py#49
[3] https://queue.taskcluster.net/v1/task/TH_Po8RgR72VzFONT4skhg/artifacts/public/build/firefox-54.0a1.en-US.win32.crashreporter-symbols.zip
[4] https://archive.mozilla.org/pub/firefox/try-builds/kgupta@mozilla.com-bcc2605203a77e52b7c6216e0ea7099eb9d77307/try-win32-debug/try-win32-debug-bm78-try1-build1550.txt.gz
Flags: needinfo?(ted)
Assignee | ||
Comment 1•8 years ago
|
||
FWIW the webrender-enabled windows build on the graphics branch (e.g. [1]) processes xul.pdb in 70.03 seconds but it still appears to be truncated. (You can download the symbols zip file from the job details pane in TreeHerder).
[1] https://treeherder.mozilla.org/#/jobs?repo=graphics&revision=9059ef02d91ba49bb3cf3b9cdd4c92bc99a4d4b0&selectedJob=80866645
Assignee | ||
Comment 2•8 years ago
|
||
I'm starting to suspect the dump_syms.exe program is dying, because when I compare the webrender-enabled and webrender-disabled .sym files, the -disabled one is much larger and has a longer runtime (to completion). So it's not like it's timing out or running out of space. I presume I can reproduce this locally using the firefox-54.0a1.en-US.win32.crashreporter-symbols-full.zip - I'll try doing that tomorrow once I'm home and have access to my windows machine again.
Still, in the meantime, any suggestions or things to look for would be useful.
Assignee | ||
Comment 3•8 years ago
|
||
I can reproduce this locally, it looks like dump_syms.exe is crashing with a divide-by-zero error. I'll need to rebuild dump_syms.exe with debugging stuff and symbols, will chase it down.
Assignee | ||
Comment 4•8 years ago
|
||
So, it looks like at [1] when we try to get the name of the function for a IDiaSymbol, it crashes somewhere inside msdia140.dll with a divide by zero. Seems like a bug in msdia140.dll. I'll see if I can work around it somehow.
[1] http://searchfox.org/mozilla-central/rev/546a05fec017cb785fca62a5199d42812a6a1fd3/toolkit/crashreporter/google-breakpad/src/common/windows/pdb_source_line_writer.cc#973
Flags: needinfo?(ted)
Assignee | ||
Comment 5•8 years ago
|
||
I added a try/catch block to catch the div-by-zero and that seems to work locally. There's something like 33 symbols in the webrender-enabled xul.pdb that have this problem.
Try push to verify: https://treeherder.mozilla.org/#/jobs?repo=try&revision=106917c598749edd1949c214df95dfdf081b80da
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → bugmail
Assignee | ||
Comment 6•8 years ago
|
||
Ok, that seems to work. The try push has what appears to be a complete xul.sym file, with 53 inaccessible names in the debug build's symbol file.
Comment hidden (mozreview-request) |
Comment 8•8 years ago
|
||
This would need to go upstream to Breakpad. Do you know what symbols specifically the code is crashing on?
Assignee | ||
Comment 9•8 years ago
|
||
Not specifically - I know it's something in webrender or one of the rust dependencies of webrender, because enabling that is what causes this to trigger. I'm not sure how to find out exactly which symbol - I tried printing the get_name() value but it seems to be non-ascii as far as I can tell. I'm not familiar with windows symbol mangling, is there a tool I can use to produce this information?
Comment 10•8 years ago
|
||
I would think the `get_name` function would produce useful output here, but maybe the root cause is that the symbol name has bad data or something? I think you can use WinDBG for this--take the offset address from the FUNC line in the .sym file for one of these functions, then load up the build in WinDBG and do `ln xul+0xaddr`, which should show you the matching function name. I haven't tested that--it's possible that WinDBG command isn't quite right. If it doesn't work dmajor might be able to advise, he knows more WinDBG tricks than I do.
Assignee | ||
Comment 11•8 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #10)
> I would think the `get_name` function would produce useful output here, but
> maybe the root cause is that the symbol name has bad data or something?
This was my mistake - I was doing a bad job of trying to print the BSTR.
> I
> think you can use WinDBG for this--take the offset address from the FUNC
> line in the .sym file for one of these functions, then load up the build in
> WinDBG and do `ln xul+0xaddr`, which should show you the matching function
> name.
I did this on some of the addresses, and got results like:
xul!core::array::{{impl}}::into_iter<webrender::internal_types::BlurAttribute> (void)
xul!core::array::{{impl}}::into_iter<webrender::internal_types::BlurAttribute> (void)
xul!core::array::{{impl}}::into_iter<webrender::texture_cache::FreeListBin> (void)
xul!core::array::{{impl}}::into_iter<webrender::internal_types::ClipAttribute> (void)
xul!gamma_lut::GammaLut::replace_pixels_rgb (void)
Now that I'm using get_name properly I get similar results there. I'll update the patch to fall through to the get_name codepath instead of returning "name inaccessible", since that will produce more useful results.
Comment hidden (mozreview-request) |
![]() |
||
Comment 13•8 years ago
|
||
This sounds fishy. I'd like to better understand why this crash happens before we sweep it under the rug. Mind if I poke at this a bit before you land?
Assignee | ||
Comment 14•8 years ago
|
||
I have no objections to you poking at it, but I also don't want to delay landing this patch, as it is blocking Quantum Render work. Specifically, without this we cannot properly run debug tests on windows QR-enabled builds. If it turns out there's a better way to fix this I'm happy to back out this patch later.
Also for the record, I have a friend who works on the VS studio team at Microsoft, so I sent him STR, he said he would file a bug for it internally. If you want a standalone way to reproduce this, you can download the bundle I made at https://people-mozilla.org/~kgupta/tmp/dia_bug.tar.gz (warning: 100+ MB file, includes a 600+ MB xul.pdb file), build the dump_syms VS project inside the folder, and run the binary on the included xul.pdb file.
![]() |
||
Comment 15•8 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #10)
> I would think the `get_name` function would produce useful output here, but
> maybe the root cause is that the symbol name has bad data or something? I
> think you can use WinDBG for this--take the offset address from the FUNC
> line in the .sym file for one of these functions, then load up the build in
> WinDBG and do `ln xul+0xaddr`, which should show you the matching function
> name. I haven't tested that--it's possible that WinDBG command isn't quite
> right. If it doesn't work dmajor might be able to advise, he knows more
> WinDBG tricks than I do.
At the stack frame of:
dump_syms!google_breakpad::PDBSourceLineWriter::PrintFunction
Some of the locals are:
length = 0x10
rva = 0x1a64c0
Disassembling that rva in a second debugger gets:
0:000> u xul+0x1a64c0 xul+0x1a64c0+0x10
xul+0x1a64c0:
101a64c0 55 push ebp
101a64c1 89e5 mov ebp,esp
101a64c3 56 push esi
101a64c4 89ce mov esi,ecx
101a64c6 e8551e0200 call xul+0x1c8320 (101c8320)
101a64cb 89f0 mov eax,esi
101a64cd 5e pop esi
101a64ce 5d pop ebp
101a64cf c3 ret
101a64d0 55 push ebp
So the |length| is at least sane-looking. But notice that the debugger didn't print a name for the function. Confirmed with "ln", the result is empty.
Seems like we're doing a bad job of producing symbols for rust functions. Even if we get dump_syms crawling, this may pose a problem for live debugging.
I have the div-by-zero up in my debugger but haven't dug into it yet.
![]() |
||
Comment 16•8 years ago
|
||
Ok, so around the crash stack of:
00 0073ebb0 6b3f2601 msdia140!_aulldiv+0x14
01 0073ec8c 6b3f2a1b msdia140!SymbolNameBuilder::AppendArrayTypeName+0x3b1
02 0073edc8 6b3f2740 msdia140!SymbolNameBuilder::AppendTypeName+0x12b
03 0073edec 6b3f29fc msdia140!SymbolNameBuilder::AppendPointerTypeName+0x30
04 0073ef28 6b3f2ced msdia140!SymbolNameBuilder::AppendTypeName+0x10c
05 0073ef5c 6b3f1d88 msdia140!SymbolNameBuilder::AppendFunctionName+0x18d
06 0073ef7c 6b3f3209 msdia140!SymbolNameBuilder::GetName+0x58
07 0073eff0 6b3986b8 msdia140!GetUndecoratedName+0x389
08 0073f42c 0030cc4a msdia140!CDiaSymbol::get_undecoratedNameEx+0x2b8
09 0073f560 0030f1c6 dump_syms!google_breakpad::PDBSourceLineWriter::GetSymbolFunctionName+0x4a
MSDIA is doing something like (pseudocode):
uint64_t numerator, denominator;
pDiaSymbol1->get_length(&numerator); // == 3
pDiaSymbol2->get_length(&denominator); // == 0
sprintf(tmpBuf, "[%llu]", numerator / denominator);
Append(symbolNameBuf, tmpBuf);
Given the square brackets in the printf and the word "Array" in the function name, I bet it's trying to calculate an array size with the old sizeof(thing)/sizeof(thing[0]) formula.
In my debugger I called get_name on these DiaSymbols, and the thing-of-length-3 has name "", and the thing-of-length-0 has name "webrender::internal_types::BlurAttribute".
So I guess this is an array of BlurAttribute? But the PDB thinks that a single BlueAttribute has size zero?
![]() |
||
Comment 17•8 years ago
|
||
It is indeed arrays that are triggering this. And it doesn't have to be custom types: even a [u8; 256] is enough to crash dump_syms. Here's a file that crashes when applied to m-c.
Anybody know where in the code rustc generates debug info for arrays?
Summary: Windows symbol dumper might have issues when dealing with lots of rust code → Windows symbol dumper crashes on symbols that refer to rust arrays
Comment 18•8 years ago
|
||
That'd be a good question to ask in #rust-internals on IRC.
![]() |
||
Comment 19•8 years ago
|
||
No luck in #rust-internals. Anthony, can you see if mw is able to help with this? (I'm not sure he has a bugzilla account I can needinfo yet...)
The problem is that the debug information for rust array types says that the size is 0 (even for non-zero-size types). This code may be a starting point, but I'm just guessing: https://dxr.mozilla.org/rust/search?q=fixed_vec_metadata
Flags: needinfo?(ajones)
Comment 20•8 years ago
|
||
I'll look into it. Did you open a Github issue in the Rust repo?
(In reply to michaelwoerister from comment #20)
> I'll look into it. Did you open a Github issue in the Rust repo?
Thanks.
Flags: needinfo?(ajones)
Assignee | ||
Comment 22•8 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #14)
Also for the record, I have a friend who works on the VS studio team at
> Microsoft, so I sent him STR, he said he would file a bug for it internally.
This happened, and the issue was fixed. It will eventually make its way out into a release.
Assignee | ||
Comment 23•8 years ago
|
||
(In reply to michaelwoerister from comment #20)
> I'll look into it. Did you open a Github issue in the Rust repo?
I didn't see an issue filed by :dmajor, so I filed https://github.com/rust-lang/rust/issues/40477
See Also: → https://github.com/rust-lang/rust/issues/40477
Comment 24•8 years ago
|
||
I've taken a first look at this and it seems to arise when there's the Rust equivalent of a C "flexible array member" [1], that is, an array the size of which is not known at compile-time. What the Rust compiler does at the moment results in correct (or at least sensible) DWARF: A DW_TAG_subrange_type with a DW_AT_lower_bound but no DW_AT_count. I haven't looked at the MSVC version yet. It would be interesting what kind of debuginfo MSVC generates for a C struct like:
struct line {
int length;
char contents[];
};
[1] https://en.wikipedia.org/wiki/Flexible_array_member
![]() |
||
Comment 25•8 years ago
|
||
(In reply to michaelwoerister from comment #24)
> I've taken a first look at this and it seems to arise when there's the Rust
> equivalent of a C "flexible array member" [1], that is, an array the size of
> which is not known at compile-time.
Comment 17 shows this happening even for `[u8; 256]`. Is that really considered to be unknown-size?
Comment 26•8 years ago
|
||
(In reply to David Major [:dmajor] from comment #25)
> Comment 17 shows this happening even for `[u8; 256]`. Is that really
> considered to be unknown-size?
No, that by itself should not be considered to be of unknown size. However, I suspect that there's an implicit coercion to [u8] in that code somewhere.
To verify that, you could try to compile the following library (which should not contain any coercions) and see if that crashes too:
#![crate_type="rlib"]
pub fn foo() -> [u8; 256] {
[0; 256]
}
Comment 27•8 years ago
|
||
mozreview-review |
Comment on attachment 8843408 [details]
Bug 1343625 - Catch and gracefully handle div-by-zero exceptions in get_undecoratedNameEx.
https://reviewboard.mozilla.org/r/117182/#review122634
OK. Let's just land this as a local patch instead of upstreaming it to Breakpad. Hopefully upstream Rust will fix this and we'll be able to remove it when we pick up that fix.
Attachment #8843408 -
Flags: review?(ted) → review+
Comment 28•8 years ago
|
||
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0286599493ff
Catch and gracefully handle div-by-zero exceptions in get_undecoratedNameEx. r=ted
Comment 29•8 years ago
|
||
Is there an easy way for me to reproduce this locally?
Comment 30•8 years ago
|
||
Breakpad has a prebuilt dump_syms binary that should be sufficient to reproduce the crash given a PDB file containing the rust debug info:
https://chromium.googlesource.com/breakpad/breakpad/+/master/src/tools/windows/binaries/
Comment 31•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox55:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Comment 32•8 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #30)
> Breakpad has a prebuilt dump_syms binary that should be sufficient to
> reproduce the crash given a PDB file containing the rust debug info:
> https://chromium.googlesource.com/breakpad/breakpad/+/master/src/tools/
> windows/binaries/
Oh, nice! That *is* easy. Thanks Ted!
Comment 33•8 years ago
|
||
I was able to reproduce this with a small test program and looked into it some more. dmajor was right: A simple fixed-size array is sufficient to trigger the bug. I'm not sure how to best fix this. The Rust compiler is producing the correct LLVM metadata, and (as far as I can tell, see [1] and [2]) LLVM produces the correct CodeView debuginfo from that.
[1] https://github.com/rust-lang/llvm/blob/859fb269364623b17e092efaba3f94e70ce97c5e/lib/CodeGen/AsmPrinter/CodeViewDebug.cpp#L982
[2] http://pierrelib.pagesperso-orange.fr/exec_formats/MS_Symbol_Type_v1.0.pdf (p. 33)
Comment 34•8 years ago
|
||
Is there equivalent C code (like your comment 24) that you could run through clang-cl to see what the generated debug info looks like?
Comment 35•8 years ago
|
||
If possible, running the same code through MSVC to compare seems useful.
Comment 36•8 years ago
|
||
So I ran this through Clang and it seems that dump_syms has a problem with fixed-size arrays as function parameters. I'll have to investigate further. It might be related to the array being implicitly passed as a pointer.
Updated•7 years ago
|
Product: Core → Firefox Build System
You need to log in
before you can comment on or make changes to this bug.
Description
•