Closed Bug 526484 Opened 15 years ago Closed 15 years ago

symbol dumper behavior fails to include certain code blocks produced by PGO (some sections of functions missing in output)

Categories

(Toolkit :: Crash Reporting, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
status1.9.2 --- final-fixed
status1.9.1 --- .8-fixed

People

(Reporter: ted, Assigned: ted)

Details

Attachments

(1 file)

In bug 525326 comment 18, Jeff notes that the top of the stack here: http://crash-stats.mozilla.com/report/index/7c622a06-1d57-4d34-afd8-540af2091016 is in known code, despite showing up as "xul.dll@0x34311e". If you look at the source of the frame above it, it's clearly calling "GifWrite()", and I suspect GifWrite has been inlined, since that's the only call site for it. I think maybe the Windows symbol dumper isn't handling inlined functions correctly in this case. We've also seen other instances that could be the same thing, for example: http://crash-stats.mozilla.com/report/index/1c3fde12-d276-4c7b-9d2c-917a82091104?p=1 We clearly have mozjs.dll symbols there, just not for that top frame. jimb noted elsewhere while working on DWARF CFI work that the Breakpad symbol format doesn't actually handle inlined methods cleanly. IIRC, on OSX/Linux we currently just stick the file/line info of the inlined method inside the function data of the containing method, so in crash stacks you get the name of the containing method, and the file/line info of the inlined method.
The same crash on 3.5.4, since the crash reports linked in comment 0 are nightlies, and as such their symbols have already been deleted from the symbol server (oops): http://crash-stats.mozilla.com/report/index/5878c51c-c28f-4bf4-afff-030c52091112 The libxul symbols for 3.5.4 on Windows: http://symbols.mozilla.org/firefox/xul.pdb/438A50233B37496A8B73EA977D68E98F2/xul.sym (there's also the compressed xul.pd_ in that same directory) There appears to be an entry for GifWrite in the symbol file: FUNC 157f10 4c4 8 nsGIFDecoder2::GifWrite(unsigned char const *,unsigned int) but that's clearly not the address in the stack trace (xul.dll@0x348864). Also, the entry for nsGIFDecoder2::ProcessData looks sane: FUNC 188635 2e c nsGIFDecoder2::ProcessData(unsigned char *,unsigned int,unsigne d int *) 188635 0 247 193060 188635 e 250 193060 188643 9 251 193060 18864c d 254 193060 188659 7 260 193060 188660 3 263 193060 (that's the entirety of it). So it doesn't appear to actually have been inlined. I'm going to poke at the PDB file with the DIA APIs to see what it thinks is at that address.
Summary: investigate symbol dumper behavior with inlined functions on windows → investigate symbol dumper behavior on windows
using the DIA API findSymbolByRVA, I get the following info for that address: Found something at 0x348864 ================== Symbol type: SymTagBlock No name present No source file present Is code: no Is function: no Has debug info: no Is compiler generated: no Found lexical parent ================== Symbol type: SymTagFunction Name: nsGIFDecoder2::GifWrite Source File: nsGIFDecoder2::GifWrite Is code: no Is function: no Has debug info: no Is compiler generated: no So it's a code block that's apparently a child of GifWrite, but it doesn't have its own name.
I guess part of the problem here is that Breakpad's symbol format doesn't know how to represent noncontiguous functions, which is what this apparently is. The function printer in the symbol dumping code just asks for a list of functions by RVA, and then prints them: http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/common/windows/pdb_source_line_writer.cc#226 Since this isn't actually a function, it probably doesn't get enumerated there. If I had to guess, I would think that we could call IDiaSymbol::findChildren on each function we get, then for any blocks we find that aren't contained in the function's address space, we could print out an extra FUNC line with the block's rva and size, and its line info.
That guess doesn't seem to pan out. Using DIA, I can get this block if I call findSymbolByRVA with its RVA, and then get the parent function using get_lexicalParent, but if I have the parent function, this block isn't anywhere I can get it using findChildren (it's not a descendant node at all in that tree). There must be some way to collect these orphaned blocks and associate them with the function they came from, I'll keep looking. Currently dump_syms uses getSymbolsByAddr to enumerate over all functions in the PDB, but this doesn't find these code blocks.
Summary: investigate symbol dumper behavior on windows → symbol dumper behavior fails to include certain code blocks produced by PGO (some sections of functions missing in output)
Ok, after much fiddling, I was able to hack dump_syms such that it can locate these missing blocks of code. I can make it output extra FUNC lines, so that we should get function names for these blocks of code, but there doesn't appear to be source line data available at all in the PDB files. This means that we won't get source file + line numbers for them. I'll have to test with VC9 and see if it's any better in this regard. If not, I'll file a bug on Microsoft.
Since I said I thought there was a compiler bug (or at least deficiency) here, Jeff asked what would happen if a Microsoft debugger were to look at these blocks of code. I used the testcase from bug 525326 to crash a Firefox 3.5.4 with WinDBG attached. The top of the stack looks like: ChildEBP RetAddr 0012f96c 10188643 xul!nsGIFDecoder2::GifWrite+0x1f0a35 0012f978 1018867b xul!nsGIFDecoder2::ProcessData(unsigned char * data = 0x025ed608 "D???", unsigned int count = 0x4e0b020, unsigned int * _retval = 0x04cb20a0)+0xe [e:\builds\moz2_slave\win32_build\build\modules\libpr0n\decoders\gif\nsgifdecoder2.cpp @ 251] I don't get source info for the top frame. I have source server debugging enabled, and I get source if I click on the second frame. It seems pretty clear that the compiler just isn't emitting full debug info for these blocks of code that were rearranged by the optimizer. I'm going to try producing a reduced testcase so I can test on VC9 (and the VC10 beta). I'll also get my patch cleaned up and submitted upstream, since having function names is better than what we have now (and appears to be the best we can do with the compiler limitations).
Assignee: nobody → ted.mielczarek
Status: NEW → ASSIGNED
Patch up for upstream review: http://breakpad.appspot.com/40007
I landed this upstream. We should just take it locally (and on branches) because merging with upstream is a pain due to the Linux client rewrite right now. (I will merge that in bug 514188 eventually.)
Actually, I forgot, we don't build dump_syms on Windows (because it doesn't build on MSVC Express, I think), so I just need to update the checked-in binary.
I have another patch awaiting review upstream that I'd like to land first: http://breakpad.appspot.com/43001/show
Not much to review here, really. The upstream patches that this picks up are listed in this bug. I built this using VC8, and the dump_syms.vcproj in the Breakpad tree. The only change I made was to switch it to static CRT linkage. We should take this on all of our branches, ideally, since it's nothing but good. I can even regenerate our existing symbols for the latest release on each branch using this, I'll file a separate bug for that.
Attachment #414265 - Flags: review?(benjamin)
Attachment #414265 - Flags: review?(benjamin) → review+
I filed bug 530783 on that.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Comment on attachment 414265 [details] [diff] [review] update dump_syms.exe to one built from SVN r430 We should take this on branches to get better symbols on Windows builds.
Attachment #414265 - Flags: approval1.9.2?
Attachment #414265 - Flags: approval1.9.1.7?
Whiteboard: [landed 12-14, want one day bake before 1.9.2]
Maybe this is 3.6.1 instead of 3.6, but nominating anyway. Would like this landed on 1.9.2 before 1.9.1
Flags: blocking1.9.2?
blocking1.9.1: --- → needed
Comment on attachment 414265 [details] [diff] [review] update dump_syms.exe to one built from SVN r430 a192=beltzner, let's get this in, but if it bounces, we'll pull it, so not blocking.
Attachment #414265 - Flags: approval1.9.2? → approval1.9.2+
Flags: wanted1.9.2+
Flags: blocking1.9.2?
Flags: blocking1.9.2-
Pushed to 1.9.2: http://hg.mozilla.org/releases/mozilla-1.9.2/rev/c576c15e01a9 Kind of tough to verify, but I ssh'ed to a place with access to the symbol store, and checked the count of functions reported in each symbol file: $ grep xul.sym firefox-3.7a1pre-WINNT-20091214045914-symbols.txt xul.pdb/F6409D49C333416297B8388C2C8F5F592/xul.sym $ grep ^FUNC xul.pdb/F6409D49C333416297B8388C2C8F5F592/xul.sym | wc -l 63971 $ grep xul.sym firefox-3.7a1pre-WINNT-20091215045531-symbols.txt xul.pdb/52C531559AD943F8A6FB24BB407EE5292/xul.sym $ grep ^FUNC xul.pdb/52C531559AD943F8A6FB24BB407EE5292/xul.sym | wc -l 74972 You can see that we're reporting about 11k more function names, which are the bits that were missing.
Whiteboard: [landed 12-14, want one day bake before 1.9.2]
Whiteboard: [waiting on 1.9.2rc1 bake]
Comment on attachment 414265 [details] [diff] [review] update dump_syms.exe to one built from SVN r430 a=beltzner for 191 branch
Attachment #414265 - Flags: approval1.9.1.8? → approval1.9.1.8+
Verified on 1.9.2, while I was at it: $ grep xul.sym firefox-3.6b6pre-WINNT-20091221053555-symbols.txt xul.pdb/E3681D36E1A74C349CF24A8FDB19B2232/xul.sym $ grep ^FUNC xul.pdb/E3681D36E1A74C349CF24A8FDB19B2232/xul.sym | wc -l 63017 $ grep xul.sym firefox-3.6b6pre-WINNT-20091222062748-symbols.txt xul.pdb/263DF4266ABA49A4A9BF4929F17BF4D12/xul.sym $ grep ^FUNC xul.pdb/263DF4266ABA49A4A9BF4929F17BF4D12/xul.sym | wc -l 71349
Whiteboard: [waiting on 1.9.2rc1 bake]
blocking1.9.1: needed → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: