See e.g. https://crash-stats.mozilla.com/report/index/0013ff42-587e-4d68-98a2-29c792130629 The top frame is crashing at 0x68d4af06 which is msvcr100.dll + 0x8af06 Visual Studio correctly reports this crash as happening in _purecall. The .sym file located on the server has matching information: FUNC 8aef4 2a 0 _purecall 8aef4 0 43 40794 8aef4 c 44 40794 8af00 4 45 40794 8af04 2 47 40794 8af06 7 54 40794 8af0d c 56 40794 8af19 5 57 40794 But for some reason breakpad isn't picking this up.
So here's what's going on: FUNC 8aeec b 14 _invalid_parameter(unsigned short const *,unsigned short const *,unsigned short const *,unsigned int,unsigned int) 8aeec 5 321 42951 8aef1 1 329 42951 8aef2 5 328 42951 FUNC 8aef4 2a 0 _purecall 8aef4 0 43 40794 8aef4 c 44 40794 8af00 4 45 40794 8af04 2 47 40794 8af06 7 54 40794 8af0d c 56 40794 8af19 5 57 40794 The _invalid_parameter function is in fact 8 bytes long. But the FUNC header in the .sym file says that it's 11 bytes (0xb). This overlaps the next function which starts at the 9th byte. This causes breakpad to throw away the range for the next function with the warning: 2013-07-03 07:39:43: range_map-inl.h:77: INFO: StoreRange failed, an existing range is contained by or extends lower than the new range: new 0x8aef4+0x2a, existing 0x8aeec+0xb The reason that it's listing the function as 11 bytes long is because the final instruction of the function is a 2-byte jump instruction up to _invalid_parameter (the C function). I believe that when the compiler generated this instruction, it was originally a 5-byte long-jump instruction. When the linker finished placing the functions, it discovered that it could rewrite the jump to a 2-byte relative nearjump, but it didn't update the debuginfo. I ran a quick analysis on just this .sym file, and of the 2886 functions, 359 of them are affected by this bug, including some of the most important. I will attach a list, but in includes * operator delete * operator new * malloc/free * memcpy/memmov * fclose/fread/_close/_read I'm still looking at whether all of the overlaps are the same cause. If it's specifically that, we may be able to just hack in a small overlap check. Not sure yet.
Assignee: nobody → benjamin
This is not limited to jumps at the end of functions: FUNC 461e4 27 0 Concurrency::Context::IsCurrentTaskCollectionCanceling() 461e4 0 92 404 461e4 5 93 404 461e9 4 94 404 461ed 3 100 404 461f0 4 101 404 461f4 a 103 404 0 5 104 404 0 5 106 404 46207 2 109 404 46209 1 110 404 In this case the jump(s) that are being rewritten to nearjumps are in the middle of the function. In this case the total function size is off by only 1 byte, but I think that's because of some kind of padding (it should be 6 bytes, if everything were tightly packed). I ran my analysis script over xul.sym for Firefox 22, and it does not appear that xul.sym or nss3.sym are affected by this. Given this, I think that we should consider automatically rewriting ranges "smaller" when another range is found that sits on the end of them.
Fixing this at symbol-dumping time would be doable, but require a bit of rework of the symbol dumping code, because it currently doesn't store functions in memory, it just dumps them all out as it iterates them: http://code.google.com/p/google-breakpad/source/browse/trunk/src/common/windows/pdb_source_line_writer.cc#280 Fixing this at symbol load time would mean fixing this: http://code.google.com/p/google-breakpad/source/browse/trunk/src/processor/basic_source_line_resolver.cc#146 Where the implementation for StoreRange is here: http://code.google.com/p/google-breakpad/source/browse/trunk/src/processor/range_map-inl.h#50
The former *feels* better, since we shouldn't be writing out malformed symbol files in the first place, even if it's more of a PITA.
We could also rewrite the .sym files in python, which would be a total hack but would be relatively simple. Do we get any guarantees from the iterator that they are handed back in order? It would be nice to rewrite only the last symbol, instead of doing arbitrary searches for overlapping symbols.
It's not 100% clear to me, but the DIA APIs seem to claim that we're enumerating in address order: http://code.google.com/p/google-breakpad/source/browse/trunk/src/common/windows/pdb_source_line_writer.cc#280 http://msdn.microsoft.com/en-us/library/971a5e3e.aspx "IDiaEnumSymbolsByAddr::Next Retrieves the next symbols in order by address."
Assignee: benjamin → nobody
There's also a third-party implementation of dump_syms for Windows now: https://github.com/luser/dump_syms It would be interesting to see if it suffers from the same bug.
See also https://breakpad.appspot.com/1574002/ which may be changing the general pattern of retrieving symbols.
Whiteboard: [email@example.com][lang=c++][crashkill:P2] → [lang=c++][crashkill:P2]
Just some poking here since I ran into this in another bug. I grabbed the dump for https://crash-stats.mozilla.com/report/index/a9e740e2-ad90-4f34-b151-8e8292140918 and was poking at the symbols. I dumped symbols for msvcr100.dll with the current in-tree dump_syms and they look right: FUNC 8aeec 8 14 _invalid_parameter(unsigned short const *,unsigned short const * ,unsigned short const *,unsigned int,unsigned int) FUNC 8aef4 2a 0 _purecall compared with what we have on the symbol server: FUNC 8aeec b 14 _invalid_parameter(unsigned short const *,unsigned short const *,unsigned short const *,unsigned int,unsigned int) FUNC 8aef4 2a 0 _purecall I guess that upstream Breakpad change fixes this bug. I'll have to update the binaries in use for the fetch-win32-symbols script to make sure we're getting good symbols. bsmedberg: do you have a list anywhere of affected symbols? We should be able to replace them with files dumped using a newer dump_syms.
No I don't.
You need to log in before you can comment on or make changes to this bug.