Closed Bug 874647 Opened 7 years ago Closed 6 years ago

Find and fix the underlying cause of the Windows XP xpcshell hangs

Categories

(Testing :: XPCShell Harness, defect)

x86
Windows XP
defect
Not set

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: RyanVM, Unassigned)

References

Details

Attachments

(1 file)

Rather than confuse matters by trying to find the root cause in one of the various "1200 second" bugs or in the TBPL bug for un-hiding the tests once they are reliably green again, let's use this bug as a central place for that work.
Clint - From our email conversation, it sounds like your team is going to take the first crack at this. Can you please find a suitable owner?
Excluding resolved intermittent-failures, bug 824022 is now our #1 toporange at the moment (even excluding the overflow bug 854330).

Clint, did you have any luck with the logging? :-)
Flags: needinfo?(ctalbert)
No activity on this for weeks now tells me this isn't a priority. Over the last 3 weeks, these failures are 2 of our top 3 unfixed oranges (and there are a few more variants on this further down the list). If we can't get the resources to investigate and fix this, can we just accept reality and turn these off?
I'll also note that in the interest of slave management, I'm loathe to just re-hide these by default like we were doing with the rev3s.
I think somebody should try one of the options in bug 824022 comment 43.
Or does bug 824022 comment 945 say that we already got the information we would have gained from doing that?
That sounds true. The harness now dumps the log on timeouts, which was effectively all I was proposing.
Hi guys, my bad here. I spoke to Ed about this in person but forgot to update the bug. These aren't actually timing out. The xpcshell process is failing to exit. I can't repro that locally and I have no idea what's keeping it alive in automation.
Flags: needinfo?(ctalbert)
Here is a patch that logs all the relevant timings of the xpcshell tests. And I started another try run with it in case someone sees something I'm missing. Please retrigger to your heart's content.
https://tbpl.mozilla.org/?tree=Try&rev=1cac583bfd1c
We don't have any way to remotely trigger the profiler on Windows, do we? Getting a profile out of xpcshell when it's hung might be useful. It's entirely possible that that wouldn't show us anything (just "spinning the event loop"), but it's worth a shot.
No we don't. On POSIX we can use signals. What's analogous to Windows? Perhaps we could patch the breakpad handler to check if 'MOZ_DUMP_PROFILE_ON_CRASH' is set and when the timeout crash happens breakpad could dump the profile?
There's nothing as straightforward as POSIX signals, but we could listen for a specially-named event. We could have an environment variable, like MOZ_SPS_EVENT, and then call CreateEvent with that name, then use RegisterWaitForSingleObject to get a callback when it's signalled. Then users could call OpenEvent with the same event name and SetEvent to signal it. This would be pretty easy to wire up in automation.
After discussion I decided to go a bit simpler and have patches on bug 885472 to give us a way to easily trigger Breakpad from outside of the process on Windows, and hook that up in automation.py so we should be able to get stacks out of these xpcshell hangs. I pushed the whole bunch to try, I'll see if I can retrigger the XP runs until I catch a failure:
https://tbpl.mozilla.org/?tree=Try&rev=ec177ea5bdcf
Depends on: 885472
No longer depends on: 884877
There's definitely something weird going on here. I was able to reproduce the hangs on Try with my patches from bug 885472, but the harness was not actually able to kill xpcshell in that case. I tested on a loccal WinXP VM, and the code works fine, including with an artificially-induced hang in an xpcshell test, so there's clearly something fishy going on here.
Testing a theory, I pushed that patch queue again but with a twist:
https://tbpl.mozilla.org/?tree=Try&rev=945b35951835

I have an additional patch in there that makes xpcshell exit early instead of doing a normal shutdown. I'm on vacation for the rest of the week, but if someone wanted to retrigger a bunch of WinXP xpcshell jobs on there and see what happens, that'd be interesting.
Can't retrigger tests, since it doesn't survive packaging.
Ahh, too simple apparently. :-(
Testing another theory:
https://tbpl.mozilla.org/?tree=Try&rev=952274c52b4c

The same set of patches, but instead of making xpcshell exit early, just make it not disable crash reporting on shutdown. (This causes leaks, but those don't make xpcshell tests fail, so that doesn't matter for the purpose of this investigation.)
That didn't seem to work either, which sucks, so I took a totally different tack here:
https://tbpl.mozilla.org/?tree=Try&rev=2dce007c810a
Looks like I finally caught it:
https://tbpl.mozilla.org/php/getParsedLog.php?id=24917908&tree=Try

The minidump stack does not fill me with joy though:
18:34:18     INFO -  Operating system: Windows NT
18:34:18     INFO -                    5.1.2600 Service Pack 3
18:34:18     INFO -  CPU: x86
18:34:18     INFO -       GenuineIntel family 6 model 30 stepping 5
18:34:18     INFO -       8 CPUs
18:34:18     INFO -  No crash
18:34:18     INFO -  Thread 0
18:34:18     INFO -   0  ntdll.dll + 0xe4f4
18:34:18     INFO -      eip = 0x7c90e4f4   esp = 0x0012f5e0   ebp = 0x0012f864   ebx = 0x00404a00
18:34:18     INFO -      esi = 0x7c885780   edi = 0x00000000   eax = 0x001448d8   ecx = 0x01000000
18:34:18     INFO -      edx = 0x7c97b178   efl = 0x00000246
18:34:18     INFO -      Found by: given as instruction pointer in context
18:34:18     INFO -   1  kernel32.dll + 0x43891
18:34:18     INFO -      eip = 0x7c843892   esp = 0x0012f86c   ebp = 0x0012fff0
18:34:18     INFO -      Found by: previous frame's frame pointer

(yes, that's all of it.)
Thanks! I dumped that locally with symbols and got:
Operating system: Windows NT
                  5.1.2600 Service Pack 3
CPU: x86
     GenuineIntel family 6 model 30 stepping 5
     8 CPUs

No crash

Thread 0
 0  ntdll.dll!KiFastSystemCallRet + 0x0
    eip = 0x7c90e4f4   esp = 0x0012f5e0   ebp = 0x0012f864   ebx = 0x00404a00
    esi = 0x7c885780   edi = 0x00000000   eax = 0x001448d8   ecx = 0x01000000
    edx = 0x7c97b178   efl = 0x00000246
    Found by: given as instruction pointer in context
 1  ntdll.dll!ZwRaiseHardError + 0xc
    eip = 0x7c90d9ac   esp = 0x0012f5e4   ebp = 0x0012f864
    Found by: call frame info
 2  kernel32.dll!UnhandledExceptionFilter + 0x633
    eip = 0x7c86449d   esp = 0x0012f5e8   ebp = 0x0012f864
    Found by: call frame info
 3  kernel32.dll!_seh_longjmp_unwind + 0x9cec
    eip = 0x7c843892   esp = 0x0012f86c   ebp = 0x0012fff0
    Found by: call frame info
 4  ntdll.dll!ExecuteHandler2@20 + 0x26
    eip = 0x7c9032a8   esp = 0x0012f89c   ebp = 0x0012fff0
    Found by: call frame info


Which is not terribly informative. WinDBG seems to give a slightly longer stack, except I don't have symbols for the non-Windows bits:
ChildEBP RetAddr  
0012f5dc 7c90d9ac ntdll!KiFastSystemCallRet
0012f5e0 7c86449d ntdll!ZwRaiseHardError+0xc
0012f864 7c843892 kernel32!UnhandledExceptionFilter+0x628
0012f86c 7c839b21 kernel32!BaseProcessStart+0x39
0012f894 7c9032a8 kernel32!_except_handler3+0x61
0012f8b8 7c90327a ntdll!ExecuteHandler2+0x26
0012f968 7c90e46a ntdll!ExecuteHandler+0x24
0012f968 003310d1 ntdll!KiUserExceptionDispatcher+0xe
*** WARNING: Unable to verify timestamp for mozglue.dll
*** ERROR: Module load completed but symbols could not be loaded for mozglue.dll
WARNING: Stack unwind information not available. Following frames may be wrong.
0012fc68 003331ec mozglue+0x10d1
0012fc80 00333ecc mozglue+0x31ec
0012fca4 00333fb9 mozglue+0x3ecc
0012fcc4 003340f5 mozglue+0x3fb9
0012fce8 003345a8 mozglue+0x40f5
*** WARNING: Unable to verify timestamp for xul.dll
*** ERROR: Module load completed but symbols could not be loaded for xul.dll
0012fcf4 10b743aa mozglue+0x45a8
0012fd08 10d0e96b xul+0xb743aa
0012fd24 10d0ea87 xul+0xd0e96b
0012fd68 10d0eb04 xul+0xd0ea87
0012fd74 7c90118a xul+0xd0eb04
0012fd94 7c923aba ntdll!LdrpCallInitRoutine+0x14
0012fe18 7c81ca96 ntdll!LdrShutdownProcess+0x14f
0012ff0c 7c81cb0e kernel32!_ExitProcess+0x42
0012ff20 78ac7ec3 kernel32!ExitProcess+0x14
0012ff2c 78ac7fdf msvcr100!__crtExitProcess(int status = 0n0)+0x17 [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0dat.c @ 708]
0012ff70 78ac8051 msvcr100!doexit(int code = 0n0, int quick = 0n0, int retcaller = 0n0)+0xfb [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0dat.c @ 621]
*** WARNING: Unable to verify timestamp for xpcshell.exe
*** ERROR: Module load completed but symbols could not be loaded for xpcshell.exe
0012ff84 004048d2 msvcr100!exit(int code = 0n0)+0x11 [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0dat.c @ 393]
0012ffc0 7c817067 xpcshell+0x48d2
0012fff0 00000000 kernel32!BaseProcessStart+0x23


Clearly something is going very wrong at shutdown here. I'll symbolicate that stack in a bit and see what it looks like.
Assignee: nobody → ted
Symbolicating shows that looks something like:
ntdll!KiFastSystemCallRet
ntdll!ZwRaiseHardError+0xc
kernel32!UnhandledExceptionFilter+0x628
kernel32!BaseProcessStart+0x39
kernel32!_except_handler3+0x61
ntdll!ExecuteHandler2+0x26
ntdll!ExecuteHandler+0x24
ntdll!KiUserExceptionDispatcher+0xe
moz_abort [memory/mozjemalloc/jemalloc.c:1590]
arena_purge [memory/mozjemalloc/jemalloc.c:3658]
arena_run_dalloc [memory/mozjemalloc/jemalloc.c:3833]
arena_dalloc_small [memory/mozjemalloc/jemalloc.c:4575]
arena_dalloc [memory/mozjemalloc/jemalloc.c:4676]
[mozglue+0x45A8]
std::_Tree<std::_Tmap_traits<unsigned __int64,mozilla::layers::CompositorParent::LayerTreeState,std::less<unsigned __int64>,std::allocator<std::pair<unsigned __int64 const ,mozilla::layers::CompositorParent::LayerTreeState> >,0> >::~_Tree<std::_Tmap_traits<unsigned __int64,mozilla::layers::CompositorParent::LayerTreeState,std::less<unsigned __int64>,std::allocator<std::pair<unsigned __int64 const ,mozilla::layers::CompositorParent::LayerTreeState> >,0> >() [c:/tools/msvs10/vc/include/xtree:792]
_CRT_INIT [f:/dd/vctools/crt_bld/self_x86/crt/src/crtdll.c:415]
__DllMainCRTStartup [f:/dd/vctools/crt_bld/self_x86/crt/src/crtdll.c:526]
_DllMainCRTStartup [f:/dd/vctools/crt_bld/self_x86/crt/src/crtdll.c:476]
0012fd94 7c923aba ntdll!LdrpCallInitRoutine+0x14
0012fe18 7c81ca96 ntdll!LdrShutdownProcess+0x14f
0012ff0c 7c81cb0e kernel32!_ExitProcess+0x42
0012ff20 78ac7ec3 kernel32!ExitProcess+0x14
0012ff2c 78ac7fdf msvcr100!__crtExitProcess(int status = 0n0)+0x17 [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0dat.c @ 708]
0012ff70 78ac8051 msvcr100!doexit(int code = 0n0, int quick = 0n0, int retcaller = 0n0)+0xfb [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0dat.c @ 621]
*** WARNING: Unable to verify timestamp for xpcshell.exe
*** ERROR: Module load completed but symbols could not be loaded for xpcshell.exe
0012ff84 004048d2 msvcr100!exit(int code = 0n0)+0x11 [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0dat.c @ 393]
__tmainCRTStartup [f:/dd/vctools/crt_bld/self_x86/crt/src/crtexe.c:572]
0012fff0 00000000 kernel32!BaseProcessStart+0x23
So, some sort of double-free on shutdown (after Breakpad's exception handler has been unregistered) that's winding up as an unhandled exception that hangs?
bsmedberg, jlebar: any thoughts?
> arena_purge [memory/mozjemalloc/jemalloc.c:3658]

This doesn't line up with a RELEASE_ASSERT in my jemalloc, which is odd, since we haven't modified that file in a while.

> chunk = arena_chunk_tree_dirty_last(&arena->chunks_dirty);
> RELEASE_ASSERT(chunk != NULL);
>
> for (i = chunk_npages - 1; chunk->ndirty > 0; i--) {
>   RELEASE_ASSERT(i >= arena_chunk_header_npages);

The line number in the stack is the |for| statement, so who knows which assertion we're hitting (assuming it's even one of these two).

If we're hitting the second assertion, that's because we're not purging as many pages in the chunk as we expected.  That could be because chunk->ndirty is corrupted, or because chunk->map is corrupted.

If we're hitting the first assertion, that's because our arena->chunks_dirty rbtree has been corrupted, or because arena->ndirty has been corrupted.  (The latter is probably more likely; I'd guess that if the rbtree is corrupted, we'd segfault either while iterating over it, or when trying to dereference the returned node.)

So this sounds like memory corruption of some sort.  That could be due to an invalid free, but I'd guess in that case it's probable we would have hit another assertion in jemalloc before getting here.

We could figure out whether it's the first assert by commenting that one out; we'll crash with a null-pointer exception right afterwards if it's the one.  But I don't know how much more that would tell us.

I'm tempted to say we should just make free() a nop we're so far into shutdown that breakpad has unregistered its exception handler.  That would be easy to do and would likely fix this crash.  But if the corruption happens before we're very far into shutdown, we want to figure it out...
Thanks for the analysis. Note that the stack may be a little dodgy due to not having symbols for everything during the unwind + post-facto symbolication. I re-pushed to try to get a build with full symbols that I could reproduce with.
Blocks: 855681
This is certainly part of the CRT shutdown sequence. Unfortunately the critical frame (the one under _CRT_INIT) is almost certainly bogus. Reading the CRT source code I can't tell for certain whether we're running a C++ destructor or an _onexit/atexit handler.

But I will note that there are at least a few places in the tree that use atexit handlers and might be compiled in Windows code:

* /gfx/harfbuzz/src/hb-common.cc
* /gfx/harfbuzz/src/hb-shape.cc
* /gfx/harfbuzz/src/hb-shaper.cc
* /ipc/chromium/src/base/at_exit.{h,cc}
* /js/src/TraceLogging.cpp

Does jemalloc have any static destructors? And Ted can you figure out from the dump which DLL/EXE we're calling _DllMainCRTStartup on at the time of the crash? If this is firefox.exe or mozglue.dll, we've probably already unloaded xul.dll (or at least run all of its static destructors) and things will be in a funny state.
jemalloc is C, so it doesn't have any destructors of any sort.  It calls atexit, but only if an env var is set, and never on Windows.
If you look at the non-symbolicated stack from comment 22, that appears to be in libxul. I'm not sure I trust that whole stack though, given my remarks above. I'm trying to repro this on a build where I have full symbols so I can get a more reliable stack (Breakpad is failing to walk this one for some reason).
I got this to happen again on try on a build where I've got PDB symbols:
https://tbpl.mozilla.org/php/getParsedLog.php?id=25021366&full=1&branch=try#error2

The stack windbg gives me is:
mozglue!arena_avail_tree_remove
mozglue!arena_run_dalloc
mozglue!arena_dalloc_small
mozglue!arena_dalloc
mozglue!je_free
xul!base::Histogram::`scalar deleting destructor'
xul!base::StatisticsRecorder::~StatisticsRecorder
xul!_CRT_INIT
xul!__DllMainCRTStartup
xul!_DllMainCRTStartup
ntdll!LdrpCallInitRoutine
ntdll!LdrShutdownProcess
kernel32!_ExitProcess
kernel32!ExitProcess
msvcr100!__crtExitProcess
msvcr100!doexit
msvcr100!exit
xpcshell!__tmainCRTStartup
kernel32!BaseProcessStart

This is weird, since it just looks like we're running static destructors. I'm not sure why that'd take 5+ minutes.
I don't think there are any RELEASE_ASSERT's in avail_tree_remove.

Are you segfaulting instead?
This isn't a crash, I just changed the harness code to call MiniDumpWriteDump on the still-living process when it times out. This is just what the process is doing when we dump it. I'm not sure why we'd still be running static destructors 5 minutes after starting the test (which should have finished sooner than that).
This reproduced again with a very similar stack:
https://tbpl.mozilla.org/php/getParsedLog.php?id=25025147&full=1&branch=try#error2

WinDBG says:
mozglue!arena_avail_tree_remove
mozglue!arena_run_dalloc
mozglue!arena_dalloc_small
mozglue!arena_dalloc
mozglue!je_free
xul!std::vector<char,std::allocator<char> >::_Tidy
xul!base::Histogram::~Histogram
xul!base::Histogram::`scalar deleting destructor'
xul!base::StatisticsRecorder::~StatisticsRecorder
xul!_CRT_INIT
xul!__DllMainCRTStartup
xul!_DllMainCRTStartup
ntdll!LdrpCallInitRoutine
ntdll!LdrShutdownProcess
kernel32!_ExitProcess
kernel32!ExitProcess
msvcr100!__crtExitProcess
msvcr100!doexit
msvcr100!exit
xpcshell!__tmainCRTStartup
kernel32!BaseProcessStart

There's clearly something bad happening running static destructors, but I don't know what. jlebar suggests commenting out the body of base::StatisticsRecorder::~StatisticsRecorder since it seems to be reliably happening under there, but this is still worrying.
A third hit on this on my try push, same stack as the last one:
https://tbpl.mozilla.org/php/getParsedLog.php?id=25026158&tree=Try#error2
This is the static destructor from http://hg.mozilla.org/mozilla-central/annotate/c0830a5933e8/toolkit/components/telemetry/Telemetry.cpp#l395

Can we try just making this a pointer and just never freeing it?
I don't think I have the time to devote to trying to solve this, so unassigning. Feel free to ping me if you need more info, I've put basically everything I've uncovered into the bug already.
Assignee: ted → nobody
No longer depends on: 885472
I tried completely removing gStatisticsRecorder because it appears to be unused, and it broke tests. So apparently it is used, through global side-effects. I'll try leaking it instead.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #38)
> I tried completely removing gStatisticsRecorder because it appears to be
> unused, and it broke tests. So apparently it is used, through global
> side-effects. I'll try leaking it instead.

Instead of leaking gStatisticsRecorder, we could go back to leaking the individual histograms by reverting bug 824647.
Here's a push to Try with bug 824647 backed out.
https://tbpl.mozilla.org/?tree=Try&rev=e0dab08b5266

And here's one off the same parent without the backout.
https://tbpl.mozilla.org/?tree=Try&rev=ba4cf5687f76

Is it just me, or did these timeouts just kind of go away on their own? The various oranges showing are known intermittents or at least don't match the failure pattern that's been investigated here. Bug 824022 hasn't shown an m-c/inbound/etc failure since late June (still happening on Aurora/Beta, though). There are a couple recent instances of bug 854330 on m-c trees, though.
And now with the patch backed out that was causing all the oranges we were seeing yesterday.
https://tbpl.mozilla.org/?tree=Try&rev=7c956b4cc672

So what's the problem here again? :-)
I don't know, this is terrifying to me, since it was incredibly reproducible not that long ago.
Bug 883657 made this problem disappear.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Blocks: 879686
You need to log in before you can comment on or make changes to this bug.