Closed Bug 970063 Opened 6 years ago Closed 6 years ago
Jit-tests on Windows frequently time out in automation
https://tbpl.mozilla.org/php/getParsedLog.php?id=34360769&tree=Mozilla-Inbound WINNT 5.2 mozilla-inbound build on 2014-02-08 13:03:06 PST for push 79f70ef516ab slave: w64-ix-slave138 FAIL - parallel\timeout-gc.js TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\parallel\timeout-gc.js | --no-baseline --no-ion --no-ti INFO exit-status : 0 INFO timed-out : False INFO stdout > Timed out, invoking the GC INFO stdout > INFO stderr 2> Script runs for too long, terminating. INFO stderr 2>
Severity: normal → major
Summary: Intermittent TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\parallel\timeout-gc.js | --no-baseline --no-ion --no-ti → Jittests on Windows frequently time out
Terrence, please may you find an owner for this?
Summary: Jittests on Windows frequently time out → Jit-tests on Windows frequently time out in automation
Actually, "on Windows Server 2008 and Windows 7" would probably be more accurate. And that's the upside of getting them off the build slaves and onto test slaves, we can just hide the Win7 jittest runs, and then shut them off when nobody ever fixes them.
A few weeks ago I was able to reproduce this in a Windows 7 VM; on average you have to run thousands of jit-tests to hit it. Unfortunately I don't have the call stack anymore (I can try to repro again later), but we're hanging after main returns, I think under ExitProcess. From what I read online, this kind of thing can happen when DllMain is called with DLL_PROCESS_DETACH and you're doing something non-trivial there: http://blogs.msdn.com/b/oldnewthing/archive/2007/05/02/2365433.aspx http://blogs.msdn.com/b/oldnewthing/archive/2012/01/05/10253268.aspx AFAICT, the only place in our codebase that does this is ICU. I could still repro it with a fix for that though. I can try to repro this again next week, but it takes a lot of time and I'm not sure how to proceed: it's hanging after our code runs and there's not a lot of documentation available :(
Bah, I wanted to link to this one instead: http://blogs.msdn.com/b/oldnewthing/archive/2007/05/03/2383346.aspx
(In reply to Phil Ringnalda (:philor) from comment #2) > Actually, "on Windows Server 2008 and Windows 7" would probably be more > accurate. So this doesn't happen on Windows 8? I guess I should take this bug because I looked into it. Please give me a week to investigate this, and if we don't find anything we can maybe modify jit-tests to allow timeouts on Windows 7/2008...
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
It's possible it does happen on Win8, just with vastly lower frequency - looking at https://tbpl.mozilla.org/?tree=Mozilla-Inbound&showall=1&jobname=winnt%206.2.*jittest back to Wednesday, there was the one timeout that resulted in bug 992202 being filed; as https://tbpl.mozilla.org/?tree=Mozilla-Inbound&showall=1&jobname=windows%207.*jittest and a little exercise of the down-arrow will show, you can't swing a dead flea without hitting a Win7 timeout, probably averaging one every 10 pushes, so every 20 runs.
I'm pretty sure I know what's going on. Here's the stack trace when it hangs: ntdll.dll!_ZwReleaseKeyedEvent@16() + 0x15 bytes ntdll.dll!_ZwReleaseKeyedEvent@16() + 0x15 bytes kernel32.dll!76c779c5() msvcr100.dll!___crtExitProcess() + 0x17 bytes msvcr100.dll!__cinit() + 0xb689 bytes msvcr100.dll!_exit() + 0x11 bytes js.exe!__tmainCRTStartup() Line 572 C kernel32.dll!@BaseThreadInitThunk@12() + 0x12 bytes ntdll.dll!___RtlUserThreadStart@8() + 0x27 bytes ntdll.dll!__RtlUserThreadStart@8() + 0x1b bytes After main() returns, the process tries to exit and it hangs. I found MS KB 2582203 , it basically describes this problem (also note that it mentions only Windows 7 and Server 2008 as affected): <quote> On a computer that is running Windows 7 or Windows Server 2008 R2, a process that is being terminated stops responding. When this issue occurs, the main thread of the process becomes blocked, and the process does not terminate. If you analyze the call stack of the process when this issue occurs, you notice that a thread is blocked in the ntdll!NtReleaseKeyedEvent() function. This issue occurs because the main thread that terminates the process tries to reactivate another thread that was terminated when the thread released an SRW lock. This race condition causes the process to stop responding. </quote> So we're terminating a thread that's releasing a lock. We know that when the main thread exits, it terminates all other threads first . However, the JS engine cleanly stops all threads when it shuts down, so in theory only the main thread should be active when the process exits and we shouldn't have this problem. This is not what happens. Although we call NSPR's PR_JoinThread, on Windows that function does not wait for the thread to end. Rather, the child thread explicitly calls _PR_NotifyJoinWaiters after the thread returns from its "main" function. So when PR_JoinThread returns, the thread is not dead: it's terminating but it still has to return from _PR_NativeRunThread, do the Windows DLL_THREAD_DETACH stuff for each DLL, etc. So to summarize: (1) Shell calls JS_DestroyRuntime, JS_ShutDown. (2) These functions call PR_JoinThread to stop all helper threads. As described above, the threads may still be running for a little longer. (3) main() returns. All threads are terminated . If one of them was just releasing a lock (not unlikely, all helper threads that are still active are busy shutting down), the main thread can deadlock per . Some ways to fix this: (1) Fix bug 956899 (get rid of the NSPR dependency). Indeed, the patches in that bug use WaitForSingleObject in Thread::join on Windows, that's the "official" Win32 way to wait for threads to terminate. A lot saner and simpler than what NSPR is doing. (2) Fix NSPR to use WaitForSingleObject. The current NSPR behavior is probably there because NSPR supports fibers ("local threads") and can recycle threads. Maybe we could use WaitForSingleObject if we're not using those features. (3) Deploy hotfix KB 2582203  on our test slaves, I've no idea how feasible this is. (4) Hack the jit-test harness to ignore timeouts on Windows 7 and Windows Server 2008. I think (4) is the best short-term fix, (1) the best longer-term fix. Unless (3) is easy, we could give that a try too...  http://support.microsoft.com/kb/2582203  http://blogs.msdn.com/b/oldnewthing/archive/2007/05/03/2383346.aspx
Thank you so much for looking into this! :-D
Note that bug 941805 landed when this started . It probably made this worse (or exposed it) by moving JS helper thread shutdown to JS_ShutDown, e.g. close to process exit. Also, I can't reproduce the problem with JS helper/GC threads disabled.  https://lists.mozilla.org/pipermail/dev-tech-js-engine-internals/2014-February/001641.html
A bit unfortunate, but this will hopefully be temporary and the Windows 8 machines should still catch Windows-only timeouts.
Attachment #8402692 - Flags: review?(terrence)
Comment on attachment 8402692 [details] [diff] [review] Ignore timeouts on Windows 7 and 2008 Review of attachment 8402692 [details] [diff] [review]: ----------------------------------------------------------------- Wow, NSPR deserves a *slow clap* for that bug. r=me
Attachment #8402692 - Flags: review?(terrence) → review+
Now with the correct username and title. This can land when the trees are open.
Attachment #8402796 - Flags: review+
Attachment #8402692 - Attachment is obsolete: true
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Should this still be reverted now that bug 956899 has completed?
(In reply to Nick Fitzgerald [:fitzgen] [⏰PDT; UTC-7] from comment #17) > Should this still be reverted now that bug 956899 has completed? Yes, thanks! Filed bug 1297652.
You need to log in before you can comment on or make changes to this bug.