Closed Bug 970063 Opened 6 years ago Closed 6 years ago

Jit-tests on Windows frequently time out in automation

Categories

(Core :: JavaScript Engine: JIT, defect, major)

x86
Windows Server 2008
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla31
Tracking Status
firefox29 --- unaffected
firefox30 --- fixed
firefox31 --- fixed
firefox-esr24 --- unaffected

People

(Reporter: philor, Assigned: jandem)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 1 obsolete file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=34360769&tree=Mozilla-Inbound
WINNT 5.2 mozilla-inbound build on 2014-02-08 13:03:06 PST for push 79f70ef516ab
slave: w64-ix-slave138

FAIL - parallel\timeout-gc.js
TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\parallel\timeout-gc.js | --no-baseline --no-ion --no-ti
INFO exit-status     : 0
INFO timed-out       : False
INFO stdout          > Timed out, invoking the GC
INFO stdout          > 
INFO stderr         2> Script runs for too long, terminating.
INFO stderr         2>
Blocks: 970065
Blocks: 970066
Blocks: 970068
Blocks: 970078
Blocks: 970113
Blocks: 970143
Blocks: 970179
Blocks: 970443
Blocks: 971502
Blocks: 971508
Blocks: 971511
Blocks: 974278
Blocks: 977970
Blocks: 979071
Blocks: 979150
Blocks: 981243
Blocks: 981952
Blocks: 984024
Blocks: 984025
Blocks: 984174
Blocks: 986765
Blocks: 986938
Blocks: 989638
Blocks: 989639
Blocks: 989837
Depends on: 991779
No longer depends on: 991779
Severity: normal → major
Summary: Intermittent TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\parallel\timeout-gc.js | --no-baseline --no-ion --no-ti → Jittests on Windows frequently time out
Terrence, please may you find an owner for this?
Flags: needinfo?(terrence)
Summary: Jittests on Windows frequently time out → Jit-tests on Windows frequently time out in automation
Blocks: 992287
Blocks: 992290
Actually, "on Windows Server 2008 and Windows 7" would probably be more accurate. And that's the upside of getting them off the build slaves and onto test slaves, we can just hide the Win7 jittest runs, and then shut them off when nobody ever fixes them.
Blocks: 992293
Blocks: 992298
Blocks: 992300
A few weeks ago I was able to reproduce this in a Windows 7 VM; on average you have to run thousands of jit-tests to hit it.

Unfortunately I don't have the call stack anymore (I can try to repro again later), but we're hanging after main returns, I think under ExitProcess. From what I read online, this kind of thing can happen when DllMain is called with DLL_PROCESS_DETACH and you're doing something non-trivial there:

http://blogs.msdn.com/b/oldnewthing/archive/2007/05/02/2365433.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2012/01/05/10253268.aspx

AFAICT, the only place in our codebase that does this is ICU. I could still repro it with a fix for that though.

I can try to repro this again next week, but it takes a lot of time and I'm not sure how to proceed: it's hanging after our code runs and there's not a lot of documentation available :(
Bah, I wanted to link to this one instead:

http://blogs.msdn.com/b/oldnewthing/archive/2007/05/03/2383346.aspx
(In reply to Phil Ringnalda (:philor) from comment #2)
> Actually, "on Windows Server 2008 and Windows 7" would probably be more
> accurate.

So this doesn't happen on Windows 8?

I guess I should take this bug because I looked into it. Please give me a week to investigate this, and if we don't find anything we can maybe modify jit-tests to allow timeouts on Windows 7/2008...
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
It's possible it does happen on Win8, just with vastly lower frequency - looking at https://tbpl.mozilla.org/?tree=Mozilla-Inbound&showall=1&jobname=winnt%206.2.*jittest back to Wednesday, there was the one timeout that resulted in bug 992202 being filed; as https://tbpl.mozilla.org/?tree=Mozilla-Inbound&showall=1&jobname=windows%207.*jittest and a little exercise of the down-arrow will show, you can't swing a dead flea without hitting a Win7 timeout, probably averaging one every 10 pushes, so every 20 runs.
Flags: needinfo?(terrence)
I'm pretty sure I know what's going on.

Here's the stack trace when it hangs:

  ntdll.dll!_ZwReleaseKeyedEvent@16()  + 0x15 bytes	
  ntdll.dll!_ZwReleaseKeyedEvent@16()  + 0x15 bytes	
  kernel32.dll!76c779c5() 	
  msvcr100.dll!___crtExitProcess()  + 0x17 bytes	
  msvcr100.dll!__cinit()  + 0xb689 bytes	
  msvcr100.dll!_exit()  + 0x11 bytes	
  js.exe!__tmainCRTStartup()  Line 572	C
  kernel32.dll!@BaseThreadInitThunk@12()  + 0x12 bytes	
  ntdll.dll!___RtlUserThreadStart@8()  + 0x27 bytes	
  ntdll.dll!__RtlUserThreadStart@8()  + 0x1b bytes	

After main() returns, the process tries to exit and it hangs.

I found MS KB 2582203 [0], it basically describes this problem (also note that it mentions only Windows 7 and Server 2008 as affected):

<quote>
On a computer that is running Windows 7 or Windows Server 2008 R2, a process that is being terminated stops responding. When this issue occurs, the main thread of the process becomes blocked, and the process does not terminate.

If you analyze the call stack of the process when this issue occurs, you notice that a thread is blocked in the ntdll!NtReleaseKeyedEvent() function.

This issue occurs because the main thread that terminates the process tries to reactivate another thread that was terminated when the thread released an SRW lock. This race condition causes the process to stop responding.
</quote>

So we're terminating a thread that's releasing a lock. We know that when the main thread exits, it terminates all other threads first [1]. However, the JS engine cleanly stops all threads when it shuts down, so in theory only the main thread should be active when the process exits and we shouldn't have this problem.

This is not what happens. Although we call NSPR's PR_JoinThread, on Windows that function does not wait for the thread to end. Rather, the child thread explicitly calls _PR_NotifyJoinWaiters after the thread returns from its "main" function. So when PR_JoinThread returns, the thread is not dead: it's terminating but it still has to return from _PR_NativeRunThread, do the Windows DLL_THREAD_DETACH stuff for each DLL, etc.

So to summarize:

(1) Shell calls JS_DestroyRuntime, JS_ShutDown.
(2) These functions call PR_JoinThread to stop all helper threads. As described above, the threads may still be running for a little longer.
(3) main() returns. All threads are terminated [1]. If one of them was just releasing a lock (not unlikely, all helper threads that are still active are busy shutting down), the main thread can deadlock per [0].

Some ways to fix this:

(1) Fix bug 956899 (get rid of the NSPR dependency). Indeed, the patches in that bug use WaitForSingleObject in Thread::join on Windows, that's the "official" Win32 way to wait for threads to terminate. A lot saner and simpler than what NSPR is doing.

(2) Fix NSPR to use WaitForSingleObject. The current NSPR behavior is probably there because NSPR supports fibers ("local threads") and can recycle threads. Maybe we could use WaitForSingleObject if we're not using those features.

(3) Deploy hotfix KB 2582203 [0] on our test slaves, I've no idea how feasible this is.

(4) Hack the jit-test harness to ignore timeouts on Windows 7 and Windows Server 2008.

I think (4) is the best short-term fix, (1) the best longer-term fix. Unless (3) is easy, we could give that a try too...

[0] http://support.microsoft.com/kb/2582203
[1] http://blogs.msdn.com/b/oldnewthing/archive/2007/05/03/2383346.aspx
Thank you so much for looking into this! :-D
Note that bug 941805 landed when this started [0]. It probably made this worse (or exposed it) by moving JS helper thread shutdown to JS_ShutDown, e.g. close to process exit.

Also, I can't reproduce the problem with JS helper/GC threads disabled.

[0] https://lists.mozilla.org/pipermail/dev-tech-js-engine-internals/2014-February/001641.html
Blocks: 941805
Depends on: 956899
Blocks: 992869, 992870
A bit unfortunate, but this will hopefully be temporary and the Windows 8 machines should still catch Windows-only timeouts.
Attachment #8402692 - Flags: review?(terrence)
Blocks: 992464
Blocks: 992459
Blocks: 992463
Blocks: 992457
Comment on attachment 8402692 [details] [diff] [review]
Ignore timeouts on Windows 7 and 2008

Review of attachment 8402692 [details] [diff] [review]:
-----------------------------------------------------------------

Wow, NSPR deserves a *slow clap* for that bug. r=me
Attachment #8402692 - Flags: review?(terrence) → review+
Now with the correct username and title. This can land when the trees are open.
Attachment #8402796 - Flags: review+
Attachment #8402692 - Attachment is obsolete: true
Keywords: checkin-needed
Blocks: 993355
Blocks: 993356
https://hg.mozilla.org/mozilla-central/rev/08812e846652
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Should this still be reverted now that bug 956899 has completed?
Flags: needinfo?(jdemooij)
Blocks: 1297652
(In reply to Nick Fitzgerald [:fitzgen] [⏰PDT; UTC-7] from comment #17)
> Should this still be reverted now that bug 956899 has completed?

Yes, thanks! Filed bug 1297652.
Flags: needinfo?(jdemooij)
You need to log in before you can comment on or make changes to this bug.