1131757 - A worker's interrupt event loop misinterprets a futex wakeup runnable, and kills the worker

Reporter

Description

•

10 years ago

Attached file performance_now_hang.tar.gz — Details

STR: 1. Make a Firefox Nightly build with SAB+Atomics+futex support following Lars Hansen's patch queue from https://github.com/lars-t-hansen/atomics-queue 2. Download attachment and unzip. 3. Run a.html and keep refreshing the page until the browser hangs (showing the slow script dialog). Observed: It looks like occasionally calling 'performance.now();' from a web worker will cause a mutex lock attempt to be done in the worker thread, which the worker thread is able to lock only after the main browser thread returns back from executing user JS code. What happens during the hang is: - the worker thread was stalled trying to execute performance.now() in function __Z5sleepi in a.js:10030. - the main thread was stalled waiting for the worker thread to finish execution in _pthread_join in a.js:9483, which it would have happily done if it had gotten past the performance.now() call. This is similar to a case of two threads being stalled at holding each others locks, and neither being able to proceed. Expected: Calling performance.now() should be a naive call to obtain the hardware clock time stamp counter value (QueryPerformanceCounter/clock_gettime/etc.) and it should never stall to wait until the main browser thread yields execution back from user JS code to the browser. Note: Even outside the experimental SAB+Atomics+futex work, this looks like a performance bug: Calling performance.now() should never involve hanging around to wait what any other thread might be doing, but it should be a very quick constant time operation that calls to obtain a hardware clock counter independent of other executing threads.

Boris Zbarsky [:bzbarsky]

Comment 1

•

10 years ago

> - the worker thread was stalled trying to execute performance.now() in function __Z5sleepi in a.js:10030. What was the C++ callstack at this point? Also, what OS are you on? On Windows, QueryPerformanceCounter is broken enough that you can't use it on its own for this sort of API. So we end up also using GetTickCount64, but if that's not available (because the Windows version is too old; GetTickCount64 is only available in Vista or later), we'll fall back to GetTickCount and a lock around some variables dealing with tick count rollover...

Flags: needinfo?(jujjyl)

Jukka Jylänki

Reporter

Comment 2

•

10 years ago

How do I identify which thread is the web worker thread? I am able to break execution in OSX Xcode debugger and examine the callstacks of all the threads, but I'm unable to locate which one is the web worker that is running my code. I see the main thread is waiting on the following, which looks correct and as expected: #0 0x00007fff8dde7132 in __psynch_cvwait () #1 0x00007fff8fdf0ea0 in _pthread_cond_wait () #2 0x00000001006226a3 in pt_TimedWait [inlined] at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptsynch.c:264 #3 0x0000000100622620 in PR_WaitCondVar at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptsynch.c:387 #4 0x0000000101ad7efc in JSMainFutexAPIImpl::wait(double) at /Users/jjylanki/mozilla-inbound-2/dom/base/nsJSEnvironment.cpp:2665 #5 0x00000001037f76d8 in js::atomics_futexWait(JSContext*, unsigned int, JS::Value*) at /Users/jjylanki/mozilla-inbound-2/js/src/builtin/AtomicsObject.cpp:824 #6 0x000000010cbeb5de in 0x10cbeb5de () but as for the other threads, I see - 12 threads all named "Analysis Helper", which are each in Analysis Helper (12) #0 0x00007fff8dde7132 in __psynch_cvwait () #1 0x00007fff8fdf0ea0 in _pthread_cond_wait () #2 0x00000001006226bd in PR_WaitCondVar at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptsynch.c:385 #3 0x00000001038b8afe in js::GlobalHelperThreadState::wait(js::GlobalHelperThreadState::CondVar, unsigned int) [inlined] at /Users/jjylanki/mozilla-inbound-2/js/src/vm/HelperThreads.cpp:548 #4 0x00000001038b8aed in js::HelperThread::threadLoop() at /Users/jjylanki/mozilla-inbound-2/js/src/vm/HelperThreads.cpp:1380 #5 0x0000000100624a51 in _pt_root at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptthread.c:212 #6 0x00007fff8fdf02fc in _pthread_body () #7 0x00007fff8fdf0279 in _pthread_start () #8 0x00007fff8fdee4b1 in thread_start () - 23 threads that are all unnamed, and each in Thread 121 #0 0x00007fff8dde252e in mach_msg_trap () #1 0x00007fff8dde169f in mach_msg () #2 0x00000001037ec84d in AsmJSMachExceptionHandlerThread(void*) at /Users/jjylanki/mozilla-inbound-2/js/src/asmjs/AsmJSSignalHandlers.cpp:720 #3 0x0000000100624a51 in _pt_root at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptthread.c:212 #4 0x00007fff8fdf02fc in _pthread_body () #5 0x00007fff8fdf0279 in _pthread_start () #6 0x00007fff8fdee4b1 in thread_start () - 19 threads that are named "DOM Worker", which are each in DOM Worker (27) #0 0x00007fff8dde7132 in __psynch_cvwait () #1 0x00007fff8fdf0ea0 in _pthread_cond_wait () #2 0x00000001006226bd in PR_WaitCondVar at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptsynch.c:385 #3 0x00000001028f9b93 in mozilla::CondVar::Wait(unsigned int) [inlined] at /Users/jjylanki/mozilla-inbound-2/obj-x86_64-apple-darwin14.0.0/dom/workers/../../dist/include/mozilla/CondVar.h:79 #4 0x00000001028f9b75 in mozilla::dom::workers::WorkerPrivate::WaitForWorkerEvents(unsigned int) at /Users/jjylanki/mozilla-inbound-2/dom/workers/WorkerPrivate.cpp:5133 #5 0x00000001028f94bd in mozilla::dom::workers::WorkerPrivate::DoRunLoop(JSContext*) at /Users/jjylanki/mozilla-inbound-2/dom/workers/WorkerPrivate.cpp:4563 #6 0x00000001028e7c48 in (anonymous namespace)::WorkerThreadPrimaryRunnable::Run() at /Users/jjylanki/mozilla-inbound-2/dom/workers/RuntimeService.cpp:2671 #7 0x0000000101094d5f in nsThread::ProcessNextEvent(bool, bool*) at /Users/jjylanki/mozilla-inbound-2/xpcom/threads/nsThread.cpp:855 #8 0x00000001010b41e7 in NS_ProcessNextEvent(nsIThread*, bool) at /Users/jjylanki/mozilla-inbound-2/xpcom/glue/nsThreadUtils.cpp:265 #9 0x000000010131d3b0 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) at /Users/jjylanki/mozilla-inbound-2/ipc/glue/MessagePump.cpp:368 #10 0x0000000101301cbc in MessageLoop::RunInternal() [inlined] at /Users/jjylanki/mozilla-inbound-2/ipc/chromium/src/base/message_loop.cc:233 #11 0x0000000101301cad in MessageLoop::RunHandler() [inlined] at /Users/jjylanki/mozilla-inbound-2/ipc/chromium/src/base/message_loop.cc:226 #12 0x0000000101301cad in MessageLoop::Run() at /Users/jjylanki/mozilla-inbound-2/ipc/chromium/src/base/message_loop.cc:200 #13 0x00000001010939bb in nsThread::ThreadFunc(void*) at /Users/jjylanki/mozilla-inbound-2/xpcom/threads/nsThread.cpp:356 #14 0x0000000100624a51 in _pt_root at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptthread.c:212 #15 0x00007fff8fdf02fc in _pthread_body () #16 0x00007fff8fdf0279 in _pthread_start () #17 0x00007fff8fdee4b1 in thread_start () and a bunch of others with names that look unrelated. (btw, there are 97 threads total in Firefox when there's just that one tab open..) Is there a way to find which of them is running the Web Worker?

Flags: needinfo?(jujjyl)

Jukka Jylänki

Reporter

Comment 3

•

10 years ago

Oh, and this was on OS X, not on Windows.

Boris Zbarsky [:bzbarsky]

Comment 4

•

10 years ago

> How do I identify which thread is the web worker thread? That would be the "DOM Worker" thread. Though you could also just attach all the thread stacks to this bug (not as a comment, since that would be a pretty huge comment). That DOM worker thread is just waiting for something to do; it's run its code to completion and is waiting for an event to tell it to run more code or whatever. > Oh, and this was on OS X, not on Windows. On OSX, the implementation of performance.now() on workers is like so (mozilla::dom::workers::Performance::Now): 37 TimeDuration duration = 38 TimeStamp::Now() - mWorkerPrivate->NowBaseTimeStamp(); 39 return duration.ToMilliseconds(); NowBaseTimeStamp() is just an accessor, no locks. ToMilliseconds() just returns ToSeconds() * 1000.0 and ToSeconds() on Mac returns (mValue * sNsPerTick) / kNsPerSecd, no locks. Now() on Mac just does a ClockTime() call, which just calls mach_absolute_time(). There are no locks anywhere in there, unless mach_absolute_time has some under the hood. Do you actually see Performance::Now in any of the stacks? If not, why do you think that it's involved in the deadlock in any way?

performance_now_hang.tar.gz 10 years ago Jukka Jylänki 169.26 KB, application/x-gzip		Details
pthread-main.js 10 years ago Lars T Hansen [:lth] 4.46 KB, application/x-javascript		Details
bug1131757-wakeup-in-interrupt.diff 10 years ago Lars T Hansen [:lth] 4.80 KB, patch		Details \| Diff \| Splinter Review
futex_wait_never_times_out_hang.tar.gz 10 years ago Jukka Jylänki 171.65 KB, application/x-gzip		Details
futex_wait_never_times_out_hang.txt 10 years ago Jukka Jylänki 28.17 KB, text/plain		Details