Closed Bug 1131757 Opened 9 years ago Closed 9 years ago

A worker's interrupt event loop misinterprets a futex wakeup runnable, and kills the worker

Categories

(Core :: JavaScript Engine, defect)

Other Branch
x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jujjyl, Assigned: lth)

Details

Attachments

(5 files)

STR:

1. Make a Firefox Nightly build with SAB+Atomics+futex support following Lars Hansen's patch queue from https://github.com/lars-t-hansen/atomics-queue
2. Download attachment and unzip.
3. Run a.html and keep refreshing the page until the browser hangs (showing the slow script dialog).

Observed:

It looks like occasionally calling 'performance.now();' from a web worker will cause a mutex lock attempt to be done in the worker thread, which the worker thread is able to lock only after the main browser thread returns back from executing user JS code. What happens during the hang is:

  - the worker thread was stalled trying to execute performance.now() in function __Z5sleepi in a.js:10030.
  - the main thread was stalled waiting for the worker thread to finish execution in _pthread_join in a.js:9483, which it would have happily done if it had gotten past the performance.now() call.

This is similar to a case of two threads being stalled at holding each others locks, and neither being able to proceed.

Expected:

Calling performance.now() should be a naive call to obtain the hardware clock time stamp counter value (QueryPerformanceCounter/clock_gettime/etc.) and it should never stall to wait until the main browser thread yields execution back from user JS code to the browser.

Note: Even outside the experimental SAB+Atomics+futex work, this looks like a performance bug: Calling performance.now() should never involve hanging around to wait what any other thread might be doing, but it should be a very quick constant time operation that calls to obtain a hardware clock counter independent of other executing threads.
>  - the worker thread was stalled trying to execute performance.now() in function __Z5sleepi in a.js:10030.

What was the C++ callstack at this point?

Also, what OS are you on?  On Windows, QueryPerformanceCounter is broken enough that you can't use it on its own for this sort of API.  So we end up also using GetTickCount64, but if that's not available (because the Windows version is too old; GetTickCount64 is only available in Vista or later), we'll fall back to GetTickCount and a lock around some variables dealing with tick count rollover...
Flags: needinfo?(jujjyl)
How do I identify which thread is the web worker thread? I am able to break execution in OSX Xcode debugger and examine the callstacks of all the threads, but I'm unable to locate which one is the web worker that is running my code. I see the main thread is waiting on the following, which looks correct and as expected:

#0	0x00007fff8dde7132 in __psynch_cvwait ()
#1	0x00007fff8fdf0ea0 in _pthread_cond_wait ()
#2	0x00000001006226a3 in pt_TimedWait [inlined] at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptsynch.c:264
#3	0x0000000100622620 in PR_WaitCondVar at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptsynch.c:387
#4	0x0000000101ad7efc in JSMainFutexAPIImpl::wait(double) at /Users/jjylanki/mozilla-inbound-2/dom/base/nsJSEnvironment.cpp:2665
#5	0x00000001037f76d8 in js::atomics_futexWait(JSContext*, unsigned int, JS::Value*) at /Users/jjylanki/mozilla-inbound-2/js/src/builtin/AtomicsObject.cpp:824
#6	0x000000010cbeb5de in 0x10cbeb5de ()

but as for the other threads, I see

 - 12 threads all named "Analysis Helper", which are each in

Analysis Helper (12)
#0	0x00007fff8dde7132 in __psynch_cvwait ()
#1	0x00007fff8fdf0ea0 in _pthread_cond_wait ()
#2	0x00000001006226bd in PR_WaitCondVar at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptsynch.c:385
#3	0x00000001038b8afe in js::GlobalHelperThreadState::wait(js::GlobalHelperThreadState::CondVar, unsigned int) [inlined] at /Users/jjylanki/mozilla-inbound-2/js/src/vm/HelperThreads.cpp:548
#4	0x00000001038b8aed in js::HelperThread::threadLoop() at /Users/jjylanki/mozilla-inbound-2/js/src/vm/HelperThreads.cpp:1380
#5	0x0000000100624a51 in _pt_root at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptthread.c:212
#6	0x00007fff8fdf02fc in _pthread_body ()
#7	0x00007fff8fdf0279 in _pthread_start ()
#8	0x00007fff8fdee4b1 in thread_start ()

 - 23 threads that are all unnamed, and each in

Thread 121
#0	0x00007fff8dde252e in mach_msg_trap ()
#1	0x00007fff8dde169f in mach_msg ()
#2	0x00000001037ec84d in AsmJSMachExceptionHandlerThread(void*) at /Users/jjylanki/mozilla-inbound-2/js/src/asmjs/AsmJSSignalHandlers.cpp:720
#3	0x0000000100624a51 in _pt_root at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptthread.c:212
#4	0x00007fff8fdf02fc in _pthread_body ()
#5	0x00007fff8fdf0279 in _pthread_start ()
#6	0x00007fff8fdee4b1 in thread_start ()

 - 19 threads that are named "DOM Worker", which are each in

DOM Worker (27)
#0	0x00007fff8dde7132 in __psynch_cvwait ()
#1	0x00007fff8fdf0ea0 in _pthread_cond_wait ()
#2	0x00000001006226bd in PR_WaitCondVar at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptsynch.c:385
#3	0x00000001028f9b93 in mozilla::CondVar::Wait(unsigned int) [inlined] at /Users/jjylanki/mozilla-inbound-2/obj-x86_64-apple-darwin14.0.0/dom/workers/../../dist/include/mozilla/CondVar.h:79
#4	0x00000001028f9b75 in mozilla::dom::workers::WorkerPrivate::WaitForWorkerEvents(unsigned int) at /Users/jjylanki/mozilla-inbound-2/dom/workers/WorkerPrivate.cpp:5133
#5	0x00000001028f94bd in mozilla::dom::workers::WorkerPrivate::DoRunLoop(JSContext*) at /Users/jjylanki/mozilla-inbound-2/dom/workers/WorkerPrivate.cpp:4563
#6	0x00000001028e7c48 in (anonymous namespace)::WorkerThreadPrimaryRunnable::Run() at /Users/jjylanki/mozilla-inbound-2/dom/workers/RuntimeService.cpp:2671
#7	0x0000000101094d5f in nsThread::ProcessNextEvent(bool, bool*) at /Users/jjylanki/mozilla-inbound-2/xpcom/threads/nsThread.cpp:855
#8	0x00000001010b41e7 in NS_ProcessNextEvent(nsIThread*, bool) at /Users/jjylanki/mozilla-inbound-2/xpcom/glue/nsThreadUtils.cpp:265
#9	0x000000010131d3b0 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) at /Users/jjylanki/mozilla-inbound-2/ipc/glue/MessagePump.cpp:368
#10	0x0000000101301cbc in MessageLoop::RunInternal() [inlined] at /Users/jjylanki/mozilla-inbound-2/ipc/chromium/src/base/message_loop.cc:233
#11	0x0000000101301cad in MessageLoop::RunHandler() [inlined] at /Users/jjylanki/mozilla-inbound-2/ipc/chromium/src/base/message_loop.cc:226
#12	0x0000000101301cad in MessageLoop::Run() at /Users/jjylanki/mozilla-inbound-2/ipc/chromium/src/base/message_loop.cc:200
#13	0x00000001010939bb in nsThread::ThreadFunc(void*) at /Users/jjylanki/mozilla-inbound-2/xpcom/threads/nsThread.cpp:356
#14	0x0000000100624a51 in _pt_root at /Users/jjylanki/mozilla-inbound-2/nsprpub/pr/src/pthreads/ptthread.c:212
#15	0x00007fff8fdf02fc in _pthread_body ()
#16	0x00007fff8fdf0279 in _pthread_start ()
#17	0x00007fff8fdee4b1 in thread_start ()

and a bunch of others with names that look unrelated. (btw, there are 97 threads total in Firefox when there's just that one tab open..) Is there a way to find which of them is running the Web Worker?
Flags: needinfo?(jujjyl)
Oh, and this was on OS X, not on Windows.
> How do I identify which thread is the web worker thread?

That would be the "DOM Worker" thread.  Though you could also just attach all the thread stacks to this bug (not as a comment, since that would be a pretty huge comment).

That DOM worker thread is just waiting for something to do; it's run its code to completion and is waiting for an event to tell it to run more code or whatever.

> Oh, and this was on OS X, not on Windows.

On OSX, the implementation of performance.now() on workers is like so (mozilla::dom::workers::Performance::Now):

37   TimeDuration duration =
38     TimeStamp::Now() - mWorkerPrivate->NowBaseTimeStamp();
39   return duration.ToMilliseconds();

NowBaseTimeStamp() is just an accessor, no locks.  

ToMilliseconds() just returns ToSeconds() * 1000.0 and ToSeconds() on Mac returns (mValue * sNsPerTick) / kNsPerSecd, no locks.

Now() on Mac just does a ClockTime() call, which just calls mach_absolute_time().  There are no locks anywhere in there, unless mach_absolute_time has some under the hood.

Do you actually see Performance::Now in any of the stacks? If not, why do you think that it's involved in the deadlock in any way?
Attached file pthread-main.js
You also need this file.
19 DOM Workers is a scary number, because there's a per-domain limit of 20 - after that you should expect deadlocks because some threads won't be mapped onto actual running threads.  This doesn't mean that's the problem but I find it a little worrisome.

When I start your program in a browser and reload it a few times I see 11 workers total.   (Linux 64-bit, debug build.)  The program itself creates 8 workers that I can see, so unless a 'new Worker' call is hidden somewhere clever I would not expect a lot more than 8 workers to be present, looks like we have a general overhead of 3 threads.  19 seems really high.  Possibly you had several tabs open?

Indeed, when I create three tabs with this program, the last tab I create never finishes loading, and the DOM worker threads just keep stacking up.  Breaking in the debugger I find there are 23 threads, which is exactly right: 20 for the file: domain and the 3 workers that were hanging around from the start.

cc'ing various interested parties, the 20-per-domain hard limit is going to cause real pain once workers can be used as threads.
Possibly delayed deletion / unloading of workers could cause the number to peak momentarily and hit the cutoff?  No evidence of this yet, of course.
Re per-domain limits, see also bug 1053275.
Thanks Lars for posting the missing pthread-main.js file to the bug, and sorry for forgetting that!

Boris: I tried again with a new run, and what I am seeing is that again all DOM Workers are sleeping with the same call stack that "DOM Worker (27)" in my previous comment lists. I do not see Performance::Now in any of the call stacks.

The reason I am quite confident that it is performance.now() that is stalling are the following reasons.

1. The code in the .js file that the thread runs looks like follows:

function _emscripten_get_now() {
  return performance.now();
}

function __Z5sleepi($msecs) {
 $msecs = $msecs|0;
 var $0 = 0, $1 = 0.0, $2 = 0.0, $3 = 0, $4 = 0.0, $5 = 0.0, $6 = 0.0, $7 = 0.0, $8 = 0, $t0 = 0.0, $t1 = 0.0, label = 0, sp = 0;
 sp = STACKTOP;
 STACKTOP = STACKTOP + 32|0; if ((STACKTOP|0) >= (STACK_MAX|0)) abort();
 $0 = $msecs;
 $1 = (+_emscripten_get_now());
 $t0 = $1;
 $2 = $t0;
 $3 = $0;
 $4 = (+($3|0));
 $5 = $2 + $4;
 $t1 = $5;
 while(1) {
  $6 = (+_emscripten_get_now());
  $7 = $t1;
  $8 = $6 < $7;
  if (!($8)) {
   break;
  }
 }
 STACKTOP = sp;return;
}

function _pthread_self() {
  if (ENVIRONMENT_IS_PTHREAD) return threadBlock;
  return PThread.mainThreadBlock;
}

function __Z10ThreadMainPv($arg) {
 // stuff ...
 dump('Thread ' + _pthread_self() + ' going to sleep\n');
 __Z5sleepi(100);
 dump('Thread ' + _pthread_self() + ' back from sleep\n');
 // stuff ...
}

and enabling dump in about:config, I am seeing the message "Thread 5260064 going to sleep", but no message "Thread 5260064 back from sleep" from the worker thread.

2. If I remove the dump() calls, the hang still occurs, so I know that it's not the dump() calls themselves.

3. If I replace the function _emscripten_get_now() above with the following function:

var simulatedTime = 0;
function _emscripten_get_now() {
  simulatedTime += 0.01;
  return simulatedTime;
}

the execution will always succeed and never hang. That is, removing the call to performance.now() and replacing with a fake strictly monotonous value avoids the hang.

Lars: the fixed pool size is definitely a problem, but this issue is something else, since I always see the culprit worker thread dump "Thread 5260064 going to sleep" as the last message it prints.

I don't know why I can't find the DOM Worker thread in Xcode threads list. Would it be possible that the thread in question has been killed by something when I break?
Oh, and I was running only one tab in the browser, but I was reloading that tab a few times to get the hang (it usually occurs in 1/3 to 1/4 of the runs).

Also, I'm running with the pref dom.workers.maxPerDomain=60 to give some breathing room for the pool size, but that does not affect the issue here, I tested that the bug reproduces with the default value for that pref as well.
Thanks for all the feedback.  I'll move over to my Mac and see if I can reproduce it there.  (I still don't understand why there should be so many Worker instances alive at the time of the hang.)
Perhaps reloading the page doesn't immediately kill the workers from the old page, but they have some kind of timeout or asynchronizity to them? And since the hang occurs immediately when the page is (re)loaded, the threads are still alive when breaking to the debugger.
> I do not see Performance::Now in any of the call stacks.

Then you're not hanging under performance.now, yes?  That is, the point when the worker enters the sleep is not when now() has been called.

> I am seeing the message "Thread 5260064 going to sleep", but no message "Thread 5260064
> back from sleep" from the worker thread.

Curious, indeed, since according to the C++ stack the now() call has certainly returned.

So I tried to reproduce this.  I have yet to see a case where I get a "going to sleep" message without a corresponding "back from sleep".  However I also get a bunch of:

  Error: futex support is not installed a.js:5691:16
  "pthread sent an error! Error: futex support is not installed" a.html:1245:12
  "[post-exception status] Exception thrown, see JavaScript console" a.html:1245:12

in the console.  Is that expected?
bz: you need to apply a patch queue to use the futexes atm, see comment #0 I think.

I'm pretty sure this is not the fault of performance.now().

Curiously there appears to be a watchdog timer on worker threads on Mac.  Surely I must be imagining this?  It looks like the "sleeping" thread is killed off after a while, though; the regular slow-script dialog then kills off the waiting main thread.  And I get this in the console (the first one is the thread and appears immediately as the slow-script dialog comes up; the second only when I click to kill the script):

Error: Script terminated by timeout at:
__Z5sleepi@file:///Users/lhansen/tmp/performance_now_hang/a.js:10026:1
__Z10ThreadMainPv@file:///Users/lhansen/tmp/performance_now_hang/a.js:10051:2
dynCall_ii@file:///Users/lhansen/tmp/performance_now_hang/a.js:16182:10
this.onmessage@file:///Users/lhansen/tmp/performance_now_hang/pthread-main.js:72:18

Error: Script terminated by timeout at:
_emscripten_futex_wait@file:///Users/lhansen/tmp/performance_now_hang/a.js:5691:17
_pthread_join@file:///Users/lhansen/tmp/performance_now_hang/a.js:9492:9
__Z10WaitToJoinv@file:///Users/lhansen/tmp/performance_now_hang/a.js:10118:10
dynCall_v@file:///Users/lhansen/tmp/performance_now_hang/a.js:16196:3
Runtime.dynCall@file:///Users/lhansen/tmp/performance_now_hang/a.js:265:14
Browser_mainLoop_runner/<@file:///Users/lhansen/tmp/performance_now_hang/a.js:6827:13
Browser.mainLoop.runIter@file:///Users/lhansen/tmp/performance_now_hang/a.js:6886:13
Browser_mainLoop_runner@file:///Users/lhansen/tmp/performance_now_hang/a.js:6823:9

This is not consistent with what I'm seeing in other test programs I have, so it's probably something else, but what?
Ah, that timeout may be a bug in my code.  Investigating now.
(In reply to Boris Zbarsky [:bz] from comment #13)
> > I do not see Performance::Now in any of the call stacks.
> 
> Then you're not hanging under performance.now, yes?  That is, the point when
> the worker enters the sleep is not when now() has been called.
> 
> > I am seeing the message "Thread 5260064 going to sleep", but no message "Thread 5260064
> > back from sleep" from the worker thread.
> 
> Curious, indeed, since according to the C++ stack the now() call has
> certainly returned.

Like I said in the earlier commits, I have not been able to produce the correct C++ stack. The stack I posted refers to the DOM Workers that I am seeing in Xcode, and I do not believe any of them corresponds to the Worker that is hanging in Performance::Now.
(In reply to Lars T Hansen [:lth] from comment #15)
> Ah, that timeout may be a bug in my code.  Investigating now.

Oh, I also noticed that I started getting two timeout messages exactly like what you show here. Also, this bug did not occur in the previous build of your patch queue from ~1 month ago, but only started to appear when I updated to the latest yesterday.
Definitely my bug.  I'm pretty sure that what happens is along these lines: An interrupt is triggered in the worker while the worker is futexWaiting, causing the InterruptCallback event loop to be entered; subsequently a wakeup arrives, which is processed by that nested loop; the wakeup runnable returns false from its Run() method meaning to signal to the futex event loop that it should wake up, but the interrupt event loop takes that to mean that the Run() had an error, and passes the value up the chain where it is taken to mean an abort.  Hence we see the first error message.

Once the worker is aborted the main thread will not be woken up and will hang, until the dialog pops up and we can kill it.
Assignee: nobody → lhansen
Summary: Calling performance.now() in a worker thread can hang if main JS thread does not yield to browser event loop. → A worker's interrupt event loop misinterprets a futex wakeup runnable, and kills the worker
Apply this on top of the patch queue.  It passes all my tests and also solves the hang in the present bug.  I'm not thrilled about the structure of the fix so I'll ponder it some more, but it should at least unblock you.
Thank you so much for the very quick identification Lars!

I tested your patch, and it succeeds in the example that I posted in this bug report. However when I run my original pthread mutex test, it deadlocks again (although much more rare this time, perhaps 1/10 or 1/20 even). What happens is that when I implement the C usleep() function for Emscripten with an Atomics.futexWait that has a timeout on a memory address that will never be waked, in order to simulate a thread doing some work for x microseconds. In that scenario, I see that occasionally the timeout does not trigger and the worker doesn't recover from the futexWait.

See the following attachment for a test case. That test is functionally the same as the original test (main thread spawns a pthread, which locks a mutex, usleep()s for 100ms, then unlocks the mutex and quits, while main thread waits to join with the pthread). Curiously for the hang to occur, I need at least two pthreads, of which the first one finishes fine, but the second one doesn't.
I'll look at that.  It sounds like it could be a bug with the timeout logic (which is different on workers and the main thread, sigh), which has been tested but not extensively.
Yeah, this sounds like this could be different. Should I close this as resolved and open a new bug, or is it ok to continue here in this thread?
I'm fine keeping it here.
Hypotheses:

- the goto loop in atomics_futexWait, triggered by an interrupt, does not exit because the loop
  exit condition is miscomputed, ie, a time computation that goes astray

- the FutexWait loop in WorkerPrivate.cpp does not exit because the mFutexIsWakingUp flag is not
  detected properly

- we go into an interrupt handler and /it/ does not exit for the same reason

- the FutexWakeRunnable is not actually run, so the mFutexIsWakingUp flag is not actually set

What worries me is your observation that at least two workers must be present for this to hang.  None of the explanations above speak to that, as everything is running on independent threads.  The only global state is the futex waiters list, which hangs of the SharedArrayRawBuffer, but the futex waiters list is not involved in deciding to wake up a worker for a timeout.
This is a logic bug in the wakeup code, I think.  The thread wakes up just fine, but throws an uncatchable that kills the thread (and we don't know that because the error message is lost somewhere - not a great situation, we need to improve that, I'll file a separate bug).  And then we deadlock.

I won't try to fix this tonight, but I'll look at it first thing in the morning - I was in the process of cleaning up that code for review in any case, and if I'm lucky the cleanup I already have will fix the problem.
I have what I think is a solid fix for this issue but I'll have to test a little more first and I'll probably have to integrate it into the existing patches.
(Or for bleeding edge code, pick up the worker-wake branch of https://github.com/lars-t-hansen/atomics-queue and apply the first six [sic!] patches in the queue.)
Fixes for these problems have been pushed to the patch queue.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: