[meta] Reduce the amount of overhead (improve performance for small messages) in postMessage to/from workers

NEW
Unassigned

Status

()

defect
P3
normal
3 years ago
4 months ago

People

(Reporter: lth, Unassigned)

Tracking

(Depends on 8 bugs, {meta})

Firefox Tracking Flags

(Not tracked)

Details

Reporter

Description

3 years ago
The benchmark (details below) uses postMessage to send a simple JS object between the main thread and a dedicated worker.  The message rate is significantly lower in Firefox (20K msgs/s) than in Chrome (35K) or Safari (75K).  This has been the case since last summer at least.  It seems to be the case for release builds, Dev Ed, and locally built Nightlies.

I make no claims as to the representativeness of the benchmark but the numbers might be interesting even so.

Clone https://github.com/lars-t-hansen/parlib-simple.

The test program is in test/test-postmsg.html.  It is self-documenting.
Component: DOM: Events → DOM: Workers
This has been discussed on twitter recently as well.  Here is a blog post with performance across a range of message sizes:

  http://nolanlawson.com/2016/02/29/high-performance-web-worker-messages/
Long story short, structured cloning is slow, we've known about it for a while, and nothing has happened.

There are some existing bugs on that: bug 879083 (complete with reviewed but unlanded patch), bug 850185.

So the right thing to do here is to do a profile and if, as I expect, structured cloning is a bottleneck we should fix that...
Depends on: 879083, 850185
Note also that the advice in that blog post makes me very sad, because it means people will use JSON.stringify even once (well, if) we switch structured cloning to be faster...
Some preliminary profiling 

CC runs too often here in the main thread. We should be able to deal with this stuff without CC, since wrappers for events can be allocated in nursery these days. I assume this is about not clearing snow white objects fast enough. I'll investigate this. I think we can optimize this case significantly.


CallSetup::CallSetup's mAsyncCause.emplace(cx, JS_NewStringCopyZ(cx, aExecutionReason)); does a slow string allocation.


Some structured clone handling here too, but not too bad.
> CallSetup::CallSetup's mAsyncCause.emplace(cx, JS_NewStringCopyZ(cx, aExecutionReason));

Gah.  I shoud have just done r- for that and not believed the promises of fixing it later.  :(  Bug 1177488 tracks this and has been open for a long time now with no motion on it...

That said, this is a problem only on nightly/aurora, since async stacks are disabled on release, right?

Also, where is the CallSetup here?  Is it in the call to the "message" event listener?

Finally, note that this is measuring the performance of doing a ping-pong some number (N) of times.  It would be interesting to measure the performance of two other things:

1)  Sending N messages and not waiting for any responses (nor, indeed, sending any responses).  This would tell us how much time just the "send the message" bits take.

2)  Sending N messages and waiting for a single response that gets sent once the last message is received.  This would give us some idea of how long receiving messages on the worker takes.

3)  Sending one message to trigger sending N messages back, and measure how long that takes.  That should give us some idea of how long processing the incoming messages on the main thread takes.

Basically, trying to separate out the cost of the message sending and reception from the cost of the context switching, which may also be nontrivial....
Depends on: 1177488
So OK, I did a profile, after turning off async stacks.

On the main thread, we only spend about 2.5s of the 10s total testcase time processing message events; the rest of the mainthread time is whatever it is the insane Mac event loop is up to.  This time breaks down as follows:

1)  About 200ms is setting up the AutoEntryScript in WorkerRunnable::Run
    (largely the cxpusher bits).  Which is too bad, because PostMessageRunnable
    can pretty much ignore most of that jazz... I wonder whether we can avoid
    WorkerRunnable here....
2)  300ms reading the structured clone bits.
3)  100ms MessageEvent::InitMessageEvent; about 40ms of this is HoldJSObjects, 20ms
    is JS::HeapValuePostBarrier, 40ms is Event::InitEvent, largely GetEventMessageAndAtom.
4)  50ms is MessageEvent::MessageEvent, with about half that time being Event::SetOwner 
    doing varoius QI and refcounting.
5)  100ms actually doing the event dispatch mechanics.
6)  150ms setting up and tearing down the CallSetup for the message listener;
    mostly cxpusher again.
7)  50ms creating a JS reflector for the MessageEvent.
8)  400ms js::AutoStopwatch::AutoStopwatch at the entrypoint to js::RunScript
    (things like thread_info in there...).
9)  30ms ~AutoStopWatch.
10) 80ms doing GC from interrupt checks.
11) 20ms doing pointless JS_WrapValue stuff for MessageEvent::GetData
    (the bindings will handle it for you, and way faster!).
12) 380ms writing structured clone bits.
13) 200ms dispatching a runnable to the worker thread.  Mostly under WorkerPrivateParent::DispatchPrivate;
    most of this is PR_Unlock doing things like pthread_cond_signal and nsThread::PutEvent doing PR_Unlock
    which does pthread_cond_signal.

There are some other minor bits here and there.

On the worker, we have 3450ms of time spent about like this:

1) 400ms canceling timers from WorkerPrivate::SetGCTimerMode
2) 70ms initializing timers and adding them to TimerThread, from SetGCTimerMode.
3) 270ms reading structured clone bits.
4) 110ms in MessageEvent ctor and InitMessageEvent.
5) 250ms in CallSetup stuff.  A bunch of this (175ms) is JS_MaybeGC triggering GC.
6) 30ms creating JS reflectors for message events.
7) 30ms GetData bits doing JS_WrapValue.
8) 340ms writing structured clone data.
9) 600ms doing nsThread::PutEvent when dispatching back to the main thread. 
   This is almost all under nsBaseAppShell::OnDispatchedEvent calling
   nsAppShell::ScheduleNativeEventCallback which calls CFRunLoopWakUp, which
   goes off into pthread_mutex_unlock, mach_msg, pthread_mutex_lock, etc.
(In reply to Boris Zbarsky [:bz] from comment #6)
> 1)  About 200ms is setting up the AutoEntryScript in WorkerRunnable::Run
>     (largely the cxpusher bits).  Which is too bad, because
> PostMessageRunnable
>     can pretty much ignore most of that jazz... I wonder whether we can avoid
>     WorkerRunnable here....

These are messages coming *back* from the worker to the main thread here?  I wonder how often we actually do need the aes on the main thread.
> These are messages coming *back* from the worker to the main thread here? 

Yes.
(In reply to Boris Zbarsky [:bz] from comment #6)
> On the main thread, we only spend about 2.5s of the 10s total testcase time
> processing message events; the rest of the mainthread time is whatever it is
> the insane Mac event loop is up to. 
Yeah, on linux I saw some oddity too, which I couldn't quite interpret yet.
_raw_write_unlock_irqrestore taking tons of time.


Interestingly you're apparently not seeing the CC or forgetSkippable stuff. Perhaps I was (un)lucky to profile
when forgetskippable timer was active.


> 1) 400ms canceling timers from WorkerPrivate::SetGCTimerMode
Need to get back to this. GC timer handling in workers is totally bogus atm :/
Bug 1216175 should help a bit with that.
Depends on: 1216175
> 8)  400ms js::AutoStopwatch::AutoStopwatch at the entrypoint to js::RunScript
>    (things like thread_info in there...).

You see a meaningful cost in thread_info? This should be called at most twice per tick that runs js code. 
Once when we start actually executing JS code after having gone through XPCJSRuntime::BeforeProcessTask, and once in XPCJSRuntime::AfterProcessTask.

If there is a meaningful cost, either I'm hooking in the wrong place or there is bug in my code.
Depends on: 1021224
Summary: postMessage performance → Reduce the amount of overhead (improve performance for small messages) in postMessage to/from workers
> This should be called at most twice per tick that runs js code. Once when we start actually executing JS code

Right, recall that in this testcase we're ping-ponging messages back and forth to the worker and main thread.  So all the JS code on the main thread does is postMessage right back to the worker.  Doing the one thread_info call on entry into that JS seems to be enough to slow it down a good bit (20-25% more time spent under JS::Call).  I filed bug 1254621 on this, but I'm not sure there's much to investigate here; this is the "lots of calls from C++ to JS" scenario, with each call coming off a new task.

Oh, and ~AutoStopWatch spends some noticeable-ish (1% of mainthread time, 2-3% of JS::Call) amount of time inserting things into a vector...
Depends on: 1254621
Currently calling Console.log() we dispatch a runnable from the worker thread, back to the main-thread. This must be considered as a delay too.
The testcase this bug is filed on does not have any console.log() calls.  It's just pure postMessage all through.
Whiteboard: btpp-meta
Depends on: 1410077
Depends on: 1410081
Depends on: 1410087
Depends on: 1410091
Depends on: 1410096
Lars, could you test this on Mac. And perhaps try beta build, since on Nightlies we have some slow stuff in the event loop to get certain telemetry data etc.

On linux we're on par or faster than Chrome with test-postmsg.html

The bugs I filed should give us few % speedup.
Flags: needinfo?(lhansen)
Reporter

Comment 15

2 years ago
For the benchmark in comment 0, I see 18700 msgs/s, which is about the same as 2 years ago.  This is in FF57 beta 10 with a clean profile.  A fresh Nightly is somewhat slower at 18100 msg/s.
Flags: needinfo?(lhansen)
Bevis and Andreas, seems like we could use some performance tuning of the labelled event queue.
Flags: needinfo?(btseng)
Flags: needinfo?(afarre)
Priority: -- → P3
Summary: Reduce the amount of overhead (improve performance for small messages) in postMessage to/from workers → [meta] Reduce the amount of overhead (improve performance for small messages) in postMessage to/from workers
Whiteboard: btpp-meta
Flags: needinfo?(bevistseng)
Reporter

Updated

10 months ago
See Also: → 1483998
LabeledEventQueue isn't currently used (see Bug 1485216) and would do well to be removed. This would of course fix the related issues where the labelled event queue is slow.
Flags: needinfo?(afarre)

Updated

6 months ago
Depends on: 1485216
You need to log in before you can comment on or make changes to this bug.