Open Bug 1254240 Opened 9 years ago Updated 3 years ago

[meta] Reduce the amount of overhead (improve performance for small messages) in postMessage to/from workers

Categories

(Core :: DOM: Workers, enhancement, P3)

Product:

Component:

Type:

enhancement

Priority:

P3

Severity:

S3

Tracking

()

Status:

NEW

People

(Reporter: lth, Unassigned)

References

(Depends on 4 open bugs)

Details

(Keywords: meta)

Lars T Hansen [:lth]

Reporter

Description

•

9 years ago

The benchmark (details below) uses postMessage to send a simple JS object between the main thread and a dedicated worker. The message rate is significantly lower in Firefox (20K msgs/s) than in Chrome (35K) or Safari (75K). This has been the case since last summer at least. It seems to be the case for release builds, Dev Ed, and locally built Nightlies. I make no claims as to the representativeness of the benchmark but the numbers might be interesting even so. Clone https://github.com/lars-t-hansen/parlib-simple. The test program is in test/test-postmsg.html. It is self-documenting.

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Component: DOM: Events → DOM: Workers

Ben Kelly [:bkelly, not reviewing]

Comment 1

•

9 years ago

This has been discussed on twitter recently as well. Here is a blog post with performance across a range of message sizes: http://nolanlawson.com/2016/02/29/high-performance-web-worker-messages/

Boris Zbarsky [:bzbarsky]

Comment 2

•

9 years ago

Long story short, structured cloning is slow, we've known about it for a while, and nothing has happened. There are some existing bugs on that: bug 879083 (complete with reviewed but unlanded patch), bug 850185. So the right thing to do here is to do a profile and if, as I expect, structured cloning is a bottleneck we should fix that...

Depends on: 879083, 850185

Boris Zbarsky [:bzbarsky]

Comment 3

•

9 years ago

Note also that the advice in that blog post makes me very sad, because it means people will use JSON.stringify even once (well, if) we switch structured cloning to be faster...

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 4

•

9 years ago

Some preliminary profiling CC runs too often here in the main thread. We should be able to deal with this stuff without CC, since wrappers for events can be allocated in nursery these days. I assume this is about not clearing snow white objects fast enough. I'll investigate this. I think we can optimize this case significantly. CallSetup::CallSetup's mAsyncCause.emplace(cx, JS_NewStringCopyZ(cx, aExecutionReason)); does a slow string allocation. Some structured clone handling here too, but not too bad.

Boris Zbarsky [:bzbarsky]

Comment 5

•

9 years ago

> CallSetup::CallSetup's mAsyncCause.emplace(cx, JS_NewStringCopyZ(cx, aExecutionReason)); Gah. I shoud have just done r- for that and not believed the promises of fixing it later. :( Bug 1177488 tracks this and has been open for a long time now with no motion on it... That said, this is a problem only on nightly/aurora, since async stacks are disabled on release, right? Also, where is the CallSetup here? Is it in the call to the "message" event listener? Finally, note that this is measuring the performance of doing a ping-pong some number (N) of times. It would be interesting to measure the performance of two other things: 1) Sending N messages and not waiting for any responses (nor, indeed, sending any responses). This would tell us how much time just the "send the message" bits take. 2) Sending N messages and waiting for a single response that gets sent once the last message is received. This would give us some idea of how long receiving messages on the worker takes. 3) Sending one message to trigger sending N messages back, and measure how long that takes. That should give us some idea of how long processing the incoming messages on the main thread takes. Basically, trying to separate out the cost of the message sending and reception from the cost of the context switching, which may also be nontrivial....

Depends on: 1177488

Boris Zbarsky [:bzbarsky]

Comment 6

•

9 years ago

So OK, I did a profile, after turning off async stacks. On the main thread, we only spend about 2.5s of the 10s total testcase time processing message events; the rest of the mainthread time is whatever it is the insane Mac event loop is up to. This time breaks down as follows: 1) About 200ms is setting up the AutoEntryScript in WorkerRunnable::Run (largely the cxpusher bits). Which is too bad, because PostMessageRunnable can pretty much ignore most of that jazz... I wonder whether we can avoid WorkerRunnable here.... 2) 300ms reading the structured clone bits. 3) 100ms MessageEvent::InitMessageEvent; about 40ms of this is HoldJSObjects, 20ms is JS::HeapValuePostBarrier, 40ms is Event::InitEvent, largely GetEventMessageAndAtom. 4) 50ms is MessageEvent::MessageEvent, with about half that time being Event::SetOwner doing varoius QI and refcounting. 5) 100ms actually doing the event dispatch mechanics. 6) 150ms setting up and tearing down the CallSetup for the message listener; mostly cxpusher again. 7) 50ms creating a JS reflector for the MessageEvent. 8) 400ms js::AutoStopwatch::AutoStopwatch at the entrypoint to js::RunScript (things like thread_info in there...). 9) 30ms ~AutoStopWatch. 10) 80ms doing GC from interrupt checks. 11) 20ms doing pointless JS_WrapValue stuff for MessageEvent::GetData (the bindings will handle it for you, and way faster!). 12) 380ms writing structured clone bits. 13) 200ms dispatching a runnable to the worker thread. Mostly under WorkerPrivateParent::DispatchPrivate; most of this is PR_Unlock doing things like pthread_cond_signal and nsThread::PutEvent doing PR_Unlock which does pthread_cond_signal. There are some other minor bits here and there. On the worker, we have 3450ms of time spent about like this: 1) 400ms canceling timers from WorkerPrivate::SetGCTimerMode 2) 70ms initializing timers and adding them to TimerThread, from SetGCTimerMode. 3) 270ms reading structured clone bits. 4) 110ms in MessageEvent ctor and InitMessageEvent. 5) 250ms in CallSetup stuff. A bunch of this (175ms) is JS_MaybeGC triggering GC. 6) 30ms creating JS reflectors for message events. 7) 30ms GetData bits doing JS_WrapValue. 8) 340ms writing structured clone data. 9) 600ms doing nsThread::PutEvent when dispatching back to the main thread. This is almost all under nsBaseAppShell::OnDispatchedEvent calling nsAppShell::ScheduleNativeEventCallback which calls CFRunLoopWakUp, which goes off into pthread_mutex_unlock, mach_msg, pthread_mutex_lock, etc.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 7

•

9 years ago

(In reply to Boris Zbarsky [:bz] from comment #6) > 1) About 200ms is setting up the AutoEntryScript in WorkerRunnable::Run > (largely the cxpusher bits). Which is too bad, because > PostMessageRunnable > can pretty much ignore most of that jazz... I wonder whether we can avoid > WorkerRunnable here.... These are messages coming *back* from the worker to the main thread here? I wonder how often we actually do need the aes on the main thread.

Boris Zbarsky [:bzbarsky]

Comment 8

•

9 years ago

> These are messages coming *back* from the worker to the main thread here? Yes.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 9

•

9 years ago

(In reply to Boris Zbarsky [:bz] from comment #6) > On the main thread, we only spend about 2.5s of the 10s total testcase time > processing message events; the rest of the mainthread time is whatever it is > the insane Mac event loop is up to. Yeah, on linux I saw some oddity too, which I couldn't quite interpret yet. _raw_write_unlock_irqrestore taking tons of time. Interestingly you're apparently not seeing the CC or forgetSkippable stuff. Perhaps I was (un)lucky to profile when forgetskippable timer was active. > 1) 400ms canceling timers from WorkerPrivate::SetGCTimerMode Need to get back to this. GC timer handling in workers is totally bogus atm :/ Bug 1216175 should help a bit with that.

Depends on: 1216175

David Teller [:Yoric] - still alive but not very active

Comment 10

•

9 years ago

> 8) 400ms js::AutoStopwatch::AutoStopwatch at the entrypoint to js::RunScript > (things like thread_info in there...). You see a meaningful cost in thread_info? This should be called at most twice per tick that runs js code. Once when we start actually executing JS code after having gone through XPCJSRuntime::BeforeProcessTask, and once in XPCJSRuntime::AfterProcessTask. If there is a meaningful cost, either I'm hooking in the wrong place or there is bug in my code.

Boris Zbarsky [:bzbarsky]

Updated

•

9 years ago

Depends on: 1021224

Summary: postMessage performance → Reduce the amount of overhead (improve performance for small messages) in postMessage to/from workers

Boris Zbarsky [:bzbarsky]

Comment 11

•

9 years ago

> This should be called at most twice per tick that runs js code. Once when we start actually executing JS code Right, recall that in this testcase we're ping-ponging messages back and forth to the worker and main thread. So all the JS code on the main thread does is postMessage right back to the worker. Doing the one thread_info call on entry into that JS seems to be enough to slow it down a good bit (20-25% more time spent under JS::Call). I filed bug 1254621 on this, but I'm not sure there's much to investigate here; this is the "lots of calls from C++ to JS" scenario, with each call coming off a new task. Oh, and ~AutoStopWatch spends some noticeable-ish (1% of mainthread time, 2-3% of JS::Call) amount of time inserting things into a vector...

Depends on: 1254621

Andrea Marchesini [:baku]

Comment 12

•

9 years ago

Currently calling Console.log() we dispatch a runnable from the worker thread, back to the main-thread. This must be considered as a delay too.

Boris Zbarsky [:bzbarsky]

Comment 13

•

9 years ago

The testcase this bug is filed on does not have any console.log() calls. It's just pure postMessage all through.

Andrew Overholt [:overholt]

Updated

•

9 years ago

Whiteboard: btpp-meta

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

8 years ago

Depends on: 1410077

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

8 years ago

Depends on: 1410081

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

8 years ago

Depends on: 1410087

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

8 years ago

Depends on: 1410091

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

8 years ago

Depends on: 1410096

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 14

•

8 years ago

Lars, could you test this on Mac. And perhaps try beta build, since on Nightlies we have some slow stuff in the event loop to get certain telemetry data etc. On linux we're on par or faster than Chrome with test-postmsg.html The bugs I filed should give us few % speedup.

Flags: needinfo?(lhansen)

Lars T Hansen [:lth]

Reporter

Comment 15

•

8 years ago

For the benchmark in comment 0, I see 18700 msgs/s, which is about the same as 2 years ago. This is in FF57 beta 10 with a clean profile. A fresh Nightly is somewhat slower at 18100 msg/s.

Flags: needinfo?(lhansen)

Andrew Overholt [:overholt]

Comment 16

•

8 years ago

Bevis and Andreas, seems like we could use some performance tuning of the labelled event queue.

Flags: needinfo?(btseng)

Flags: needinfo?(afarre)

Priority: -- → P3

Summary: Reduce the amount of overhead (improve performance for small messages) in postMessage to/from workers → [meta] Reduce the amount of overhead (improve performance for small messages) in postMessage to/from workers

Whiteboard: btpp-meta

Bevis Tseng[:bevis][:btseng](Exited)

Updated

•

8 years ago

Flags: needinfo?(bevistseng)

Lars T Hansen [:lth]

Reporter

Updated

•

7 years ago

See Also: → 1483998

BugBot [:suhaib / :marco/ :calixte]

Updated

•

7 years ago

Keywords: meta

Andreas Farre [:farre]

Comment 17

•

7 years ago

LabeledEventQueue isn't currently used (see Bug 1485216) and would do well to be removed. This would of course fix the related issues where the labelled event queue is slow.

Flags: needinfo?(afarre)

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

7 years ago

Depends on: 1485216

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

6 years ago

Blocks: 1564880

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

6 years ago

No longer blocks: 1564880

Depends on: 1564880

Jens Stutte [:jstutte]

Comment 18

•

6 years ago

Marked as enhancement as this would add value for the end-user.
:asuth, as there has not been any effort to solve this for quite a while, do we still think this is worth doing? I'd assume, that we should repeat at least the measurements made years ago (maybe the profiling is something for Subhamoy?)

Type: defect → enhancement

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Comment 19

•

6 years ago

Very much still worth doing, in particular, bug 1564880 is low hanging fruit. Though I would rank optimizing MessagePort/MessageChannel (bug 1587752) as a more urgent optimization than most of the other linked bugs here.

Flags: needinfo?(bugmail)

Updated

•

3 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.