Open
Bug 1380061
Opened 7 years ago
Updated 2 years ago
Could NodeListBinding::DOMProxyHandler::get() be optimized?
Categories
(Core :: DOM: Core & HTML, enhancement, P2)
Core
DOM: Core & HTML
Tracking
()
NEW
Performance Impact | low |
People
(Reporter: smaug, Unassigned)
References
(Depends on 1 open bug, Blocks 2 open bugs)
Details
(Keywords: perf)
Attachments
(1 file)
475 bytes,
text/html
|
Details |
I was profiling http://speedometer2.benj.me/InteractiveRunner.html Elm-TodoMVC and NodeList::DOMProxyHandler::get() showed up. https://perfht.ml/2uavsT8 Anything we could do there? (I didn't yet look at the code too much)
Reporter | ||
Updated•7 years ago
|
Summary: Could NodeList::DOMProxyHandler::get() be optimized? → Could NodeListBinding::DOMProxyHandler::get() be optimized?
Comment 1•7 years ago
|
||
Fundamentally, the proxy handler does the following things: * Checks whether the incoming jsid is a valid array index, and if so which one. * If so, unwraps "proxy " to get the nsINodeList*, calls self->IndexedGetter(), wraps the result. * If not, various other stuff. Looking at the profile, we spend 7.5ms under get(), of which 3.5ms is calling IndexedGetter() (1.5ms nsChildContentList::Item and 2ms nsAttrAndChildArray::GetSafeChildAt). Another 0.5ms is GetWrapper() on the resulting node (why is this not inlined?). 3.5ms is self time. That self time presumably breaks down as partially the GetArrayIndexFromId/IsArrayIndex bits and partially the GetOrCreateDOMReflector bits. As an experiment, I tried the following: 1) Remove the [Pure] annotation on Node.firstChild. 2) Run the following testcase: <!DOCTYPE html> <pre><script> var e = document.documentElement; e.firstChild; // create wrapper var kids = e.childNodes; var count = 1000000; var start = performance.now(); for (var i = 0; i < count; ++i) e.firstChild; document.writeln((performance.now() - start)/count * 1e6); start = performance.now(); for (var i = 0; i < count; ++i) kids[0]; document.writeln((performance.now() - start)/count * 1e6); </script> This shows that the kids[0] get is about 2x slower than the e.firstChild get here... So then I tried profiling the two loops separately, with "count" increased to 100000000. perf.html is being nicely broken and not letting me upload profiles (and also rendering broken in various ways), but it shows about 1200ms spent under get_firstChild for the first loop and about 2400ms under NodeListBinding::DOMProxyHandler::get in the second loop. The latter is spending 1300ms under nsWrapperCache::GetWrapper, while the former is not spending any time under there at all. I wonder whether we manage to inline GetWrapper in the one case but not the other or something... Really not sure what's going on there. Also, in the profile you posted GetWrapper is very much not half of the total time under get(). In any case, apart from that GetWrapper bit the amounts of time spent in the profile are pretty comparable, afaict. In general, there are two things that we could in theory make faster here with better jit support: 1) We could have the jit pass a uint32 to us if it knows the id is in fact an integer, so we don't have to do it ourselves and check whether it is one. But I suspect that check is pretty cheap. 2) The jit may be able to use TI to rule out the possibility of Xrays and hence the need to do the proxy handler check UnwrapProxy does. Maybe. Not sure how much that would help in practice.
Comment 2•7 years ago
|
||
Also, when I profile with Instruments I see GetWrapper under the get_firstChild bit too, so I'm not sure how much I trust the Gecko profiler here. :(
Updated•7 years ago
|
Priority: -- → P2
Comment 3•7 years ago
|
||
This shows up quite heavy in profiles: https://perfht.ml/2fM2Aex Ting-Yu, any chance you would be interested in working on this please?
Flags: needinfo?(janus926)
Whiteboard: [qf]
Comment 4•7 years ago
|
||
Sure, what's the STR of the profile you provided? Is it different the ones from :bz and :smaug?
Assignee: nobody → janus926
Status: NEW → ASSIGNED
Flags: needinfo?(janus926) → needinfo?(ehsan)
Comment 5•7 years ago
|
||
Here is how I captured the profile: * Set the sampling rate to 0.1ms and buffer size to 450MB (you want to do this either on Linux or OSX, on Windows high sampling frequencies require a dedicated core doing busy wait which impacts the performance of what we're measuring.) * Go to http://speedometer2.benj.me and start running Speedometer. * Capture a profile after about 150 or so of the subtests have finished. Please note that I was just trying to get a rough sense of whether this function still shows up in the profiles, and barely spent any time to look at this in detail. smaug and bz's analysis above is a lot more detailed. Unfortunately I'm not immediately sure whether it's more useful to profile the full benchmark or just the Elm subtest using the interactive runner, but you may want to start with the latter since it's a much smaller workload to profile so it may be easier. :-) Thanks!
Flags: needinfo?(ehsan)
Comment 6•7 years ago
|
||
FWIW, with the STR in comment 5, VTune shows only 0.017s CPU Time in NodeListBinding::DOMProxyHandler::get() (1ms sampling interval) where the total CPU Time is ~45s.
Updated•7 years ago
|
Whiteboard: [qf] → [qf:p3]
Comment 7•7 years ago
|
||
WPA shows following from running the latter testcase in comment 1 with Nightly 20170823100553: Line #, Process, Stack, Count, Weight (in view) (ms), TimeStamp (s), % Weight 2 , , [Root], 21174, 21,068.432792, , 7.78 3 , , |- ?!?, 20913, 20,807.151106, , 7.69 4 , , | |- xul.dll!js::ProxyGetPropertyByValue, 17007, 16,919.336013, , 6.25 5 , , | | |- xul.dll!mozilla::dom::NodeListBinding::DOMProxyHandler::get, 10009, 9,959.971161, , 3.68 6 , , | | | |- xul.dll!nsINodeList::IndexedGetter, 3593, 3,576.623473, , 1.32 7 , , | | | | |- xul.dll!mozilla::dom::FragmentOrElement::GetChildAt, 1585, 1,576.560672, , 0.58 8 , , | | | | |- xul.dll!nsINodeList::IndexedGetter<itself>, 1175, 1,169.779085, , 0.43 9 , , | | | | |- xul.dll!nsChildContentList::Item, 833, 830.283716, , 0.31 10, , | | | |- xul.dll!nsWrapperCache::GetWrapper, 3376, 3,360.225033, , 1.24 11, , | | | | |- xul.dll!nsWrapperCache::GetWrapper<itself>, 1503, 1,496.437161, , 0.55 12, , | | | | |- xul.dll!nsWrapperCache::GetWrapperPreserveColor, 1498, 1,490.285602, , 0.55 13, , | | | | |- xul.dll!JS::GCCellPtr::is<JS::Symbol>, 248, 247.136744, , 0.09 14, , | | | | |- xul.dll!JS::GCCellPtr::is<JSString>, 127, 126.365526, , 0.05 15, , | | | |- xul.dll!mozilla::dom::NodeListBinding::DOMProxyHandler::get<itself>, 2670, 2,655.443573, , 0.98 16, , | | | |- xul.dll!mozilla::dom::GetArrayIndexFromId, 370, 367.679082, , 0.14 17, , | | |- xul.dll!js::ProxyGetPropertyByValue<itself>, 4931, 4,902.931112, , 1.81 18, , | | |- xul.dll!js::ValueToId<1>, 1137, 1,132.395919, , 0.42 19, , | | |- xul.dll!js::AutoEnterPolicy::AutoEnterPolicy, 929, 923.034437, , 0.34 20, , | | |- ntoskrnl.exe!KiDpcInterrupt, 1, 1.003384, , 0.00 21, , | |- ?!?<itself>, 3861, 3,842.779579, , 1.42
Comment 8•7 years ago
|
||
The output from VTune on Linux with Nightly 20170822142709 for the same testcase as comment 7: Function / Call Stack Effective Time Spin Time Overhead Time Instructions Retired CPI Rate CPU Frequency Ratio [Outside any known module] 4.528s 0s 0s 36,101,200,000 0.474 1.113 js::ToWindowProxyIfWindow 1.956s 0s 0s 20,712,800,000 0.372 1.161 js::ProxyGetPropertyByValue 1.212s 0s 0s 9,088,200,000 0.492 1.089 js::Proxy::get 1.144s 0s 0s 10,489,000,000 0.410 1.110 EdgeNeedsSweepUnbarriered 1.132s 0s 0s 11,740,200,000 0.352 1.075 \- nsWrapperCache::GetWrapperPreserveColor <- nsWrapperCache::GetWrapper 1.132s 0s 0s 11,740,200,000 0.352 1.075 mozilla::dom::NodeListBinding::DOMProxyHandler::get 1.087s 0s 0s 6,055,400,000 0.707 1.161 js::AutoEnterPolicy::AutoEnterPolicy 0.795s 0s 0s 12,165,200,000 0.250 1.126 nsWrapperCache::GetWrapper 0.733s 0s 0s 3,818,200,000 0.671 1.030 js::ValueToId<(js::AllowGC)1> 0.731s 0s 0s 9,322,800,000 0.266 1.001 JS::Rooted<JS::Value>::~Rooted 0.723s 0s 0s 13,151,200,000 0.212 1.136 ValueToWindowProxyIfWindow 0.620s 0s 0s 5,062,600,000 0.453 1.090
Comment 9•7 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #8) > js::ToWindowProxyIfWindow 1.956s 0s 0s 20,712,800,000 0.372 1.161 This one is mainly the pointer chasing obj->group->clasp->flags. > EdgeNeedsSweepUnbarriered 1.132s 0s 0s 11,740,200,000 0.352 1.075 Dereferencing to get ChunkLocation in IsInsideNursery() [1] takes 942.135ms. [1] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/js/public/HeapAPI.h#385
Comment 10•7 years ago
|
||
Ugh, are ValueToWindowProxyIfWindow and ToWindowProxyIfWindow adding overhead? That's dumb, script should never pass a Window as receiver and we should probably make that an invariant for the get/set traps. In any case we can optimize this away, I can take a look.
Flags: needinfo?(jdemooij)
Comment 11•7 years ago
|
||
Maybe you could try the following: (1) Add a |JSObject* proxy| argument to ValueToWindowProxyIfWindow and have the callers pass the proxy object. (2) Then add this to the start of ValueToWindowProxyIfWindow: if (v == ObjectValue(*proxy)) return v; Can you check if this improves perf? It's a very simple optimization for the most common case.
Flags: needinfo?(jdemooij) → needinfo?(janus926)
Comment 12•7 years ago
|
||
The change brings ~30% improvement on Linux from running the latter testcase in comment 1: m-c: 73.30605 71.72109999999999 70.6508 73.53995000000002 71.6754 68.6558 72.37195 76.47655 69.96825 72.2134 m-c + comment 11: 52.269949999999994 52.59675 47.019400000000005 47.446000000000005 45.184749999999994 46.14019999999999 45.25279999999999 47.921150000000004 45.343050000000005 47.290350000000004 But I don't understand the change, could you teach me what |v == ObjectValue(*proxy)| implies?
Flags: needinfo?(janus926) → needinfo?(jdemooij)
Comment 13•7 years ago
|
||
The callers of Proxy::get() [1][2] have code like following and same to Proxy::set(): RootedValue receiver(cx, ObjectValue(*proxy)); return Proxy::get(cx, proxy, receiver, id, vp); So the |v == ObjectValue(*proxy)| check in comment 11 should be always true. If my understanding is correct, do we still need to call ValueToWindowProxyIfWindow in Proxy::get/set? [1] http://searchfox.org/mozilla-central/rev/5696c3e525fc8222674eed6a562f5fcbe804c4c7/js/src/proxy/Proxy.cpp#346-347 [2] http://searchfox.org/mozilla-central/rev/5696c3e525fc8222674eed6a562f5fcbe804c4c7/js/src/proxy/Proxy.cpp#358-359
Comment 14•7 years ago
|
||
I guess I understand the change now, because I missed the get/set in js::ProxyObjectOps. The change is a shortcut for the get call from js::ProxyGetProperty and js::ProxyGetPropertyByValue.
Flags: needinfo?(jdemooij)
Comment 15•7 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #12) > The change brings ~30% improvement on Linux from running the latter testcase > in comment 1: Nice! > But I don't understand the change, could you teach me what |v == > ObjectValue(*proxy)| implies? Usually, but not always, proxy == receiver (if proxy != receiver, the proxy was probably on the prototype chain of the receiver). Windows are never proxies (we should assert this) therefore we don't need to call ToWindowProxyIfWindow if the receiver == our proxy. What we could also do is make Proxy::get/set templates, with a ReceiverIsProxy or something template argument. Maybe you can measure what the performance is when ValueToWindowProxyIfWindow unconditionally returns |v|? That could tell us what kind of speedup to expect from templatizing. I'm not sure it's worth it, though.
Comment 16•7 years ago
|
||
Another thing that might be interesting to measure is whether perf improves if you remove the CheckRecursionLimit call at the start of Proxy::get. We don't really need that check if the caller is a JIT IC - ProxyGetProperty(ByValue) - so we could optimize that if it's worth it.
Comment 17•7 years ago
|
||
Another observation: note how ProxyGetProperty and ProxyGetPropertyByValue have a |RootedValue receiver| and we also have a |RootedValue receiver| in Proxy::get? Roots are not free (they maintain a linked list). I wonder if we should optimize all this as follows: (1) Rename Proxy::get to private Proxy::getInternal. Proxy::getInternal can MOZ_ASSERT_IF(receiver.isObject(), !IsWindow(&receiver.toObject()). (2) Have ProxyGetProperty* call getInternal directly. (3) Move the ValueToWindowProxyIfWindow (and maybe CheckRecursionLimit) calls to Proxy::get/set and then call getInternal. (We could still keep the receiver == proxy optimization in ValueToWindowProxyIfWindow.) What do you think?
Comment 18•7 years ago
|
||
I'll check them and update later.
Comment 19•7 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #12) > The change brings ~30% improvement on Linux from running the latter testcase > in comment 1: Note the numbers aren't improved on Windows. m-c: 36.3361 36.68915 36.8702 36.462399999999995 36.76795 36.845 36.514450000000004 36.743100000000005 36.99785000000001 36.4437 m-c + comment 11: 36.488 35.91275 38.34565 36.2722 36.16825 38.28915 37.50735 36.90305000000001 37.746700000000004 36.801899999999996
Comment 20•7 years ago
|
||
VTune on Windows shows that nsINodeList::IndexedGetter spends ~%60 of its CPU time in dereferencing |this| for accessing the pure virtual function nsINodeList::Item on vtable.
Updated•7 years ago
|
Assignee: janus926 → nobody
Status: ASSIGNED → NEW
Reporter | ||
Comment 21•6 years ago
|
||
Reporter | ||
Comment 22•6 years ago
|
||
FWIW, also Chrome seems to have quite a bit slower access when using nodelist, but that doesn't mean we shouldn't optimize.
Assignee | ||
Updated•5 years ago
|
Component: DOM → DOM: Core & HTML
Updated•2 years ago
|
Performance Impact: --- → P3
Whiteboard: [qf:p3]
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•