Open Bug 1380061 Opened 7 years ago Updated 2 years ago

Could NodeListBinding::DOMProxyHandler::get() be optimized?

Categories

(Core :: DOM: Core & HTML, enhancement, P2)

enhancement

Tracking

()

Performance Impact low

People

(Reporter: smaug, Unassigned)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Keywords: perf)

Attachments

(1 file)

I was profiling http://speedometer2.benj.me/InteractiveRunner.html Elm-TodoMVC and NodeList::DOMProxyHandler::get() showed up.
https://perfht.ml/2uavsT8
Anything we could do there?
(I didn't yet look at the code too much)
Summary: Could NodeList::DOMProxyHandler::get() be optimized? → Could NodeListBinding::DOMProxyHandler::get() be optimized?
Fundamentally, the proxy handler does the following things:

*  Checks whether the incoming jsid is a valid array index, and if so which one.
*  If so, unwraps "proxy " to get the nsINodeList*, calls self->IndexedGetter(), wraps the result.
*  If not, various other stuff.

Looking at the profile, we spend 7.5ms under get(), of which 3.5ms is calling IndexedGetter() (1.5ms nsChildContentList::Item and 2ms nsAttrAndChildArray::GetSafeChildAt).  Another 0.5ms is GetWrapper() on the resulting node (why is this not inlined?).  3.5ms is self time.  That self time presumably breaks down as partially the GetArrayIndexFromId/IsArrayIndex bits and partially the GetOrCreateDOMReflector bits.

As an experiment, I tried the following:

1) Remove the [Pure] annotation on Node.firstChild.
2) Run the following testcase:

  <!DOCTYPE html>
  <pre><script>
    var e = document.documentElement;
    e.firstChild; // create wrapper
    var kids = e.childNodes;
    var count = 1000000;
    var start = performance.now();
    for (var i = 0; i < count; ++i)
      e.firstChild;
    document.writeln((performance.now() - start)/count * 1e6);

    start = performance.now();
    for (var i = 0; i < count; ++i)
      kids[0];
    document.writeln((performance.now() - start)/count * 1e6);
  </script>

This shows that the kids[0] get is about 2x slower than the e.firstChild get here...

So then I tried profiling the two loops separately, with "count" increased to 100000000.  perf.html is being nicely broken and not letting me upload profiles (and also rendering broken in various ways), but it shows about 1200ms spent under get_firstChild for the first loop and about 2400ms under NodeListBinding::DOMProxyHandler::get in the second loop.  The latter is spending 1300ms under nsWrapperCache::GetWrapper, while the former is not spending any time under there at all.  I wonder whether we manage to inline GetWrapper in the one case but not the other or something...  Really not sure what's going on there.  Also, in the profile you posted GetWrapper is very much not half of the total time under get().

In any case, apart from that GetWrapper bit the amounts of time spent in the profile are pretty comparable, afaict.

In general, there are two things that we could in theory make faster here with better jit support:

1)  We could have the jit pass a uint32 to us if it knows the id is in fact an integer, so we don't have to do it ourselves and check whether it is one.  But I suspect that check is pretty cheap.

2)  The jit may be able to use TI to rule out the possibility of Xrays and hence the need to do the proxy handler check UnwrapProxy does.  Maybe.  Not sure how much that would help in practice.
Also, when I profile with Instruments I see GetWrapper under the get_firstChild bit too, so I'm not sure how much I trust the Gecko profiler here.  :(
Priority: -- → P2
This shows up quite heavy in profiles: https://perfht.ml/2fM2Aex

Ting-Yu, any chance you would be interested in working on this please?
Flags: needinfo?(janus926)
Whiteboard: [qf]
Sure, what's the STR of the profile you provided? Is it different the ones from :bz and :smaug?
Assignee: nobody → janus926
Status: NEW → ASSIGNED
Flags: needinfo?(janus926) → needinfo?(ehsan)
Here is how I captured the profile:

 * Set the sampling rate to 0.1ms and buffer size to 450MB (you want to do this either on Linux or OSX, on Windows high sampling frequencies require a dedicated core doing busy wait which impacts the performance of what we're measuring.)
 * Go to http://speedometer2.benj.me and start running Speedometer.
 * Capture a profile after about 150 or so of the subtests have finished.

Please note that I was just trying to get a rough sense of whether this function still shows up in the profiles, and barely spent any time to look at this in detail.  smaug and bz's analysis above is a lot more detailed.  Unfortunately I'm not immediately sure whether it's more useful to profile the full benchmark or just the Elm subtest using the interactive runner, but you may want to start with the latter since it's a much smaller workload to profile so it may be easier.  :-)

Thanks!
Flags: needinfo?(ehsan)
FWIW, with the STR in comment 5, VTune shows only 0.017s CPU Time in NodeListBinding::DOMProxyHandler::get() (1ms sampling interval) where the total CPU Time is ~45s.
Whiteboard: [qf] → [qf:p3]
WPA shows following from running the latter testcase in comment 1 with Nightly 20170823100553:

Line #, Process, Stack, Count, Weight (in view) (ms), TimeStamp (s), % Weight
2 , , [Root], 21174, 21,068.432792, , 7.78
3 , ,   |- ?!?, 20913, 20,807.151106, , 7.69
4 , ,   |    |- xul.dll!js::ProxyGetPropertyByValue, 17007, 16,919.336013, , 6.25
5 , ,   |    |    |- xul.dll!mozilla::dom::NodeListBinding::DOMProxyHandler::get, 10009, 9,959.971161, , 3.68
6 , ,   |    |    |    |- xul.dll!nsINodeList::IndexedGetter, 3593, 3,576.623473, , 1.32
7 , ,   |    |    |    |    |- xul.dll!mozilla::dom::FragmentOrElement::GetChildAt, 1585, 1,576.560672, , 0.58
8 , ,   |    |    |    |    |- xul.dll!nsINodeList::IndexedGetter<itself>, 1175, 1,169.779085, , 0.43
9 , ,   |    |    |    |    |- xul.dll!nsChildContentList::Item, 833, 830.283716, , 0.31
10, ,   |    |    |    |- xul.dll!nsWrapperCache::GetWrapper, 3376, 3,360.225033, , 1.24
11, ,   |    |    |    |    |- xul.dll!nsWrapperCache::GetWrapper<itself>, 1503, 1,496.437161, , 0.55
12, ,   |    |    |    |    |- xul.dll!nsWrapperCache::GetWrapperPreserveColor, 1498, 1,490.285602, , 0.55
13, ,   |    |    |    |    |- xul.dll!JS::GCCellPtr::is<JS::Symbol>, 248, 247.136744, , 0.09
14, ,   |    |    |    |    |- xul.dll!JS::GCCellPtr::is<JSString>, 127, 126.365526, , 0.05
15, ,   |    |    |    |- xul.dll!mozilla::dom::NodeListBinding::DOMProxyHandler::get<itself>, 2670, 2,655.443573, , 0.98
16, ,   |    |    |    |- xul.dll!mozilla::dom::GetArrayIndexFromId, 370, 367.679082, , 0.14
17, ,   |    |    |- xul.dll!js::ProxyGetPropertyByValue<itself>, 4931, 4,902.931112, , 1.81
18, ,   |    |    |- xul.dll!js::ValueToId<1>, 1137, 1,132.395919, , 0.42
19, ,   |    |    |- xul.dll!js::AutoEnterPolicy::AutoEnterPolicy, 929, 923.034437, , 0.34
20, ,   |    |    |- ntoskrnl.exe!KiDpcInterrupt, 1, 1.003384, , 0.00
21, ,   |    |- ?!?<itself>, 3861, 3,842.779579, , 1.42
The output from VTune on Linux with Nightly 20170822142709 for the same testcase as comment 7:

Function / Call Stack	Effective Time	Spin Time	Overhead Time	Instructions Retired	CPI Rate	CPU Frequency Ratio
[Outside any known module]	4.528s	0s	0s	36,101,200,000	0.474	1.113
js::ToWindowProxyIfWindow	1.956s	0s	0s	20,712,800,000	0.372	1.161
js::ProxyGetPropertyByValue	1.212s	0s	0s	9,088,200,000	0.492	1.089
js::Proxy::get	            1.144s	0s	0s	10,489,000,000	0.410	1.110
EdgeNeedsSweepUnbarriered	1.132s	0s	0s	11,740,200,000	0.352	1.075
  \- nsWrapperCache::GetWrapperPreserveColor <- nsWrapperCache::GetWrapper	1.132s	0s	0s	11,740,200,000	0.352	1.075
mozilla::dom::NodeListBinding::DOMProxyHandler::get	1.087s	0s	0s	6,055,400,000	0.707	1.161
js::AutoEnterPolicy::AutoEnterPolicy	0.795s	0s	0s	12,165,200,000	0.250	1.126
nsWrapperCache::GetWrapper	0.733s	0s	0s	3,818,200,000	0.671	1.030
js::ValueToId<(js::AllowGC)1>	0.731s	0s	0s	9,322,800,000	0.266	1.001
JS::Rooted<JS::Value>::~Rooted	0.723s	0s	0s	13,151,200,000	0.212	1.136
ValueToWindowProxyIfWindow	0.620s	0s	0s	5,062,600,000	0.453	1.090
(In reply to Ting-Yu Chou [:ting] from comment #8)
> js::ToWindowProxyIfWindow	1.956s	0s	0s	20,712,800,000	0.372	1.161

This one is mainly the pointer chasing obj->group->clasp->flags.

> EdgeNeedsSweepUnbarriered	1.132s	0s	0s	11,740,200,000	0.352	1.075

Dereferencing to get ChunkLocation in IsInsideNursery() [1] takes 942.135ms.

[1] http://searchfox.org/mozilla-central/rev/48ea452803907f2575d81021e8678634e8067fc2/js/public/HeapAPI.h#385
Ugh, are ValueToWindowProxyIfWindow and ToWindowProxyIfWindow adding overhead?

That's dumb, script should never pass a Window as receiver and we should probably make that an invariant for the get/set traps. In any case we can optimize this away, I can take a look.
Flags: needinfo?(jdemooij)
Maybe you could try the following:

(1) Add a |JSObject* proxy| argument to ValueToWindowProxyIfWindow and have the callers pass the proxy object.

(2) Then add this to the start of ValueToWindowProxyIfWindow:

if (v == ObjectValue(*proxy))
    return v;

Can you check if this improves perf? It's a very simple optimization for the most common case.
Flags: needinfo?(jdemooij) → needinfo?(janus926)
The change brings ~30% improvement on Linux from running the latter testcase in comment 1:

m-c:
  73.30605
  71.72109999999999
  70.6508
  73.53995000000002
  71.6754
  68.6558
  72.37195
  76.47655
  69.96825
  72.2134

m-c + comment 11:
  52.269949999999994
  52.59675
  47.019400000000005
  47.446000000000005
  45.184749999999994
  46.14019999999999
  45.25279999999999
  47.921150000000004
  45.343050000000005
  47.290350000000004

But I don't understand the change, could you teach me what |v == ObjectValue(*proxy)| implies?
Flags: needinfo?(janus926) → needinfo?(jdemooij)
The callers of Proxy::get() [1][2] have code like following and same to Proxy::set():

  RootedValue receiver(cx, ObjectValue(*proxy));
  return Proxy::get(cx, proxy, receiver, id, vp);

So the |v == ObjectValue(*proxy)| check in comment 11 should be always true. If my understanding is correct, do we still need to call ValueToWindowProxyIfWindow in Proxy::get/set?

[1] http://searchfox.org/mozilla-central/rev/5696c3e525fc8222674eed6a562f5fcbe804c4c7/js/src/proxy/Proxy.cpp#346-347
[2] http://searchfox.org/mozilla-central/rev/5696c3e525fc8222674eed6a562f5fcbe804c4c7/js/src/proxy/Proxy.cpp#358-359
I guess I understand the change now, because I missed the get/set in js::ProxyObjectOps. The change is a shortcut for the get call from js::ProxyGetProperty and js::ProxyGetPropertyByValue.
Flags: needinfo?(jdemooij)
(In reply to Ting-Yu Chou [:ting] from comment #12)
> The change brings ~30% improvement on Linux from running the latter testcase
> in comment 1:

Nice!

> But I don't understand the change, could you teach me what |v ==
> ObjectValue(*proxy)| implies?

Usually, but not always, proxy == receiver (if proxy != receiver, the proxy was probably on the prototype chain of the receiver). Windows are never proxies (we should assert this) therefore we don't need to call ToWindowProxyIfWindow if the receiver == our proxy.

What we could also do is make Proxy::get/set templates, with a ReceiverIsProxy or something template argument. Maybe you can measure what the performance is when ValueToWindowProxyIfWindow unconditionally returns |v|? That could tell us what kind of speedup to expect from templatizing. I'm not sure it's worth it, though.
Another thing that might be interesting to measure is whether perf improves if you remove the CheckRecursionLimit call at the start of Proxy::get. We don't really need that check if the caller is a JIT IC - ProxyGetProperty(ByValue) - so we could optimize that if it's worth it.
Another observation: note how ProxyGetProperty and ProxyGetPropertyByValue have a |RootedValue receiver| and we also have a |RootedValue receiver| in Proxy::get? Roots are not free (they maintain a linked list). I wonder if we should optimize all this as follows:

(1) Rename Proxy::get to private Proxy::getInternal. Proxy::getInternal can

  MOZ_ASSERT_IF(receiver.isObject(), !IsWindow(&receiver.toObject()).

(2) Have ProxyGetProperty* call getInternal directly.

(3) Move the ValueToWindowProxyIfWindow (and maybe CheckRecursionLimit) calls to Proxy::get/set and then call getInternal.

(We could still keep the receiver == proxy optimization in ValueToWindowProxyIfWindow.)

What do you think?
I'll check them and update later.
(In reply to Ting-Yu Chou [:ting] from comment #12)
> The change brings ~30% improvement on Linux from running the latter testcase
> in comment 1:

Note the numbers aren't improved on Windows.

m-c:
  36.3361
  36.68915
  36.8702
  36.462399999999995
  36.76795
  36.845
  36.514450000000004
  36.743100000000005
  36.99785000000001
  36.4437

m-c + comment 11:
  36.488
  35.91275
  38.34565
  36.2722
  36.16825
  38.28915
  37.50735
  36.90305000000001
  37.746700000000004
  36.801899999999996
VTune on Windows shows that nsINodeList::IndexedGetter spends ~%60 of its CPU time in dereferencing |this| for accessing the pure virtual function nsINodeList::Item on vtable.
Depends on: 1394290
Depends on: 1394312
Assignee: janus926 → nobody
Status: ASSIGNED → NEW
Keywords: perf
FWIW, also Chrome seems to have quite a bit slower access when using nodelist, but that doesn't mean we shouldn't optimize.
Component: DOM → DOM: Core & HTML
Depends on: 1708431
Performance Impact: --- → P3
Whiteboard: [qf:p3]
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: