Creating objects without properties is slower than creating objects with properties

NEW
Unassigned

Status

()

Core
JavaScript Engine: JIT
2 years ago
2 years ago

People

(Reporter: bz, Unassigned)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(firefox48 affected)

Details

Attachments

(12 attachments, 1 obsolete attachment)

1.10 KB, text/plain
Details
1.17 KB, text/html
Details
533 bytes, application/x-javascript
Details
18.68 KB, application/pdf
Details
17.05 KB, application/pdf
Details
35.64 KB, application/pdf
Details
28.19 KB, application/pdf
Details
19.82 KB, application/pdf
Details
18.54 KB, application/pdf
Details
36.03 KB, application/pdf
Details
4.29 MB, text/plain
Details
27.62 KB, application/pdf
Details
Created attachment 8738211 [details]
test.js

See the attached shell testcase; compare the "Object with members class" numbers to the ones for "Object without members class" and similarly for the function version.  The "without members" versions are about 25% slower for me.

I'm _guessing_ this is one of those object-allocation optimization things that we end up missing the fast path for if the object has no props...
Oh, and the object-literal version runs at the "fast" speed here, not the "slow" speed, fwiw.
I make a simple HTML page to run the test in Nightly 49.0a1, (2016-05-02), and the result is:

----

Object with members class:  35.63000000000011  
Object with members function:  37.98499999999967 
Object without members class:  34.909999999999854  
Object without members function:  37.04500000000007 

new Object():  19518.510000000002                   
object literal:  35.22999999999956

----

I changed the Date to |performance.now|, and use |console.log| instead of |print| because there is no |print| in browser (it invokes the "Printer" dialog unless in debugger) .

Maybe I missed something, but the result looks weird because the first four cases are very close. The only significant one is |new Object()| is much slower than the object literal case. 

Then, I tested the case in Chrome:

----

Object with members class:  3029.8499999999995
Object with members function:  2969.0949999999993
Object without members class:  2924.470000000001
Object without members function:  2750.779999999999

new Object():  6881.595000000001
object literal:  3034.3100000000013

----

Although I don't know why it took so much time to complete the test compare to FF (good news?), it shows the result as we may expect: the object with members is slower than the one without members, although it's still insignificant. And the |new Object| is still far slower than literal one, although the gap is not as much as in FF.

According to the result, maybe we should focus on the |new Object| vs. object literal first?
Greg, that's a shell testcase (hence the use of print).  The JS folks generally prefer those if they still show the bug.

That said, I just tried this in browser, 2016-05-08 nightly on Mac, and here are the numbers I see with the replacements you described and with the new Object bit taken out but the testcase otherwise unmodified:

Object with members class: 947.6850000000001
Object with members function: 930.92
Object without members class: 1143.085
Object without members function: 1156.8399999999997
object literal: 892.6199999999999

which clearly shows the bug.  I don't know what you were doing exactly to get numbers in the 30-some range, but presumably you made some sort of other changes to the testcase?

The new Object slowness is already covered by bug 1262212, which I filed at the same time as this bug.  The summary of this bug is very carefully written to describe the actual problem this bug is about.
Created attachment 8750175 [details]
HTML testcase for running in browser
Interesting. I changed nothing except the part I mentioned, but when I launched your browser test the result is very different:

----
Object with members class: 1578.9600000000003
Object with members function: 1233.8250000000003
Object without members class: 1888.5500000000002
Object without members function: 1908.2749999999996
object literal: 1242.585000000001
----

Then I may take a look at this test and the shell version to seek if there are some issues I can make some progress.
> I changed nothing except the part I mentioned

That's hard to reconcile with the numbers you were seeing, unless your computer is 50x faster than mine in single-core performance.  Which is not very likely.
Created attachment 8755325 [details]
testcase-classonly.js
Created attachment 8755327 [details]
func00-pass00-BuildSSA-mir.gv.pdf
Created attachment 8755328 [details]
func00-pass26-Generate LIR-lir.gv.pdf
Created attachment 8755329 [details]
func01-pass00-BuildSSA-mir.gv.pdf
Created attachment 8755330 [details]
func01-pass26-Generate LIR-lir.gv.pdf
Created attachment 8755331 [details]
func02-pass00-BuildSSA-mir.gv.pdf
Created attachment 8755332 [details]
func02-pass26-Generate LIR-lir.gv.pdf
Created attachment 8755333 [details]
func03-pass00-BuildSSA-mir.gv.pdf
Created attachment 8755334 [details]
func03-pass26-Generate LIR-lir.gv
Created attachment 8755335 [details]
class-both-results.dms
After clumsily chasing possible issues for a while, I've noticed that:

1. According to the Gecko profiler, it looks that the loop executes object constructor |NoMembers| doubles the number of invoking |NewObjectOperationWithTemplate| compare to the loop of |WithMembers| (~1200 vs. ~500). The function call triggers minor GC, so in the profiling result there are lots of minor GC time.

2. Because of 1., I guess it's because JIT engine optimizes code in different ways, so that the path of invoking |NoMembers| has more |NewObjectOperationWithTemplate| calls, or the inlined assemblies. Therefore I generated the Iongraph as I pasted. However, I don't know why it generates two different graphs (fun1 and fun3) for the same line (line1, which is actually the whole file).

3. In the LIR graph of fun3, these two loops are actually almost the same, except the second loop of |WithMembers| correctly carries two |storeunboxedscalar| from its constructor. So if the graph represents the final assembly the engine executes, the loop looks clean and shows no issue within the codeflow.

4. In the LIR graph of fun1, the second loops differs because it has no |createthiswithtemplate| as the first loop. Instead of, it looks that the |createthiswithtemplate| and the lines to store two members, plus the action to add 1 to the loop index, all has been expanded to the |Block 9|. Or at least the block looks like the only place where those code could be in the second loop.

As a result, I may follow the fun1 graph to make sure this difference is the root cause of performance gap.
Created attachment 8755361 [details]
func03-pass26-Generate LIR-lir.gv.pdf
Attachment #8755334 - Attachment is obsolete: true
With `--ion-eager` it generates more graphs for the script, but the |NoMember| loop finally get faster than |WithMember|. However, the whole speed were greatly down compare to the case without the flag.

From those graph I noticed that some optimized codeflow will combine two loops into one, and in this case the explicit function calls like |storeunboxedscalar| has been eliminated, and the only relevant part of the original operations were all been cache operations like |setpropertycache|. But it's hardly to find whether the operations were still in the only loop, or the engine just skip the steps because it found these two loops were actually doing nothing different in each iteration.
Therefore I slightly changed the case, to make the constructor with members will accept the iteration index and add it to the members. After that, I got the optimized graph with different loops that engine won't combine them into one, and the result keep as the description: the case of |NoMember| is still slower than |WithMember|. So I may try to trace the modified case to seek clue.
Hi Boris, today I've noticed that if I set a counter at the beginning of 

    Interpreter.cpp::NewObjectOperationWithTemplate

like via GDB/LLDB breakpoint with hit counter, the result would be:

----
#1: without --ion-eager:

Loop of |NoMember|   constructor: hit count = 6103
Loop of |WithMember| constructor: hit count = 3051

----
#2: with --ion-eager:

Loop of |NoMember|   constructor: hit count = 6103
Loop of |WithMember| constructor: hit count = 4577
----

And the total consumed time crucially depends on the number of that function call.

However, I don't know why these two counts are all far less than |1e8|, which is the number |for| should iterate. Because I'm not sure if this means our engine will greatly reduce the times of calling actual construction operation when it optimizes the loop, or it is just because there is another super fast path (inlined?) to create the object without bothering the slow method (~2.5ms each call on my machine). Could you give me some suggestions for tracking where the relevant optimization rules is? Thanks.
Flags: needinfo?(bzbarsky)
Greg, I'm generally not the right person to ask JIT questions (others know it way better than I do), but I believe we do have inline object construction paths in ion.  See CodeGenerator::visitNewObject and the bits of it that don't involve the OutOfLineNewObject thing.  I don't know whether those paths are being used here or not.  There are other calls to masm.createGCObject too; the one in CodeGenerator::visitCreateThisWithTemplate might be relevant here, I guess.  Its out of line path calls NewObjectOperationWithTemplate which might be what you're seeing.
Flags: needinfo?(bzbarsky)
Thank you, Boris. I will take a look at these two possible method for the clue.
I might finally get close to the root case: there are some new discoveries from today's work:

1. After insert a |breakpoint()| in |masm.createGCObject| to send INT 3 triggering LLDB, it shows the during the loop, iteration before the "1097" times won't invoke the method. So it must use another unknown way to create the object with the constructor.

2. I made a list to depict when the slower |NewObjectOperationWithTemplate| method would be invoked. And it seemed to reveal what made the performance gap clearly:

----
(for the #1 loop: create object with |NoMember| constructor)

#1 loop iteration 1097

--> this time, start to trigger the JITed asm

#1 loop iteration 16389
#1 loop iteration 32772
#1 loop iteration 49155
#1 loop iteration 65538
#1 loop iteration 81921
#1 loop iteration 98304
...

--> at these iterations, it triggers the slow NewObjectOperationWithTemplate
--> the difference is always 16383 (2^14 - 1)
--> every (2^14 -1) times, it follows the slower path to complete the task

----

(for the #2 loop: create object with |WithMember| constructor)

#2 loop iteration 1097

--> this time, start to trigger the JITed asm

#2 loop iteration 32758
#2 loop iteration 65525
#2 loop iteration 98292
#2 loop iteration 131059
#2 loop iteration 163826
#2 loop iteration 196593
...
--> at these iterations, it triggers the slow NewObjectOperationWithTemplate
--> the difference is always 32767 (2^15 - 1)
--> every (2^15 -1) times, it followsthe slower path to complete the task

----

It shows: that the loop with |WithMember| constructor is faster because it calls the |NewObjectOperationWithTemplate| every (2^15 - 1) iterations, while the one with |NoMember| constructor calls it every (2^14 - 1) iterations. I believe during the rest time they both follow the path by |masm.createGCObject| or other method so there might be no significant difference between them then.

And I also unwind the JIT frame to take a look at the stacktrace before invoke the |NewObjectOperationWithTemplate| method. Here is what it looks like:

----
js`js::NewObjectOperationWithTemplate(...) + 86 at Interpreter.cpp:4765
<<JitFrame_Exit>>()
<<JitFrame_IonJS>>()
<<JitFrame_Entry>>()
js`EnterBaseline(...) + 820 at BaselineJIT.cpp:151
js`js::jit::EnterBaselineAtBranch(...) + 1044 at BaselineJIT.cpp:261 
js`Interpret(...) + 4544 at Interpreter.cpp:1836
js`js::RunScript(...) + 519 at Interpreter.cpp:426
js`js::ExecuteKernel
(....)
----

I must admit it so far haven't inspire me too much yet. What I can guess is although inside the frame it can follow the faster inlined method to create the object, but somehow the Ion frame get exited and it fallback to the slower one. And the magic number between the two method calls might suggest it's because the frame get "exhausted".

Although I haven't dig into JitFrame to get the clue yet, but is it possible that the JitFrame is actually full of the expanded loop iteration assembly instructions for (2^14) or (2^15) iterations, and when it get executed it must fallback to the ordinary function call before the next frame ready, although this still explain what put the magic differences between these two loops.
NI Nicolas because I don't know who to ask for the magic difference...Sorry if I ask wrong person again.
Flags: needinfo?(nicolas.b.pierron)
The Jit does not have any Loop unrolling enabled by default. As far as I know, the only reason we fallback is when we do not have enough space to allocate the object.  Thus the slow down experienced by NewObjcetOperationWithTemplate might not come from this function, but from the nursery GC which is going on when we fallback.

The fact that the fallback does not happen after the same number of iteration, but at a factor of 2, suggests that we might actually fill up the nursery space quicker than the other.

My blind guess would be that this might be a problem with the size associated with the class of the template object, where in one case we allocate an object with a fewer reserved slots than in the other case.

By the way, if you are not aware, there is a page with a collection of tips&tricks to investigate issues, here is one that might have been useful in your case:
https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Hacking_Tips#Setting_a_breakpoint_in_the_generated_code_%28from_gdb_x86_x86-64_arm%29
Flags: needinfo?(nicolas.b.pierron)
There was some discussion about this on IRC, yes; the number of slots we guess for things is kinda heuristic and it's not obvious what number we're picking here in the various cases.
Per comment 27 I set a breakpoint in code and it confirms that the engine will trigger GC as the iteration pattern. Therefore, maybe the next step is to see what can we do for the symptom.
I just noticed a weird thing may worth to take a look: when I set the breakpoint in |GCRuntime::tryNewNurseryObject| to pause the process when it hit the |minorGC| call, the stack frame actually shows the variable of |thingSize| differs suspiciously: 

Object generated from |NoMember|   constructor: thingSize = 64
Object generated from |WithMember| constructor: thingSize = 32

So it looks that an empty object generated from a constructor doing nothing, will take twice more space than the one with data members. And I believe this is why the Nursery get fulled faster in the first case.

I will try to trace why the |thingSize| from the empty constructor is larger than another one, and if this is reasonable or patchable.
After taking a look at the size issue, it shows that there are two different paths when the |minorGC| line reached:

1. |NoMember| constructor: it get the AllocKind via

    gc::GetGCObjectFixedSlotsKind(baseobj->numFixedSlots())

in |CopyInitializerObject|. The line, because of the |numFixedSlots()| will return |4|,
will return OBJECT4 (w/ BACKGROUND) as the AllocKind.

2. |WithMember| constructor: it get the AllocKind via

    group->unboxedLayout().getAllocKind()

in |UnboxedPlainObject::create|. And the |getAllocKind| actually returns:

    gc::GetGCObjectKindForBytes(UnboxedPlainObject::offsetOfData() + size())

as the result. In the test case, object created by |WithMember| constructor will give the argument as |24|,
so when it goes to the prediction inside |GetGCObjectKindForBytes|:

    if (nbytes <= sizeof(NativeObject))
        return AllocKind::OBJECT0;

It holds true. Because |sizeof(NativeObject)| gives |32| > |24|. Therefore, the AllocKind of object created by |WithMember| constructor is OBJECT0 (w/ BACKGROUND).

----

As a result, maybe the issue is that an empty object created by empty constructor gets OBJECT4 with thingSize=64, which is larger than an object with members. However, I don't exactly now if simply change the |numFixedSlots| of "default" object will solve the issue or not. I will do more surveys about the path to assign such number in the |NoMember| case.
Oh, and I will check whether the asm generated by JIT take other paths to assign the object size or not.
(In reply to Greg Weng [:snowmantw][:gweng][:λ] from comment #32)
> Oh, and I will check whether the asm generated by JIT take other paths to
> assign the object size or not.

No need to.

The Jit create a template object, the template object contains the early initialized object, and the Jit just use the nursery as a bump allocator, and copy over the content of the template object to the reserved space in the nursery.  Thus the object size should be identical to the one of the template object.

This is the reason why you see no function calls, except when the nursery is full.
Hi Nicolas, I'm now trying to trace the different |allocKind| to seek a solution for this bug, but I'm not sure if it's a proper solution to change that.

The major issue is that I've found that the value is in fact effected by lots of things, like JSClass' "flags" and other things decides the type's fixed slots, so I cannot find a proper way to put a straightforward patch for getting a smaller |allocKind| like the patch in bug 1066787 (via GuessObjectGCKind).

Another issue is I've found these two kinds of object seems to get initialized even before Jit executes |createThisWithTemplate|, so that the |allocKind| is already fixed before Jit retrieves it as |templateObject|.

I tried to trace the path of object initialization before |createThisWithTemplate|, but I can only get that in Interpreter.cpp, |NewObjectOperation| will be invoked and it will eventually allocate an object via |NewObjectWithClassProto|, which can calculate the |allocKind| based on the |clasp_|. And since the |clasp_| is |PlainObject|, they should have the same default |allocKind|.

I'm not sure if I got wrong here, but if this holds true, it means that the two kinds of object have different |allocKind| in Comment 31 is just because they follow the different paths (CopyInitializerObject vs. UnboxedPlainObject::create) in |NewObjectOperationWithTemplate|. If so, then I think the next issue is I need to know why one could have the UnboxedLayout while another doesn't. Because the prediction for Unboxed object can apparently use |GetGCObjectKindForBytes| retrieve a smaller |allocKind| than |CopyInitializerObject|.

I also wonder if it's possible for |CopyInitializerObject| to adopt the same method to get |allocKind| like the Unboxed one. However, I haven't had the clue about UnboxedLayout and why some objects could have it, so I don't know if this is a solution or just a mistake.
Flags: needinfo?(nicolas.b.pierron)
(In reply to Greg Weng [:snowmantw][:gweng][:λ] from comment #34)
> I'm not sure if I got wrong here, but if this holds true, it means that the
> two kinds of object have different |allocKind| in Comment 31 is just because
> they follow the different paths (CopyInitializerObject vs.
> UnboxedPlainObject::create) in |NewObjectOperationWithTemplate|. If so, then
> I think the next issue is I need to know why one could have the
> UnboxedLayout while another doesn't. Because the prediction for Unboxed
> object can apparently use |GetGCObjectKindForBytes| retrieve a smaller
> |allocKind| than |CopyInitializerObject|.

I will need-info Brian Hackett, he might know better this part of the code than me.

> I also wonder if it's possible for |CopyInitializerObject| to adopt the same
> method to get |allocKind| like the Unboxed one. However, I haven't had the
> clue about UnboxedLayout and why some objects could have it, so I don't know
> if this is a solution or just a mistake.

Just to clarify, an ordinary object is storing JS::Value. A JS::Value is a 8 bytes element, which contain a tag and a payload.  Some of the payloads can fit on 1 / 4 bytes.  UnboxedPlainObject is making used of the property that these objects are used with always the same type of properties.  The UnboxedLayout is used to store the type and the offset of each element in the object.  Thus, this is made to avoid storing 8 bytes entries for each property and only store the minimal number of bytes.

If I recall correctly, objects allocations are optimized after seeing ~20 elements coming from the same allocation site.  This optimization rewrite the objects which have currently been seen to make them used unboxed objects.

My blind guess would be that we do not have an allocation site for “new Object()”, as opposed to object literals.  Thus, we never attempt to use an unboxed representation for these objects.
Flags: needinfo?(nicolas.b.pierron) → needinfo?(bhackett1024)
(In reply to Nicolas B. Pierron [:nbp] from comment #35)
> My blind guess would be that we do not have an allocation site for “new
> Object()”, as opposed to object literals.  Thus, we never attempt to use an
> unboxed representation for these objects.

This is correct.  Unboxed plain objects are only used for object literals (including JSON) and 'new Foo()' constructors where Foo initializes all the object's properties.  The main difficulty in using unboxed objects for 'new Object()' sites is that we need to prove that each of the unboxed properties will be written to before being read from, since the unboxed object is given all its properties at creation.
Flags: needinfo?(bhackett1024)
(In reply to Brian Hackett (:bhackett) from comment #36)
> (In reply to Nicolas B. Pierron [:nbp] from comment #35)
> > My blind guess would be that we do not have an allocation site for “new
> > Object()”, as opposed to object literals.  Thus, we never attempt to use an
> > unboxed representation for these objects.
> 
> This is correct.  Unboxed plain objects are only used for object literals
> (including JSON) and 'new Foo()' constructors where Foo initializes all the
> object's properties.  The main difficulty in using unboxed objects for 'new
> Object()' sites is that we need to prove that each of the unboxed properties
> will be written to before being read from, since the unboxed object is given
> all its properties at creation.

So, I think the case is with |WithMember| constructor, we can assume that the properties defined in the constructor during |new| operation, are most likely the only properties it will own before the first reading. But for |NoMember| constructor, the possibility of people attaching new properties before reading it is so high (since to read a blank object is ...boring and useless), so that we cannot make such assumption and assign the smaller layout to it.
It may sound ridiculous, but can we make some assumptions to provide a heuristic optimization, like to assume that most class with empty constructor will add maybe 2 elements with reasonable length during it get fully initialized and then be read? Or, could we find other possible methods to do the optimization?
Flags: needinfo?(bhackett1024)
(In reply to Greg Weng [:snowmantw][:gweng][:λ] from comment #38)
> It may sound ridiculous, but can we make some assumptions to provide a
> heuristic optimization, like to assume that most class with empty
> constructor will add maybe 2 elements with reasonable length during it get
> fully initialized and then be read? Or, could we find other possible methods
> to do the optimization?

Well, we already make heuristic assumptions like this, and give objects with no known properties space for four inline properties.  See GuessObjectGCKind.  This heuristic is pretty outdated and may need to be updated now that we have generational GC and a lower cost for dynamically allocating property arrays for objects in the nursery.  What would be even better though would be to tune the size of new objects at the site based on the properties which have historically been given to objects allocated there.  We already do this for 'new Foo' objects (see TypeNewScript::maybeAnalyze) but not others.
Flags: needinfo?(bhackett1024)
You need to log in before you can comment on or make changes to this bug.