1590198 - Remove redundant ObjectGroups for functions with empty shapes

Assignee

Description

•

6 years ago

As discussed in bug 1589142, we've discovered that (i) a majority of ObjectGroups on the heap often exist for function objects, and (ii) most of these objects have an empty shape. This is a ripe opportunity for memory reduction: we could lazily create ObjectGroups for functions only when needed (or avoid creating them altogether?). Reducing the number of groups would also change the balance of, and improve the impact of, the shape-group-tuple (shoople) work in 1589142.

Jan de Mooij [:jandem]

Comment 1

•

6 years ago

Exciting project! My main concern is what Brian already mentioned in the other bug - objects depend on their group for:

The Class.
The prototype.
The Realm (and global, Compartment, Zone).

All of these fields are very hot (and used from JIT code).

Using lazy groups for this makes a lot of sense to me. We could potentially use the fun_resolve hook. Giving up TI for these functions and relying on ICs/guards is probably acceptable (this is the direction we seem to be heading in anyway..), but we need to be careful because groups with "unknown properties" currently generalize TypeSets containing them (they become "unknown object") and that would break Ion inlining.

So another thought is to rethink Ion inlining - if we can cut that dependency maybe we can get away with a boring vanilla ObjectGroup based on Class/proto.

Jan de Mooij [:jandem]

Comment 2

•

6 years ago

(In reply to Jan de Mooij [:jandem] from comment #1)

So another thought is to rethink Ion inlining

I'd suggest starting with this. The alternative requires a lot of changes elsewhere in the engine just to accommodate Ion inlining, which seems wrong to me. Especially with our focus shifting away from Octane/Kraken-style workloads. Maybe we can see what JSC and V8 are doing in this space (scripted function inlining)?

Iain Ireland [:iain]

Comment 3

•

6 years ago

The clear advantage of TI over relying on ICs is that we can elide the guards.

Do we have any idea what the overhead of those guards would be? (I am inclined to say we are willing to take the hit, but it would be nice to have numbers.)
Are there any cases where we get information from TI that we can't get from looking at ICs? Right now, one example is that the baseline call IC stub generated by TryAttachFunApply doesn't store a specific target, but we could fix that, and then fall back on the current "apply any target" behaviour when we go megamorphic. I can't think of any reason that the same workaround couldn't be used everywhere, but JS is full of surprises.

Kannan Vijayan [:djvj]

Comment 4

•

6 years ago

Do we need to block shuples work on this specifically? I was thinking that the initial step of the shuples bit doesn't really depend on this - namely: avoiding the allocation of a distinct shuple for all function objects that differ by group.

The changes required to Ion in that scenario are minimal - most of the mechanics stays the same, and we pull the object group directly from the tagged "pseudo-shuple" pointer on the function object itself, and use that as normal.

I have my doubts about mucking around with Ion internals in the short term. A shuples implementation which uses a tagged group pointer and default shape leaves the door open to later make implicit the default function group for a canonical function. That lets us leave any future Ion inlining work unblocked, as well as somewhat prepare for a better compiler infrastructure built around the shuples construct in 2020.

Jan de Mooij [:jandem]

Comment 5

•

6 years ago

If we can fix this bug without (a) a ton of changes across the code base and (b) slowing down object Class/realm/prototype accesses, that sounds great. Maybe I'm just overestimating the amount of work required to add a new kind of thing to TypeSets.

Chris Fallin [:cfallin]

Assignee

Comment 6

•

6 years ago

•

Edited

(In reply to Iain Ireland [:iain] from comment #3)

The clear advantage of TI over relying on ICs is that we can elide the guards.

(EDIT: I had a question here but I had missed one of Jan's two comments above. Sorry!)

Jan de Mooij [:jandem]

Comment 7

•

6 years ago

(In reply to Iain Ireland [:iain] from comment #3)

The clear advantage of TI over relying on ICs is that we can elide the guards.

Do we have any idea what the overhead of those guards would be?

Note that extra guards only apply to non-singleton callees. Native functions and (AFAIK) most scripted functions in Octane/Kraken are singletons.

Iain Ireland [:iain]

Comment 8

•

6 years ago

:iain, could you expand a bit of context on the tradeoff you're discussing here (I don't see ICs/guards discussed previously) -- do you mean that you're considering the case where we no longer have distinct ObjectGroups for functions, and no equivalent replacement for call-target resolution (e.g., JSFunctions directly in TypeSets)?

When we inline a function, we need to make sure that the callee can't change without us noticing. For example, if we inline "obj.foo(arg)", we need to make sure that the function we inline and the function we get from obj.foo continue to be the same. Right now, we use TI for that. We have a typeset representing the callee. When we inline, we attach a constraint to that typeset, so that if a new target is added to that typeset, we will invalidate. This means we have to be able to represent individual functions in typesets, which is how we ended up talking about bit-tagging group pointers and so on.

The alternative is that we break the dependency between inlining and TI. Our alternative technique for this kind of thing is inspecting the baseline IC chain. If there is a single stub in the IC chain for a call site, and that stub guards on the target function, then we can inline that target function. The upside of this is that we don't have to represent call targets in TI. The downside is that we generate guards instead of constraints, which adds overhead to the hot path (albeit well-predicted overhead).

A further downside is that some of our more specialized call ICs (for example, fun.call and fun.apply) currently don't guard on the target function, so we don't necessarily have the information that we need to inline. But that's an implementation detail, and it shouldn't be too hard to fix if we decide to go down that route.

As Jan points out, though, we only need to do any of this for non-singleton groups. So I guess the idea is something like:

Singleton functions are represented in TI as a pointer to the object itself. We can therefore inline it just fine.
Non-singleton functions are given a shared group based on class/proto/?
To inline non-singleton functions, we have to get the callee by inspecting the baseline IC chain and adding guards. This might require a few modifications to call ICs, but nothing too complicated.

If we do it this way, then I think (hope?) we get to avoid fancy tagging schemes in shuples altogether. I believe we also get to save a chunk of memory on redundant groups, which might be the most immediate win of any of the shuple work.

Kannan Vijayan [:djvj]

Comment 9

•

6 years ago

I am hoping we can pare down the initial implementation even further. Removing function ObjectGroups in the short term involves a lot of changes to Ion which to me is a major timesink without much future reward - unless absolutely necessary it's good to avoid.

However, we can slowly abstract away from ObjectGroups being needed in various circumstances for any new compiler, while leaving enough infrastructure in place to support Ion's current behaviour.

Currently, Ion expects some function GetGroup(JSObject*) -> ObjectGroup*, where the returned ObjectGroup* is also guaranteed to be the same token that is stored on type-sets wherever the underlying object traveled.

In the short term, we should be able to fulfill that requirement easily with a JSObject representation along the lines of:

  struct ShapedObject {
    ShuplePtr shuple_;
    ....
  }
  struct ShuplePtr {
    union {
      uintptr_t rawBits;
      PlainShuplePtr shuple_;
      ImmFuncShuplePtr immFunc_;
      SingletonShuplePtr single_;
    }
  }
  struct PlainShuplePtr { /* Shuple* with low bits tagged 0. */ }
  struct ImmFuncShuplePtr { /* Canonical function ObjectGroup* with low bits tagged 1. */ }
  struct SingletonShuplePtr { /* Pointer to object itself, tagged 1. */ }
  }

This lets us keep an implementation for GetGroup(JSObject*) -> ObjectGroup* that is pretty straightforward still, and doesn't need modification of heap or stack typeset logic, or much changes to Ion. It'll let us move on with shuples implementation, keep current Ion performance, and hopefully remove all of the Ion logic from the codebase when the replacement infrastructure has matured.

I'd like to push hard in this direction because the actual compiler work on top of shuples will take a lot of time, and I'd like to spend as little time as possible on things that would require changes to Ion.

Jan de Mooij [:jandem]

Comment 10

•

6 years ago

(In reply to Kannan Vijayan [:djvj] from comment #9)

This lets us keep an implementation for GetGroup(JSObject*) -> ObjectGroup* that is pretty straightforward still

Can you be more specific? Adding a branch to GetGroup is not something we want to do or can afford (perf-wise, given how hot this code is, and it also complicates generated JIT code for these fields). I'd take a guard on all inlined scripted calls over adding branches to get{Class,Realm,Prototype} because that at least limits the overhead to Ion...

Kannan Vijayan [:djvj]

Comment 11

•

6 years ago

Ok, yeah - it would involve adding a branch to GetGroup, in cases where we don't statically know that the object is not a function object. In those cases, we can access this directly.

This brings up a concern - however. If GetGroup really needs to be hot and we can't afford a single well-predicted test-bit style branch on it, wouldn't it be an expensive thing to access via a shuple indirection anyway?

This may imply that we need to look at moving all uses of ObjectGroup internals, at least in situations where the underlying object is dynamic and not statically known to be a function/non-function, off of hot paths before approaching the shuples implementation.

Jan, can you describe the hot-paths that access these members of ObjectGroup (class, proto, realm)? What proportion of them are in places where we might statically expect to already know that we have either a Function or a non-Function object? Are we pulling them out to do more than simply guard them against a specific value?

Jan de Mooij [:jandem]

Comment 12

•

6 years ago

•

Edited

(In reply to Kannan Vijayan [:djvj] from comment #11)

wouldn't it be an expensive thing to access via a shuple indirection anyway?

It's definitely a concern, but an extra dereference is hopefully easier to justify than a branch. Also, with shuples we could potentially move some of {Realm, Class, prototype} to the shuple itself so we don't have the extra indirection?

Jan, can you describe the hot-paths that access these members of ObjectGroup (class, proto, realm)? What proportion of them are in places where we might statically expect to already know that we have either a Function or a non-Function object? Are we pulling them out to do more than simply guard them against a specific value?

For the JIT/IC uses, in many cases (especially guards) we know more statically. I'm worried about:

A lot of code checks the Class, there are hundreds of obj->is<T>() calls in SpiderMonkey and Gecko. Or things like JSObject::isNative().
Class hook checks like here or in JSObject::finalize. GC code operates on "any JSObject", tracing/finalization code needs to know the object layout and these branches are not well-predicted.
Loops like this one for instanceof (and the C++ equivalent), or the one in GetNativeDataPropertyPure (this one often shows up in profiles for megamorphic sites).
AutoRealm, obj->compartment(), obj->zone() are used often with arbitrary objects.

(edit: never mind, this doesn't really help you)

Jan de Mooij [:jandem]

Comment 13

•

6 years ago

•

Edited

Maybe it's possible to represent each (non-singleton?) canonical function as a FunctionObjectGroup (later FunctionShuple) that has the Class/realm/prototype fields from ObjectGroup and a pointer to the JSScript/LazyScript? Then lambda cloning would allocate a new JSFunction pointing to this group/shuple and its script/lazyscript) and we don't need the canonical JSFunction anymore.

Variable-size tuples (layout based on Class for example) was part of the original tuples idea and maybe it would work well for functions?

Steven DeTar [:sdetar]

Updated

•

6 years ago

Priority: -- → P2

Chris Fallin [:cfallin]

Assignee

Comment 14

•

6 years ago

Attached file Bug 1590198: remove groups for non-singleton functions (WIP). (obsolete) — Details

Chris Fallin [:cfallin]

Assignee

Comment 15

•

6 years ago

•

Edited

After a few weeks on this, I've got a working patch (on Phab) that (i) saves ~3% of JS heap on AWSY tests, but (ii) slows down Speedometer by ~2%. The latter is pretty persistent -- I spent the past few weeks reworking things to have a fallback path rather than bailout, and it's still a perf loss -- so it's looking like this may be the best we get with this approach.

Thoughts on this tradeoff? Take it for Fission maybe?

The patch does the following:

Removes separate groups for all non-singleton functions.
In baseline IC attach logic for call ops, emits a meta-op that records the JSScript of the called function.
At Ion compilation time, when deciding whether/what to inline, looks at the TypeSet for the callee as normal. If nothing, or if the general Function group corresponding to non-singleton functions is present in the set, then inspect the baseline IC chain to find this meta-op. If present, the set of possible targets contains this script. The InliningTarget type is now a three-way variant: singleton function, callable non-function object, or non-singleton function identified by script / canonical function.
If inlining a target identified by script only, use a new MIR op MTestScript that tests the script pointer of a given function object against a baked-in expected value. If match, branch to inlined path. If no match, branch to fallback path that does a normal call.

There were a lot of subtle bugs to resolve here, once I got past the basic "how does IonBuilder work" (understanding the stack invariants, how to patch up resume points to avoid mysterious appearances of Magic(optimized-out) values, etc). In particular, the Definite Properties Analysis can no longer analyze a function that calls non-singleton functions, because it relies on constraints that freeze the observed inlining (and we can't use constraints if we don't have groups!). There's also some interesting interaction with object-format assumptions -- if we defer the function target test to runtime (rather than constraint-based), then e.g. the load-function-environment op cannot be lifted above the MTestScript because the function may even be a native function, so I had to add a new category to the alias analysis to fix that. In summary, all tests are green now, but this needs some careful eyeballs to look at it.

Jan de Mooij [:jandem]

Comment 16

•

6 years ago

Do you know what exactly causes the 2% regression? Do you also see it if you just add a guard to existing inlined non-singleton calls in Ion? Or is it from the DP Analysis changes?

Chris Fallin [:cfallin]

Assignee

Comment 17

•

6 years ago

(In reply to Jan de Mooij [:jandem] from comment #16)

Do you know what exactly causes the 2% regression? Do you also see it if you just add a guard to existing inlined non-singleton calls in Ion? Or is it from the DP Analysis changes?

I don't have direct data to back this up, but my current hypothesis (from three indirect arguments) is that it is simply because we no longer inline as many functions. Specifically:

The DPA fix came in late, while I was still debugging the long tail of test failures but after I had already run several performance runs. The numbers prior to the fix were also around ~2%. Furthermore, the failure would happen whenever DPA attempted to run at all on any function that had inlined a non-singleton, and this failure happened only on a few test cases, so this seems to indicate that the effect on DPA should be rare and have a small contribution.
The guard is a single load/compare/branch, and the load is from a field in JSFunction (the script) which is right next to the environment pointer. The cases we're concerned with here are lambdas that are likely to load their environment anyway, so the guard probably isn't contributing a cache miss. (We'd miss on the guard, but then the environment pointer load which had been a miss would become a hit.)
The effect that removing the TI-based inlining information had (in the separate experiment based on this patch) was pretty dramatic -- 12% on Speedometer, 43% on Octane. So the influence of inlining is obviously quite high on performance. Not inlining some functions therefore seems much more likely to be the culprit than an extra guard in cases that we still inline.

I can certainly run an experiment for #2 (adding extra guards to the unmodified version) but it's not quite a trivial amount of work, as I have to port over the new type of guard to the baseline*, as well as all of the meta-op and IC inspection changes that let me know where I would have added guards.

An easier experiment might be to modify the the baseline* to simply not inline any non-singleton functions. This would tell us how much their inlining contributes to perf. My suspicion is quite a lot...

*I keep wanting to say "baseline" as in the baseline of the experiment / the current status quo, not as in the baseline {interpreter,compiler}. Please forgive the potentially confusing overloading of words :-)

Jan de Mooij [:jandem]

Comment 18

•

6 years ago

(In reply to Chris Fallin [:cfallin] from comment #17)

because we no longer inline as many functions.

Why is that? Do you have a simple example where that happens? It seems to me we could potentially inline all callees seen in Baseline?

Chris Fallin [:cfallin]

Assignee

Comment 19

•

6 years ago

(In reply to Jan de Mooij [:jandem] from comment #18)

Why is that? Do you have a simple example where that happens?

Yes, the simplest possible example is just where a callsite calls two different lambda functions with different scripts. (Say, a polymorphic higher-order function like map.) The reason is that this patch, as written now, only picks the first callee on the IC chain and inlines that. This was done in order to manage complexity; otherwise we would need not just the "test if script pointer is equal" MIR branch instruction, but a "dispatch on script pointer", or else a chain of branches. Basically, I got the simplest thing working first.

It seems to me we could potentially inline all callees seen in Baseline?

Yes, that's the next thing I'll try, probably with a chain of branches, and using the same inlining-choice logic that the ObjectGroup-based path uses.

Chris Fallin [:cfallin]

Assignee

Comment 20

•

6 years ago

The latest patch now does poly-inlining, and is all-green with this try-run.

Unfortunately, it's still losing ~2% on Speedometer: comparison. On Octane, similar story at ~2.5%.

The AWSY results show 2.1% heap savings for "JS opt", so the upside continues to exist, at least.

I had recovered some perf (earlier Speedometer runs with poly-inlining were ~3-4% loss) by realizing I was inadvertently breaking the InlinePropertyTable optimization (CALLPROP to fetch a property, then CALL of that property's value, can inline the called method and remove the property fetch altogether because we can leverage ObjectGroups and constraints to know exactly what the property would contain). I was curious if that was the issue, so I ran another experiment where I intentionally disabled that optimization whenever the inlining-targets set contained a non-singleton function. Unfortunately, that didn't seem to be it: comparison shows only 0.13% degradation on Speedometer.

At this point, the most likely explanation seems to be the cost of the guards, and the loads that they imply. There is no easy, dozen-line-change experiment to run to test this -- we'd have to add the new guard type to masm/LIR/MIR and artificially inject it in the right places -- but even if the experiment did confirm, I'm not sure what further we can do: the guards are fundamentally necessary if we can't rely on constraints (with ObjectGroups) to ensure the proper target.

So, if others have more ideas, I'd love to hear them, but I think this patch is close to a stationary point: 2% memory savings for 2% perf, and we just need to decide whether we want to do that or not. (I suspect not, unless it goes in simultaneously with a default-flip for Fission, where memory pressure suddenly becomes greater...) Thoughts?

Phabricator Automation

Updated

•

5 years ago

Attachment #9104386 - Attachment is obsolete: true

Jan de Mooij [:jandem]

Comment 21

•

5 years ago

Fixed by Warp.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED