JM: Analyze WebKit PIC implementation

RESOLVED FIXED

Status

()

Core
JavaScript Engine
RESOLVED FIXED
8 years ago
8 years ago

People

(Reporter: dmandelin, Assigned: dmandelin)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

8 years ago
We need a summary of what WebKit does for a PIC, and an analysis of its performance implications.
(Assignee)

Updated

8 years ago
Assignee: general → dmandelin
(Assignee)

Comment 1

8 years ago
First, a description of how they do property caching in the interpreter.

A property access bytecode op starts as |op_get_by_id|, which simply runs the full lookup (slow path). But it also calls |tryCachePutByID| to cache the result. They actually patch the bytecode, storing in a new, more specific op. For example, if the property is found in the normal way on the object itself (i.e., not up the proto chain), then they patch to bytecode |op_get_by_id_self|. This tests whether the assumptions of the cached access path are valid (in this case, that the base object is a normal JS object and that it has the same structure (what they call a shape)), and if so, does the fast path. If not, they call |uncacheGetByID|, which patches the op back to |op_get_by_id|.

This caching is monomorphic. It will switch back and forth between different access paths. But if it is caching structures, it will go megamorphic if it observes a different structure. This patches the bytecode to |op_get_by_id_generic|, which simply does the slow path and does not ever try to cache again.

The specialized access paths are:

 - op_get_by_id_self           get a property from the object itself
 - op_get_by_id_proto          get a property from the object's prototype
 - op_get_by_id_getter_proto   get a property with a getter function from
                               the object's prototype
 - op_get_by_id_custom_proto   get a property with a 'custom' getter function
                               from the object's prototype
 - op_get_by_id_chain          get a property from further up the prototype
                               chain
 - op_get_by_id_getter_self    get a property with a getter function from self
 - op_get_by_id_custom_self    get a property with a custom function from self
 - op_get_by_id_getter_chain   get a property with a getter function from chain
 - op_get_by_id_custom_chain   get a property with a custom function from chain
 - op_get_array_length         get array length
 - op_get_string_length        get string length

It's interesting that their patching mechanism makes it cheap to have many different specialized access paths.
(Assignee)

Comment 2

8 years ago
And now for the JIT version. IIUC, what they are implementing is very close to what is described in the original Self paper. So, read this first if you haven't already:

    http://research.sun.com/self/papers/pics.html

Recall that op_get_by_id is the basic property read opcode (c.f. JSOP_GETPROP). At first, it is compiled to this (I have changed some terms to match SpiderMonkey usage):

  // INITIAL, UNOPTIMIZED COMPILED VERSION
  // compiled op_get_by_id, inputs are base [object] and id [name]
  entry:
    if |base| is not a vanilla JS value:
        goto slow
    if |base.shape != NO_SHAPE|
        goto slow
    // fast path
    load reg0 := base.dslots
    load reg0 := base.dslots[NO_SLOT]
  done:
    ... [continue]

  slow:
    call stub cti_op_get_by_id
    goto done

  // stub called by slow path
  cti_op_get_by_id:
    get the property value
    if calling stub for this static op for the 2nd time:
        call JITThunks::tryCacheGetByID

So far, all this really accomplishes is to call the general-case stub. But there are two important features. First, there is an inlined fast path that reads a slot from the object itself. The fast path is guarded by shape, with an invalid shape so the guard will fail the first time the path is run. Second, when the stub is called for the second time, it will call JITThunks::tryCacheGetByID, which attempts to do inline caching.

JITThunks::tryCacheGetByID can optimize all or almost all of the specialized access paths described in comment 1. I will focus on the self case and the proto case; the others are handled in a corresponding way.

If the access path is op_get_by_id_self, then tryCacheGetByID optimizes by calling patchGetByIdSelf. This patches the original code as follows:

  1. Patch the shape guard to use the currently observed shape.
  2. Patch the slot number to use the slot for that shape.
  3. Patch the slow case to jump to op_get_by_id_self_fail.

At this point, the inline code is now:

  // MONOMORPHIC INLINE CACHE OPTIMIZED COMPILED VERSION
  // compiled op_get_by_id, inputs are base [object] and id [name]
  entry:
    if |base| is not a vanilla JS value:
        goto slow
    if |base.shape != shape1|
        goto slow
    // fast path
    load reg0 := base.dslots
    load reg0 := base.dslots[slot1]
  done:
    ... [continue]

  slow:
    call stub cti_op_get_by_id_self_fail
    goto done

Note 1: The fast path can also be patched to be a single load if the desired slot is part of fslots.
Note 2: JIT-generated code is kept in non-writable pages, and the write bits are turned on temporarily during patching.

At this point, we have a monomorphic inline cache, with a slow case jump to cti_op_get_by_id_self_fail. That function can extend the inline cache to be polymorphic, as I will explain next.
(Assignee)

Comment 3

8 years ago
Starting at the end of comment 1, after MIC (monomorphic inline cache) code for the self-property case has been generated, the slow path calls cti_op_get_by_id_self_fail. This function tries to extend the MIC to a PIC, as follows.

First, if the base value is not a vanilla value or is uncacheable, the slow path will be patched to call cti_op_get_by_id_generic, which is simply a general case, non-caching property accessor (i.e., the PIC gives up in that case).

Otherwise, it calls compileGetByIdSelfList, which does the actual code generation. This will generate code like this:

  // JITted stub for second fast path in PIC
  pic_part_2:
    if |base.shape != shape2|:
        goto slow
    load property value from fslots
    goto done

It then patches the original code so that the shape guard failure jumps to this new stub. To make it explicit, the inline code is now this:

  // 2-SHAPE POLYMORPHIC INLINE CACHE OPTIMIZED COMPILED VERSION
  // compiled op_get_by_id, inputs are base [object] and id [name]
  entry:
    if |base| is not a vanilla JS value:
        goto slow
    if |base.shape != shape1|
        goto pic_part_2
    // fast path
    load reg0 := base.dslots
    load reg0 := base.dslots[slot1]
  done:
    ... [continue]

  slow:
    call stub cti_op_get_by_id_self_fail
    goto done

Note that the slow path still calls cti_op_get_by_id_self_fail, so if it is hit again, it will extend to a third entry in the same way, and then a fourth, and so on.

Of course, there is a limit to the cache size: currently it is 8. This means the biggest PIC will have the initial inlined fast path, then 7 chained stub fast paths. After the last allowed fast path is generated, the slow path is patched to call cti_op_get_by_id_generic. This just means it won't try to extend the PIC any more. 

It does not appear that the PIC ever goes megamorphic (i.e., discards all fast paths in favor of a call to the general case)--it seems that a base is declared uncacheable only if a property is deleted.
(Assignee)

Comment 4

8 years ago
Finally, I'll go over how the op_get_by_id_proto case is optimized with the PIC. This part goes back to JITThunks::tryCacheGetByID. If the property is found on the immediate prototype, this function will call compileGetByIdProto to optimize.

compileGetByIdProto generates this code:

  pic_proto_stub:
    if |base.shape != shape_p|:
        goto slow
    if |base.proto.shape != shape_pp|:
        goto slow
    reg0 := load from base.proto.(f|d)slots
    goto done

and it patches:

  - the original shape guard failure to jump to pic_proto_stub.
  - the original slow case to call cti_op_get_by_id_proto_list.

So far this is identical to the self case except:

  - the JITted access path uses the proto instead of the object itself (duh)
  - the fast path inlined in the original code is not being used. Instead, the original shape guard just always fails
  - the new slow case is cti_op_get_by_id_proto_list

cti_op_get_by_id_proto_list is similar to the corresponding function in the self case (cti_op_get_by_id_self_fail). Here's how it works. First, if the current state can't be cached as a proto get (the base value is not vanilla and cacheable, or the value holding the property is the self object, or the value holding the property is not on the prototype chain, then the call to this function is patched to call cti_op_get_by_id_proto_fail instead. Otherwise, in the usual case, compileGetByIdProtoList is called to extend the PIC. There is a third case, in which the object holding the property is not the proto but is on the proto chain. In that case, compileGetByIdChainList is called to extend the PIC with a more general optimized access path. compileGetByIdProtoList extends the PIC by generating another stub similar to the one above. The final thing to note is that once the PIC is full, the slow path is changed to call cti_op_get_by_id_proto_full.

cti_op_get_by_id_proto_fail (called for non-cacheable things in the proto case) just does a general lookup--it seems it is its own op purely for informative and codegen-control purposes. cti_op_get_by_id_proto_list_full is the same. (The code really is identical.)

Everything else is pretty much what you would expect it to be given the foregoing. I should add a note about compileGetByIdChain(List). It's pretty much the same as compileGetByIdProto except that it walks up multiple steps inline--the JIT generates code to walk up |k| steps, with shape checks at each step.
(Assignee)

Updated

8 years ago
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
See bug 411630 comment 2 for some instrumentation results I gathered in early 2008 when reviving the property cache (v1 of that cache was from the late '90s).

The gmail deep proto chains made me want a constant test of two words, hence the caching of tag > 1 property cache hits (grand-proto or further, or scope chain and proto chain searching of any degree > 0) using the direct object identity as key. Bug 497789 now has a working patch that uses direct object shape, with the price that shapes depend on prototype object identities.

IIRC this is how JSC's Structures work. Please correct me if I'm misremembering.

/be
(Assignee)

Comment 6

8 years ago
(In reply to comment #5)
> IIRC this is how JSC's Structures work. Please correct me if I'm
> misremembering.

I will look into it. I didn't actually read the Structure code in detail. They do generate a shape guard for each link in the prototype chain, which would make me think that that Structure is only a property of the object itself, but I'll check.
(Assignee)

Comment 7

8 years ago
More on WebKit Structures (their analogue to our shape):

Logically, a Structure consists of a prototype and a property map (so Brendan is right in comment 5). Given this, I'm not sure why they guard on each step in the prototype chain.

New objects. When an object |o| is created with a prototype |p|, |o.m_structure| is set to |p.m_inheritorID|. |m_inheritorID| is a property of the object that is lazily initialized to a new structure created with Structure::create. Thus, empty objects have the same structure iff they have the same prototype (which I think is what we also do). If a new object is created without an explicit prototype (e.g., {}), its structure starts as globalObject->emptyObjectStructure(). That appears to be another structure that is created newly for each global object. So, no prototype is essential treated as a special null prototype, which makes sense.

Adding properties. When a property is added to an object, its structure must change. Here's how it works:

  Step 1. Call Structure::addPropertyTransitionToExistingStructure. This searches the "transition table" in the current structure for a Structure that matches the new format. The "transition table" is a property of Structure: it is a mapping from property to structure t_s(p) -> s', with the meaning that adding property |p| to |s| yields structure |s'|. Thus, it has the same purpose as our JSScope::table. 

  Step 2. If addPropertyTransitionToExistingStructure returned a new structure, we are done: that is the structure to use.

  Step 3. Otherwise, call Structure::addPropertyTransition, which creates a new structure and adds it to the transition table.

Structure has other methods for "transitioning" to a new, related structure:

  removePropertyTransition
      This returns a Structure with a given property removed. This is done by getting a dictionary-implemented Structure equivalent to the input Structure, then removing the property from that Structure.

  changePrototypeTransition
      This returns a Structure equivalent to the input Structure with a different prototype. This is done simply by creating a new structure that is a copy, except with changed prototype.

  despecifyFunctionTransition
      This returns a Structure equivalent to the input but with a given function property "unbranded" (in our terms, i.e., the identity of the function is not embedded in the structure). This is done by creating a new structure with the given function unbranded. Interestingly, they also have a counter of the number of times this happens, and they unbrand all function properties when the counter reaches a threshold (currently 3).

  getterSetterTransition
      I believe this returns a copy of the input Structure that is set to have getters or setters. 

  toCacheableDictionaryTransition
  toUncachableDictionaryTransition
      These create a new structure that is copy of the current one with a dictionary implementation and its DictionaryKind set to either cacheable or uncacheable.
You need to log in before you can comment on or make changes to this bug.