Open Bug 1689434 Opened 4 years ago Updated 2 years ago

Experiment with length-3 TaggedParserAtomIndex in Parser

Categories

(Core :: JavaScript Engine, task, P2)

task

Tracking

()

People

(Reporter: tcampbell, Unassigned)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [sp3])

Attachments

(2 files)

After Bug 1687428, we will have tagged atom indices in Parser which can be directly computed for tiny strings without any lookup / pointer chasing. Currently, we support length 1-, 2- tiny strings (with limited char sets) because the GC has reserved JSAtoms that we can quickly translate to.

For length-3 strings, we could introduce a new tag mode that does not have reserved ParserAtom/JSAtoms, but still can directly construct the tagged-atom-index. The idea here is that the parser makes many more atomization calls than instantiate-to-JSAtom calls. Even without reserved JSAtoms, we should be able to improve parsing for large minified JS files with thousands of identifiers (which happens on a lot of big websites).

See TaggedParserAtomIndex representation. We could easily have strings up to length 5 of Base64 characters.

Looks like for things like Netflix and Gmail, we can save 30k+ atomization calls in just their largest script alone.

See Also: → 1689892
See Also: → 1690634
See Also: 1689892
Depends on: 1689892

Length-3 atoms should be simple enough, we should move length-3 from FOR_EACH_NONTINY_COMMON_PROPERTYNAME list to new FOR_EACH_LENGTH3_PROPERTYNAME list.

But Length-4 is problematic about null, given it conflicts with the requirement here:
https://searchfox.org/mozilla-central/rev/40205f1b55f1ba8a26fcbd362c20d843e8718a87/js/src/vm/CommonPropertyNames.h#528-539
If we can remove well-known parser atoms instance (bug 1690634), it should be simplified

There's one issue about CompilationAtomCache::getExistingAtomAt.
The consumer expects the return value to be non-null,
so, if we introduce "static" string without pre-allocated JSAtom, the function become fallible, or we should
introduce yet another cache.

Tried length-3/4 strings

(NOTE: the following result is without length-4 cache properly working)

patch stack: https://hg.mozilla.org/try/pushloghtml?changeset=dcbe74ad27b96188dd114b5405d24ad3bf16443b

The result of large files in modified parsemark: https://arai-a.github.io/raptor-graph/pgraph.html

  • parse_full: full parse
  • dumpStencil_full: full parse + compile to stencil
  • dumpStencil_lazy: lazy parse + compile to stencil
  • compile_full: full parse + compile to stencil + instantiate
  • compile_lazy: lazy parse + compile to stencil + instantiate
  • decode_full: decode of (full parse + compile to stencil + encode) + instantiate
  • decode_lazy: decode of (lazy parse + compile to stencil + encode) + instantiate

across files

  • parse and compile to stencil are improved in most case:
    • gmail
      • parse_full: -1.2ms (25.67 => 24.47, -4.67%)
      • dumpStencil_full: -2.13ms (61.7 => 59.57, -3.45%)
    • MochiKit
      • parse_full: -0.34ms (4.48 => 4.14, -7.59%)
      • dumpStencil_full: -0.22ms (8.89 => 8.67, -2.47%)
  • instantiate is regressed
    • the improvement for compile is less than the improvement for dumpStencil
      • gmail
        • dumpStencil_full: -2.13ms (61.7 => 59.57)
        • compile_full: -0.92ms (67.28 => 66.36)
      • MochiKit
        • dumpStencil_full: -0.22ms (8.89 => 8.67)
        • compile_full: -0.14ms (9.48 => 9.34)
    • in some case, compile become regression
      • twitter
        • dumpStencil_full: -0.27ms (12.27 => 12)
        • compile_full: +0.05ms (12.69 => 12.74)
  • decode + instantiation is regressed
    • gmail
      • decode_full: +0.59ms (8.59 => 9.18)
      • decode_lazy: +0.5ms (3.4 => 3.9)
    • MochiKit
      • decode_full: +0.09ms (1.14 => 1.23)
      • decode_lazy: +0.02ms (0.54 => 0.56)
    • twitter
      • decode_full: +0.01ms (1.83 => 1.84)
      • decode_lazy: +0.02ms (0.68 => 0.7)

The regression is most likely caused by getExistingAtomAt (that affects instantiation step), there

  • before the patch, length-3/4 strings are simple pointer access for already instantiated JSAtom in atomCache
  • after the patch, it becomes:
    • switch-case for well-known atom indices
    • simple hash map lookup for atom index -> JSAtom
    • JSAtom atomization

(updated the result for length-4, after fixing the cache for length-4 atom)

the number of toJSAtom call (including getExistingAtomAt) for length-3/4, for gmail_combined.js

well-known atom (no atomize) hashmap cache hit atomize + cache total
parse_full : length-3 0 0 0 0
parse_full : length-4 0 0 0 0
dumpStencil_full : length-3 0 0 0 0
dumpStencil_full : length-4 0 0 0 0
compile_full : length-3 137 19699 9241 29077
compile_full : length-4 98 475 555 1128
decode_full : length-3 137 19686 9241 29064
decode_full : length-4 98 478 555 1131

tried benchmark after "fixing" length-4 cache (put the atom to hash map), the result regresses.
so, apparently the hashmap's overhead is larger than the merit.

will check without hashmap, but only with well-known atom check

After removing hashmap cache, compared unpatched, length-3, length-3/4.
in most case, both length-3 and length-4 static string contributes to the parser / compile to stencil improvement,
but both regresses instantiate.

on gmail_combined.js

mode before length-3 length-3/4
parse_full 26.39 24.03 23.35
dumpStencil_full 59.64 59.08 57.9
dumpStencil_lazy 28.35 26.61 26.93
compile_full 67.28 65.88 65.55
compile_lazy 30.8 29.35 29.48
decode_full 8.36 9 8.99
decode_lazy 3.97 3.91 3.96

on twitter_combined.js

mode before length-3 length-3/4
parse_full 5.56 5.74 5.52
dumpStencil_full 11.9 11.82 11.64
dumpStencil_lazy 5.83 5.54 5.48
compile_full 13.08 13.2 12.96
compile_lazy 5.83 5.88 5.77
decode_full 1.74 1.75 1.84
decode_lazy 0.74 0.73 0.74

Then, compared rough time taken by single getExistingAtomAt call during instantiation, on gmail_combined.js
(the result may contain PRMJ_Now call cost, while I tried to exclude. so the absolute value isn't reliable)

first time compilation (first call for length-3/4 calls AtomizeChars)

kind total time hit count per single call
ParserAtomIndex 134 us 5070 26 ns
well-known 157 us 4203 37 ns
length-1 529 us 18504 29 ns
length-2 827 us 26498 31 ns
length-3 3549 us 29077 122 ns
length-4 147 us 1128 130 ns

second time compilation (JSAtoms for length-3/4 are supposed to be cached in VM atom cache)

kind total time hit count per single call
ParserAtomIndex 155 us 5070 31 ns
well-known 121 us 4203 29 ns
length-1 597 us 18504 32 ns
length-2 944 us 26498 36 ns
length-3 2509 us 29077 86 ns
length-4 95 us 1128 84 ns

to be clear, the cost of getExistingAtomAt cannot become completely equivalent between length-3/4 and ParserAtomIndex,
given that AtomizeChars call for ParserAtomIndex case is performed before that point. (InstantiateMarkedAtoms call)

in length-3 case above, 36% of the call requires AtomizeChars even if we introduce some kind of cache that's local to current compilation,
given there are ~10000 unique length-3 atoms.

so, the problems would be:

  • cost of hash calculation for each call
    • the above patch doesn't statically calculate hash for length-3/4 atoms
  • cost of subsequent AtomizeChars calls, for atoms used twice or more inside the stencil
    • extra hash calculation
    • extra lookup
  • CPU cache efficiency
    • instruction cache:
      • for ParserAtomIndex case, AtomizeChars is called for all of them in InstantiateMarkedAtoms
      • for length-3/4 case, AtomizeChars is called in between instantiating other GC objects
    • data cache:
      • for ParserAtomIndex case, atom's pointer for the current compilation is stored into CompilationAtomCache.atoms_
        and the access is simple pointer calculation
      • for length-3/4 case, atom's pointer is retrieved from per-zone atom cache, or global AtomsTable

(In reply to Tooru Fujisawa [:arai] from comment #9)

 * the above patch doesn't statically calculate hash for length-3/4 atoms

the hash table for the entire length-3 static atom (small-char x 3) becomes 1MB (64 * 64 * 64 * 4 bytes),
and doesn't sound reasonable to statically calculate.

(In reply to Tooru Fujisawa [:arai] from comment #9)

  • cost of subsequent AtomizeChars calls, for atoms used twice or more inside the stencil
    • extra hash calculation
    • extra lookup

Thanks! This is the aspect I was missing in my thinking. To see benefit, we'd need to have more parser-atomize calls than copies of the tagged index in the final stencil. With your data in https://bugzilla.mozilla.org/show_bug.cgi?id=1687094#c13 I am less convinced there is as much of a win here if JSAtoms don't natively support these. Hmm..

Okay, I'll put this on the backlog for now.
we can revisit this later with integrating length-3/4 atoms optimization in VM side.

anyway, I think some refactoring/simplification patches (bug 1691107, bug 1691029, bug 1691014, and some of bug 1689892) worth landing.
that will help investigating bug 1690634 (it's mostly data reduction)

Blocks: stencil-backlog
No longer blocks: 1609486
Blocks: 1803803
Whiteboard: [sp3]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: