Experiment with length-3 TaggedParserAtomIndex in Parser
Categories
(Core :: JavaScript Engine, task, P2)
Tracking
()
People
(Reporter: tcampbell, Unassigned)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [sp3])
Attachments
(2 files)
After Bug 1687428, we will have tagged atom indices in Parser which can be directly computed for tiny strings without any lookup / pointer chasing. Currently, we support length 1-, 2- tiny strings (with limited char sets) because the GC has reserved JSAtoms that we can quickly translate to.
For length-3 strings, we could introduce a new tag mode that does not have reserved ParserAtom/JSAtoms, but still can directly construct the tagged-atom-index. The idea here is that the parser makes many more atomization calls than instantiate-to-JSAtom calls. Even without reserved JSAtoms, we should be able to improve parsing for large minified JS files with thousands of identifiers (which happens on a lot of big websites).
Reporter | ||
Comment 1•4 years ago
|
||
See TaggedParserAtomIndex representation. We could easily have strings up to length 5 of Base64 characters.
Reporter | ||
Comment 2•4 years ago
|
||
Looks like for things like Netflix and Gmail, we can save 30k+ atomization calls in just their largest script alone.
Comment 3•4 years ago
|
||
Length-3 atoms should be simple enough, we should move length-3 from FOR_EACH_NONTINY_COMMON_PROPERTYNAME
list to new FOR_EACH_LENGTH3_PROPERTYNAME
list.
But Length-4 is problematic about null
, given it conflicts with the requirement here:
https://searchfox.org/mozilla-central/rev/40205f1b55f1ba8a26fcbd362c20d843e8718a87/js/src/vm/CommonPropertyNames.h#528-539
If we can remove well-known parser atoms instance (bug 1690634), it should be simplified
Comment 4•4 years ago
|
||
There's one issue about CompilationAtomCache::getExistingAtomAt
.
The consumer expects the return value to be non-null,
so, if we introduce "static" string without pre-allocated JSAtom
, the function become fallible, or we should
introduce yet another cache.
Comment 5•4 years ago
•
|
||
Tried length-3/4 strings
(NOTE: the following result is without length-4 cache properly working)
patch stack: https://hg.mozilla.org/try/pushloghtml?changeset=dcbe74ad27b96188dd114b5405d24ad3bf16443b
The result of large files in modified parsemark: https://arai-a.github.io/raptor-graph/pgraph.html
parse_full
: full parsedumpStencil_full
: full parse + compile to stencildumpStencil_lazy
: lazy parse + compile to stencilcompile_full
: full parse + compile to stencil + instantiatecompile_lazy
: lazy parse + compile to stencil + instantiatedecode_full
: decode of (full parse + compile to stencil + encode) + instantiatedecode_lazy
: decode of (lazy parse + compile to stencil + encode) + instantiate
across files
- parse and compile to stencil are improved in most case:
- gmail
- parse_full: -1.2ms (25.67 => 24.47, -4.67%)
- dumpStencil_full: -2.13ms (61.7 => 59.57, -3.45%)
- MochiKit
- parse_full: -0.34ms (4.48 => 4.14, -7.59%)
- dumpStencil_full: -0.22ms (8.89 => 8.67, -2.47%)
- gmail
- instantiate is regressed
- the improvement for compile is less than the improvement for dumpStencil
- gmail
- dumpStencil_full: -2.13ms (61.7 => 59.57)
- compile_full: -0.92ms (67.28 => 66.36)
- MochiKit
- dumpStencil_full: -0.22ms (8.89 => 8.67)
- compile_full: -0.14ms (9.48 => 9.34)
- gmail
- in some case, compile become regression
- twitter
- dumpStencil_full: -0.27ms (12.27 => 12)
- compile_full: +0.05ms (12.69 => 12.74)
- twitter
- the improvement for compile is less than the improvement for dumpStencil
- decode + instantiation is regressed
- gmail
- decode_full: +0.59ms (8.59 => 9.18)
- decode_lazy: +0.5ms (3.4 => 3.9)
- MochiKit
- decode_full: +0.09ms (1.14 => 1.23)
- decode_lazy: +0.02ms (0.54 => 0.56)
- twitter
- decode_full: +0.01ms (1.83 => 1.84)
- decode_lazy: +0.02ms (0.68 => 0.7)
- gmail
The regression is most likely caused by getExistingAtomAt
(that affects instantiation step), there
- before the patch, length-3/4 strings are simple pointer access for already instantiated JSAtom in atomCache
- after the patch, it becomes:
- switch-case for well-known atom indices
- simple hash map lookup for atom index -> JSAtom
JSAtom
atomization
Comment 6•4 years ago
•
|
||
(updated the result for length-4, after fixing the cache for length-4 atom)
the number of toJSAtom
call (including getExistingAtomAt
) for length-3/4, for gmail_combined.js
well-known atom (no atomize) | hashmap cache hit | atomize + cache | total | |
---|---|---|---|---|
parse_full : length-3 | 0 | 0 | 0 | 0 |
parse_full : length-4 | 0 | 0 | 0 | 0 |
dumpStencil_full : length-3 | 0 | 0 | 0 | 0 |
dumpStencil_full : length-4 | 0 | 0 | 0 | 0 |
compile_full : length-3 | 137 | 19699 | 9241 | 29077 |
compile_full : length-4 | 98 | 475 | 555 | 1128 |
decode_full : length-3 | 137 | 19686 | 9241 | 29064 |
decode_full : length-4 | 98 | 478 | 555 | 1131 |
Comment 7•4 years ago
|
||
tried benchmark after "fixing" length-4 cache (put the atom to hash map), the result regresses.
so, apparently the hashmap's overhead is larger than the merit.
will check without hashmap, but only with well-known atom check
Comment 8•4 years ago
|
||
After removing hashmap cache, compared unpatched, length-3, length-3/4.
in most case, both length-3 and length-4 static string contributes to the parser / compile to stencil improvement,
but both regresses instantiate.
on gmail_combined.js
mode | before | length-3 | length-3/4 |
---|---|---|---|
parse_full | 26.39 | 24.03 | 23.35 |
dumpStencil_full | 59.64 | 59.08 | 57.9 |
dumpStencil_lazy | 28.35 | 26.61 | 26.93 |
compile_full | 67.28 | 65.88 | 65.55 |
compile_lazy | 30.8 | 29.35 | 29.48 |
decode_full | 8.36 | 9 | 8.99 |
decode_lazy | 3.97 | 3.91 | 3.96 |
on twitter_combined.js
mode | before | length-3 | length-3/4 |
---|---|---|---|
parse_full | 5.56 | 5.74 | 5.52 |
dumpStencil_full | 11.9 | 11.82 | 11.64 |
dumpStencil_lazy | 5.83 | 5.54 | 5.48 |
compile_full | 13.08 | 13.2 | 12.96 |
compile_lazy | 5.83 | 5.88 | 5.77 |
decode_full | 1.74 | 1.75 | 1.84 |
decode_lazy | 0.74 | 0.73 | 0.74 |
Then, compared rough time taken by single getExistingAtomAt
call during instantiation, on gmail_combined.js
(the result may contain PRMJ_Now
call cost, while I tried to exclude. so the absolute value isn't reliable)
first time compilation (first call for length-3/4 calls AtomizeChars
)
kind | total time | hit count | per single call |
---|---|---|---|
ParserAtomIndex | 134 us | 5070 | 26 ns |
well-known | 157 us | 4203 | 37 ns |
length-1 | 529 us | 18504 | 29 ns |
length-2 | 827 us | 26498 | 31 ns |
length-3 | 3549 us | 29077 | 122 ns |
length-4 | 147 us | 1128 | 130 ns |
second time compilation (JSAtoms for length-3/4 are supposed to be cached in VM atom cache)
kind | total time | hit count | per single call |
---|---|---|---|
ParserAtomIndex | 155 us | 5070 | 31 ns |
well-known | 121 us | 4203 | 29 ns |
length-1 | 597 us | 18504 | 32 ns |
length-2 | 944 us | 26498 | 36 ns |
length-3 | 2509 us | 29077 | 86 ns |
length-4 | 95 us | 1128 | 84 ns |
Comment 9•4 years ago
|
||
to be clear, the cost of getExistingAtomAt
cannot become completely equivalent between length-3/4 and ParserAtomIndex,
given that AtomizeChars
call for ParserAtomIndex
case is performed before that point. (InstantiateMarkedAtoms
call)
in length-3 case above, 36% of the call requires AtomizeChars
even if we introduce some kind of cache that's local to current compilation,
given there are ~10000 unique length-3 atoms.
so, the problems would be:
- cost of hash calculation for each call
- the above patch doesn't statically calculate hash for length-3/4 atoms
- cost of subsequent
AtomizeChars
calls, for atoms used twice or more inside the stencil- extra hash calculation
- extra lookup
- CPU cache efficiency
- instruction cache:
- for
ParserAtomIndex
case,AtomizeChars
is called for all of them inInstantiateMarkedAtoms
- for length-3/4 case,
AtomizeChars
is called in between instantiating other GC objects
- for
- data cache:
- for
ParserAtomIndex
case, atom's pointer for the current compilation is stored intoCompilationAtomCache.atoms_
and the access is simple pointer calculation - for length-3/4 case, atom's pointer is retrieved from per-zone atom cache, or global AtomsTable
- for
- instruction cache:
Comment 10•4 years ago
|
||
(In reply to Tooru Fujisawa [:arai] from comment #9)
* the above patch doesn't statically calculate hash for length-3/4 atoms
the hash table for the entire length-3 static atom (small-char x 3) becomes 1MB (64 * 64 * 64 * 4 bytes),
and doesn't sound reasonable to statically calculate.
Reporter | ||
Comment 11•4 years ago
|
||
(In reply to Tooru Fujisawa [:arai] from comment #9)
- cost of subsequent
AtomizeChars
calls, for atoms used twice or more inside the stencil
- extra hash calculation
- extra lookup
Thanks! This is the aspect I was missing in my thinking. To see benefit, we'd need to have more parser-atomize calls than copies of the tagged index in the final stencil. With your data in https://bugzilla.mozilla.org/show_bug.cgi?id=1687094#c13 I am less convinced there is as much of a win here if JSAtoms don't natively support these. Hmm..
Comment 12•4 years ago
|
||
Okay, I'll put this on the backlog for now.
we can revisit this later with integrating length-3/4 atoms optimization in VM side.
anyway, I think some refactoring/simplification patches (bug 1691107, bug 1691029, bug 1691014, and some of bug 1689892) worth landing.
that will help investigating bug 1690634 (it's mostly data reduction)
Comment 13•4 years ago
|
||
Comment 14•4 years ago
|
||
Depends on D107353
Updated•2 years ago
|
Updated•2 years ago
|
Description
•