wasm-gc: omit IL->OOL data pointer from WasmStructObject when not needed
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
People
(Reporter: jseward, Assigned: jseward)
References
(Blocks 1 open bug)
Details
Attachments
(2 files, 2 obsolete files)
Proposed JS3 benchmark "j2cl-box2d-wasm" has a D1 miss count more than twice as
high on SM as it does on V8. This is believed in part to be because SM's
WasmStructObjects are larger than V8's, and j2cl-box2d-wasm allocates a lot of
objects. For example, V8 generates
vmovss 19(%r8), %xmm0
vmovss %xmm0, 19(%r12)
where SM generates
vmovss 44(%rdx), %xmm1
vmovss %xmm1, 44(%r10)
Assuming r8/r12 have low-end tags of 0b001, that's a 24-byte difference.
For SM we can get some idea of the sensitivity of this benchmark to
WasmStructObject overhead size by artificially increasing the size of the
object with extra words:
N = # extra words in WasmStructObject
N insns cycles IPC Dmisses
0 11,888,388,519 4,791,157,013 2.48 202,898,145
1 11,922,566,989 4,998,100,118 2.39 252,951,255
2 11,941,307,319 5,027,953,406 2.37 254,930,242
3 11,961,214,343 5,010,693,654 2.39 255,991,355
4 11,952,698,814 5,083,673,556 2.35 263,415,340
5 12,026,518,835 5,336,168,889 2.25 309,611,179
6 12,005,779,475 5,331,609,412 2.25 308,848,077
7 12,077,991,732 5,419,249,772 2.23 310,656,130
8 12,065,427,639 5,543,931,938 2.18 320,761,034
9 12,270,819,557 6,219,587,233 1.97 365,207,339
It would help to omit dataPointer_ when it is not needed. But that means,
when it is needed, moving it to the end of the IL data elements so that they do
not move relative to the structure base pointer as inheritance adds more
fields. A tedious but manageable extra complication.
It also means we need to indicate some other way, whether or not the object has
OOL data. We could steal the lowest bit in superTypeVector_ for that
purpose.
Comment 1•2 months ago
|
||
vmovss 19(%r8), %xmm0
vs
vmovss 44(%rdx), %xmm1
19 and 44 more than one word apart, does this mean some sort of packing is happening here?
| Assignee | ||
Comment 2•2 months ago
|
||
(In reply to Yury Delendik (:yury) from comment #1)
19 and 44 more than one word apart, does this mean some sort of packing is happening here?
(As explained to me by Jan) V8 uses a scheme where the lowest bits
of a GC pointer contain some other info -- I don't know what. And
rather than mask off the bits before using the pointer, if it knows what
the bits are, it can just change the access offset to compensate.
In this example, I assume that the code "knows" that the %r8 has a
value that ends in the bits 001. So the access is really to
"20 + (%r8 with the bottom 3 bits masked off)".
The offsets (20, 44) are 0 % 4 because these fields have type float32.
| Assignee | ||
Comment 3•2 months ago
|
||
| Assignee | ||
Comment 4•2 months ago
|
||
| Assignee | ||
Updated•2 months ago
|
| Assignee | ||
Comment 5•2 months ago
|
||
The effectiveness of this layout rework is enhanced by the patch in bug
1991368, which adds js::gc::AllocKind::OBJECT6.
Here are some numbers for the resulting stack of 3 patches. Intel Tiger Lake,
best of 10 runs. "Total Score" values increase, the D1 miss rate decreases (as
intended), and overall I get the impression that about 1/2 to 2/3 of the cycle
count reduction is due to the reduction in cache misses, as was hoped for.
basis = basis
obje6 = basis + patch that adds AllocKind::OBJECT6 (bug 1991368)
exper = obje6 + "WIP Part 1" (abstractify layout algorithm, but don't change it)
bitma = exper + "WIP Part 2" (fancy layout algorithm)
Total Score:
basis obje6 exper bitma %-vs-basis
j2cl-box2d 155.83 156.75 158.84 156.38 +0.3%
Df-todomvc 23.19 23.34 23.54 23.76 +2.5%
Ko-compose 3.95 3.98 3.97 4.06 +2.8%
Df-complex 4.26 4.33 4.35 4.46 +4.7%
cycles:u, millions
basis obje6 exper bitma %-vs-basis
j2cl-box2d 4,771 4,746 4,724 4,718 -1.1%
Df-todomvc 24,091 23,908 23,946 23,507 -2.5%
Ko-compose 79,711 79,024 78,937 77,641 -2.6%
Df-complex 75,605 74,710 73,945 71,291 -6.0%
instructions:u, millions
basis obje6 exper bitma %-vs-basis
j2cl-box2d 11,626 11,637 11,606 11,616 -0.1%
Df-todomvc 29,177 29,166 29,007 28,966 -0.7%
Ko-compose 115,327 114,916 114,983 114,400 -0.8%
Df-complex 107,670 107,718 106,288 103,862 -3.7%
L1-dcache-load-misses:u, millions
basis obje6 exper bitma %-vs-basis
j2cl-box2d 201 197 196 189 -6.3%
Df-todomvc 598 594 599 576 -3.8%
Ko-compose 1,860 1,855 1,874 1,806 -3.0%
Df-complex 1,650 1,656 1,650 1,540 -7.1%
| Assignee | ||
Comment 6•2 months ago
|
||
| Assignee | ||
Comment 7•2 months ago
|
||
| Assignee | ||
Updated•2 months ago
|
| Assignee | ||
Updated•2 months ago
|
Updated•1 month ago
|
Updated•1 month ago
|
Updated•1 month ago
|
Updated•1 month ago
|
Description
•