Open Bug 1986858 Opened 2 months ago Updated 1 month ago

wasm-gc: omit IL->OOL data pointer from WasmStructObject when not needed

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: jseward, Assigned: jseward)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 2 obsolete files)

Proposed JS3 benchmark "j2cl-box2d-wasm" has a D1 miss count more than twice as
high on SM as it does on V8. This is believed in part to be because SM's
WasmStructObjects are larger than V8's, and j2cl-box2d-wasm allocates a lot of
objects. For example, V8 generates

        vmovss   19(%r8), %xmm0
        vmovss   %xmm0, 19(%r12)

where SM generates

        vmovss   44(%rdx), %xmm1
        vmovss   %xmm1, 44(%r10)

Assuming r8/r12 have low-end tags of 0b001, that's a 24-byte difference.

For SM we can get some idea of the sensitivity of this benchmark to
WasmStructObject overhead size by artificially increasing the size of the
object with extra words:

N = # extra words in WasmStructObject

N            insns          cycles    IPC       Dmisses

0   11,888,388,519   4,791,157,013   2.48   202,898,145
1   11,922,566,989   4,998,100,118   2.39   252,951,255
2   11,941,307,319   5,027,953,406   2.37   254,930,242
3   11,961,214,343   5,010,693,654   2.39   255,991,355
4   11,952,698,814   5,083,673,556   2.35   263,415,340
5   12,026,518,835   5,336,168,889   2.25   309,611,179
6   12,005,779,475   5,331,609,412   2.25   308,848,077
7   12,077,991,732   5,419,249,772   2.23   310,656,130
8   12,065,427,639   5,543,931,938   2.18   320,761,034
9   12,270,819,557   6,219,587,233   1.97   365,207,339

It would help to omit dataPointer_ when it is not needed. But that means,
when it is needed, moving it to the end of the IL data elements so that they do
not move relative to the structure base pointer as inheritance adds more
fields. A tedious but manageable extra complication.

It also means we need to indicate some other way, whether or not the object has
OOL data. We could steal the lowest bit in superTypeVector_ for that
purpose.

vmovss 19(%r8), %xmm0
vs
vmovss 44(%rdx), %xmm1

19 and 44 more than one word apart, does this mean some sort of packing is happening here?

(In reply to Yury Delendik (:yury) from comment #1)

19 and 44 more than one word apart, does this mean some sort of packing is happening here?

(As explained to me by Jan) V8 uses a scheme where the lowest bits
of a GC pointer contain some other info -- I don't know what. And
rather than mask off the bits before using the pointer, if it knows what
the bits are, it can just change the access offset to compensate.
In this example, I assume that the code "knows" that the %r8 has a
value that ends in the bits 001. So the access is really to
"20 + (%r8 with the bottom 3 bits masked off)".

The offsets (20, 44) are 0 % 4 because these fields have type float32.

Assignee: nobody → jseward
Attachment #9516795 - Attachment is patch: true
Depends on: 1991368

The effectiveness of this layout rework is enhanced by the patch in bug
1991368, which adds js::gc::AllocKind::OBJECT6.

Here are some numbers for the resulting stack of 3 patches. Intel Tiger Lake,
best of 10 runs. "Total Score" values increase, the D1 miss rate decreases (as
intended), and overall I get the impression that about 1/2 to 2/3 of the cycle
count reduction is due to the reduction in cache misses, as was hoped for.

basis = basis
obje6 = basis + patch that adds AllocKind::OBJECT6 (bug 1991368)
exper = obje6 + "WIP Part 1" (abstractify layout algorithm, but don't change it)
bitma = exper + "WIP Part 2" (fancy layout algorithm)


Total Score:
               basis     obje6     exper     bitma   %-vs-basis
j2cl-box2d    155.83    156.75    158.84    156.38   +0.3%
Df-todomvc     23.19     23.34     23.54     23.76   +2.5%
Ko-compose      3.95      3.98      3.97      4.06   +2.8%
Df-complex      4.26      4.33      4.35      4.46   +4.7%


cycles:u, millions
               basis     obje6     exper     bitma   %-vs-basis
j2cl-box2d     4,771     4,746     4,724     4,718   -1.1%
Df-todomvc    24,091    23,908    23,946    23,507   -2.5%
Ko-compose    79,711    79,024    78,937    77,641   -2.6%
Df-complex    75,605    74,710    73,945    71,291   -6.0%


instructions:u, millions
               basis     obje6     exper     bitma   %-vs-basis
j2cl-box2d    11,626    11,637    11,606    11,616   -0.1%
Df-todomvc    29,177    29,166    29,007    28,966   -0.7%
Ko-compose   115,327   114,916   114,983   114,400   -0.8%
Df-complex   107,670   107,718   106,288   103,862   -3.7%


L1-dcache-load-misses:u, millions
               basis     obje6     exper     bitma   %-vs-basis
j2cl-box2d       201       197       196       189   -6.3%
Df-todomvc       598       594       599       576   -3.8%
Ko-compose     1,860     1,855     1,874     1,806   -3.0%
Df-complex     1,650     1,656     1,650     1,540   -7.1%
Attachment #9516794 - Attachment is obsolete: true
Attachment #9516795 - Attachment is obsolete: true
Attachment #9518032 - Attachment description: WIP: Bug 1986858-1-refactor-struct-layout.diff → WIP: Bug 1986858 part 1: refactor wasm structure layout machinery.
Attachment #9518033 - Attachment description: WIP: Bug 1986858-2-bitmap-based-layout.diff → WIP: Bug 1986858 part 2: do structure layout with backfilling and OOLPtr avoidance.
Attachment #9518032 - Attachment description: WIP: Bug 1986858 part 1: refactor wasm structure layout machinery. → Bug 1986858 part 1: refactor wasm structure layout machinery. r=bvisness.
Attachment #9518033 - Attachment description: WIP: Bug 1986858 part 2: do structure layout with backfilling and OOLPtr avoidance. → Bug 1986858 part 2: do structure layout with backfilling and OOLPtr avoidance. r=bvisness.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: