Open Bug 1345476 Opened 7 years ago Updated 2 years ago

[exploration] Do size-profiling of EpicZenGarden codegen to look for size-reduction opportunities

Categories

(Core :: JavaScript: WebAssembly, task, P3)

task

Tracking

()

People

(Reporter: luke, Unassigned)

References

(Blocks 1 open bug, )

Details

Attachments

(1 file, 4 obsolete files)

...especially on 32-bit, where code-memory size is more limited (bug 1345205).  For a 44mb .wasm file, we're currently getting about 105mb of executable code.  Bug 1334504 should help here by removing the separate "profiling" prologue/epilogues.  I think a good start might be measuring aggregate size of generateOutOfLineCode() and doing per-op-type profiling if that's high.  It'd also be good to sanity check that the size of stubs/glue code added between functions (by ModuleGenerator) is only a small fraction.
Summary: Baldr: do size-profiling of EpicZenGarden codegen to look for → Baldr: do size-profiling of EpicZenGarden codegen to look for size-reduction opportunities
mbebenita observed at one point that the baseline code was 2x the size of the Ion code and with the removal of patching that can only have gotten worse.  Tiering is likely to make the code management problem worse.
Ugh, that's a good point; like 3x worse.  A mitigation could be to not do baseline compilation for wasm modules bigger than a certain threshold on 32-bit.  Really, a 44mb .wasm module (14mb gzipped) is a bit ridiculous no matter how you slice it; Unity's done a lot more work on code size and is only 12mb (3.6mb gzipped).
And yet it's presumably these large modules that would benefit the most from baseline compilation...  And we'll need an answer for debugging needs, probably?  (Could be as ugly and easy as not limiting code size in Dev Ed, I suppose, but that's scant help for a dev trying to diagnose a customer problem in a release browser.)
Right, that could be a reason to increase the max-code-bytes quota further on 32-bit.  For debugging, a 64-bit browser would work fine.
Another idea would be to switch baseline, in these gigantor cases, to do per-function JIT compilation.  If all baseline calls went through a table (which could be the same mechanism for tiering into Ion), then that table could be pre-populated with stubs that compile when called.  I still think we'd have a smoother experience (and be able to take advantage of streaming+parallel compilation) with AOT baseline, though, so probably we'd want to prefer this when not prohibited by memory.
Priority: -- → P3
As an update: due to size-reductions in the .wasm file itself (bug 1341633), and improvements in codegen (bug 1338217 and bug 1334504), a 32-bit Ion build is 76mb and a 32-bit --wasm-always-baseline build is 113mb.  So one of these should fit in the now-140mb quota (as increased by bug 1345205).
So I took a look at the EpicZenGarden .wasm file and it looks like the name section is left in there which takes up 8,602,868 bytes. Not super important, but useful to know when doing size comparisons. Also, I'm not quite sure which .wasm file you are looking at, mine is 39,510,398 bytes.

Here's a breakdown of all the instructions we emit and the total number of bytes used to encode them:

cdq                  2
cqo                  4
div                  10
idiv                 12
sqrtsd               24
bsr                  45
cvttss2si            45
andpd                55
bsf                  141
jo                   174
cvttsd2si            230
movq                 245
xchg                 258
roundsd              270
js                   330
setbe                359
setge                1605
cvtsi2sd             2138
setle                2489
divsd                2557
movapd               2873
cvtsd2ss             2892
ucomisd              3002
movsxd               3009
addsd                3605
roundss              4141
subsd                4282
jnp                  4872
xorpd                5533
mulsd                5706
setb                 6571
sqrtss               7473
ja                   9818
andps                10832
setg                 12000
sar                  13151
jb                   13182
jp                   13866
jg                   15780
cvtss2sd             16047
setae                16141
pcmpeqw              18960
setl                 19922
psllq                22752
cvtsi2ss             23316
divss                25684
movd                 28364
seta                 36983
setne                44202
jbe                  45066
movsd                46027
shr                  46723
sete                 47028
xorps                52630
jl                   54798
ucomiss              61401
jge                  79826
lea                  82510
or                   88758
imul                 88899
jle                  92686
cmove                111055
ret                  112293
subss                127990
movaps               153815
nop                  224586
movsx                230824
movabs               233030
shl                  263491
movzx                302992
addss                313032
jne                  370436
xor                  376958
and                  448103
sub                  497382
mulss                507326
test                 543892
jmp                  786174
cmp                  836107
je                   890184
jae                  896890
movss                2074553
call                 2777591
add                  2802569
mov                  22145761

Total                39219338

I didn't use any command line args, so I'm guessing it's a 64-bit Ion build on OSX. So many moves :|
Attached file Epic ZenGarden OpStats - Chart & Data (obsolete) —
Some more data related to the emitted code in ion is attached.
Both the chart and the table are ordered by the average byte size per op.
Thanks!  (cc'ing a few people who might be interested in this raw data)

After scanning the list a few times, I can't really see anything unexpected here: the "flabby" ops (with more than ~9bytes/op) all use an insignificant % of total bytes.  (Speaking of, what was the sum total code size?  It'd be nice to have a column next to totalBytes which has % of total)

The comparisons (*AndBranch) are all a bit bigger than expected, but I expect this is due immediates.

One interesting thing is that the single biggest op (at 11.2mb) is MoveGroup which is inserted by the register allocator when spilling is necessary; given all the existing work on this, I doubt there is any low-hanging fruit here.  There are quite a lot of calls here (578k) so the lack of non-volatile registers could be a significant contributing factor.
A few more ideas, at a quick look:

WrapInt64ToInt32 is currently a `movl`; with care this could often be optimized to a no-op in many cases since 32-bit instructions only read the low 32 bits of their inputs.

NegD/NegF would be smaller with a constant-pool load instead of materializing the constant manually.

Branch immediates: the macroassembler currently always uses 32-bit immediates for forward branches. In the case of branches within individual LIR opcodes, the code generator may be able to declare that the destination is within range for an 8-bit immediate.
Attached file Epic ZenGarden OpStats - Chart & Data (obsolete) —
Sheets updated with explicit percentages.
Attachment #8884488 - Attachment is obsolete: true
Thanks!  So scanning the whole list again, it seems like all ops are in one of 3 categories: (1) insignificant % of total (<2%), (2) already optimized and thus not likely source of low-hanging fruit (call, load, store, movegroup), (3) just super-hot and not flabby (comparisons, add).  So unfortunately no clear action here, just "good job Ion!".

Next, it'd be useful to profile the size of the OutOfLineCodes emitted by generateOutOfLineCode().
Attached file Epic ZenGarden OpStats w/ OOL (obsolete) —
Updated data with OOL operations.
Attachment #8885005 - Attachment is obsolete: true
Wow, so out-of-line is pretty light then, mostly just out-of-line switch jump tables.  So the total size here is 37.7mb whereas about:memory reports 56.7mb for the total code allocation.  The two significant remaining buckets are wasm trap out-of-line paths (emitted by masm.wasmEmitTrapOutOfLineCode()) and prologue/epilogue code (emitted by GenerateFunction(Prologue|Epilogue)).  Just to make sure we've covered the whole 56.7mb, could you measure and include these as well?
(In reply to Luke Wagner [:luke] from comment #14)

> GenerateFunction(Prologue|Epilogue)).  Just to make sure we've covered the
> whole 56.7mb, could you measure and include these as well?

The grandtotal is around 49~MB, so I'm wondering: waaaat else might be taking the remaining 7~MB??
Yeah, that sounds interesting.  Other buckets: padding (functions are aligned to 16 byte boundaries, iirc), and the stubs at the end (emitted by ModuleGenerator::finishCodegen).  (Note: I _think_ about:memory is measuring MB not MiB.)
> (Note: I _think_ about:memory is measuring MB not MiB.)

Hm, looks like MiB: https://hg.mozilla.org/mozilla-central/annotate/e0b0865639cebc1b5afa0268a4b073fcdde0e69c/toolkit/components/aboutmemory/content/aboutMemory.js#l1600 (Aside: TIL verbose mode gives you the exact bytes!)
D'oh!  Sorry about that; knowing njn, I had assumed MiB would've been used if MiB was meant.
(In reply to Luke Wagner [:luke] from comment #17)

> Yeah, that sounds interesting.  Other buckets: padding (functions are
> aligned to 16 byte boundaries, iirc), and the stubs at the end (emitted by
> ModuleGenerator::finishCodegen).  (Note: I _think_ about:memory is measuring
> MB not MiB.)

That's interesting: I've left my Linux laptop at the office today and I ran the data-gathering patch on my personal Mac (including the stubs).

We're getting 3 generated stubs that add up to roughly 10MB. I'm seeing pretty bizarre numbers here, though; I have to recheck everything and re-run the numbers with the same baseline tomorrow.
Alright: I've found the origin of those bizarre numbers I was getting for some emitted op (RTTI dark magic, fwiw:)).
There are only two stubs and they add up to around 5MB~ (5122589 bytes). Those, along with the 16 bytes aligns, should account for everything, at least... I guess.:)
Here's the updated table, including the stubs and the relevant byte size measurements of the compiled code merges.

All in all we're almost there in accounting for everything (the grand total is "almost" 56MB). We're still missing a couple of megabytes, as the compiled code merges ("asmMergeWith") calls add up to exactly the 56MB we're seeing in about:memory.

Luke, do you think this may be due to the aligns?
Attachment #8887280 - Attachment is obsolete: true
(In reply to Michelangelo De Simone [:mds] from comment #22)
Ah, ok, so there's still a 59.3MB-56.4MB=2.9MB hiding somewhere.  It should be possible to track this down by slicing up the remainder of CodeGenerator::generateWasm() (e.g., I see a masm.flush() call).
Component: JavaScript Engine: JIT → Javascript: WebAssembly
Depends on: 1590305
Blocks: 1590305
No longer depends on: 1590305

This is probably still worth doing, though not for x86-32 probably. Latest measurements for wasm baseline (bug 1715459 comment 4) on x86-64 is 75MB of code (reducable to 71MB by pinning the TLS) but we also care about Ion. And these days there are much bigger test cases than Zen Garden.

Type: enhancement → task
Summary: Baldr: do size-profiling of EpicZenGarden codegen to look for size-reduction opportunities → [exploration] Do size-profiling of EpicZenGarden codegen to look for size-reduction opportunities
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: