Baldr: do size-profiling of EpicZenGarden codegen to look for size-reduction opportunities




9 months ago
5 months ago


(Reporter: luke, Unassigned)


9 months ago
...especially on 32-bit, where code-memory size is more limited (bug 1345205).  For a 44mb .wasm file, we're currently getting about 105mb of executable code.  Bug 1334504 should help here by removing the separate "profiling" prologue/epilogues.  I think a good start might be measuring aggregate size of generateOutOfLineCode() and doing per-op-type profiling if that's high.  It'd also be good to sanity check that the size of stubs/glue code added between functions (by ModuleGenerator) is only a small fraction.


9 months ago
Comment 1

9 months ago
mbebenita observed at one point that the baseline code was 2x the size of the Ion code and with the removal of patching that can only have gotten worse.  Tiering is likely to make the code management problem worse.

Comment 2

9 months ago
Ugh, that's a good point; like 3x worse.  A mitigation could be to not do baseline compilation for wasm modules bigger than a certain threshold on 32-bit.  Really, a 44mb .wasm module (14mb gzipped) is a bit ridiculous no matter how you slice it; Unity's done a lot more work on code size and is only 12mb (3.6mb gzipped).

Comment 3

9 months ago
And yet it's presumably these large modules that would benefit the most from baseline compilation...  And we'll need an answer for debugging needs, probably?  (Could be as ugly and easy as not limiting code size in Dev Ed, I suppose, but that's scant help for a dev trying to diagnose a customer problem in a release browser.)

Comment 4

9 months ago
Right, that could be a reason to increase the max-code-bytes quota further on 32-bit.  For debugging, a 64-bit browser would work fine.

Comment 5

9 months ago
Another idea would be to switch baseline, in these gigantor cases, to do per-function JIT compilation.  If all baseline calls went through a table (which could be the same mechanism for tiering into Ion), then that table could be pre-populated with stubs that compile when called.  I still think we'd have a smoother experience (and be able to take advantage of streaming+parallel compilation) with AOT baseline, though, so probably we'd want to prefer this when not prohibited by memory.
Comment 6

9 months ago
As an update: due to size-reductions in the .wasm file itself (bug 1341633), and improvements in codegen (bug 1338217 and bug 1334504), a 32-bit Ion build is 76mb and a 32-bit --wasm-always-baseline build is 113mb.  So one of these should fit in the now-140mb quota (as increased by bug 1345205).
So I took a look at the EpicZenGarden .wasm file and it looks like the name section is left in there which takes up 8,602,868 bytes. Not super important, but useful to know when doing size comparisons. Also, I'm not quite sure which .wasm file you are looking at, mine is 39,510,398 bytes.

Here's a breakdown of all the instructions we emit and the total number of bytes used to encode them:

Some more data related to the emitted code in ion is attached.
Comment 9

5 months ago
Thanks!  (cc'ing a few people who might be interested in this raw data)

After scanning the list a few times, I can't really see anything unexpected here: the "flabby" ops (with more than ~9bytes/op) all use an insignificant % of total bytes.  (Speaking of, what was the sum total code size?  It'd be nice to have a column next to totalBytes which has % of total)

The comparisons (*AndBranch) are all a bit bigger than expected, but I expect this is due immediates.

One interesting thing is that the single biggest op (at 11.2mb) is MoveGroup which is inserted by the register allocator when spilling is necessary; given all the existing work on this, I doubt there is any low-hanging fruit here.  There are quite a lot of calls here (578k) so the lack of non-volatile registers could be a significant contributing factor.
A few more ideas, at a quick look:

WrapInt64ToInt32 is currently a `movl`; with care this could often be optimized to a no-op in many cases since 32-bit instructions only read the low 32 bits of their inputs.

NegD/NegF would be smaller with a constant-pool load instead of materializing the constant manually.

Branch immediates: the macroassembler currently always uses 32-bit immediates for forward branches. In the case of branches within individual LIR opcodes, the code generator may be able to declare that the destination is within range for an 8-bit immediate.
Comment 12

5 months ago
Thanks!  So scanning the whole list again, it seems like all ops are in one of 3 categories: (1) insignificant % of total (<2%), (2) already optimized and thus not likely source of low-hanging fruit (call, load, store, movegroup), (3) just super-hot and not flabby (comparisons, add).  So unfortunately no clear action here, just "good job Ion!".

Next, it'd be useful to profile the size of the OutOfLineCodes emitted by generateOutOfLineCode().
Comment 14

5 months ago
Wow, so out-of-line is pretty light then, mostly just out-of-line switch jump tables.  So the total size here is 37.7mb whereas about:memory reports 56.7mb for the total code allocation.  The two significant remaining buckets are wasm trap out-of-line paths (emitted by masm.wasmEmitTrapOutOfLineCode()) and prologue/epilogue code (emitted by GenerateFunction(Prologue|Epilogue)).  Just to make sure we've covered the whole 56.7mb, could you measure and include these as well?
The grandtotal is around 49~MB, so I'm wondering: waaaat else might be taking the remaining 7~MB??

Comment 17

5 months ago
Yeah, that sounds interesting.  Other buckets: padding (functions are aligned to 16 byte boundaries, iirc), and the stubs at the end (emitted by ModuleGenerator::finishCodegen).  (Note: I _think_ about:memory is measuring MB not MiB.)
> (Note: I _think_ about:memory is measuring MB not MiB.)

Hm, looks like MiB: (Aside: TIL verbose mode gives you the exact bytes!)

Comment 19

5 months ago
D'oh!  Sorry about that; knowing njn, I had assumed MiB would've been used if MiB was meant.
(In reply to Luke Wagner [:luke] from comment #17)

> Yeah, that sounds interesting.  Other buckets: padding (functions are
> aligned to 16 byte boundaries, iirc), and the stubs at the end (emitted by
> ModuleGenerator::finishCodegen).  (Note: I _think_ about:memory is measuring
> MB not MiB.)

That's interesting: I've left my Linux laptop at the office today and I ran the data-gathering patch on my personal Mac (including the stubs).

We're getting 3 generated stubs that add up to roughly 10MB. I'm seeing pretty bizarre numbers here, though; I have to recheck everything and re-run the numbers with the same baseline tomorrow.
Alright: I've found the origin of those bizarre numbers I was getting for some emitted op (RTTI dark magic, fwiw:)).
There are only two stubs and they add up to around 5MB~ (5122589 bytes). Those, along with the 16 bytes aligns, should account for everything, at least... I guess.:)
Created attachment 8892121 [details]
Epic ZenGarden OpStats w/ OOL and Prologues/Epilogues/Stubs

Here's the updated table, including the stubs and the relevant byte size measurements of the compiled code merges.

All in all we're almost there in accounting for everything (the grand total is "almost" 56MB). We're still missing a couple of megabytes, as the compiled code merges ("asmMergeWith") calls add up to exactly the 56MB we're seeing in about:memory.

Luke, do you think this may be due to the aligns?
Comment 23

5 months ago
(In reply to Michelangelo De Simone [:mds] from comment #22)
Ah, ok, so there's still a 59.3MB-56.4MB=2.9MB hiding somewhere.  It should be possible to track this down by slicing up the remainder of CodeGenerator::generateWasm() (e.g., I see a masm.flush() call).
