518.85 KB, application/pdf
...especially on 32-bit, where code-memory size is more limited (bug 1345205). For a 44mb .wasm file, we're currently getting about 105mb of executable code. Bug 1334504 should help here by removing the separate "profiling" prologue/epilogues. I think a good start might be measuring aggregate size of generateOutOfLineCode() and doing per-op-type profiling if that's high. It'd also be good to sanity check that the size of stubs/glue code added between functions (by ModuleGenerator) is only a small fraction.
mbebenita observed at one point that the baseline code was 2x the size of the Ion code and with the removal of patching that can only have gotten worse. Tiering is likely to make the code management problem worse.
Ugh, that's a good point; like 3x worse. A mitigation could be to not do baseline compilation for wasm modules bigger than a certain threshold on 32-bit. Really, a 44mb .wasm module (14mb gzipped) is a bit ridiculous no matter how you slice it; Unity's done a lot more work on code size and is only 12mb (3.6mb gzipped).
And yet it's presumably these large modules that would benefit the most from baseline compilation... And we'll need an answer for debugging needs, probably? (Could be as ugly and easy as not limiting code size in Dev Ed, I suppose, but that's scant help for a dev trying to diagnose a customer problem in a release browser.)
Right, that could be a reason to increase the max-code-bytes quota further on 32-bit. For debugging, a 64-bit browser would work fine.
Another idea would be to switch baseline, in these gigantor cases, to do per-function JIT compilation. If all baseline calls went through a table (which could be the same mechanism for tiering into Ion), then that table could be pre-populated with stubs that compile when called. I still think we'd have a smoother experience (and be able to take advantage of streaming+parallel compilation) with AOT baseline, though, so probably we'd want to prefer this when not prohibited by memory.
As an update: due to size-reductions in the .wasm file itself (bug 1341633), and improvements in codegen (bug 1338217 and bug 1334504), a 32-bit Ion build is 76mb and a 32-bit --wasm-always-baseline build is 113mb. So one of these should fit in the now-140mb quota (as increased by bug 1345205).
So I took a look at the EpicZenGarden .wasm file and it looks like the name section is left in there which takes up 8,602,868 bytes. Not super important, but useful to know when doing size comparisons. Also, I'm not quite sure which .wasm file you are looking at, mine is 39,510,398 bytes. Here's a breakdown of all the instructions we emit and the total number of bytes used to encode them: cdq 2 cqo 4 div 10 idiv 12 sqrtsd 24 bsr 45 cvttss2si 45 andpd 55 bsf 141 jo 174 cvttsd2si 230 movq 245 xchg 258 roundsd 270 js 330 setbe 359 setge 1605 cvtsi2sd 2138 setle 2489 divsd 2557 movapd 2873 cvtsd2ss 2892 ucomisd 3002 movsxd 3009 addsd 3605 roundss 4141 subsd 4282 jnp 4872 xorpd 5533 mulsd 5706 setb 6571 sqrtss 7473 ja 9818 andps 10832 setg 12000 sar 13151 jb 13182 jp 13866 jg 15780 cvtss2sd 16047 setae 16141 pcmpeqw 18960 setl 19922 psllq 22752 cvtsi2ss 23316 divss 25684 movd 28364 seta 36983 setne 44202 jbe 45066 movsd 46027 shr 46723 sete 47028 xorps 52630 jl 54798 ucomiss 61401 jge 79826 lea 82510 or 88758 imul 88899 jle 92686 cmove 111055 ret 112293 subss 127990 movaps 153815 nop 224586 movsx 230824 movabs 233030 shl 263491 movzx 302992 addss 313032 jne 370436 xor 376958 and 448103 sub 497382 mulss 507326 test 543892 jmp 786174 cmp 836107 je 890184 jae 896890 movss 2074553 call 2777591 add 2802569 mov 22145761 Total 39219338 I didn't use any command line args, so I'm guessing it's a 64-bit Ion build on OSX. So many moves :|
Created attachment 8884488 [details] Epic ZenGarden OpStats - Chart & Data Some more data related to the emitted code in ion is attached. Both the chart and the table are ordered by the average byte size per op.
Thanks! (cc'ing a few people who might be interested in this raw data) After scanning the list a few times, I can't really see anything unexpected here: the "flabby" ops (with more than ~9bytes/op) all use an insignificant % of total bytes. (Speaking of, what was the sum total code size? It'd be nice to have a column next to totalBytes which has % of total) The comparisons (*AndBranch) are all a bit bigger than expected, but I expect this is due immediates. One interesting thing is that the single biggest op (at 11.2mb) is MoveGroup which is inserted by the register allocator when spilling is necessary; given all the existing work on this, I doubt there is any low-hanging fruit here. There are quite a lot of calls here (578k) so the lack of non-volatile registers could be a significant contributing factor.
A few more ideas, at a quick look: WrapInt64ToInt32 is currently a `movl`; with care this could often be optimized to a no-op in many cases since 32-bit instructions only read the low 32 bits of their inputs. NegD/NegF would be smaller with a constant-pool load instead of materializing the constant manually. Branch immediates: the macroassembler currently always uses 32-bit immediates for forward branches. In the case of branches within individual LIR opcodes, the code generator may be able to declare that the destination is within range for an 8-bit immediate.
Created attachment 8885005 [details] Epic ZenGarden OpStats - Chart & Data Sheets updated with explicit percentages.
Thanks! So scanning the whole list again, it seems like all ops are in one of 3 categories: (1) insignificant % of total (<2%), (2) already optimized and thus not likely source of low-hanging fruit (call, load, store, movegroup), (3) just super-hot and not flabby (comparisons, add). So unfortunately no clear action here, just "good job Ion!". Next, it'd be useful to profile the size of the OutOfLineCodes emitted by generateOutOfLineCode().
Created attachment 8886737 [details] Epic ZenGarden OpStats w/ OOL Updated data with OOL operations.
Wow, so out-of-line is pretty light then, mostly just out-of-line switch jump tables. So the total size here is 37.7mb whereas about:memory reports 56.7mb for the total code allocation. The two significant remaining buckets are wasm trap out-of-line paths (emitted by masm.wasmEmitTrapOutOfLineCode()) and prologue/epilogue code (emitted by GenerateFunction(Prologue|Epilogue)). Just to make sure we've covered the whole 56.7mb, could you measure and include these as well?
Created attachment 8887280 [details] Epic ZenGarden OpStats w/ OOL and Prologues/Epilogues
(In reply to Luke Wagner [:luke] from comment #14) > GenerateFunction(Prologue|Epilogue)). Just to make sure we've covered the > whole 56.7mb, could you measure and include these as well? The grandtotal is around 49~MB, so I'm wondering: waaaat else might be taking the remaining 7~MB??
Yeah, that sounds interesting. Other buckets: padding (functions are aligned to 16 byte boundaries, iirc), and the stubs at the end (emitted by ModuleGenerator::finishCodegen). (Note: I _think_ about:memory is measuring MB not MiB.)
> (Note: I _think_ about:memory is measuring MB not MiB.) Hm, looks like MiB: https://hg.mozilla.org/mozilla-central/annotate/e0b0865639cebc1b5afa0268a4b073fcdde0e69c/toolkit/components/aboutmemory/content/aboutMemory.js#l1600 (Aside: TIL verbose mode gives you the exact bytes!)
D'oh! Sorry about that; knowing njn, I had assumed MiB would've been used if MiB was meant.
(In reply to Luke Wagner [:luke] from comment #17) > Yeah, that sounds interesting. Other buckets: padding (functions are > aligned to 16 byte boundaries, iirc), and the stubs at the end (emitted by > ModuleGenerator::finishCodegen). (Note: I _think_ about:memory is measuring > MB not MiB.) That's interesting: I've left my Linux laptop at the office today and I ran the data-gathering patch on my personal Mac (including the stubs). We're getting 3 generated stubs that add up to roughly 10MB. I'm seeing pretty bizarre numbers here, though; I have to recheck everything and re-run the numbers with the same baseline tomorrow.
Alright: I've found the origin of those bizarre numbers I was getting for some emitted op (RTTI dark magic, fwiw:)). There are only two stubs and they add up to around 5MB~ (5122589 bytes). Those, along with the 16 bytes aligns, should account for everything, at least... I guess.:)
Created attachment 8892121 [details] Epic ZenGarden OpStats w/ OOL and Prologues/Epilogues/Stubs Here's the updated table, including the stubs and the relevant byte size measurements of the compiled code merges. All in all we're almost there in accounting for everything (the grand total is "almost" 56MB). We're still missing a couple of megabytes, as the compiled code merges ("asmMergeWith") calls add up to exactly the 56MB we're seeing in about:memory. Luke, do you think this may be due to the aligns?
(In reply to Michelangelo De Simone [:mds] from comment #22) Ah, ok, so there's still a 59.3MB-56.4MB=2.9MB hiding somewhere. It should be possible to track this down by slicing up the remainder of CodeGenerator::generateWasm() (e.g., I see a masm.flush() call).