At the moment we sync() the value stack unconditionally before calls (this is simple) but there are probably many common cases where we could instead save/restore live registers and perform parallel assignment into argument registers. This may become especially important if we start keeping some locals in registers.
The sync() before function calls accounts for 50% of all syncs performed by the baseline compiler. Data today from a run of AngryBots, see bug 1316818 for code and an explanation: [Profiling] WASM BASELINE SYNC: total=377856 capacity=7 specific=10521 local=575 join=181835 call=184918
Priority: P3 → P2
Summary: Wasm baseline: Better register management around function calls → Wasm baseline: Register management and sync avoidance around function calls
I'm semi-actively working on this and would like to have it resolved for FF54. I think one can do much with little here. Getting rid of the syncs is probably not all that crucial for compiler performance because what we're going to have to do for calls is pretty similar to a sync anyway (plus sync does not really show up in the profiles on ARM, at least). However, run-time performance could be significantly affected, and this is also a blocker for any meaningful investigation of allocating locals to registers.
Assignee: nobody → lhansen
Status: NEW → ASSIGNED
Priority: P2 → P1
Created attachment 8835663 [details] [diff] [review] bug1316806-parallel-assignment-poc.patch Proof of concept. This implements a simplistic parallel assignment algorithm and uses it for call, callIndirect, and calls to math builtins. On the embenchen benchmarks, it shows a few speedups on substantial programs, this is on x64: box2d 4% binarytrees 3.6% scimark 2.7% Compilation time is likely not improved by this patch. This patch represents the simplest of two approaches: it syncs non-argument values on the eval stack to the CPU stack as before but does not sync values that will be used for arguments, so it saves on memory traffic for those values (but may have to do some extra moves to get them into registers instead). The more complicated approach would instead just spill non-argument register values and reload them after the call. Thus, we'd avoid syncing latent local values (a win) but might end up spilling and restoring values multiple times if they surround nested calls (a loss), except if we're clever (complex). Furthermore, once we keep locals in registers they'd have to be spilled anyway, adding yet more complexity. It is encouraging that there are some wins to be seen with the simple approach. However, more benchmarks are needed on more platforms, and it would be useful to also test the alternative approach. And compilation time could be an issue.
Unscientifically (just a few samples), this gains us 500ms on raybench, approximately 3.8%. Not bad. Demoting this to P3 for now. There are gains to be had here, but we'll want to do parallel assignment together with sync avoidance for if and block and together with keeping locals in registers, and that's a larger package for when we feel that baseline performance needs a boost.
Priority: P1 → P3
Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.
Status: ASSIGNED → RESOLVED
Last Resolved: 3 months ago
Resolution: --- → INACTIVE
Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---
You need to log in before you can comment on or make changes to this bug.