Use ArmV8.1 LSE for atomic instructions when available
Categories
(Core :: JavaScript: WebAssembly, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox114 | --- | fixed |
People
(Reporter: rhunt, Assigned: yury)
References
Details
Attachments
(2 files, 1 obsolete file)
An issue reported on chromium tracker [1] shows that multithreading performance of a rayon demo is worse than single threading performance.
Some new numbers, since Safari now supports wasm threads:
https://rreverser.com/wasm-bindgen-rayon-demo/
On Chrome 111.0.5528.0 (Official Build) canary (arm64):
Single thread: ~190ms
Multi thread: usually over 12s
Single thread with profiler running: ~300ms
Multi thread with profiler running: ~65ms
On Safari Technology preview 160:
Single thread: ~180ms
Multi thread: ~24ms
On Firefox 108
Single thread: ~180ms
Multi thread: usually 1-5s
So, not as bad as Chrome, but seems to be a similar bug.
[1] https://bugs.chromium.org/p/chromium/issues/detail?id=1228686#c26
Comment 1•2 years ago
|
||
The severity field is not set for this bug.
:rhunt, could you have a look please?
For more information, please visit auto_nag documentation.
Reporter | ||
Updated•2 years ago
|
Comment 2•2 years ago
|
||
The problem seems to appear only on MacOS and ARM64.
I was not able to reproduce on Linux and x86_64 with Firefox 110.0.
I couldn't reproduce as well on Android and an ARM architecture and the latest Firefox mobile.
On MacOS and ARM64 with firefox 110.0 it's not as bad as reported but multi-threads should be faster:
single-thread: around 210 ms
multi-threads: around 185 ms
Comment 3•2 years ago
|
||
I also couldn't repro it on Linux/x86_64/Fx110.0 (low end Intel), with
either Ion or baseline. I guess it's some kind of thread contention
problem, either in Rayon itself or in our jemalloc; both of those have
sometimes shown excessive contention in a multithreaded scenario
in the past. Does the Gecko profiler shed any light on the problem?
It might be worth also testing on Linux/arm64. If it's still bad on that
then it might be something arm64 specific; if not maybe it's related to
the fallback paths for contended locks (sys_futex on Linux vs whatever
MacOS does).
Modifying the demo to make it run (eg) 10 x longer would probably make
it much easier to profile.
Reporter | ||
Comment 4•2 years ago
|
||
Comment 29 from Pierre Langlois [1] indicates that using newer Armv8.1 LSE extensions resolved this issue in their testing. The theory is that the problem is locking contention with using load-linked/store-conditional loops, and that it can be resolved using newer atomics instructions.
Another data point was that JSC did not use pthreads for their locking implementation while we and V8 do.
[1] https://bugs.chromium.org/p/chromium/issues/detail?id=1228686#c29
Reporter | ||
Comment 5•2 years ago
|
||
Looking at a profile there are two functions consuming 95% of self time on my machine: 281 and 236. Here's the code for them:
(func (;281;) (type 1) (param i32 i32) (result i32)
loop ;; label = @1
i32.const 0
i32.const 1
i32.atomic.rmw.xchg offset=1066168
br_if 0 (;@1;)
end
block (result i32) ;; label = @1
local.get 1
i32.const 9
i32.ge_u
if ;; label = @2
local.get 1
local.get 0
call 100
br 1 (;@1;)
end
local.get 0
call 33
end
i32.const 0
i32.const 0
i32.atomic.rmw.xchg offset=1066168
drop
)
(func (;236;) (type 13) (param i32 i32 i32 i32) (result i32)
loop ;; label = @1
i32.const 0
i32.const 1
i32.atomic.rmw.xchg offset=1066168
br_if 0 (;@1;)
end
block ;; label = @1
block ;; label = @2
local.get 2
i32.const 9
i32.ge_u
if ;; label = @3
local.get 2
local.get 3
call 100
local.tee 2
br_if 1 (;@2;)
i32.const 0
local.set 2
br 2 (;@1;)
end
local.get 0
local.get 3
call 50
local.set 2
br 1 (;@1;)
end
local.get 2
local.get 0
local.get 1
local.get 3
local.get 1
local.get 3
i32.lt_u
select
memory.copy
local.get 0
call 41
end
i32.const 0
i32.const 0
i32.atomic.rmw.xchg offset=1066168
drop
local.get 2
)
Note the 4 occurrences of i32.atomic.rmw.xchg offset=1066168
.
Reporter | ||
Comment 6•2 years ago
|
||
I can confirm that switching to use LSE for that specific wasm instruction fixes this issue. I have a rough patch that does this, but the full patch for all instructions will be more involved.
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Comment 8•2 years ago
|
||
Comment 9•2 years ago
|
||
Not sure if it helps, but I could provide patches to update VIXL to a more recent version. Unfortunately we can't update to the latest version(s), because at some point they've changed their instruction decoder approach to use hash tables, which is quite slow and leads to time-outs on the try-servers.
Comment 10•2 years ago
|
||
This commit for V8 may be interesting as well which also includes more instructions https://chromium-review.googlesource.com/c/v8/v8/+/4181032
Reporter | ||
Comment 11•2 years ago
|
||
(In reply to elmdominic from comment #10)
This commit for V8 may be interesting as well which also includes more instructions https://chromium-review.googlesource.com/c/v8/v8/+/4181032
Yeah, that's the commit I based mine off of. I think it should be a pretty good guide for a complete implementation.
(In reply to André Bargull [:anba] from comment #9)
Not sure if it helps, but I could provide patches to update VIXL to a more recent version. Unfortunately we can't update to the latest version(s), because at some point they've changed their instruction decoder approach to use hash tables, which is quite slow and leads to time-outs on the try-servers.
Interesting, that could be useful. There were already constants for the opcodes we need, like swpal
[1], but the Assembler/MacroAssembler methods were missing. If those methods are in a newer version then that would help. Otherwise, I'm not sure what our policy is for making changes to the API ourselves, but that's the route we'd need to take.
I am going to be unable to work on this more for about a week with some traveling coming up.
[1] https://searchfox.org/mozilla-central/source/js/src/jit/arm64/vixl/Constants-vixl.h#1199
Comment 12•2 years ago
|
||
I can also confirm that the V8 fix also fixes the issues I was seeing with projects in WebContainer (here https://bugs.chromium.org/p/chromium/issues/detail?id=1356099). Looking forward for this fix to also land in Firefox.
Assignee | ||
Comment 13•2 years ago
|
||
Add instructions, and update copyright. The Aarch 8.1 changes were done
after license change.
Assignee | ||
Comment 14•2 years ago
|
||
Depends on D175781
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 15•2 years ago
|
||
Main workhorse instructions of the patches are: Casal_, Swpal_, Ld___al_ and St___l_ instructions of Arm 8.1 extension. These are imported in our VIXL code. New code contributions are made under new copyright, so I update it in our code as well.
Ideally, we need update GenerateAtomicOperations.py, but it looks like we can get away with just using it as alternative encoding. The https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions does not provide too much detail, but it looks like it will be compatible with the existing encodings/behavior/barriers.
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Comment 16•2 years ago
|
||
Comment 17•2 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/4e1087df4931
https://hg.mozilla.org/mozilla-central/rev/9139e0e350e1
Description
•