Closed Bug 1809189 Opened 1 year ago Closed 1 year ago

Use ArmV8.1 LSE for atomic instructions when available

Categories

(Core :: JavaScript: WebAssembly, defect, P3)

ARM64
macOS
defect

Tracking

()

RESOLVED FIXED
114 Branch
Tracking Status
firefox114 --- fixed

People

(Reporter: rhunt, Assigned: yury)

References

Details

Attachments

(2 files, 1 obsolete file)

An issue reported on chromium tracker [1] shows that multithreading performance of a rayon demo is worse than single threading performance.

Some new numbers, since Safari now supports wasm threads:

https://rreverser.com/wasm-bindgen-rayon-demo/

On Chrome 111.0.5528.0 (Official Build) canary (arm64):

Single thread: ~190ms
Multi thread: usually over 12s
Single thread with profiler running: ~300ms
Multi thread with profiler running: ~65ms

On Safari Technology preview 160:

Single thread: ~180ms
Multi thread: ~24ms

On Firefox 108

Single thread: ~180ms
Multi thread: usually 1-5s

So, not as bad as Chrome, but seems to be a similar bug.

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=1228686#c26

See Also: → 1809405

The severity field is not set for this bug.
:rhunt, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(rhunt)
Severity: -- → S3
Flags: needinfo?(rhunt)

The problem seems to appear only on MacOS and ARM64.

I was not able to reproduce on Linux and x86_64 with Firefox 110.0.
I couldn't reproduce as well on Android and an ARM architecture and the latest Firefox mobile.

On MacOS and ARM64 with firefox 110.0 it's not as bad as reported but multi-threads should be faster:
single-thread: around 210 ms

multi-threads: around 185 ms

I also couldn't repro it on Linux/x86_64/Fx110.0 (low end Intel), with
either Ion or baseline. I guess it's some kind of thread contention
problem, either in Rayon itself or in our jemalloc; both of those have
sometimes shown excessive contention in a multithreaded scenario
in the past. Does the Gecko profiler shed any light on the problem?

It might be worth also testing on Linux/arm64. If it's still bad on that
then it might be something arm64 specific; if not maybe it's related to
the fallback paths for contended locks (sys_futex on Linux vs whatever
MacOS does).

Modifying the demo to make it run (eg) 10 x longer would probably make
it much easier to profile.

Comment 29 from Pierre Langlois [1] indicates that using newer Armv8.1 LSE extensions resolved this issue in their testing. The theory is that the problem is locking contention with using load-linked/store-conditional loops, and that it can be resolved using newer atomics instructions.

Another data point was that JSC did not use pthreads for their locking implementation while we and V8 do.

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=1228686#c29

Looking at a profile there are two functions consuming 95% of self time on my machine: 281 and 236. Here's the code for them:

(func (;281;) (type 1) (param i32 i32) (result i32)
    loop  ;; label = @1
      i32.const 0
      i32.const 1
      i32.atomic.rmw.xchg offset=1066168
      br_if 0 (;@1;)
    end
    block (result i32)  ;; label = @1
      local.get 1
      i32.const 9
      i32.ge_u
      if  ;; label = @2
        local.get 1
        local.get 0
        call 100
        br 1 (;@1;)
      end
      local.get 0
      call 33
    end
    i32.const 0
    i32.const 0
    i32.atomic.rmw.xchg offset=1066168
    drop
  )
(func (;236;) (type 13) (param i32 i32 i32 i32) (result i32)
    loop  ;; label = @1
      i32.const 0
      i32.const 1
      i32.atomic.rmw.xchg offset=1066168
      br_if 0 (;@1;)
    end
    block  ;; label = @1
      block  ;; label = @2
        local.get 2
        i32.const 9
        i32.ge_u
        if  ;; label = @3
          local.get 2
          local.get 3
          call 100
          local.tee 2
          br_if 1 (;@2;)
          i32.const 0
          local.set 2
          br 2 (;@1;)
        end
        local.get 0
        local.get 3
        call 50
        local.set 2
        br 1 (;@1;)
      end
      local.get 2
      local.get 0
      local.get 1
      local.get 3
      local.get 1
      local.get 3
      i32.lt_u
      select
      memory.copy
      local.get 0
      call 41
    end
    i32.const 0
    i32.const 0
    i32.atomic.rmw.xchg offset=1066168
    drop
    local.get 2
  )

Note the 4 occurrences of i32.atomic.rmw.xchg offset=1066168.

I can confirm that switching to use LSE for that specific wasm instruction fixes this issue. I have a rough patch that does this, but the full patch for all instructions will be more involved.

Summary: Performance issue with rayon demo → Use ArmV8.1 LSE for atomic instructions when available
Duplicate of this bug: 1809405

Not sure if it helps, but I could provide patches to update VIXL to a more recent version. Unfortunately we can't update to the latest version(s), because at some point they've changed their instruction decoder approach to use hash tables, which is quite slow and leads to time-outs on the try-servers.

This commit for V8 may be interesting as well which also includes more instructions https://chromium-review.googlesource.com/c/v8/v8/+/4181032

(In reply to elmdominic from comment #10)

This commit for V8 may be interesting as well which also includes more instructions https://chromium-review.googlesource.com/c/v8/v8/+/4181032

Yeah, that's the commit I based mine off of. I think it should be a pretty good guide for a complete implementation.

(In reply to André Bargull [:anba] from comment #9)

Not sure if it helps, but I could provide patches to update VIXL to a more recent version. Unfortunately we can't update to the latest version(s), because at some point they've changed their instruction decoder approach to use hash tables, which is quite slow and leads to time-outs on the try-servers.

Interesting, that could be useful. There were already constants for the opcodes we need, like swpal [1], but the Assembler/MacroAssembler methods were missing. If those methods are in a newer version then that would help. Otherwise, I'm not sure what our policy is for making changes to the API ourselves, but that's the route we'd need to take.

I am going to be unable to work on this more for about a week with some traveling coming up.

[1] https://searchfox.org/mozilla-central/source/js/src/jit/arm64/vixl/Constants-vixl.h#1199

I can also confirm that the V8 fix also fixes the issues I was seeing with projects in WebContainer (here https://bugs.chromium.org/p/chromium/issues/detail?id=1356099). Looking forward for this fix to also land in Firefox.

Assignee: nobody → ydelendik
See Also: → 1442534

Add instructions, and update copyright. The Aarch 8.1 changes were done
after license change.

Attachment #9329075 - Attachment description: WIP: Bug 1809189 - Update vixl with 8.1 atomic memory instructions. → Bug 1809189 - Update vixl with 8.1 atomic memory instructions. r?rhunt
Attachment #9329540 - Attachment description: WIP: Bug 1809189 - ARM64: Use 8.1 atomic instruction for atomic ops. → Bug 1809189 - ARM64: Use 8.1 atomic instruction for atomic ops. r?rhunt

Main workhorse instructions of the patches are: Casal_, Swpal_, Ld___al_ and St___l_ instructions of Arm 8.1 extension. These are imported in our VIXL code. New code contributions are made under new copyright, so I update it in our code as well.

Ideally, we need update GenerateAtomicOperations.py, but it looks like we can get away with just using it as alternative encoding. The https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions does not provide too much detail, but it looks like it will be compatible with the existing encodings/behavior/barriers.

Attachment #9329540 - Attachment description: Bug 1809189 - ARM64: Use 8.1 atomic instruction for atomic ops. r?rhunt → Bug 1809189 - ARM64: Use 8.1 atomic instruction for atomic ops. r?jseward
Attachment #9329075 - Attachment is obsolete: true
Attachment #9329075 - Attachment is obsolete: false
Attachment #9321963 - Attachment is obsolete: true
Pushed by ydelendik@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4e1087df4931
Update vixl with 8.1 atomic memory instructions. r=jseward
https://hg.mozilla.org/integration/autoland/rev/9139e0e350e1
ARM64: Use 8.1 atomic instruction for atomic ops. r=jseward
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 114 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: