Open Bug 1077027 Opened 10 years ago Updated 6 months ago

x86 atomics: Generate better code for atomic store

Tracking

()

Status:

NEW

People

(Reporter: lth, Unassigned)

References

(Blocks 2 open bugs)

Details

Lars T Hansen [:lth]

Reporter

Description

•

10 years ago

Followup work on bug 979594. An atomic store is implemented as two instructions on x86: a regular store followed by a fence, which is either an "mfence" instruction (SSE2+) or a "lock add [esp], 0" instruction (older systems). Instead of those two instructions we can generate an "xchg" instruction (which is implicitly locked), this performs the store and barrier in one instruction and will never be any slower than the two separate instructions. Linux uses the same trick. The only reason not to perform this optimization is if the fence could be removed through other optimizations, typically memory barrier merging/elision.

Lars T Hansen [:lth]

Reporter

Comment 1

•

10 years ago

Actually this looks like a bit of a minefield. A blog post (from 2009) re the JVM notes that they have had to switch back and forth between mfence and lock;add for a fence, with the latter being faster on some architectures at some points in time, and the other at other points in time. (The differences can be annoyingly large.) Part of the problem appears to be that the MFENCE is (interpreted as being) loaded down with additional semantics, and LOCK;ADD is well optimized. Blog post: https://blogs.oracle.com/dave/resource/NHM-Pipeline-Blog-V2.txt Discussion: https://blogs.oracle.com/dave/entry/instruction_selection_for_volatile_fences It may still be that a single XCHG is faster, though it needs a register for the garbage result and hence may incur a MOV to set things up or will increase register pressure around it.

Lars T Hansen [:lth]

Reporter

Comment 2

•

9 years ago

Actually it would be worthwhile to expedite this. Consider this simple benchmark: <script> var ia = new Int32Array(new SharedArrayBuffer(1024)); function f() { for ( let i=0 ; i < 10000 ; i++ ) Atomics.store(ia, 0, 0); } function g() { for ( let i=0 ; i < 1000 ; i++ ) f(); } var then = Date.now(); g(); var now = Date.now(); document.writeln((now - then) + "ms"); </script> If Atomics.store is replaced by Atomics.exchange this speeds up from 137ms to 52ms on my system (late-2013 MBP, locally built Nightly from recent sources, but also in the JS shell). This is fairly dramatic. Inlining happens properly in both cases and the code is exactly as expected, as are execution counts for the hot instructions, so this is not the JIT's fault, but a hardware artifact.

Priority: -- → P2

Lars T Hansen [:lth]

Reporter

Comment 3

•

9 years ago

AMD FX4100 has a smaller difference (again, 64-bit code) but exchange is still twice as fast: Exchange: 139ms Store: 274ms

Lars T Hansen [:lth]

Reporter

Updated

•

8 years ago

Priority: P2 → P3

Lars T Hansen [:lth]

Reporter

Updated

•

8 years ago

Assignee: lhansen → nobody

Lars T Hansen [:lth]

Reporter

Updated

•

8 years ago

Blocks: 1317626

Lars T Hansen [:lth]

Reporter

Updated

•

8 years ago

No longer blocks: shared-array-buffer

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Blocks: wasm-jit-bugs

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.

Bugzilla

x86 atomics: Generate better code for atomic store

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

Tracking

()

People

(Reporter: lth, Unassigned)

References

(Blocks 2 open bugs)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Updated

Updated

Updated

Updated