Open Bug 1077027 Opened 5 years ago Updated 3 years ago

x86 atomics: Generate better code for atomic store


(Core :: JavaScript Engine: JIT, defect, P3)





(Reporter: lth, Unassigned)


(Blocks 1 open bug)


Followup work on bug 979594.

An atomic store is implemented as two instructions on x86: a regular store followed by a fence, which is either an "mfence" instruction (SSE2+) or a "lock add [esp], 0" instruction (older systems).

Instead of those two instructions we can generate an "xchg" instruction (which is implicitly locked), this performs the store and barrier in one instruction and will never be any slower than the two separate instructions.  Linux uses the same trick.

The only reason not to perform this optimization is if the fence could be removed through other optimizations, typically memory barrier merging/elision.
Actually this looks like a bit of a minefield.  A blog post (from 2009) re the JVM notes that they have had to switch back and forth between mfence and lock;add for a fence, with the latter being faster on some architectures at some points in time, and the other at other points in time.  (The differences can be annoyingly large.)  Part of the problem appears to be that the MFENCE is (interpreted as being) loaded down with additional semantics, and LOCK;ADD is well optimized.

Blog post:

It may still be that a single XCHG is faster, though it needs a register for the garbage result and hence may incur a MOV to set things up or will increase register pressure around it.
Actually it would be worthwhile to expedite this.  Consider this simple benchmark:

var ia = new Int32Array(new SharedArrayBuffer(1024));
function f() {
   for ( let i=0 ; i < 10000 ; i++ ), 0, 0);
function g() {
   for ( let i=0 ; i < 1000 ; i++ )
var then =;
var now =;
document.writeln((now - then) + "ms");

If is replaced by this speeds up from 137ms to 52ms on my system (late-2013 MBP, locally built Nightly from recent sources, but also in the JS shell).  This is fairly dramatic.

Inlining happens properly in both cases and the code is exactly as expected, as are execution counts for the hot instructions, so this is not the JIT's fault, but a hardware artifact.
Priority: -- → P2
AMD FX4100 has a smaller difference (again, 64-bit code) but exchange is still twice as fast:

  Exchange: 139ms
  Store: 274ms
Priority: P2 → P3
Assignee: lhansen → nobody
Blocks: 1317626
No longer blocks: shared-array-buffer
You need to log in before you can comment on or make changes to this bug.