Closed Bug 1133738 Opened 5 years ago Closed 3 years ago

Support int64 atomics

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED WONTFIX

People

(Reporter: lth, Unassigned)

References

(Blocks 1 open bug)

Details

Requirements for int64 atomics:

- should be usable from asm.js
- should have only moderate per-operation overhead

The requirements pretty much prohibit using objects of any kind to represent int64 values.  The remaining candidates are: using explicit TA locations, and obtaining the second part of the return value by a second function call.  I've gone with the latter.

A polyfill that can be replaced by fast native code is here: https://github.com/lars-t-hansen/parlib-simple/blob/master/src/int64atomics.js

(A native implementation will not need the extra two "coordination" arguments to all the methods and it might be useful to try to get rid of them for the polyfill too.)
Do we currently need this for the C++ ports?  I'd assume they would be mostly content with word-sized atomics and Emscripten compiles with a 32-bit word.

If there isn't a pressing need, I think we should try to build on value types and use an 'int64' value type as the argument to the atomic ops.  By comparison, SIMD.int32x4 is a value type and the whole is that int32x4 only ever lives in scalar locals and is never GC-allocated.  This does introduce a performance cliff if your handwritten JS falls off the happy path, but the same problem exists for SIMD and the plan is to annotate all value type slow paths so that they show up clearly (with informative explanations) in the Jit Coach profiler view.  Of course with asm.js we'd require that int64 was only used in ways that avoid GC allocation.  This has been the rough plan for int64-in-asm.js for a while; it just hasn't been a high priority because int64 in C++ is generally not a high priority (though it does show up).

To wit, the int64-via-magic-global-property has been discussed before and people were not a big fan of the magic property:
  https://esdiscuss.org/topic/proposal-for-efficient-64-bit-arithmetic-without-value-objects
(In reply to Luke Wagner [:luke] from comment #1)

> Do we currently need this for the C++ ports?  I'd assume they would be
> mostly content with word-sized atomics and Emscripten compiles with a 32-bit
> word.

Clearly Emscripten can generate code that implements 64-bit atomics by means of a spinlock, as the polyfill does.  But C++ does have 64-bit int atomics and Jukka brought it up recently when we were discussing float64 atomics that it would be useful to have good support: https://bugzilla.mozilla.org/show_bug.cgi?id=1131624#c2.  A native implementation will probably have significantly better performance than a spinlock solution.  Whether it's truly important, I don't know.

I'm not all that fond of the magic getter myself (Math.H is a much nicer name than mine though, I'll have to steal that).  I would have preferred multiple return values or a value type.  However, TC39 has failed to act on multiple return values for a decade, so far as I know, and value types...  Until something happens with those I'll probably keep exploring this space.
(In reply to Lars T Hansen [:lth] from comment #2)
> However, TC39 has failed to act on
> multiple return values for a decade, so far as I know, and value types... 

I think SIMD.js is making progress in TC39 with multiple browsers interested which means the first concrete value types which should pave the way for int64.

> Until something happens with those I'll probably keep exploring this space.

Yeah, I don't mean to curtail exploration, but it'd be nice to have some measurements saying we need this before landing anything, even #ifdef NIGHTLY.
Jukka brought up int64 atomics a couple of times recently, since a partner of ours has code that uses these "a lot" (my paraphrasing), and right now, the only way to implement them is with a spinlock.

I get the sense Jukka uses a global or sharded spinlock, which is ugly but possibly unavoidable, depending on code base assumptions and/or memory pressure.  In principle an atomic<int64_t> can be implemented with a spinlock that is private to that atomic.  Not that that is necessarily performant.

If we are going to move on int64 atomics before we have value types, then in addition to the requirements listed above we have two more:

 - should not get in the way of using the existing atomic methods on future int64 value types
 - should be compatible (atomically speaking) with those existing methods on those future types

After racking my brain over the weekend the two API options on the table are still using explicit TA locations for the result or having an accessor somewhere to obtain part of the result, as for Math.H.  The TA locations (if they are in the shared heap, which I think is the only option for asm.js, since that's the only TA it can talk about) must be thread-local, which is awkward: it means passing an index around to identify the location for a given thread.

Thus I think the best API for the non-value-type case is still:

   Atomics.load64(u8Array, aligned8Index) => resultLo
   Atomics.H => resultHi

   Atomics.store64(u8Array, aligned8Index, hiBits, loBits) => void

   Atomics.compareExchange64(u8Array, aligned8Index, expectedHi, expectedLo, newHi, newLo) => resultLo
   Atomics.H => resultHi

I'm not going to move on this, though, until we have more data:

Jukka, when you have time it would be useful to know more about what you've implemented (precisely) and how hot those functions are getting, any numbers at all would probably be a good start at this point.  If it looks plausible that it could be a performance bottleneck I can put together a prototype of the above API that you can experiment with; if it improves performance we can then discuss it further.
In a benchmark I did last week on 8 threads on a computer with 8 hardware cores (16 HT), I am seeing atomic 64bit load to take 7.5% of total execution time. Here is a bottom-up call profile: http://clb.demon.fi/emcc/pthreads/WebGLBenchmark_PhysicsMeshes_8threads.png . Atomic u64 ops are used in lockless queue and stack implementations, as well as in a job queue implementation where main thread submits jobs to a pool of job executing threads.

The implementation of the u64 load and other ops is here: https://github.com/juj/emscripten/commit/69eda4c318aeafd368b68bee3ff923df1eb0367b . Memory is divided to multiple interleaved memory addresses to relieve conflicts. The calls emscripten_atomic_exchange_u32() and emscripten_atomic_store_u32() in the SPINLOCK_ACQUIRE() and SPINLOCK_RELEASE() macros become direct asm.js Atomic calls. The JavaScript compiled version looks something like this:

function _emscripten_atomic_load_u64(i2) {
 i2 = i2 | 0;
 var i1 = 0, i3 = 0;
 i1 = 4481984 + ((i2 >>> 3 & 255) << 2) | 0;
 do i3 = Atomics_load(HEAP32, i1 >> 2); while (!((i3 | 0) == (Atomics_compareExchange(HEAP32, i1 >> 2, i3, 1) | 0) & (i3 | 0) == 0));
 i3 = HEAP32[i2 >> 2] | 0;
 i2 = HEAP32[i2 + 4 >> 2] | 0;
 Atomics_store(HEAP32, i1 >> 2, 0);
 tempRet0 = i2;
 return i3 | 0;
}

In my experience thread-local storage is very slow (in native Win32 at least), so I am not sure I like the TLS-based Atomics.H interface. But perhaps asm.js/ionmonkey codegen could be made to detect the pattern and omit any actual TLS ops?
Thanks, that's very helpful.  7.5% is a rather large number IMO, it'll be worth it to try to improve that.

When I say thread-local storage, all I mean is that the value needs to be stored somewhere not in the shared array.  The obvious-ish idea is to store it in the JSRuntime or even on a worker-local runtime that is closer to the execution path, if there is one.  I would also eventually like to consider whether there are optimizations available, so that a reference to Atomics.H immediately after Atomics.load64 would simply reference the appropriate output register of the CMPXCHG8 operation, or, if Atomics.H is observed "nearby", a local-variable shadow copy could be allocated for that result and just read, if access to the runtime is too expensive.
I guess with JSRuntime being shared among tabs, we'll need something slightly more sophisticated.  But the idea is still to do something fast, not to get involved with true TLS.
Jukka: by any chance, can those 64bit loads/stores be typedef'd to 32bit for Emscripten builds?  That is, is there a real need for 64 bits on a virtual architecture that only has 32-bit pointers?

Given that SIMD.js also blocks on value types, and recent affirmations that int64/value objects are coming:
  http://brendaneich.github.io/ModernWeb.tw-2015/#72
and the last reaction to a similar magic-property proposal:
  https://esdiscuss.org/topic/proposal-for-efficient-64-bit-arithmetic-without-value-objects
I'd say it's better to just push for int64/value objects with this as a driving use case.
My take on what Jukka wrote is that the program is already being compiled in 32-bit mode and uses 64-bit atomics as a stand-in for double-compare-and-swap to implement lockless data structures, I would assume by keeping a generation number to avoid an ABA problems.
Luke: unfortunately not. I think the extra 32 bits are used as tags to avoid the ABA problem, they are not for 64bit memory addressing. I'd be really happy to get the work towards int64 value objects actually started in Emscripten sooner rather than later, since most profiles of Emscripten apps always tend to contain some amount of time spent in emulated i64 ops, and some applications are prohibitively slow in Emscripten due to this. JSMESS Jaguar CPU (iirc) emulation, Ogg Tremor and a custom 3rd party partner script language come to mind. Also a bit funny, SIMD.js is already adding code for a int64x2 type before having scalar int64 support functional. Perhaps we could just write the Atomics specification against the upcoming 64bit int types already now?
Both are, I believe, aimed at ES7, so it would make sense.
SIMD.js is dead, TypedObjects are in a holding pattern, neither will be in ES2017 from what I can tell.  Let's revisit this later, perhaps in light of WebAssembly and the patterns that evolve in interacting between JS and WebAssembly.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
Makes sense. Fwiw, 64-bit atomics are extremely critical for us, they're used often in libraries, and currently we resort to emulating them with spinlocks to global memory in Emscripten.
You need to log in before you can comment on or make changes to this bug.