Bug 1071024 implemented LDREX, STREX, and DMB in the simulator but in a very low-fidelity way: they do not actually synchronize. But now that we have the C++ abstractions for synchronization (in jit/Atomic-operations.h) we may be able to do something about that, or we may be able to add more C++ abstractions for it. This is actually important: Once we land test cases for multiprocessing in the DOM test suite (bug 1135026) the simulator will need to do something sane for these operations, or the tests will fail for sure. Probably we are only going to run the simulator on x86, so it is good enough to come up with a scheme that works on that architecture. For example, it is probably good enough for DMB to be an MFENCE and for LDREX and STREX to use a LOCK prefix, and then we need to remember the value loaded by LDREX and implement STREX as a CMPXCHG as outlined in bug 1071024. There are some issues with that but for the kind of code we are going to be generating it will be good enough. (Not sure what the ARM64 simulator does yet - I have a memory it does something useful. Should look to that for ideas; or open a separate bug for it if it does not implement LDREX/STREX in an interesting way.)
The ARM64 simulator simulates the local and global monitors (with simulated failures) and does some limited barriering for the acquire/release atomics and in one other case, but does not appear to actually synchronize properly for the purposes of multithreaded computation. It may be sufficient to replace the store code with an atomic compareExchange. But for multithreaded computation we really don't want to use the simulated "failed" monitor access in the ARM64 simulator, we want something closer to what I propose for the 32-bit simulator. The simulated failures may still be desirable; I'm not sure. See VisitLoadStoreExclusive() in jit/arm64/vixl/Simulator-vixl.cpp.
Created attachment 8671878 [details] [diff] [review] bug1211409-ldrex-strex.patch Basically a complete implementation, good enough for x86. This needs a little further work as follows: - Test cases that test mutual exclusion by using evalInWorker and atomics in actual shared memory. - The atomic operations used here (load, compareExchange) should be Relaxed but are currently SeqCst because we have no Relaxed operations in the AtomicOperations suite. I'm not sure if I'm going to worry too much about this; the implementation burden for AtomicOperations is already considerable and we don't need the Relaxed atomics anywhere else ATM. - Clang on 32-bit x86 does not implement 8-byte atomic load and compareExchange. Currently I've commented that code out since we don't generate LDREXD or STREXD yet. I could just make the failure permanent, until we need them. Jakob also points out that the store-exclusive only will fail on x86 if the value in the cell has changed, not if it has been updated to the same value - this in contrast to ARM, where the latter type of store also causes a conflict. That could possibly be fixed by using a sequence number - this is just an ABA problem - but that means we'll need a two-nonadjacent-word CAS operation, which is messy. Also, it is probably overkill for the simulator.
Jakob further points out that this simulation re-introduces the ABA problem to ARM, which they have worked so hard to avoid on that architecture :) Presumably that will break some programs, though it will not break any JS programs and should not be a problem for the browser...
Created attachment 8671992 [details] [diff] [review] bug1211409-ldrex-strex.patch Cleaner code + test case. (I have an idea for how to solve the ABA problem but I'll file that as a followup bug.)
Comment on attachment 8671992 [details] [diff] [review] bug1211409-ldrex-strex.patch Looks good. Is there a corresponding bug for arm64?
Comment on attachment 8671992 [details] [diff] [review] bug1211409-ldrex-strex.patch Review of attachment 8671992 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit-test/tests/atomics/mutual-exclusion.js @@ +1,2 @@ > +// Let a few threads hammer on memory with atomics to make sure mutual > +// exclusion works properly. Should we add a comment here to the effect that this test /may/ sometimes pass even when there is a bug in the atomics implementation? @@ +84,5 @@ > + } > + return val; > +})(); > + > +console.log("This test takes a little while to run"); How long does it take? ::: js/src/jit/arm/Simulator-arm.cpp @@ +1489,5 @@ > + exclusiveMonitorHeld_ = true; > +} > + > +uint64_t > +Simulator::exclusiveMonitorGetAndClear(bool* held) The name suggests that this function clears the exclusive monitor, but the function body does not seem to do that. @@ +1532,5 @@ > +// operations anywhere else in the system, and the distinction is not > +// important to the simulation at the level where we're operating. > + > +#define loadRelaxed loadSeqCst > +#define compareExchangeRelaxed compareExchangeSeqCst If we do implement the relaxed versions one day, these #defines would silently disable them in this file. Perhaps use inline functions instead? That would cause compiler errors when these function are added to AtomicOperations.h. @@ +1626,5 @@ > + uint16_t value = AtomicOperations::loadRelaxed(ptr); > + exclusiveMonitorSet(value); > + return value; > + } > + printf("Unaligned unsigned halfword read at 0x%08x, pc=%p\n", addr, instr); "Unaligned atomic unsigned halfword... @@ +4319,2 @@ > case 4: // DSB > + // This implementation of DSB is incorrect, but we do not use DSB. Should we assert that the instruction never appears, then? @@ +4322,2 @@ > case 6: // ISB > + // This implementation of ISB is correct only for coherent-I/D-cache systems such as x86. We use ISB after generating jitted code, but I wonder if it would ever appear in code running in the simulator?
The ARM64 case is bug 1214199. In addition to the weaknesses pointed out in comment 6 (the failure to clear the monitor was embarrassing) we should double check either that the simulation is platform-independent (and I think it is) or that we assert on hardware where it is incorrect, if the atomic opcodes are used.
This has caused spidermonkey bustage
https://hg.mozilla.org/integration/mozilla-inbound/rev/315e05e75b350f1aa3b54dd75f8c95678b78b445 Bug 1211409 - fix include order ON CLOSED TREE.
The fix is in.
sorry had to back this out for test failures like https://treeherder.mozilla.org/logviewer.html#?job_id=15609395&repo=mozilla-inbound
I think this can land; the only failure in that run is SM-tc (arm) on Linux64 Opt, it looks like an infra issue and is not even built by default, as far as I can tell from treeherder.
(In reply to Lars T Hansen [:lth] from comment #15) > I think this can land; the only failure in that run is SM-tc (arm) on > Linux64 Opt, it looks like an infra issue and is not even built by default, > as far as I can tell from treeherder. Bug 1208118 - SpiderMonkey SM-tc try jobs in TaskCluster failing to find proper compiler