Closed Bug 869606 Opened 12 years ago Closed 5 years ago

OdinMonkey: combine overflow and alignment tests on array load / store, on x86

Categories

(Core :: JavaScript: WebAssembly, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: bbouvier, Assigned: bbouvier)

Details

Attachments

(1 file, 7 obsolete files)

When doing a heap load or store in asm.js, on x86 (for instance : array32[ i >> 2 ] = x), the generated code contains an bitwise AND to clear the last bits of the index (here, 2 last bits) and then carries out a comparison to check that there is no overflow: 0xf7fd3011: and $0xfffffffc,%eax // put two last bits of eax to 0 0xf7fd3018: cmp $0x4000,%eax // eax <= size of heap? (here 16384) 0xf7fd301e: jae 0xf7fd302a These two operations can be combined into a single one, using a test operation instead of and + cmp. The test mask has to combine both clearing mask present in the AND and the overflow check present in the CMP. For instance, the above code can be combined into: 0xf7fd3011: test $0xffffc003,%eax 0xf7fd301e: jne 0xf7fd302a This would avoid one assembly operation, which could lead to performance enhancements.
Attached patch v1 (obsolete) — Splinter Review
A first working version. I'm not quite sure about the modification in AsmJS.cpp:CheckArrayAccess though: is the pointerDef the one that has to be saved instead of the MBitAnd? Benchmarks coming soon.
Attachment #746664 - Flags: feedback?(luke)
Instead of posting all the benchmarks, I will just summarize them. The AWFY asm.js (both micro benchmarks and apps) tests have been used and run 5 times on the same machine, on the unpatched version and the patched version. Results on asm-microbenchmarks show enhancements for 4 of them (fannkuch: 5.5%, skinning:7%, fasta: 4%, copy:3.5%) and degradations for 4 of them (life: 1%, memops:1.5%, corrections:16%, primes: 7%). Also, the microbenchmarks run way faster in debug mode than in release mode, at least on my computer. As the microbenchmarks make usage of tiny loops and have few instructions, the impact of instructions alignment might not be neglectable. Moreover, as we remove one low-level instruction (the AND), we expect to have globally performance enhancements. As a matter of fact, the performance degradations observed on microbenchmarks may be considered as noise. Results on asm-apps show enhancements from 8 to 15%, on longer test durations. These results are more meaningful and fit what we can expect as there is no degradation.
Douglas/Marty: can the optimization be easily extended to ARM?
I think there is an ARM instruction to clear a range of bits and this could be used for a single instruction test. However, isn't the masking operation still needed, as the index still needs to be masked?
Comment on attachment 746664 [details] [diff] [review] v1 Review of attachment 746664 [details] [diff] [review]: ----------------------------------------------------------------- Great start! I think there is one important bug, however: if there is an unaligned access, then the patch as is will jump to the out-of-line path which will return 0 or NaN for loads or ignore the store. What needs to happen is that the pointer is aligned (by masking off the bits) and the load/store is performed with this aligned pointer. The most obvious way I can think of is for the out-of-line path to be: testl index, #alignment-plus-out-of-bounds-mask jnz out-of-line-path before-load: mov dst, #base-address(index) rejoin: ... next instruction out-of-line-path: and index, #alignment-mask cmpl index, #heap-length jb before-load move 0 or NaN to dst j rejoin this means saving the index Register and before-load Label in the OutOfLine object. (This is currently shared with visitLoadTypedArrayElementStatic, so you'll want to fork off your own AsmJS version.) Note: the same also goes for the store path: an unaligned store needs to store at the masked location. I think this should have been caught by running 'js/src/jit-tests/jit_tests.py $(SHELL) asm.js'; if it wasn't, could you add tests (to jit-tests/tests/asm.js/testHeapAccess) that would have caught the bug? ::: js/src/ion/AsmJSLink.cpp @@ +198,4 @@ > uint8_t *code = module.functionCode(); > for (unsigned i = 0; i < module.numHeapAccesses(); i++) { > const AsmJSHeapAccess &access = module.heapAccess(i); > + JSC::X86Assembler::setPointer(access.patchLengthAt(code), access.testMask(heapLength)); Can you rename "patchLengthAt" to be "patchMaskAt"? ::: js/src/ion/RegisterSets.h @@ +747,5 @@ > #if defined(JS_CPU_X86) > uint8_t cmpDelta_; > + ArrayBufferView::ViewType vt_; > +#else > + uint8_t isFloat32Load_; To minimize platform variation, can you remove isFloat32Load_ and move vt_ outside of the #if? @@ +808,5 @@ > + > + void *testMask(uint32_t heapLength) const > + { > + // The test mask is the OR combination of the overflow mask and the > + // alignment mask. I'd rather have this logic in AsmJSLink (next to where we call setPointer and keep AsmJSHeapAccess a simple logic-free datum. Perhaps you can move the mask computation to a commented static helper function right above DynamicallyLinkModule? @@ +834,5 @@ > + break; > + default: > + JS_NOT_REACHED("unknown array buffer view type"); > + break; > + } Can you compute the alignment mask using ((uint32_t(1) << TypedArrayShift(viewType)) - 1)? @@ +843,5 @@ > + // A heap of minimal length 4096 (== 0x1000) will have the > + // overflow mask 0xFFFFF000. > + uint32_t base = 0xFFFFF000; > + int shift; > + JS_FLOOR_LOG2(shift, heapLength); Even better, we can JS_ASSERT(IsPowerOfTwo(heapLength)). With this property, we should be able to compute the mask using ~(heapLength - 1).
Attachment #746664 - Flags: feedback?(luke)
(In reply to Douglas Crosher [:dougc] from comment #4) The idea is to put the masking on the assumed-never-taken out-of-line path (as described in comment 5).
Attached patch v2 - following Luke's remarks (obsolete) — Splinter Review
Thanks for the feedback! Indeed, I forgot to launch tests for this one. 2 of them were not passing (testHeapAccess and testZOOB). This new version fixes the issue. I will do some other benchmarks to see whether there are differences for micro benchmarks especially.
Attachment #746664 - Attachment is obsolete: true
Attachment #747728 - Flags: review?(luke)
(In reply to Benjamin Bouvier [:bbouvier] from comment #7) > Created attachment 747728 [details] [diff] [review] > v2 - following Luke's remarks > > Thanks for the feedback! > > Indeed, I forgot to launch tests for this one. 2 of them were not passing > (testHeapAccess and testZOOB). This new version fixes the issue. > > I will do some other benchmarks to see whether there are differences for > micro benchmarks especially. The index register is being destructively modified and the register allocator may be assuming otherwise here?
(In reply to Luke Wagner [:luke] from comment #3) > Douglas/Marty: can the optimization be easily extended to ARM? The ARM 'BFC' Bit Field Clear instruction appears usable for masking, but unfortunately it can not set the condition flags and a separate compare would be needed. Thus this would be a three instruction operation on the ARM: BFC, CMP, Branch. It's something to explore.
This patch fixes the issue raised in Comment 8. It ensures that the index register is not clobbered by making a temporary copy of the index. Douglas, could you please confirm that it is what you had in mind in Comment 8? Unfortunately, this affects performance. When testing on the asmjs-apps benchmarks, box2d is better by 2%, bullet by up to 8% and zlib is worse by up to 4%. A micro benchmark which should just results in a STORE operation shows that the index register is copied into another register even if the index is dead. This implies that either something in the register coherency patch is wrong, or there might be a bug in the register allocator. As it happens also with the backtracking allocator, if the bug occurs during register allocation, it means that it happens during liveness analysis. I will try to make some experimentations regarding different registers assignments (useRegister, useRegisterAtStart, etc.) and / or look at the liveness analysis code.
Attachment #751243 - Flags: feedback?(dtc-moz)
Attachment #747728 - Attachment is obsolete: true
Attachment #747728 - Flags: review?(luke)
It'd be great to see why tempCopy is needlessly making a register copy in both allocators and fixing that. However, if it is a still a problem after fixing the problem, we could always avoid the need for a tempCopy by performing the load in the out-of-line path, that way we don't need to mutate the input register.
Comment on attachment 751243 [details] [diff] [review] v2 - Ensure registers are not clobbered Review of attachment 751243 [details] [diff] [review]: ----------------------------------------------------------------- Registers are scarce so it would be better not to allocate an extra register for a temp that is rarely used. The register allocator is basic, but it might be a distraction in this case to start extending it. You might need to find somewhere to store the index. One option might be to store state on the stack. Ion appears to reserve stack to build arguments for calls, and perhaps it would be possible to ensure there is a reserve for use by emitted ops, and then you could save the index to the stack, mask it, load/store, then load it back from the stack. I am not so sure that the low bit masking should be a slow OOL path, and the OOL code is growing in size and may not be negligible on low end devices. Some code can make use of low bit tagging and would want fast low bit masking. The masking is already explicit in the JS, and would a double mask be needed to actually request a fast path masking operation? Might not be unrealistic, but it needs some thought. Also, if the bounds check is eliminated by other optimizations then it is no longer possible to hide the masking operation in the bounds check and we still need to be able to eliminate them and perhaps hoist them. I suggest working to eliminate unnecessary masking operations, and perhaps hoist them, but with the assumption that the asm.js code writer (Emscripten etc) can be expected to cooperate. I suggest not supporting costly optimizations in the asm.js assembler that the code writer could have optimized. In the absence of contrary guidance, I suggest working this into the range analysis stage, at least as a start. The optimizations might be costly but we can keep in mind the possibility of moving them to a separate more limit stage specific to the asm.js assembler. The approach of hiding the masking operation in the bounds check is sound and might still find a place.
Attachment #751243 - Flags: feedback?(dtc-moz)
(In reply to Douglas Crosher [:dougc] from comment #12) Oh right, pushing on the stack was the other good option, thanks for reminding! > I am not so sure that the low bit masking should be > a slow OOL path, and the OOL code is growing in size > and may not be negligible on low end devices. Yes, we should measure the change in OOL code size. It should be very cold, though, so I think we can tolerate a reasonable increase. > Some code can make use of low bit tagging and would > want fast low bit masking. The primary use case here is generated code so an unaligned load would mean that the underlying C++ code was performing the load without masking off the low bits. This should basically never happen. You are right that Emscripten *could* remove a preceding mask because of masking effect of the shifts, but it doesn't. Note: LLVM marks potentially-unaligned memory accesses as such and Emscripten in this case will emit byte loads and OR them together. > In the absence of contrary guidance, I suggest working > this into the range analysis stage, at least as a start. I think we should *also* work on the range analysis and hoisting in the backend. I don't think we will be able to remove so many alignment masks that this optimization would be unnecessary, though.
Thanks for your comments and remarks. Using the heap looks like a way simpler solution indeed. I can see two different ways of implementing it: - the first one would be to push the index before the test, and popping it back after the store / load. This way, the index register would be stored back after the operation. I implemented this simple, naive solution and the problem is that during loads, the register that contains the index can be clobbered by the loaded value, so the pop would clobber it then. A solution to this problem would be to have a temp register to store the index (to be able to pop into this register), but it's equivalent to using a temp copy, which we want to avoid. Moreover, it introduces two new operations on the main path (push / pop), which can hide the speed enhancement obtained by removing the AND. - the other solution is that the push and pop take place in the OOL path. This way, we ensure that the pop operation won't clobber any register used after the operation. But this implies to have the load / store in the OOL path, and therefore to have the overflow test (which needs patching) in the OOL path. As a matter of fact, the OOL path might grow up in size. I will implement the second solution, except if anybody sees any better way to do it.
Actually, the second solution would suffer from the issue as the first one. Any time you pop from the stack to a register, if you do it after a load, you take the risk of clobbering it. I guess that the use of a temporary register is unavoidable here.
(In reply to Benjamin Bouvier [:bbouvier] from comment #15) > Actually, the second solution would suffer from the issue as the first one. > Any time you pop from the stack to a register, if you do it after a load, > you take the risk of clobbering it. I guess that the use of a temporary > register is unavoidable here. Yes, good point. There are still options to explore. For example, the result can be stored on the stack too, and reloaded after restoring the index. Inline code: Test index, #length-and-alignment-mask Branch-if-not-zero OOL Load Result, [index+#base] Done: OOL: Store [sp], index And index, #low-bit-mask Test index, #length-mask Branch-if-not-zero Out-Of-Bounds Load result, [index+#base] Store [sp+#wordSize], result Load index, [sp] Load result, [sp+#wordSize] Branch Done Out-of-Bounds: Load index, [sp] ... It is also possible to specify that the index register argument remain live until the end of the operation so that it is not allocated to the same register as the result. However this does have some cost as it precludes the result being allocated to the same register as the index when possible and needed.
(In reply to Luke Wagner [:luke] from comment #13) > (In reply to Douglas Crosher [:dougc] from comment #12) ... > > I am not so sure that the low bit masking should be > > a slow OOL path, and the OOL code is growing in size > > and may not be negligible on low end devices. > > Yes, we should measure the change in OOL code size. It should be very cold, > though, so I think we can tolerate a reasonable increase. Lets wait for some measurements. Heap access can be frequent and the OOL code is sizeable, and there are more patches to record. > > Some code can make use of low bit tagging and would > > want fast low bit masking. > > The primary use case here is generated code so an unaligned load would mean > that the underlying C++ code was performing the load without masking off the > low bits. This should basically never happen. The issue is not an unaligned access. C++ code can in practice explicitly mask a pointer to remove low tag bits before a load or store using the pointer. This style of optimization is common in dynamically typed languages and used to store some immediate type information, and I would like to see asm.js support this well. > You are right that > Emscripten *could* remove a preceding mask because of masking effect of the > shifts, but it doesn't. But it could tomorrow, or someone could fork it and have it do this, and there is currently no valid reasons that they would not make this optimization. Or people might be hand optimizing the asm.js code and make this obvious optimization. > Note: LLVM marks potentially-unaligned memory > accesses as such and Emscripten in this case will emit byte loads and OR > them together. The concern is the translation of a mask followed by an aligned memory access to fast asm.js assembled code. > > In the absence of contrary guidance, I suggest working > > this into the range analysis stage, at least as a start. > > I think we should *also* work on the range analysis and hoisting in the > backend. I don't think we will be able to remove so many alignment masks > that this optimization would be unnecessary, though. Perhaps, but to also optimize the mask/access path it may be necessary to use a asm.js syntactic distinction to differentiate between the fast-path and slow-path low bit masking. For example, u32[(i & ~3)>>2] could represent a fast path masking operation. For example, u32[(i - 1)>>2] to apply a corrective alignment offset that could be hidden in the heap base offset on the x86. This would need to be added to the asm.js specification so that code generators know how to express the fast/slow path low bit masking preference. It just seem unfortunate to do so now, before the results of masking elimination and hoisting are known. The real fast path will be code with hoisted and eliminated low bit masking and bounds checks. Performance critical code will need to use such a programming style. It will be fast and compact with no OOL code. LLVM/Emscripten could take as much of the burden as possible to keep the asm.js assembler as light as possible.
(In reply to Douglas Crosher [:dougc] from comment #16) > Yes, good point. There are still options to explore. For example, the > result can be stored on the stack too, and reloaded after restoring > the index. Thanks. I was figuring out whether there was an proper way that would avoid the use of patching on linkage, but it doesn't seem there is one. I will implement it this way.
Here is the new version of the patch, with the load / store replicated in the OOL path. No register is clobbered during the process. An eventual optimization would be to use the destination register as the index register in the case of a load of any non Float32 element (and use the stack push/pop only for Float32). However, the current version seems simpler this way. Benchmarks show the expected enhancements (from 3% to 6% faster in the case of asmjs-apps AWFY benchmarks). I will add instrumentation to measure the grow in terms of code size in a different patch.
Attachment #751243 - Attachment is obsolete: true
Attachment #752374 - Flags: review?(luke)
I made some measurements on the code size. At this point, the total code size can grow up to ~45% of the initial total code size (without the patch). The OOL code size grows up to 4.5 times the initial size. With this patch, the OOL code size represents up to ~50% of the total code size. That is pretty high. This doesn't seem to affect compilation time, yet. This can be explained by the fact that all store and load operations are replicated in the OOL paths. While the LOAD OOL path initially took 2 instructions, it takes now 7 instructions. The STORE OOL initially had no cost and now takes 7 instructions with this patch. A simple optimization which could reduce this number of instructions would be to avoid the masking and comparison in the OOL path if the align mask is #FFFFFFFF (which happens if we load a 8 bytes value). A quick benchmark shows that 4% of the loads have such a mask, in real world applications (asmjs-apps and epic citadel). This optimization would then reduce the OOL code size by 3%. As Luke pointed out IRL, a lot of generated OOL code is duplicated. If we manually generate a OOL stub function for each (input register, output register) pair, the OOL paths could just contain calls/jmps instead of the current 7 instructions, which would imply huge savings in terms of code size. I'll try this tomorrow.
Attachment #752374 - Attachment is obsolete: true
Attachment #752374 - Flags: review?(luke)
This patch implements a map of call stubs for the (isLoad, ViewType, index register code, output register code) quadruples. These functions are called on the OOL paths, which avoids the code size to grow up extensively. I also used the xor instruction for resetting registers to 0, as mentionned in bug 753934. Speedups are still in the range of 3% to 7% for asmjs-apps with load factor 4. The interesting part is now the difference in total generated code size: - for asmjs-apps, all apps have their code size grow up to around 1%, except for bullet which has a smaller code size. - for the Unreal Engine 3, the total code size grows up by 1.5%. In all cases, OOL code sizes grow up much more, but the hoisted AND instructions for both actions (load and stores) tend to compensate for the 2 instructions in each OOL path.
Attachment #753066 - Attachment is obsolete: true
Attachment #754014 - Flags: review?(luke)
(In reply to Benjamin Bouvier [:bbouvier] from comment #22) > I also used the xor instruction for resetting registers to 0, as mentionned > in bug 753934. I was talking about bug 875917 actually (this 753934 must be a code size of one of the tests programs...)
Attached patch rebased patch (obsolete) — Splinter Review
Attachment #754014 - Attachment is obsolete: true
Attachment #754014 - Flags: review?(luke)
Attachment #759473 - Flags: review?(luke)
I tried this patch on the new benchmarks Linpack and Lua-binary-trees: - Linpack: 840 MFlops -> 880 MFlops (4.7% speedup) - Lua: 9.8s -> 9s (8.1% speedup)
Attachment #759473 - Attachment is obsolete: true
Attachment #759473 - Flags: review?(luke)
Attachment #767909 - Flags: review?(luke)
Comment on attachment 767909 [details] [diff] [review] rebased + auto fixed nits (just discussed with bbouvier, but for the record) There are some alternative strategies for bounds-check removal (bug 897337) we're planning to try that would apply more generally, so we're going to table this patch for the moment until we see how the other plans work out. This is great work, though.
Attachment #767909 - Flags: review?(luke)
Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INACTIVE
Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---

We probably won't land this ever: we're now working on Cranelift, and x86 32-bits optimizations are becoming less and less relevant. That was a fun learning experiment back in the days :)

Status: REOPENED → RESOLVED
Closed: 7 years ago5 years ago
Resolution: --- → WONTFIX
Component: JavaScript Engine → Javascript: WebAssembly
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: