Closed
Bug 869606
Opened 12 years ago
Closed 5 years ago
OdinMonkey: combine overflow and alignment tests on array load / store, on x86
Categories
(Core :: JavaScript: WebAssembly, defect)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: bbouvier, Assigned: bbouvier)
Details
Attachments
(1 file, 7 obsolete files)
35.26 KB,
patch
|
Details | Diff | Splinter Review |
When doing a heap load or store in asm.js, on x86 (for instance : array32[ i >> 2 ] = x), the generated code contains an bitwise AND to clear the last bits of the index (here, 2 last bits) and then carries out a comparison to check that there is no overflow:
0xf7fd3011: and $0xfffffffc,%eax // put two last bits of eax to 0
0xf7fd3018: cmp $0x4000,%eax // eax <= size of heap? (here 16384)
0xf7fd301e: jae 0xf7fd302a
These two operations can be combined into a single one, using a test operation instead of and + cmp. The test mask has to combine both clearing mask present in the AND and the overflow check present in the CMP. For instance, the above code can be combined into:
0xf7fd3011: test $0xffffc003,%eax
0xf7fd301e: jne 0xf7fd302a
This would avoid one assembly operation, which could lead to performance enhancements.
Assignee | ||
Comment 1•12 years ago
|
||
A first working version.
I'm not quite sure about the modification in AsmJS.cpp:CheckArrayAccess though: is the pointerDef the one that has to be saved instead of the MBitAnd?
Benchmarks coming soon.
Attachment #746664 -
Flags: feedback?(luke)
Assignee | ||
Comment 2•12 years ago
|
||
Instead of posting all the benchmarks, I will just summarize them.
The AWFY asm.js (both micro benchmarks and apps) tests have been used and run 5 times on the same machine, on the unpatched version and the patched version.
Results on asm-microbenchmarks show enhancements for 4 of them (fannkuch: 5.5%, skinning:7%, fasta: 4%, copy:3.5%) and degradations for 4 of them (life: 1%, memops:1.5%, corrections:16%, primes: 7%). Also, the microbenchmarks run way faster in debug mode than in release mode, at least on my computer. As the microbenchmarks make usage of tiny loops and have few instructions, the impact of instructions alignment might not be neglectable.
Moreover, as we remove one low-level instruction (the AND), we expect to have globally performance enhancements. As a matter of fact, the performance degradations observed on microbenchmarks may be considered as noise.
Results on asm-apps show enhancements from 8 to 15%, on longer test durations. These results are more meaningful and fit what we can expect as there is no degradation.
![]() |
||
Comment 3•12 years ago
|
||
Douglas/Marty: can the optimization be easily extended to ARM?
Comment 4•12 years ago
|
||
I think there is an ARM instruction to clear a range of bits and this could be used for a single instruction test.
However, isn't the masking operation still needed, as the index still needs to be masked?
![]() |
||
Comment 5•12 years ago
|
||
Comment on attachment 746664 [details] [diff] [review]
v1
Review of attachment 746664 [details] [diff] [review]:
-----------------------------------------------------------------
Great start!
I think there is one important bug, however: if there is an unaligned access, then the patch as is will jump to the out-of-line path which will return 0 or NaN for loads or ignore the store. What needs to happen is that the pointer is aligned (by masking off the bits) and the load/store is performed with this aligned pointer. The most obvious way I can think of is for the out-of-line path to be:
testl index, #alignment-plus-out-of-bounds-mask
jnz out-of-line-path
before-load:
mov dst, #base-address(index)
rejoin:
... next instruction
out-of-line-path:
and index, #alignment-mask
cmpl index, #heap-length
jb before-load
move 0 or NaN to dst
j rejoin
this means saving the index Register and before-load Label in the OutOfLine object. (This is currently shared with visitLoadTypedArrayElementStatic, so you'll want to fork off your own AsmJS version.)
Note: the same also goes for the store path: an unaligned store needs to store at the masked location.
I think this should have been caught by running 'js/src/jit-tests/jit_tests.py $(SHELL) asm.js'; if it wasn't, could you add tests (to jit-tests/tests/asm.js/testHeapAccess) that would have caught the bug?
::: js/src/ion/AsmJSLink.cpp
@@ +198,4 @@
> uint8_t *code = module.functionCode();
> for (unsigned i = 0; i < module.numHeapAccesses(); i++) {
> const AsmJSHeapAccess &access = module.heapAccess(i);
> + JSC::X86Assembler::setPointer(access.patchLengthAt(code), access.testMask(heapLength));
Can you rename "patchLengthAt" to be "patchMaskAt"?
::: js/src/ion/RegisterSets.h
@@ +747,5 @@
> #if defined(JS_CPU_X86)
> uint8_t cmpDelta_;
> + ArrayBufferView::ViewType vt_;
> +#else
> + uint8_t isFloat32Load_;
To minimize platform variation, can you remove isFloat32Load_ and move vt_ outside of the #if?
@@ +808,5 @@
> +
> + void *testMask(uint32_t heapLength) const
> + {
> + // The test mask is the OR combination of the overflow mask and the
> + // alignment mask.
I'd rather have this logic in AsmJSLink (next to where we call setPointer and keep AsmJSHeapAccess a simple logic-free datum. Perhaps you can move the mask computation to a commented static helper function right above DynamicallyLinkModule?
@@ +834,5 @@
> + break;
> + default:
> + JS_NOT_REACHED("unknown array buffer view type");
> + break;
> + }
Can you compute the alignment mask using ((uint32_t(1) << TypedArrayShift(viewType)) - 1)?
@@ +843,5 @@
> + // A heap of minimal length 4096 (== 0x1000) will have the
> + // overflow mask 0xFFFFF000.
> + uint32_t base = 0xFFFFF000;
> + int shift;
> + JS_FLOOR_LOG2(shift, heapLength);
Even better, we can JS_ASSERT(IsPowerOfTwo(heapLength)). With this property, we should be able to compute the mask using ~(heapLength - 1).
Attachment #746664 -
Flags: feedback?(luke)
![]() |
||
Comment 6•12 years ago
|
||
(In reply to Douglas Crosher [:dougc] from comment #4)
The idea is to put the masking on the assumed-never-taken out-of-line path (as described in comment 5).
Assignee | ||
Comment 7•12 years ago
|
||
Thanks for the feedback!
Indeed, I forgot to launch tests for this one. 2 of them were not passing (testHeapAccess and testZOOB). This new version fixes the issue.
I will do some other benchmarks to see whether there are differences for micro benchmarks especially.
Attachment #746664 -
Attachment is obsolete: true
Attachment #747728 -
Flags: review?(luke)
Comment 8•12 years ago
|
||
(In reply to Benjamin Bouvier [:bbouvier] from comment #7)
> Created attachment 747728 [details] [diff] [review]
> v2 - following Luke's remarks
>
> Thanks for the feedback!
>
> Indeed, I forgot to launch tests for this one. 2 of them were not passing
> (testHeapAccess and testZOOB). This new version fixes the issue.
>
> I will do some other benchmarks to see whether there are differences for
> micro benchmarks especially.
The index register is being destructively modified and the register allocator
may be assuming otherwise here?
Comment 9•12 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #3)
> Douglas/Marty: can the optimization be easily extended to ARM?
The ARM 'BFC' Bit Field Clear instruction appears usable for
masking, but unfortunately it can not set the condition flags
and a separate compare would be needed. Thus this would be
a three instruction operation on the ARM: BFC, CMP, Branch.
It's something to explore.
Assignee | ||
Comment 10•12 years ago
|
||
This patch fixes the issue raised in Comment 8. It ensures that the index register is not clobbered by making a temporary copy of the index.
Douglas, could you please confirm that it is what you had in mind in Comment 8?
Unfortunately, this affects performance. When testing on the asmjs-apps benchmarks, box2d is better by 2%, bullet by up to 8% and zlib is worse by up to 4%.
A micro benchmark which should just results in a STORE operation shows that the index register is copied into another register even if the index is dead.
This implies that either something in the register coherency patch is wrong, or there might be a bug in the register allocator. As it happens also with the backtracking allocator, if the bug occurs during register allocation, it means that it happens during liveness analysis. I will try to make some experimentations regarding different registers assignments (useRegister, useRegisterAtStart, etc.) and / or look at the liveness analysis code.
Attachment #751243 -
Flags: feedback?(dtc-moz)
![]() |
||
Updated•12 years ago
|
Attachment #747728 -
Attachment is obsolete: true
Attachment #747728 -
Flags: review?(luke)
![]() |
||
Comment 11•12 years ago
|
||
It'd be great to see why tempCopy is needlessly making a register copy in both allocators and fixing that. However, if it is a still a problem after fixing the problem, we could always avoid the need for a tempCopy by performing the load in the out-of-line path, that way we don't need to mutate the input register.
Comment 12•12 years ago
|
||
Comment on attachment 751243 [details] [diff] [review]
v2 - Ensure registers are not clobbered
Review of attachment 751243 [details] [diff] [review]:
-----------------------------------------------------------------
Registers are scarce so it would be better not to allocate
an extra register for a temp that is rarely used.
The register allocator is basic, but it might be a
distraction in this case to start extending it.
You might need to find somewhere to store the index.
One option might be to store state on the stack.
Ion appears to reserve stack to build arguments
for calls, and perhaps it would be possible to
ensure there is a reserve for use by emitted ops,
and then you could save the index to the stack, mask
it, load/store, then load it back from the stack.
I am not so sure that the low bit masking should be
a slow OOL path, and the OOL code is growing in size
and may not be negligible on low end devices.
Some code can make use of low bit tagging and would
want fast low bit masking. The masking is already
explicit in the JS, and would a double mask be needed
to actually request a fast path masking operation?
Might not be unrealistic, but it needs some thought.
Also, if the bounds check is eliminated by other
optimizations then it is no longer possible to hide
the masking operation in the bounds check and we
still need to be able to eliminate them and perhaps
hoist them.
I suggest working to eliminate unnecessary masking
operations, and perhaps hoist them, but with the
assumption that the asm.js code writer (Emscripten
etc) can be expected to cooperate. I suggest not
supporting costly optimizations in the asm.js assembler
that the code writer could have optimized.
In the absence of contrary guidance, I suggest working
this into the range analysis stage, at least as a start.
The optimizations might be costly but we can keep
in mind the possibility of moving them to a separate
more limit stage specific to the asm.js assembler.
The approach of hiding the masking operation in the
bounds check is sound and might still find a place.
Attachment #751243 -
Flags: feedback?(dtc-moz)
![]() |
||
Comment 13•12 years ago
|
||
(In reply to Douglas Crosher [:dougc] from comment #12)
Oh right, pushing on the stack was the other good option, thanks for reminding!
> I am not so sure that the low bit masking should be
> a slow OOL path, and the OOL code is growing in size
> and may not be negligible on low end devices.
Yes, we should measure the change in OOL code size. It should be very cold, though, so I think we can tolerate a reasonable increase.
> Some code can make use of low bit tagging and would
> want fast low bit masking.
The primary use case here is generated code so an unaligned load would mean that the underlying C++ code was performing the load without masking off the low bits. This should basically never happen. You are right that Emscripten *could* remove a preceding mask because of masking effect of the shifts, but it doesn't. Note: LLVM marks potentially-unaligned memory accesses as such and Emscripten in this case will emit byte loads and OR them together.
> In the absence of contrary guidance, I suggest working
> this into the range analysis stage, at least as a start.
I think we should *also* work on the range analysis and hoisting in the backend. I don't think we will be able to remove so many alignment masks that this optimization would be unnecessary, though.
Assignee | ||
Comment 14•12 years ago
|
||
Thanks for your comments and remarks. Using the heap looks like a way simpler solution indeed.
I can see two different ways of implementing it:
- the first one would be to push the index before the test, and popping it back after the store / load. This way, the index register would be stored back after the operation. I implemented this simple, naive solution and the problem is that during loads, the register that contains the index can be clobbered by the loaded value, so the pop would clobber it then. A solution to this problem would be to have a temp register to store the index (to be able to pop into this register), but it's equivalent to using a temp copy, which we want to avoid. Moreover, it introduces two new operations on the main path (push / pop), which can hide the speed enhancement obtained by removing the AND.
- the other solution is that the push and pop take place in the OOL path. This way, we ensure that the pop operation won't clobber any register used after the operation. But this implies to have the load / store in the OOL path, and therefore to have the overflow test (which needs patching) in the OOL path. As a matter of fact, the OOL path might grow up in size.
I will implement the second solution, except if anybody sees any better way to do it.
Assignee | ||
Comment 15•12 years ago
|
||
Actually, the second solution would suffer from the issue as the first one. Any time you pop from the stack to a register, if you do it after a load, you take the risk of clobbering it. I guess that the use of a temporary register is unavoidable here.
Comment 16•12 years ago
|
||
(In reply to Benjamin Bouvier [:bbouvier] from comment #15)
> Actually, the second solution would suffer from the issue as the first one.
> Any time you pop from the stack to a register, if you do it after a load,
> you take the risk of clobbering it. I guess that the use of a temporary
> register is unavoidable here.
Yes, good point. There are still options to explore. For example, the
result can be stored on the stack too, and reloaded after restoring
the index.
Inline code:
Test index, #length-and-alignment-mask
Branch-if-not-zero OOL
Load Result, [index+#base]
Done:
OOL:
Store [sp], index
And index, #low-bit-mask
Test index, #length-mask
Branch-if-not-zero Out-Of-Bounds
Load result, [index+#base]
Store [sp+#wordSize], result
Load index, [sp]
Load result, [sp+#wordSize]
Branch Done
Out-of-Bounds:
Load index, [sp]
...
It is also possible to specify that the index register argument
remain live until the end of the operation so that it is not
allocated to the same register as the result. However this does
have some cost as it precludes the result being allocated to the
same register as the index when possible and needed.
Comment 17•12 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #13)
> (In reply to Douglas Crosher [:dougc] from comment #12)
...
> > I am not so sure that the low bit masking should be
> > a slow OOL path, and the OOL code is growing in size
> > and may not be negligible on low end devices.
>
> Yes, we should measure the change in OOL code size. It should be very cold,
> though, so I think we can tolerate a reasonable increase.
Lets wait for some measurements. Heap access can be frequent and the
OOL code is sizeable, and there are more patches to record.
> > Some code can make use of low bit tagging and would
> > want fast low bit masking.
>
> The primary use case here is generated code so an unaligned load would mean
> that the underlying C++ code was performing the load without masking off the
> low bits. This should basically never happen.
The issue is not an unaligned access.
C++ code can in practice explicitly mask a pointer to remove low
tag bits before a load or store using the pointer. This style of
optimization is common in dynamically typed languages and used
to store some immediate type information, and I would like to
see asm.js support this well.
> You are right that
> Emscripten *could* remove a preceding mask because of masking effect of the
> shifts, but it doesn't.
But it could tomorrow, or someone could fork it and have it do this,
and there is currently no valid reasons that they would not make this
optimization. Or people might be hand optimizing the asm.js
code and make this obvious optimization.
> Note: LLVM marks potentially-unaligned memory
> accesses as such and Emscripten in this case will emit byte loads and OR
> them together.
The concern is the translation of a mask followed by an
aligned memory access to fast asm.js assembled code.
> > In the absence of contrary guidance, I suggest working
> > this into the range analysis stage, at least as a start.
>
> I think we should *also* work on the range analysis and hoisting in the
> backend. I don't think we will be able to remove so many alignment masks
> that this optimization would be unnecessary, though.
Perhaps, but to also optimize the mask/access path it may be
necessary to use a asm.js syntactic distinction to differentiate
between the fast-path and slow-path low bit masking.
For example, u32[(i & ~3)>>2] could represent a fast path
masking operation.
For example, u32[(i - 1)>>2] to apply a corrective alignment
offset that could be hidden in the heap base offset on the x86.
This would need to be added to the asm.js specification so that
code generators know how to express the fast/slow path low bit
masking preference.
It just seem unfortunate to do so now, before the results of
masking elimination and hoisting are known.
The real fast path will be code with hoisted and eliminated low
bit masking and bounds checks. Performance critical code will
need to use such a programming style. It will be fast and compact
with no OOL code. LLVM/Emscripten could take as much of the
burden as possible to keep the asm.js assembler as light as
possible.
Assignee | ||
Comment 18•12 years ago
|
||
(In reply to Douglas Crosher [:dougc] from comment #16)
> Yes, good point. There are still options to explore. For example, the
> result can be stored on the stack too, and reloaded after restoring
> the index.
Thanks. I was figuring out whether there was an proper way that would avoid the use of patching on linkage, but it doesn't seem there is one. I will implement it this way.
Assignee | ||
Comment 19•12 years ago
|
||
Here is the new version of the patch, with the load / store replicated in the OOL path. No register is clobbered during the process.
An eventual optimization would be to use the destination register as the index register in the case of a load of any non Float32 element (and use the stack push/pop only for Float32). However, the current version seems simpler this way.
Benchmarks show the expected enhancements (from 3% to 6% faster in the case of asmjs-apps AWFY benchmarks).
I will add instrumentation to measure the grow in terms of code size in a different patch.
Attachment #751243 -
Attachment is obsolete: true
Attachment #752374 -
Flags: review?(luke)
Assignee | ||
Comment 20•12 years ago
|
||
I made some measurements on the code size. At this point, the total code size can grow up to ~45% of the initial total code size (without the patch). The OOL code size grows up to 4.5 times the initial size. With this patch, the OOL code size represents up to ~50% of the total code size. That is pretty high. This doesn't seem to affect compilation time, yet.
This can be explained by the fact that all store and load operations are replicated in the OOL paths. While the LOAD OOL path initially took 2 instructions, it takes now 7 instructions. The STORE OOL initially had no cost and now takes 7 instructions with this patch.
A simple optimization which could reduce this number of instructions would be to avoid the masking and comparison in the OOL path if the align mask is #FFFFFFFF (which happens if we load a 8 bytes value). A quick benchmark shows that 4% of the loads have such a mask, in real world applications (asmjs-apps and epic citadel). This optimization would then reduce the OOL code size by 3%.
As Luke pointed out IRL, a lot of generated OOL code is duplicated. If we manually generate a OOL stub function for each (input register, output register) pair, the OOL paths could just contain calls/jmps instead of the current 7 instructions, which would imply huge savings in terms of code size. I'll try this tomorrow.
Assignee | ||
Comment 21•12 years ago
|
||
Attachment #752374 -
Attachment is obsolete: true
Attachment #752374 -
Flags: review?(luke)
Assignee | ||
Comment 22•12 years ago
|
||
This patch implements a map of call stubs for the (isLoad, ViewType, index register code, output register code) quadruples. These functions are called on the OOL paths, which avoids the code size to grow up extensively.
I also used the xor instruction for resetting registers to 0, as mentionned in bug 753934.
Speedups are still in the range of 3% to 7% for asmjs-apps with load factor 4.
The interesting part is now the difference in total generated code size:
- for asmjs-apps, all apps have their code size grow up to around 1%, except for bullet which has a smaller code size.
- for the Unreal Engine 3, the total code size grows up by 1.5%.
In all cases, OOL code sizes grow up much more, but the hoisted AND instructions for both actions (load and stores) tend to compensate for the 2 instructions in each OOL path.
Attachment #753066 -
Attachment is obsolete: true
Attachment #754014 -
Flags: review?(luke)
Assignee | ||
Comment 23•12 years ago
|
||
(In reply to Benjamin Bouvier [:bbouvier] from comment #22)
> I also used the xor instruction for resetting registers to 0, as mentionned
> in bug 753934.
I was talking about bug 875917 actually (this 753934 must be a code size of one of the tests programs...)
Assignee | ||
Comment 24•12 years ago
|
||
Attachment #754014 -
Attachment is obsolete: true
Attachment #754014 -
Flags: review?(luke)
Attachment #759473 -
Flags: review?(luke)
Assignee | ||
Comment 25•12 years ago
|
||
I tried this patch on the new benchmarks Linpack and Lua-binary-trees:
- Linpack: 840 MFlops -> 880 MFlops (4.7% speedup)
- Lua: 9.8s -> 9s (8.1% speedup)
Attachment #759473 -
Attachment is obsolete: true
Attachment #759473 -
Flags: review?(luke)
Attachment #767909 -
Flags: review?(luke)
![]() |
||
Comment 26•12 years ago
|
||
Comment on attachment 767909 [details] [diff] [review]
rebased + auto fixed nits
(just discussed with bbouvier, but for the record)
There are some alternative strategies for bounds-check removal (bug 897337) we're planning to try that would apply more generally, so we're going to table this patch for the moment until we see how the other plans work out. This is great work, though.
Attachment #767909 -
Flags: review?(luke)
Comment 27•7 years ago
|
||
Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INACTIVE
Assignee | ||
Updated•7 years ago
|
Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---
Assignee | ||
Comment 28•5 years ago
|
||
We probably won't land this ever: we're now working on Cranelift, and x86 32-bits optimizations are becoming less and less relevant. That was a fun learning experiment back in the days :)
Status: REOPENED → RESOLVED
Closed: 7 years ago → 5 years ago
Resolution: --- → WONTFIX
Assignee | ||
Updated•5 years ago
|
Component: JavaScript Engine → Javascript: WebAssembly
You need to log in
before you can comment on or make changes to this bug.
Description
•