Odinmonkey: (ARM) use the 'compare immediate' instruction for bounds checks

RESOLVED FIXED in Firefox 26

Status

()

defect
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: dougc, Assigned: dougc)

Tracking

(Blocks 1 bug)

unspecified
mozilla27
ARM
All
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox26 fixed, firefox27 fixed)

Details

(Whiteboard: [games])

Attachments

(1 attachment, 6 obsolete attachments)

The ARM backend currently uses a logical shift and comparison to zero
for bounds checks but this only works for heap lengths that are a power
of two.  Using the 'compare immediate' instruction allows a wider
range of heap lengths.

The immediate value to this instruction is an eight bit value that
can be shifted in multiples of two bits.  Thus the quantum by which
the heap length can be increased depends on the heap length as follows:

up to  4080M:  16M
up to  1020M:   4M
up to   255M:   1M
up to 65280k: 256k
up to 16320k:  64k
up to  4080k:  16k
up to  1020k:   4k

It seems unfortunate to expose the asm.js spec. to such ARM constraints,
but it's better than the power-of-two restriction.  Perhaps a subset of
these could be exposed.

There is also a compare-negative-immediate instruction that could add
some other supported lengths but it's not clear that these would be
particularly useful.
Posted patch Backend support. (obsolete) — Splinter Review
This patch has the backend support and switches to using
the compare-immediate instruction.

TODO update the AsmJS linker to accept some range of
the supported heap lengths and add tests for these.
The max size of an ArrayBuffer is INT32_MAX at the moment, so I think that means we only need a quanta of 8MB to support the largest size.  I'm inclined to *always* require a quanta of 8MB since this is conveniently close and bigger than the "huge page" size option on Linux/Windows which conceivably some day we'd like to support (right now we'll just disable asm.js compilation on x64 if the system page size isn't 4KB).
Another question: is there any performance difference on the awfy asmjs-apps benchmarks?
(In reply to Luke Wagner [:luke] from comment #2)
> The max size of an ArrayBuffer is INT32_MAX at the moment, so I think that
> means we only need a quanta of 8MB to support the largest size.  I'm
> inclined to *always* require a quanta of 8MB since this is conveniently
> close and bigger than the "huge page" size option on Linux/Windows which
> conceivably some day we'd like to support (right now we'll just disable
> asm.js compilation on x64 if the system page size isn't 4KB).

The instruction encoding only supports shifts of the 8 bits that are a
multiple of two, so a quanta of 8MB might not achievable for a maximum
heap size of INT32_MAX.  E.g. the encoding options are:

abcdefgh 00000000 00000000 00000000
00abcdef gh000000 00000000 00000000
...
Ah, I missed the "multiple of two" part.  Yes, so then perhaps we could start with a simple 16MB quanta.
Support for non-power-of-two heap lengths can help with the optimization
of heap access, particularly when using masking for safe heap access.

For example, HEAP8[(i&0x07FFFFFF)+96] could be proven to be a safe heap
access if the heap length is 128M+1M.  The top 1M would be a buffer zone
that could still be accessed by a non-masked index.

Similarly for a base pointer plus an offset proven to have a limited range.
For example, HEAP8[(i&0x07FFFFFF)+(k&0xff)] could proven to be a safe heap
access.

From the perspective of this use, the table in comment 0 could be
rewritten to relate the mask heap length to the minimum possible
buffer zone size beyond the masked length:

Masked length,  minimum quanta
    2048M          16M  ??
    1024M          16M
     512M           4M
     256M           4M
     128M           1M
      64M           1M
      32M         256k
      16M         256k
       8M          64k
       4M          64k
       2M          16k
       1M          16k
     512k           4k

This scheme hits a problem for a 2048M masked heap length if the
maximum heap length is also 2048M.

An alternative is to use memory protection to implement such a
buffer zone beyond the heap length, and this would only be restricted
by the pages size.  Both approaches might be offered, giving the
programmer the option to choose an explicit buffer zone length or
use a default implicit buffer zone using memory protection.
(In reply to Luke Wagner [:luke] from comment #3)
> Another question: is there any performance difference on the awfy asmjs-apps
> benchmarks?

Testing on a Nexus 4 (Krait) shows a good gain in performance!
For example, almost a 10% reduction in run time on memops.
This is an unexpected result, but welcomed.

asmjs-ubench memops: mean 42.962 vs 38.955  std 0.29575 seconds

Double checked the code, and it looks ok.  Here's the hottest block:

   0xb5ffa614:	adds	r4, r0, r3
   0xb5ffa618:	cmp	r4, #134217728	; 0x8000000  <<< bounds check
   0xb5ffa61c:	ldrsbcc	r4, [r11, r4]
   0xb5ffa620:	movcs	r4, #0
   0xb5ffa624:	and	r5, r4, #1
   0xb5ffa628:	adds	r4, r5, r2
   0xb5ffa62c:	adds	r2, r3, #1
   0xb5ffa630:	cmp	r2, #1048576	; 0x100000
   0xb5ffa634:	bge	0xb5ffa644
Comment on attachment 797918 [details] [diff] [review]
Backend support.

Requesting approval to land the backend changes on the basis that
the change gives a useful performance improvement.

Could we keep the bug open and land a separate patch to enable
the use of some non-power-of-two heaps sizes once the spec.
changes are decided?
Attachment #797918 - Flags: review?(luke)
Comment on attachment 797918 [details] [diff] [review]
Backend support.

It appears we can do even better performance wise for the load-heap sequence.  Let me explore this and double check.
Attachment #797918 - Flags: review?(luke)
Testing shows that swapping the two load instructions gives
a further performance gain.  For the memops benchmark the gain
is around 10% and for a particular game the gain is 2% on
significant noise but still trending in the right direction.

Tested on the Krait processor in the Nexus 4.

This patch includes only the ARM backend changes related to
this bug.  It assumes bug 865516 has landed.
Assignee: general → dtc-moz
Attachment #797918 - Attachment is obsolete: true
Attachment #799465 - Flags: review?(mrosenberg)
Nice!  Could you describe what the new instruction sequence is?
(In reply to Luke Wagner [:luke] from comment #11)
> Nice!  Could you describe what the new instruction sequence is?

The ARM uses a test (now compare-immediate) followed by two
conditional loads, one load for the in-bounds case and another
for the out-of-bounds case. The latest change just swapped
the order of these two loads, so that the out-of-bounds conditional
load is now first.
Comment on attachment 799465 [details] [diff] [review]
Further optimize the load-heap instruction sequence.

Review of attachment 799465 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/arm/CodeGenerator-arm.cpp
@@ +1844,3 @@
>      } else {
> +        Register d = ToRegister(ins->output());
> +        masm.ma_mov(Imm32(0), d, NoSetCond, Assembler::AboveOrEqual);

I was thinking of just making this unconditional.  Can you see if that has any effect on perf?
Attachment #799465 - Flags: review?(mrosenberg) → review+
(In reply to Marty Rosenberg [:mjrosenb] from comment #13)
> Comment on attachment 799465 [details] [diff] [review]
> Further optimize the load-heap instruction sequence.
...
> > +        Register d = ToRegister(ins->output());
> > +        masm.ma_mov(Imm32(0), d, NoSetCond, Assembler::AboveOrEqual);
> 
> I was thinking of just making this unconditional.  Can you see if that has
> any effect on perf?

Yes, this was tried but slowed performance.
Posted patch Rebase ARM backend changes. (obsolete) — Splinter Review
Carry forward r+
Attachment #799465 - Attachment is obsolete: true
Attachment #801054 - Flags: review+
Posted patch Fix rebase. (obsolete) — Splinter Review
Sorry, rebase was broken. Carrying forward r+.
Attachment #801054 - Attachment is obsolete: true
Attachment #801059 - Flags: review+
Keywords: checkin-needed
Whiteboard: leave open
This seems to be the cause of 23% performance regression on the
awft asmjs-ubench-memops Tegra-2 benchmark.

Performance of the awft asm.js apps seems to be more positive:
asmjs-apps-bullet and asmjs-apps-zlib seem to have improved.  So
this does not appear to be as simple as reverting back to using
'mov' for some processors.

A patch that supports both mov and cmp for use as a bounds check
plus some instruction order variations has been code up, and I 
will test on a wider range of processors.
This patch enables use of all possible heap lengths as
constrained by the x64 page size restriction and the
ARM 'cmp immediate' instruction encoding.

If you want to enable use of just a subset then it should
be trivial so just let me know.

The heap length policy has been abstracted into two functions:
roundUpAsmJSHeapLength, and IsValidAsmJSHeapLength.

On a link failure due to an invalid heap length the message reports
the failing length and makes a suggestion by rounding up to the
next valid length.
Attachment #804559 - Flags: review?(luke)
Forgot to test for heap sizes less than 4096 bytes.  Added this
and reworked the code to hopefully be a little clearer.

Note the sparc has a page size of 8192 bytes, and perhaps it would
be prudent to increase the minimum to 8192 and depreciate the 4096
minimum heap length.

It's not clear how important this is to ff26 at this stage.
Attachment #804559 - Attachment is obsolete: true
Attachment #804559 - Flags: review?(luke)
Attachment #805144 - Flags: review?(luke)
Comment on attachment 805144 [details] [diff] [review]
Catch heap sizes less than 4096 bytes, and rework the code for improved clarity.

Review of attachment 805144 [details] [diff] [review]:
-----------------------------------------------------------------

Great job!

::: js/src/jit/AsmJS.cpp
@@ +929,5 @@
> +//  000000ab cdefgh00 00000000 00000000
> +//  00000000 abcdefgh 00000000 00000000
> +//  00000000 00abcdef gh000000 00000000
> +//  00000000 0000abcd efgh0000 00000000
> +//  ...

On the ellipsis: are there any other immediates supported?

@@ +932,5 @@
> +//  00000000 0000abcd efgh0000 00000000
> +//  ...
> +//
> +// ARM Thumb T2 encoding supports the following immediate constants and although
> +// this mode is not currently used it might be worth considering as a constraint:

If I'm reading correctly, ARM Thumb T2 seems like a strict generalization of the immediate constraints on ARMv7.  Is that right?  If so, then it seems like you could remove the chart and comment.

@@ +961,5 @@
> +//  Heap length 0x00100000 to 0x003fc000 quanta 0x00004000
> +//  Heap length 0x00001000 to 0x000ff000 quanta 0x00001000
> +//
> +uint32_t
> +js::roundUpAsmJSHeapLength(uint32_t length) {

{ on new line

@@ +977,5 @@
> +        return (length + 0x000fffff) & ~0x000fffff;
> +    if (length < 0x10000000u) // < 1024M quanta 4M
> +        return (length + 0x003fffff) & ~0x003fffff;
> +    // < 4096M quanta 16M.  Note zero is returned if over 0xff000000 but such
> +    // length are not currently valid.

Can you JS_ASSERT(length <= 0xff000000)?

@@ +982,5 @@
> +    return (length + 0x00ffffff) & ~0x00ffffff;
> +}
> +
> +bool
> +js::IsValidAsmJSHeapLength(uint32_t length) {

{ on new line

::: js/src/jit/AsmJS.h
@@ +79,5 @@
>  static const size_t AsmJSAllocationGranularity = 4096;
>  
> +// These functions define the valid heap lengths.
> +extern uint32_t
> +roundUpAsmJSHeapLength(uint32_t length);

Capitalize 'r' and also, how about 'RoundUpToNextValidAsmJSHeapLength'?  Otherwise 'round' is a bit ambiguous.

::: js/src/jit/AsmJSLink.cpp
@@ +214,5 @@
>              return LinkFail(cx, "bad ArrayBuffer argument");
>  
>          heap = &bufferVal.toObject().as<ArrayBufferObject>();
>  
> +        if (!IsValidAsmJSHeapLength(heap->byteLength()) || heap->byteLength() < AsmJSAllocationGranularity) {

Can you move the >= AsmJSAllocationGranularity check into IsValidAsmJSHeapLength?

@@ +216,5 @@
>          heap = &bufferVal.toObject().as<ArrayBufferObject>();
>  
> +        if (!IsValidAsmJSHeapLength(heap->byteLength()) || heap->byteLength() < AsmJSAllocationGranularity) {
> +            return LinkFail(cx, JS_smprintf("ArrayBuffer byteLength 0x%x is not a valid heap length. The next valid length is 0x%x",
> +                                            heap->byteLength(), roundUpAsmJSHeapLength(heap->byteLength())));

Nice message!
Attachment #805144 - Flags: review?(luke) → review+
Address reviewer feedback.

Carry forward r+
Attachment #805144 - Attachment is obsolete: true
Attachment #805612 - Flags: review+
(In reply to Luke Wagner [:luke] from comment #22)
> Comment on attachment 805144 [details] [diff] [review]
> Catch heap sizes less than 4096 bytes, and rework the code for improved
> clarity.
> 
> Review of attachment 805144 [details] [diff] [review]:
> -----------------------------------------------------------------

Thank you for the quick review.
 
> ::: js/src/jit/AsmJS.cpp
> @@ +929,5 @@
> > +//  000000ab cdefgh00 00000000 00000000
> > +//  00000000 abcdefgh 00000000 00000000
> > +//  00000000 00abcdef gh000000 00000000
> > +//  00000000 0000abcd efgh0000 00000000
> > +//  ...
> 
> On the ellipsis: are there any other immediates supported?

Yes, smaller immediates are supported but this is all we need
because the minimum granularity is 4096 or 0x1000.

There is also a compare-negative-immediate instruction that could
add some more, but it is not clear if they are useful.

> @@ +932,5 @@
> > +//  00000000 0000abcd efgh0000 00000000
> > +//  ...
> > +//
> > +// ARM Thumb T2 encoding supports the following immediate constants and although
> > +// this mode is not currently used it might be worth considering as a constraint:
> 
> If I'm reading correctly, ARM Thumb T2 seems like a strict generalization of
> the immediate constraints on ARMv7.  Is that right?  If so, then it seems
> like you could remove the chart and comment.

Ok, the table has been removed, and a small comment added noting that
it is also compatible with the Thumb T2 encodeing.
 
> @@ +961,5 @@
> > +//  Heap length 0x00100000 to 0x003fc000 quanta 0x00004000
> > +//  Heap length 0x00001000 to 0x000ff000 quanta 0x00001000
> > +//
> > +uint32_t
> > +js::roundUpAsmJSHeapLength(uint32_t length) {
> 
> { on new line

Done
 
> @@ +977,5 @@
> > +        return (length + 0x000fffff) & ~0x000fffff;
> > +    if (length < 0x10000000u) // < 1024M quanta 4M
> > +        return (length + 0x003fffff) & ~0x003fffff;
> > +    // < 4096M quanta 16M.  Note zero is returned if over 0xff000000 but such
> > +    // length are not currently valid.
> 
> Can you JS_ASSERT(length <= 0xff000000)?

Done.

> @@ +982,5 @@
> > +    return (length + 0x00ffffff) & ~0x00ffffff;
> > +}
> > +
> > +bool
> > +js::IsValidAsmJSHeapLength(uint32_t length) {
> 
> { on new line

Done.
 
> ::: js/src/jit/AsmJS.h
> @@ +79,5 @@
> >  static const size_t AsmJSAllocationGranularity = 4096;
> >  
> > +// These functions define the valid heap lengths.
> > +extern uint32_t
> > +roundUpAsmJSHeapLength(uint32_t length);
> 
> Capitalize 'r' and also, how about 'RoundUpToNextValidAsmJSHeapLength'? 
> Otherwise 'round' is a bit ambiguous.

Done.
 
> ::: js/src/jit/AsmJSLink.cpp
> @@ +214,5 @@
> >              return LinkFail(cx, "bad ArrayBuffer argument");
> >  
> >          heap = &bufferVal.toObject().as<ArrayBufferObject>();
> >  
> > +        if (!IsValidAsmJSHeapLength(heap->byteLength()) || heap->byteLength() < AsmJSAllocationGranularity) {
> 
> Can you move the >= AsmJSAllocationGranularity check into
> IsValidAsmJSHeapLength?

Good suggestion. Done.
Whiteboard: [leave open]
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/06f35795582a
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla27
Comment on attachment 805612 [details] [diff] [review]
Address reviewer feedback.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
User impact if declined: asm.js code will not be able to use the new non-power-of-two heap lengths and backwards compatibility requirements would prevent app developers using in future until ff26 is history.  This feature might help reduce memory usage and also improve performance of apps.
Testing completed (on m-c, etc.): includes tests
Risk to taking this patch (and alternatives if risky): low
String or IDL/UUID changes made by this patch:
Attachment #805612 - Flags: approval-mozilla-aurora?
Depends on: 917800
Whiteboard: [games]
Attachment #805612 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #801059 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.