Closed Bug 692042 Opened 13 years ago Closed 13 years ago

Optimize ByteArray: readByte, readInt, readDouble, etc

Categories

(Tamarin Graveyard :: Library, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED
Q1 12 - Brannan

People

(Reporter: lhansen, Assigned: lhansen)

References

Details

(Whiteboard: PACMAN)

Attachments

(1 file, 1 obsolete file)

The "stream" oriented methods on ByteArray are quite slow, involving many function calls (one of them virtual) and eventually ending up in memcpy to move a tiny number of bytes.  It is easy to do much better by cleaning up the C++ code.
Attached patch Patch (obsolete) — Splinter Review
Changes for the easy optimizations, ready for a buildbot run-through (no R? yet).

Performance profile for this change relative to TR 6547, Mac-32 Release, 2x2.93GHz Xeon, OS 10.6, 5 iterations.  Slowdown on bytearray-read-utfbytes-1 needs investigation but is probably noise.

                                        avm            avm2
test                           best     avg    best     avg   %dBst   %dAvg
Metric: iterations/second 
Dir: asmicro/
 bytearray-read-available-1   143.7   143.7   143.9   143.7     0.1     0.0   
  bytearray-read-bool-1         7.8     7.8    19.1    19.1   145.7   145.7 ++
  bytearray-read-byte-1         8.6     8.6    23.3    23.2   171.3   171.2 ++
  bytearray-read-byte-2       106.9   106.8   107.0   106.8     0.1     0.0   
  bytearray-read-byte-3        99.3    99.2    99.6    99.5     0.3     0.3   
  bytearray-read-bytes-1       16.7    16.6    16.7    16.7     0.2     0.2   
  bytearray-read-bytes-2      862.1   861.6   858.1   857.6    -0.5    -0.5 - 
  bytearray-read-double-1       6.2     6.2    16.0    15.9   156.6   156.6 ++
  bytearray-read-double-2       8.1     8.1    29.0    29.0   256.9   257.0 ++
  bytearray-read-float-1        7.5     7.5    19.8    19.5   165.1   161.0 ++
  bytearray-read-float-2        7.9     7.9    26.1    26.0   230.6   229.4 ++
  bytearray-read-int-1          8.4     8.4    24.4    24.3   190.2   190.1 ++
  bytearray-read-int-2          8.8     8.8    29.7    29.7   236.6   236.3 ++
  bytearray-read-length-1     170.8   170.7   170.8   170.6    -0.0    -0.0   
  bytearray-read-position-1   170.8   170.8   170.8   170.6    -0.0    -0.1   
  bytearray-read-short-1        7.8     7.7    28.1    28.1   257.9   263.5 ++
  bytearray-read-short-2        7.7     7.7    28.9    28.9   273.3   273.3 ++
  bytearray-read-ubyte-1        8.4     8.4    23.5    23.5   179.0   179.3 ++
  bytearray-read-uint-1         8.4     8.4    24.5    24.5   191.7   191.6 ++
  bytearray-read-uint-2         8.9     8.9    29.7    29.6   234.7   233.9 ++
  bytearray-read-ushort-1       7.8     7.8    28.0    28.0   261.8   261.4 ++
  bytearray-read-ushort-2       8.0     8.0    28.5    28.5   257.5   257.1 ++
  bytearray-read-utf-1         16.5    16.4    16.6    16.5     1.0     0.8   
  bytearray-read-utf-2          3.4     3.4     3.5     3.4     3.3     1.4 + 
  bytearray-read-utfbytes-1    18.7    18.7    17.5    17.5    -6.5    -6.7 --
  bytearray-read-utfbytes-2     3.4     3.4     3.5     3.5     4.1     2.0 + 
  bytearray-write-bool-1        6.2     6.1    19.1    19.1   210.0   210.3 ++
  bytearray-write-byte-1        6.2     6.2    14.6    14.5   133.2   133.1 ++
  bytearray-write-byte-2       81.6    81.5    85.4    84.1     4.7     3.2 + 
  bytearray-write-byte-3       81.6    81.5    81.8    81.6     0.2     0.1   
  bytearray-write-bytes-1      21.6    21.6    21.6    21.6    -0.0    -0.0   
  bytearray-write-bytes-2    1223.8  1222.4  1218.8  1217.8    -0.4    -0.4 - 
  bytearray-write-double-1      4.7     4.7    11.8    11.8   150.7   150.5 ++
  bytearray-write-double-2      5.6     5.6    13.3    13.3   136.2   136.0 ++
  bytearray-write-float-1       5.6     5.6    13.7    13.7   145.5   145.5 ++
  bytearray-write-float-2       5.9     5.9    13.3    13.3   124.2   124.4 ++
  bytearray-write-int-1         6.1     6.1    16.8    16.8   174.5   174.4 ++
  bytearray-write-int-2         6.2     6.1    16.6    16.5   169.9   168.5 ++
  bytearray-write-short-1       5.9     5.9    16.5    16.5   180.9   181.1 ++
  bytearray-write-short-2       5.8     5.8    15.4    15.4   166.9   167.0 ++
  bytearray-write-uint-1        6.1     6.1    16.9    16.8   176.7   176.0 ++
  bytearray-write-uint-2        6.2     6.1    15.8    15.8   157.0   156.4 ++
Priority: -- → P3
Blocks: 692046
Comment on attachment 564764 [details] [diff] [review]
Patch

Buildbot shows a MIPS failure on a Strings benchmark that's been judged spurious by Brent.

The performance drop on read-utfbytes is still being investigated.

Meanwhile, time for a review.
Attachment #564764 - Flags: review?(fklockii)
I judge that the performance discrepancy is noise, possibly icache placement related.  Consider these two runs with configurations / setup equivalent to the previous runs:

  bytearray-read-utfbytes-1    18.6    18.5    17.4    17.3    -6.4    -6.6 --

  bytearray-read-utfbytes-1    16.4    16.4    17.5    17.5     6.8     6.6 ++

The first run is with the benchmark as written.  The second benchmark is with two spurious "print" statements added before the main benchmark loop.  True, the effect is to slow down the old code by 13% without changing the performance of the new code, not to improve the new code, but the point stands: the change seen on this benchmark is almost certainly noise.
BTW, I've tested with the tests and bug fixes in bug #691251 and with the additional tests currently awaiting review in bug #687045 (in that order).  Those may be needed to apply the patch, should that be desirable.
There will have to be an adjustment to this patch: In bug #685441 it is revealed that not only will ARM compilers defeat the attempted use of unaligned uint32_t loads to load 32-bit floating point values (this was previously known and is commented upon in the code in the patch), but they will also defeat the attempted use of unaligned uint32_t loads to load 64-bit floating-point values.  (Ditto for stores.)

That's surprising and annoying - what's the point of the architecture allowing unaligned loads the compilers make them unreliable - but in practice ARM is the only non-x86 architecture that allows unaligned int loads at all (without using multi-instruction sequences), and furthermore on ARM the unaligned memory ops are so slow that they're not worth the hassle.

Anyway the impact is that those cases have to be removed from readDouble and writeDouble in this patch.  Please do the review assuming that that's the case.
Whiteboard: PACMAN
Attached patch Patch, v2Splinter Review
This patch is essentially identical to the previous one, but removes the inappropriate use of unaligned int access to load/store floating point, and removes one obsolete comment that should have been removed before.
Attachment #564764 - Attachment is obsolete: true
Attachment #564764 - Flags: review?(fklockii)
Attachment #565163 - Flags: review?(fklockii)
In ByteArrayGlue.cpp you mention that you do not know of any byte swap instructions for ARM.  Some googling indicates they added a REV instruction in ARMv6 which appears to behave as byteSwap32; see also NSS's Bug 600106 which also discusses the gcc intrinsic for this.
(I assume the way you have written your code, you'll just inherit the GCC intrinsic on ARM platforms, regardless of what your comment says.  Still, you may want to update the comment accordingly, and also double-check that we are emitting it on arm targets.)
(In reply to Felix S Klock II from comment #8)
> (I assume the way you have written your code, you'll just inherit the GCC
> intrinsic on ARM platforms, regardless of what your comment says.  Still,
> you may want to update the comment accordingly, and also double-check that
> we are emitting it on arm targets.)

Oops, no, that's not true, you won't magically inherit the intrinsic; I overlooked the outer guard:

#if defined VMCFG_IA32 || defined VMCFG_AMD64 || defined VMCFG_PPC || defined VMCFG_MIPS || defined VMCFG_SH4

So you'll have to fix the code too.  Unless I'm wrong in my interpretation of what REV does and/or what GCC provides.
(In reply to Felix S Klock II from comment #7)
> In ByteArrayGlue.cpp you mention that you do not know of any byte swap
> instructions for ARM.  Some googling indicates they added a REV instruction
> in ARMv6 which appears to behave as byteSwap32; see also NSS's Bug 600106
> which also discusses the gcc intrinsic for this.

Thanks!

I wonder why this did not show up in my searches in the ARM docs and on the web?
(In reply to Lars T Hansen from comment #10)
> (In reply to Felix S Klock II from comment #7)
> > In ByteArrayGlue.cpp you mention that you do not know of any byte swap
> > instructions for ARM.  Some googling indicates they added a REV instruction
> > in ARMv6 which appears to behave as byteSwap32; see also NSS's Bug 600106
> > which also discusses the gcc intrinsic for this.
> 
> Thanks!
> 
> I wonder why this did not show up in my searches in the ARM docs and on the
> web?

<shrug> It was among my first hits when googling "arm byte swap"
Comment on attachment 565163 [details] [diff] [review]
Patch, v2

Review of attachment 565163 [details] [diff] [review]:
-----------------------------------------------------------------

R+.  My comments below are either suggestions for a few more static assertions or issues of micro-optimization and/or lack of clarity/contradiction in the comments.

::: core/ByteArrayGlue.cpp
@@ +99,5 @@
> +// strictly less than 2^32.  See constraint in ByteArray::Grower::EnsureWritableCapacity().
> +//
> +// MAX_BYTEARRAY_SHORT_ACCESS_LENGTH is a value no smaller than 4095 and no greater
> +// than 2^32-1-MAX_BYTEARRAY_STORE_LENGTH.
> +//

Consider adding static asserts somewhere for the constraint 
uint64_t(MAX_BYTEARRAY_STORE_LENGTH) + uint64_t(MAX_BYTEARRAY_SHORT_ACCESS_LENGTH)
< uint64_t(2^32).

(You could just replace the comment stating this with such a static assert.)

(I do see the static_assert that is checking the exact value of MAX_BYTEARRAY_STORE_LENGTH, but that's a much stronger constraint than the one you are stating here.)

@@ +211,2 @@
>          if (minimumCapacity > (MMgc::GCHeap::kMaxObjectSize - MMgc::GCHeap::kBlockSize*2))
>              m_owner->ThrowMemoryError();

You might as well update the right-hand side of the comparison above to use the new constant, right?

@@ +597,5 @@
> +    // to further optimizations but that seems unreasonably short.)
> +    //
> +    // Observe that ByteArray::Grower::EnsureWritableCapacity() implements the appropriate
> +    // limit on m_capacity, and that is the only code that allocates memory for ByteArray.
> +    // Everywhere else we ensure m_length <= m_capactiy.

typo: m_capactiy

@@ +609,5 @@
> +        // m_position + nbytes does not overflow uint32_t in the second disjunct because:
> +        //
> +        //   m_position < m_length <= m_capacity <= MAX_BYTEARRAY_STORE_LENGTH 
> +        //   nbytes <= MAX_BYTEARRAY_SHORT_ACCESS_LENGTH
> +        //   MAX_BYTEARRAY_STORE_LENGTH + MAX_BYTEARRAY_SHORT_ACCESS_LENGTH < 2^32

(see note above; this is where I was suggesting replacing comment with static assert code.)

@@ +611,5 @@
> +        //   m_position < m_length <= m_capacity <= MAX_BYTEARRAY_STORE_LENGTH 
> +        //   nbytes <= MAX_BYTEARRAY_SHORT_ACCESS_LENGTH
> +        //   MAX_BYTEARRAY_STORE_LENGTH + MAX_BYTEARRAY_SHORT_ACCESS_LENGTH < 2^32
> +
> +        if (m_position >= m_length || m_position + nbytes > m_length)

Why is the first disjunct necessary?  Doesn't it imply the second disjunct (since nbytes > 0 and you've already argued it won't overflow) and therefore the second on its own would suffice?

@@ +628,5 @@
> +        // m_position + nbytes does not overflow uint32_t in the second disjunct because:
> +        //
> +        //   m_position < m_length <= m_capacity <= MAX_BYTEARRAY_STORE_LENGTH 
> +        //   nbytes <= MAX_BYTEARRAY_SHORT_ACCESS_LENGTH
> +        //   MAX_BYTEARRAY_STORE_LENGTH + MAX_BYTEARRAY_SHORT_ACCESS_LENGTH < 2^32

(and this is another candidate for the static assert treatment; if you're going to phrase it as a comment, you might as well make it concrete, IMO.)

@@ +631,5 @@
> +        //   nbytes <= MAX_BYTEARRAY_SHORT_ACCESS_LENGTH
> +        //   MAX_BYTEARRAY_STORE_LENGTH + MAX_BYTEARRAY_SHORT_ACCESS_LENGTH < 2^32
> +
> +        if (m_position >= m_length || m_position + nbytes > m_length)
> +            SetLength(m_position, nbytes);  // The addition would *not* be safe against overflow here

I'm feeling really dense.  Why doesn't the argument you used above (to show that m_position+nbytes does not overflow uint32_t)  apply in this case as well?
Attachment #565163 - Flags: review?(fklockii) → review+
> >          if (minimumCapacity > (MMgc::GCHeap::kMaxObjectSize - MMgc::GCHeap::kBlockSize*2))
> >              m_owner->ThrowMemoryError();
> 
> You might as well update the right-hand side of the comparison above to use
> the new constant, right?

I don't think I will, I had explicitly decided not to do so, because while this test happens to use the same value as the constant the justification for the value is different: it is based on internal properties of FixedMalloc.

> @@ +611,5 @@
> > +        //   m_position < m_length <= m_capacity <= MAX_BYTEARRAY_STORE_LENGTH 
> > +        //   nbytes <= MAX_BYTEARRAY_SHORT_ACCESS_LENGTH
> > +        //   MAX_BYTEARRAY_STORE_LENGTH + MAX_BYTEARRAY_SHORT_ACCESS_LENGTH < 2^32
> > +
> > +        if (m_position >= m_length || m_position + nbytes > m_length)
> 
> Why is the first disjunct necessary?

It is intended to establish the conditions under which the second disjunct is meaningful.  Note the comment above applies to the second disjunct; without the first disjunct the properties of the second disjunct would not hold.

> @@ +631,5 @@
> > +        //   nbytes <= MAX_BYTEARRAY_SHORT_ACCESS_LENGTH
> > +        //   MAX_BYTEARRAY_STORE_LENGTH + MAX_BYTEARRAY_SHORT_ACCESS_LENGTH < 2^32
> > +
> > +        if (m_position >= m_length || m_position + nbytes > m_length)
> > +            SetLength(m_position, nbytes);  // The addition would *not* be safe against overflow here
> 
> I'm feeling really dense.  Why doesn't the argument you used above (to show
> that m_position+nbytes does not overflow uint32_t)  apply in this case as
> well?

Ditto: the first disjunct sets the stage for the second.

I'll ponder what you've written but I don't think I understand your objections.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
changeset: 6657:5532f94bc66b
user:      Lars T Hansen <lhansen@adobe.com>
summary:   Fix 692042 - Optimize ByteArray: readByte, readInt, readDouble, etc (r=fklockii)

http://hg.mozilla.org/tamarin-redux/rev/5532f94bc66b
(In reply to Lars T Hansen from comment #13)
> Note the comment above applies to the second disjunct;
> without the first disjunct the properties of the second disjunct would not
> hold.

Ah, this is what I was missing in my review.  You're right; I was thinking the argument applied to the whole disjunction, but of course that is bogus.
(In reply to Felix S Klock II from comment #15)
> (In reply to Lars T Hansen from comment #13)
> > Note the comment above applies to the second disjunct;
> > without the first disjunct the properties of the second disjunct would not
> > hold.
> 
> Ah, this is what I was missing in my review.  You're right; I was thinking
> the argument applied to the whole disjunction, but of course that is bogus.

BTW I clarified the comments some, in anticipation of this resolution of the issue :-)
See Bug 700983

(I may reopen this, or may just log info in the above ticket; it depends on what my initial investigation indicates.)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: