692042 - Optimize ByteArray: readByte, readInt, readDouble, etc

Assignee

Description

•

13 years ago

The "stream" oriented methods on ByteArray are quite slow, involving many function calls (one of them virtual) and eventually ending up in memcpy to move a tiny number of bytes.  It is easy to do much better by cleaning up the C++ code.

Lars T Hansen

Assignee

Comment 1

•

13 years ago

Attached patch Patch (obsolete) — Details — Splinter Review

Changes for the easy optimizations, ready for a buildbot run-through (no R? yet).

Performance profile for this change relative to TR 6547, Mac-32 Release, 2x2.93GHz Xeon, OS 10.6, 5 iterations.  Slowdown on bytearray-read-utfbytes-1 needs investigation but is probably noise.

                                        avm            avm2
test                           best     avg    best     avg   %dBst   %dAvg
Metric: iterations/second 
Dir: asmicro/
 bytearray-read-available-1   143.7   143.7   143.9   143.7     0.1     0.0   
  bytearray-read-bool-1         7.8     7.8    19.1    19.1   145.7   145.7 ++
  bytearray-read-byte-1         8.6     8.6    23.3    23.2   171.3   171.2 ++
  bytearray-read-byte-2       106.9   106.8   107.0   106.8     0.1     0.0   
  bytearray-read-byte-3        99.3    99.2    99.6    99.5     0.3     0.3   
  bytearray-read-bytes-1       16.7    16.6    16.7    16.7     0.2     0.2   
  bytearray-read-bytes-2      862.1   861.6   858.1   857.6    -0.5    -0.5 - 
  bytearray-read-double-1       6.2     6.2    16.0    15.9   156.6   156.6 ++
  bytearray-read-double-2       8.1     8.1    29.0    29.0   256.9   257.0 ++
  bytearray-read-float-1        7.5     7.5    19.8    19.5   165.1   161.0 ++
  bytearray-read-float-2        7.9     7.9    26.1    26.0   230.6   229.4 ++
  bytearray-read-int-1          8.4     8.4    24.4    24.3   190.2   190.1 ++
  bytearray-read-int-2          8.8     8.8    29.7    29.7   236.6   236.3 ++
  bytearray-read-length-1     170.8   170.7   170.8   170.6    -0.0    -0.0   
  bytearray-read-position-1   170.8   170.8   170.8   170.6    -0.0    -0.1   
  bytearray-read-short-1        7.8     7.7    28.1    28.1   257.9   263.5 ++
  bytearray-read-short-2        7.7     7.7    28.9    28.9   273.3   273.3 ++
  bytearray-read-ubyte-1        8.4     8.4    23.5    23.5   179.0   179.3 ++
  bytearray-read-uint-1         8.4     8.4    24.5    24.5   191.7   191.6 ++
  bytearray-read-uint-2         8.9     8.9    29.7    29.6   234.7   233.9 ++
  bytearray-read-ushort-1       7.8     7.8    28.0    28.0   261.8   261.4 ++
  bytearray-read-ushort-2       8.0     8.0    28.5    28.5   257.5   257.1 ++
  bytearray-read-utf-1         16.5    16.4    16.6    16.5     1.0     0.8   
  bytearray-read-utf-2          3.4     3.4     3.5     3.4     3.3     1.4 + 
  bytearray-read-utfbytes-1    18.7    18.7    17.5    17.5    -6.5    -6.7 --
  bytearray-read-utfbytes-2     3.4     3.4     3.5     3.5     4.1     2.0 + 
  bytearray-write-bool-1        6.2     6.1    19.1    19.1   210.0   210.3 ++
  bytearray-write-byte-1        6.2     6.2    14.6    14.5   133.2   133.1 ++
  bytearray-write-byte-2       81.6    81.5    85.4    84.1     4.7     3.2 + 
  bytearray-write-byte-3       81.6    81.5    81.8    81.6     0.2     0.1   
  bytearray-write-bytes-1      21.6    21.6    21.6    21.6    -0.0    -0.0   
  bytearray-write-bytes-2    1223.8  1222.4  1218.8  1217.8    -0.4    -0.4 - 
  bytearray-write-double-1      4.7     4.7    11.8    11.8   150.7   150.5 ++
  bytearray-write-double-2      5.6     5.6    13.3    13.3   136.2   136.0 ++
  bytearray-write-float-1       5.6     5.6    13.7    13.7   145.5   145.5 ++
  bytearray-write-float-2       5.9     5.9    13.3    13.3   124.2   124.4 ++
  bytearray-write-int-1         6.1     6.1    16.8    16.8   174.5   174.4 ++
  bytearray-write-int-2         6.2     6.1    16.6    16.5   169.9   168.5 ++
  bytearray-write-short-1       5.9     5.9    16.5    16.5   180.9   181.1 ++
  bytearray-write-short-2       5.8     5.8    15.4    15.4   166.9   167.0 ++
  bytearray-write-uint-1        6.1     6.1    16.9    16.8   176.7   176.0 ++
  bytearray-write-uint-2        6.2     6.1    15.8    15.8   157.0   156.4 ++

Lars T Hansen

Assignee

Updated

•

13 years ago

Priority: -- → P3

Lars T Hansen

Assignee

Updated

•

13 years ago

Blocks: 692046

Lars T Hansen

Assignee

Comment 2

•

13 years ago

Comment on attachment 564764 [details] [diff] [review]
Patch

Buildbot shows a MIPS failure on a Strings benchmark that's been judged spurious by Brent.

The performance drop on read-utfbytes is still being investigated.

Meanwhile, time for a review.

Attachment #564764 - Flags: review?(fklockii)

Lars T Hansen

Assignee

Comment 3

•

13 years ago

I judge that the performance discrepancy is noise, possibly icache placement related.  Consider these two runs with configurations / setup equivalent to the previous runs:

  bytearray-read-utfbytes-1    18.6    18.5    17.4    17.3    -6.4    -6.6 --

  bytearray-read-utfbytes-1    16.4    16.4    17.5    17.5     6.8     6.6 ++

The first run is with the benchmark as written.  The second benchmark is with two spurious "print" statements added before the main benchmark loop.  True, the effect is to slow down the old code by 13% without changing the performance of the new code, not to improve the new code, but the point stands: the change seen on this benchmark is almost certainly noise.

Lars T Hansen

Assignee

Comment 4

•

13 years ago

BTW, I've tested with the tests and bug fixes in bug #691251 and with the additional tests currently awaiting review in bug #687045 (in that order).  Those may be needed to apply the patch, should that be desirable.

Lars T Hansen

Assignee

Comment 5

•

13 years ago

There will have to be an adjustment to this patch: In bug #685441 it is revealed that not only will ARM compilers defeat the attempted use of unaligned uint32_t loads to load 32-bit floating point values (this was previously known and is commented upon in the code in the patch), but they will also defeat the attempted use of unaligned uint32_t loads to load 64-bit floating-point values.  (Ditto for stores.)

That's surprising and annoying - what's the point of the architecture allowing unaligned loads the compilers make them unreliable - but in practice ARM is the only non-x86 architecture that allows unaligned int loads at all (without using multi-instruction sequences), and furthermore on ARM the unaligned memory ops are so slow that they're not worth the hassle.

Anyway the impact is that those cases have to be removed from readDouble and writeDouble in this patch.  Please do the review assuming that that's the case.

Lars T Hansen

Assignee

Updated

•

13 years ago

Whiteboard: PACMAN

Lars T Hansen

Assignee

Comment 6

•

13 years ago

Attached patch Patch, v2 — Details — Splinter Review

This patch is essentially identical to the previous one, but removes the inappropriate use of unaligned int access to load/store floating point, and removes one obsolete comment that should have been removed before.

Attachment #564764 - Attachment is obsolete: true

Attachment #564764 - Flags: review?(fklockii)

Attachment #565163 - Flags: review?(fklockii)

Felix S. Klock II [:pnkfelix, :fklock]

Comment 7

•

13 years ago

In ByteArrayGlue.cpp you mention that you do not know of any byte swap instructions for ARM.  Some googling indicates they added a REV instruction in ARMv6 which appears to behave as byteSwap32; see also NSS's Bug 600106 which also discusses the gcc intrinsic for this.

Felix S. Klock II [:pnkfelix, :fklock]

Comment 8

•

13 years ago

(I assume the way you have written your code, you'll just inherit the GCC intrinsic on ARM platforms, regardless of what your comment says.  Still, you may want to update the comment accordingly, and also double-check that we are emitting it on arm targets.)

Felix S. Klock II [:pnkfelix, :fklock]

Comment 9

•

13 years ago

(In reply to Felix S Klock II from comment #8)
> (I assume the way you have written your code, you'll just inherit the GCC
> intrinsic on ARM platforms, regardless of what your comment says.  Still,
> you may want to update the comment accordingly, and also double-check that
> we are emitting it on arm targets.)

Oops, no, that's not true, you won't magically inherit the intrinsic; I overlooked the outer guard:

#if defined VMCFG_IA32 || defined VMCFG_AMD64 || defined VMCFG_PPC || defined VMCFG_MIPS || defined VMCFG_SH4

So you'll have to fix the code too.  Unless I'm wrong in my interpretation of what REV does and/or what GCC provides.

Lars T Hansen

Assignee

Comment 10

•

13 years ago

(In reply to Felix S Klock II from comment #7)
> In ByteArrayGlue.cpp you mention that you do not know of any byte swap
> instructions for ARM.  Some googling indicates they added a REV instruction
> in ARMv6 which appears to behave as byteSwap32; see also NSS's Bug 600106
> which also discusses the gcc intrinsic for this.

Thanks!

I wonder why this did not show up in my searches in the ARM docs and on the web?

Felix S. Klock II [:pnkfelix, :fklock]

Comment 11

•

13 years ago

(In reply to Lars T Hansen from comment #10)
> (In reply to Felix S Klock II from comment #7)
> > In ByteArrayGlue.cpp you mention that you do not know of any byte swap
> > instructions for ARM.  Some googling indicates they added a REV instruction
> > in ARMv6 which appears to behave as byteSwap32; see also NSS's Bug 600106
> > which also discusses the gcc intrinsic for this.
> 
> Thanks!
> 
> I wonder why this did not show up in my searches in the ARM docs and on the
> web?

<shrug> It was among my first hits when googling "arm byte swap"

Felix S. Klock II [:pnkfelix, :fklock]

Comment 12

•

13 years ago

Comment on attachment 565163 [details] [diff] [review]
Patch, v2

Review of attachment 565163 [details] [diff] [review]:
-----------------------------------------------------------------

R+.  My comments below are either suggestions for a few more static assertions or issues of micro-optimization and/or lack of clarity/contradiction in the comments.

::: core/ByteArrayGlue.cpp
@@ +99,5 @@
> +// strictly less than 2^32.  See constraint in ByteArray::Grower::EnsureWritableCapacity().
> +//
> +// MAX_BYTEARRAY_SHORT_ACCESS_LENGTH is a value no smaller than 4095 and no greater
> +// than 2^32-1-MAX_BYTEARRAY_STORE_LENGTH.
> +//

Consider adding static asserts somewhere for the constraint 
uint64_t(MAX_BYTEARRAY_STORE_LENGTH) + uint64_t(MAX_BYTEARRAY_SHORT_ACCESS_LENGTH)
< uint64_t(2^32).

(You could just replace the comment stating this with such a static assert.)

(I do see the static_assert that is checking the exact value of MAX_BYTEARRAY_STORE_LENGTH, but that's a much stronger constraint than the one you are stating here.)

@@ +211,2 @@
>          if (minimumCapacity > (MMgc::GCHeap::kMaxObjectSize - MMgc::GCHeap::kBlockSize*2))
>              m_owner->ThrowMemoryError();

You might as well update the right-hand side of the comparison above to use the new constant, right?

@@ +597,5 @@
> +    // to further optimizations but that seems unreasonably short.)
> +    //
> +    // Observe that ByteArray::Grower::EnsureWritableCapacity() implements the appropriate
> +    // limit on m_capacity, and that is the only code that allocates memory for ByteArray.
> +    // Everywhere else we ensure m_length <= m_capactiy.

typo: m_capactiy

@@ +609,5 @@
> +        // m_position + nbytes does not overflow uint32_t in the second disjunct because:
> +        //
> +        //   m_position < m_length <= m_capacity <= MAX_BYTEARRAY_STORE_LENGTH 
> +        //   nbytes <= MAX_BYTEARRAY_SHORT_ACCESS_LENGTH
> +        //   MAX_BYTEARRAY_STORE_LENGTH + MAX_BYTEARRAY_SHORT_ACCESS_LENGTH < 2^32

(see note above; this is where I was suggesting replacing comment with static assert code.)

@@ +611,5 @@
> +        //   m_position < m_length <= m_capacity <= MAX_BYTEARRAY_STORE_LENGTH 
> +        //   nbytes <= MAX_BYTEARRAY_SHORT_ACCESS_LENGTH
> +        //   MAX_BYTEARRAY_STORE_LENGTH + MAX_BYTEARRAY_SHORT_ACCESS_LENGTH < 2^32
> +
> +        if (m_position >= m_length || m_position + nbytes > m_length)

Why is the first disjunct necessary?  Doesn't it imply the second disjunct (since nbytes > 0 and you've already argued it won't overflow) and therefore the second on its own would suffice?

@@ +628,5 @@
> +        // m_position + nbytes does not overflow uint32_t in the second disjunct because:
> +        //
> +        //   m_position < m_length <= m_capacity <= MAX_BYTEARRAY_STORE_LENGTH 
> +        //   nbytes <= MAX_BYTEARRAY_SHORT_ACCESS_LENGTH
> +        //   MAX_BYTEARRAY_STORE_LENGTH + MAX_BYTEARRAY_SHORT_ACCESS_LENGTH < 2^32

(and this is another candidate for the static assert treatment; if you're going to phrase it as a comment, you might as well make it concrete, IMO.)

@@ +631,5 @@
> +        //   nbytes <= MAX_BYTEARRAY_SHORT_ACCESS_LENGTH
> +        //   MAX_BYTEARRAY_STORE_LENGTH + MAX_BYTEARRAY_SHORT_ACCESS_LENGTH < 2^32
> +
> +        if (m_position >= m_length || m_position + nbytes > m_length)
> +            SetLength(m_position, nbytes);  // The addition would *not* be safe against overflow here

I'm feeling really dense.  Why doesn't the argument you used above (to show that m_position+nbytes does not overflow uint32_t)  apply in this case as well?

Felix S. Klock II [:pnkfelix, :fklock]

Updated

•

13 years ago

Attachment #565163 - Flags: review?(fklockii) → review+

Lars T Hansen

Assignee

Comment 13

•

13 years ago

> >          if (minimumCapacity > (MMgc::GCHeap::kMaxObjectSize - MMgc::GCHeap::kBlockSize*2))
> >              m_owner->ThrowMemoryError();
> 
> You might as well update the right-hand side of the comparison above to use
> the new constant, right?

I don't think I will, I had explicitly decided not to do so, because while this test happens to use the same value as the constant the justification for the value is different: it is based on internal properties of FixedMalloc.

> @@ +611,5 @@
> > +        //   m_position < m_length <= m_capacity <= MAX_BYTEARRAY_STORE_LENGTH 
> > +        //   nbytes <= MAX_BYTEARRAY_SHORT_ACCESS_LENGTH
> > +        //   MAX_BYTEARRAY_STORE_LENGTH + MAX_BYTEARRAY_SHORT_ACCESS_LENGTH < 2^32
> > +
> > +        if (m_position >= m_length || m_position + nbytes > m_length)
> 
> Why is the first disjunct necessary?

It is intended to establish the conditions under which the second disjunct is meaningful.  Note the comment above applies to the second disjunct; without the first disjunct the properties of the second disjunct would not hold.

> @@ +631,5 @@
> > +        //   nbytes <= MAX_BYTEARRAY_SHORT_ACCESS_LENGTH
> > +        //   MAX_BYTEARRAY_STORE_LENGTH + MAX_BYTEARRAY_SHORT_ACCESS_LENGTH < 2^32
> > +
> > +        if (m_position >= m_length || m_position + nbytes > m_length)
> > +            SetLength(m_position, nbytes);  // The addition would *not* be safe against overflow here
> 
> I'm feeling really dense.  Why doesn't the argument you used above (to show
> that m_position+nbytes does not overflow uint32_t)  apply in this case as
> well?

Ditto: the first disjunct sets the stage for the second.

I'll ponder what you've written but I don't think I understand your objections.

Lars T Hansen

Assignee

Updated

•

13 years ago

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Tamarin Bot

Comment 14

•

13 years ago

changeset: 6657:5532f94bc66b
user:      Lars T Hansen <lhansen@adobe.com>
summary:   Fix 692042 - Optimize ByteArray: readByte, readInt, readDouble, etc (r=fklockii)

http://hg.mozilla.org/tamarin-redux/rev/5532f94bc66b

Felix S. Klock II [:pnkfelix, :fklock]

Comment 15

•

13 years ago

(In reply to Lars T Hansen from comment #13)
> Note the comment above applies to the second disjunct;
> without the first disjunct the properties of the second disjunct would not
> hold.

Ah, this is what I was missing in my review.  You're right; I was thinking the argument applied to the whole disjunction, but of course that is bogus.

Lars T Hansen

Assignee

Comment 16

•

13 years ago

(In reply to Felix S Klock II from comment #15)
> (In reply to Lars T Hansen from comment #13)
> > Note the comment above applies to the second disjunct;
> > without the first disjunct the properties of the second disjunct would not
> > hold.
> 
> Ah, this is what I was missing in my review.  You're right; I was thinking
> the argument applied to the whole disjunction, but of course that is bogus.

BTW I clarified the comments some, in anticipation of this resolution of the issue :-)

Felix S. Klock II [:pnkfelix, :fklock]

Comment 17

•

13 years ago

See Bug 700983

(I may reopen this, or may just log info in the above ticket; it depends on what my initial investigation indicates.)

Patch 13 years ago Lars T Hansen 28.84 KB, patch		Details \| Diff \| Splinter Review
Patch, v2 13 years ago Lars T Hansen 28.92 KB, patch	pnkfelix : review+	Details \| Diff \| Splinter Review