Closed Bug 586838 Opened 9 years ago Closed 2 years ago

Add NEON versions of LossyConvertEncoding

Categories

(Core :: String, defect)

ARM
All
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox59 --- fixed

People

(Reporter: justin.lebar+bug, Assigned: m_kato)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 3 obsolete files)

See bug 586698.  We should be able to do the same for NEON.
Depends on: 586698
Assignee: nobody → m_kato
Status: NEW → ASSIGNED
Attached patch fix v1 (obsolete) — Splinter Review
Attached patch fix v1.1 (obsolete) — Splinter Review
Attachment #559444 - Attachment is obsolete: true
Attachment #559445 - Flags: review?(justin.lebar+bug)
Comment on attachment 559445 [details] [diff] [review]
fix v1.1

In the future, can you please include more context in your patches?  This has just three lines of context.  Add

   [diff]
   showfunc = 1
   unified = 8

to your hgrc.

> +#if defined(MOZILLA_MAY_SUPPORT_NEON)
> +        if (mozilla::supports_neon())
> +          {
> +            p = write_ascii_neon(start, N);
> +          }
> +#endif

Unlike the lossy versions, this method stops when it sees a non-ascii
character.  Please add a comment explaining that here.

> +#include "nscore.h"
> +#include "nsUTF8Utils.h"
> +
> +const ConvertUTF8toUTF16::value_type*
> +ConvertUTF8toUTF16::write_ascii_neon( const value_type* start, PRUint32 N )
> +{
> +  const value_type* p = start;
> +  const value_type* end = start + N;
> +  buffer_type* out = mBuffer;
> +
> +  // Although ARMv7 supports unaligned aceess, unaligned access
> +  // is slow. So we need aligned acess to get performance

s/aceess/access/ (twice).  Also, please end the sentence with a period.

> +        "pld      [%1, #8]\n"

Like in the other bug, do you know that #8 here, or #16 / #32 elsewhere, is right?

> +    if (done_writing >= aSource + 8) {
> +      __asm__(
> +        "pld       [%0, #16]\n"

Do you need a pld here, and in the similar code above, since we're not in a
loop?

The rest of it looks good to me.  My only other question is, do you need to annotate the vldm / vstm instructions to indicate that your pointers are aligned?  In bug 548664, you have

> + "vst1.16   {q1}, [%0, :128]!\n".  Do you need to add similar things to this patch?
Attachment #559445 - Flags: review?(justin.lebar+bug) → review+
Attachment #559445 - Attachment is obsolete: true
Blocks: 1420369
Attachment #8933542 - Flags: review?(erahm)
arm32 might not allow unaligned access (depends on hardware configuration), so we have to consider unsigned access.

When using Cortex-A53 board (arm32 mode), this improves as twice as fast.
(In reply to Makoto Kato [:m_kato] (slow due to PTO?) from comment #6)
> arm32 might not allow unaligned access (depends on hardware configuration),
> so we have to consider unsigned access.
> 
> When using Cortex-A53 board (arm32 mode), this improves as twice as fast.

Just a heads up this might take me a little bit, I'm getting up to speed on neon intrinsics. Really happy to see this patch though!
no mozreview version.  Eric, if you cannot review this, please change better reviewer.
Attachment #8933542 - Attachment is obsolete: true
Attachment #8933542 - Flags: review?(erahm)
Attachment #8936391 - Flags: review?(erahm)
Comment on attachment 8936391 [details] [diff] [review]
Add NEON versions of LossyConvertEncoding

Review of attachment 8936391 [details] [diff] [review]:
-----------------------------------------------------------------

This looks like a good start. I have high level question: does the store actually require 16-byte alignment? Would it be possible to reduce the restriction on the alignment of the destination buffer like we do in the SSE code so that we can use this in more cases?

::: xpcom/string/nsUTF8UtilsNEON.cpp
@@ +15,5 @@
> +                                      uint32_t aSourceLength)
> +{
> +  char* dest = mDestination;
> +
> +  // Align source and destination to a 16-byte boundary.

This comment is slightly misleading. We're trying to align the source to a 16-byte boundary and the dest to an 8-byte boundary.

@@ +17,5 @@
> +  char* dest = mDestination;
> +
> +  // Align source and destination to a 16-byte boundary.
> +  uint32_t i = 0;
> +  while (((NS_PTR_TO_INT32(aSource + i) & 0xf) ||

The SIMD code actually negates NS_PTR_TO_INT32, I'm not sure if that matters here (maybe because we're using an intptr_t cast instead of a uintptr_t cast).

@@ +19,5 @@
> +  // Align source and destination to a 16-byte boundary.
> +  uint32_t i = 0;
> +  while (((NS_PTR_TO_INT32(aSource + i) & 0xf) ||
> +          (NS_PTR_TO_INT32(dest + i) & 0x7)) &&
> +         i < aSourceLength) {

But if we can't align `aSrc` by 16 and `aDst` by 8 we do all the work here.

I'm not sure we should be so strict here, I don't think `vst1[q]_u8` actually requires dest to be aligned (at least judging by some stack overflow comments [1]). If we were hand writing ASM then we could specify an alignment requirement. Does that seem correct to you? What does the disassembly look like for this?

[1] https://stackoverflow.com/a/37537993

@@ +61,5 @@
> +                                      uint32_t aSourceLength)
> +{
> +  char16_t* dest = mDestination;
> +
> +  // Align source and destination to a 16-byte boundary.

This is the reverse, but the same comments above apply. I think we should still probably align on `aSource`.

@@ +78,5 @@
> +  }
> +
> +  // Walk 16 bytes at a time.
> +  while (aSourceLength - i > 15) {
> +    uint8x16_t s = vld1q_u8(reinterpret_cast<const uint8_t*>(aSource + i));

load src into a q reg...why don't we just load into d reg, ie:

> uint16x8_t low  = vmovl_u8(vld1_u8(...));
> uint16x8_t high = vmovl_u8(vld1_u8(...));

It seems like that's one less instruction, but maybe working with a q reg will be quicker?
Attachment #8936391 - Flags: review?(erahm) → feedback+
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #9)
> Comment on attachment 8936391 [details] [diff] [review]
> Add NEON versions of LossyConvertEncoding
> 
> Review of attachment 8936391 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks like a good start. I have high level question: does the store
> actually require 16-byte alignment? Would it be possible to reduce the
> restriction on the alignment of the destination buffer like we do in the SSE
> code so that we can use this in more cases?

It looks like there is `__builtin_assume_aligned` for gcc and clang. We can use that to tell the compiler that it should use optimized assembly instructions. Perhaps we can use that for the source at least?

> ::: xpcom/string/nsUTF8UtilsNEON.cpp
> @@ +15,5 @@
> > +                                      uint32_t aSourceLength)
> > +{
> > +  char* dest = mDestination;
> > +
> > +  // Align source and destination to a 16-byte boundary.
> 
> This comment is slightly misleading. We're trying to align the source to a
> 16-byte boundary and the dest to an 8-byte boundary.
> 
> @@ +17,5 @@
> > +  char* dest = mDestination;
> > +
> > +  // Align source and destination to a 16-byte boundary.
> > +  uint32_t i = 0;
> > +  while (((NS_PTR_TO_INT32(aSource + i) & 0xf) ||
> 
> The SIMD code actually negates NS_PTR_TO_INT32, I'm not sure if that matters
> here (maybe because we're using an intptr_t cast instead of a uintptr_t
> cast).

I worked out what's going on here, the SSE version is a bit-twiddling hack. This version is okay, we should just be using uintptr_t directly instead of the macro, ie:

> while (uintptr_t(aSource + 1) & 0xf)
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #9)
> Comment on attachment 8936391 [details] [diff] [review]
> Add NEON versions of LossyConvertEncoding
> 
> Review of attachment 8936391 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks like a good start. I have high level question: does the store
> actually require 16-byte alignment? Would it be possible to reduce the
> restriction on the alignment of the destination buffer like we do in the SSE
> code so that we can use this in more cases?

It depends on hardware configuration (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344f/Cihejdic.html) for unalignment access on arm32.  SSE has unaligned access instruction, but no one on arm32.

Also, most current arm device that uses android, it allows unaligned access.
 
> ::: xpcom/string/nsUTF8UtilsNEON.cpp
> @@ +15,5 @@
> > +                                      uint32_t aSourceLength)
> > +{
> > +  char* dest = mDestination;
> > +
> > +  // Align source and destination to a 16-byte boundary.
> 
> This comment is slightly misleading. We're trying to align the source to a
> 16-byte boundary and the dest to an 8-byte boundary.
> 
> @@ +17,5 @@
> > +  char* dest = mDestination;
> > +
> > +  // Align source and destination to a 16-byte boundary.
> > +  uint32_t i = 0;
> > +  while (((NS_PTR_TO_INT32(aSource + i) & 0xf) ||
> 
> The SIMD code actually negates NS_PTR_TO_INT32, I'm not sure if that matters
> here (maybe because we're using an intptr_t cast instead of a uintptr_t
> cast).
> 
> @@ +19,5 @@
> > +  // Align source and destination to a 16-byte boundary.
> > +  uint32_t i = 0;
> > +  while (((NS_PTR_TO_INT32(aSource + i) & 0xf) ||
> > +          (NS_PTR_TO_INT32(dest + i) & 0x7)) &&
> > +         i < aSourceLength) {
> 
> But if we can't align `aSrc` by 16 and `aDst` by 8 we do all the work here.
> 
> I'm not sure we should be so strict here, I don't think `vst1[q]_u8`
> actually requires dest to be aligned (at least judging by some stack
> overflow comments [1]). If we were hand writing ASM then we could specify an
> alignment requirement. Does that seem correct to you? What does the
> disassembly look like for this?
> 
> [1] https://stackoverflow.com/a/37537993

According to http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344f/Cihejdic.html, "If an alignment qualifier is not specified, and A=1, the alignment fault is taken if it is not aligned to element size."

> 
> @@ +61,5 @@
> > +                                      uint32_t aSourceLength)
> > +{
> > +  char16_t* dest = mDestination;
> > +
> > +  // Align source and destination to a 16-byte boundary.
> 
> This is the reverse, but the same comments above apply. I think we should
> still probably align on `aSource`.
> 
> @@ +78,5 @@
> > +  }
> > +
> > +  // Walk 16 bytes at a time.
> > +  while (aSourceLength - i > 15) {
> > +    uint8x16_t s = vld1q_u8(reinterpret_cast<const uint8_t*>(aSource + i));
> 
> load src into a q reg...why don't we just load into d reg, ie:
> 
> > uint16x8_t low  = vmovl_u8(vld1_u8(...));
> > uint16x8_t high = vmovl_u8(vld1_u8(...));
> 
> It seems like that's one less instruction, but maybe working with a q reg
> will be quicker?

Qx reg is Dx reg pair on NEON.  So this code will be compiled like

  86:   f964 0a2d       vld1.8  {d16-d17}, [r4 :128]!
  8a:   f105 0620       add.w   r6, r5, #32
  8e:   3b10            subs    r3, #16
  90:   ffc8 2a30       vmovl.u8        q9, d16
  94:   ffc8 0a31       vmovl.u8        q8, d17
  98:   f945 2a6d       vst1.16 {d18-d19}, [r5 :128]!
  9c:   f945 0a6f       vst1.16 {d16-d17}, [r5 :128]

(this code uses __builtin_assume_aligned, so alignment qualifier is added)
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #10)
> (In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment
> #9)
> > Comment on attachment 8936391 [details] [diff] [review]
> > Add NEON versions of LossyConvertEncoding
> > 
> > Review of attachment 8936391 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > This looks like a good start. I have high level question: does the store
> > actually require 16-byte alignment? Would it be possible to reduce the
> > restriction on the alignment of the destination buffer like we do in the SSE
> > code so that we can use this in more cases?
> 
> It looks like there is `__builtin_assume_aligned` for gcc and clang. We can
> use that to tell the compiler that it should use optimized assembly
> instructions. Perhaps we can use that for the source at least?

__builtin_assume_aligned can add alignment qualifier.  I should use it.
(In reply to Makoto Kato [:m_kato] from comment #14)
> Created attachment 8940745 [details] [diff] [review]
> Add NEON versions of LossyConvertEncoding v2

Thanks for the update, I should be able to look at this tomorrow.
(In reply to Makoto Kato [:m_kato] from comment #11)

> According to
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344f/
> Cihejdic.html, "If an alignment qualifier is not specified, and A=1, the
> alignment fault is taken if it is not aligned to element size."

I think "element size" is the key here, I took that to mean `alignmentof(uint8_t)` for `vst1q_u8`. I am by no means an expert on ARM, so if you know I'm wrong or feel strongly about keeping dst aligned I'm fine with that, you don't need to prove it to me.

That said, it would be nice if we could test on a device with A=1. 

> Qx reg is Dx reg pair on NEON.  So this code will be compiled like

I just learned something very cool :)
Comment on attachment 8940745 [details] [diff] [review]
Add NEON versions of LossyConvertEncoding v2

Review of attachment 8940745 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, I'm still not totally sure we need 8/16 byte alignment for the dst but I'm okay with this as-is.

::: xpcom/string/moz.build
@@ +60,5 @@
>      SOURCES['nsReadableUtilsSSE2.cpp'].flags += CONFIG['SSE2_FLAGS']
>  
> +if CONFIG['BUILD_ARM_NEON'] or CONFIG['CPU_ARCH'] == 'aarch64':
> +    SOURCES += ['nsUTF8UtilsNEON.cpp']
> +    SOURCES['nsUTF8UtilsNEON.cpp'].flags += CONFIG['NEON_FLAGS']

Do we need neon flags for 'aarch64'? I'm guessing it doesn't matter, but double checking.

::: xpcom/string/nsUTF8Utils.h
@@ +664,5 @@
>        write_sse2(aSource, aSourceLength);
>        return;
>      }
>  #endif
> +#if defined(MOZILLA_MAY_SUPPORT_NEON) && defined(MOZ_LITTLE_ENDIAN)

Why only little endian?

::: xpcom/string/nsUTF8UtilsNEON.cpp
@@ +36,5 @@
> +            vmovn_u16(s));
> +    i += 8;
> +  }
> +
> +  // Align source and destination to a 16-byte boundary.

I wonder if we could just declare aligned versions here and avoid the `_builtin_assume_aligned`s below, ie:

> const char16_t* src = reinterpret_cast<const uint16_t*>(__builtin_assume_aligned(aSource + i, 16);
> dest = __builtin_assume_aligned(dest + i, 16)
> 
>  while (aSourceLength - i > 15) {
>    uint16x8_t low = vld1q_u16(src + i);
>    uint16x8_t low = vld1q_u16(src + i + 8);
>    ...

I assume the compiler can tell we're going in 16-byte increments.
Attachment #8940745 - Flags: review?(erahm) → review+
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #17)
> Comment on attachment 8940745 [details] [diff] [review]
> Add NEON versions of LossyConvertEncoding v2
> 
> Review of attachment 8940745 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good, I'm still not totally sure we need 8/16 byte alignment for the
> dst but I'm okay with this as-is.
> 
> ::: xpcom/string/moz.build
> @@ +60,5 @@
> >      SOURCES['nsReadableUtilsSSE2.cpp'].flags += CONFIG['SSE2_FLAGS']
> >  
> > +if CONFIG['BUILD_ARM_NEON'] or CONFIG['CPU_ARCH'] == 'aarch64':
> > +    SOURCES += ['nsUTF8UtilsNEON.cpp']
> > +    SOURCES['nsUTF8UtilsNEON.cpp'].flags += CONFIG['NEON_FLAGS']
> 
> Do we need neon flags for 'aarch64'? I'm guessing it doesn't matter, but
> double checking.

Unnecessary on aarch64.  But, NEON_FLAGS is useful macro to use NEON on both 32bit and 64bit (bug 1298569).
 
> ::: xpcom/string/nsUTF8Utils.h
> @@ +664,5 @@
> >        write_sse2(aSource, aSourceLength);
> >        return;
> >      }
> >  #endif
> > +#if defined(MOZILLA_MAY_SUPPORT_NEON) && defined(MOZ_LITTLE_ENDIAN)
> 
> Why only little endian?

I have no environment for big endian.  vcombine_u8 should be replace high with low for big endian. I will follow up it by another bug.
 
> ::: xpcom/string/nsUTF8UtilsNEON.cpp
> @@ +36,5 @@
> > +            vmovn_u16(s));
> > +    i += 8;
> > +  }
> > +
> > +  // Align source and destination to a 16-byte boundary.
> 
> I wonder if we could just declare aligned versions here and avoid the
> `_builtin_assume_aligned`s below, ie:
> 
> > const char16_t* src = reinterpret_cast<const uint16_t*>(__builtin_assume_aligned(aSource + i, 16);
> > dest = __builtin_assume_aligned(dest + i, 16)
> > 
> >  while (aSourceLength - i > 15) {
> >    uint16x8_t low = vld1q_u16(src + i);
> >    uint16x8_t low = vld1q_u16(src + i + 8);
> >    ...
> 
> I assume the compiler can tell we're going in 16-byte increments.

When using it, alignment qualifier isn't added on clang-5.0 unfortunately.
Pushed by m_kato@ga2.so-net.ne.jp:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3db5dd3a6fe6
Add NEON versions of LossyConvertEncoding. r=erahm
https://hg.mozilla.org/mozilla-central/rev/3db5dd3a6fe6
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Do I understand correctly that we build this code by default when compiling on an ARMv7 GNU/Linux host?
Flags: needinfo?(m_kato)
(In reply to Henri Sivonen (:hsivonen) from comment #21)
> Do I understand correctly that we build this code by default when compiling
> on an ARMv7 GNU/Linux host?

When target is Linux/arm that compiler supports NEON, this is compiled.  This code is tested on Android/arm and Ubuntu/armeabihf 16.04.
Flags: needinfo?(m_kato)
You need to log in before you can comment on or make changes to this bug.