586838 - Add NEON versions of LossyConvertEncoding

Reporter

Description

•

14 years ago

See bug 586698.  We should be able to do the same for NEON.

Makoto Kato [:m_kato]

Assignee

Updated

•

14 years ago

Depends on: 586698

Makoto Kato [:m_kato]

Assignee

Updated

•

13 years ago

Assignee: nobody → m_kato

Status: NEW → ASSIGNED

Makoto Kato [:m_kato]

Assignee

Comment 1

•

13 years ago

Attached patch fix v1 (obsolete) — Details — Splinter Review

Makoto Kato [:m_kato]

Assignee

Comment 2

•

13 years ago

Attached patch fix v1.1 (obsolete) — Details — Splinter Review

Attachment #559444 - Attachment is obsolete: true

Makoto Kato [:m_kato]

Assignee

Updated

•

13 years ago

Attachment #559445 - Flags: review?(justin.lebar+bug)

Justin Lebar (not reading bugmail)

Reporter

Comment 3

•

13 years ago

Comment on attachment 559445 [details] [diff] [review]
fix v1.1

In the future, can you please include more context in your patches?  This has just three lines of context.  Add

   [diff]
   showfunc = 1
   unified = 8

to your hgrc.

> +#if defined(MOZILLA_MAY_SUPPORT_NEON)
> +        if (mozilla::supports_neon())
> +          {
> +            p = write_ascii_neon(start, N);
> +          }
> +#endif

Unlike the lossy versions, this method stops when it sees a non-ascii
character.  Please add a comment explaining that here.

> +#include "nscore.h"
> +#include "nsUTF8Utils.h"
> +
> +const ConvertUTF8toUTF16::value_type*
> +ConvertUTF8toUTF16::write_ascii_neon( const value_type* start, PRUint32 N )
> +{
> +  const value_type* p = start;
> +  const value_type* end = start + N;
> +  buffer_type* out = mBuffer;
> +
> +  // Although ARMv7 supports unaligned aceess, unaligned access
> +  // is slow. So we need aligned acess to get performance

s/aceess/access/ (twice).  Also, please end the sentence with a period.

> +        "pld      [%1, #8]\n"

Like in the other bug, do you know that #8 here, or #16 / #32 elsewhere, is right?

> +    if (done_writing >= aSource + 8) {
> +      __asm__(
> +        "pld       [%0, #16]\n"

Do you need a pld here, and in the similar code above, since we're not in a
loop?

The rest of it looks good to me.  My only other question is, do you need to annotate the vldm / vstm instructions to indicate that your pointers are aligned?  In bug 548664, you have

> + "vst1.16   {q1}, [%0, :128]!\n".  Do you need to add similar things to this patch?

Attachment #559445 - Flags: review?(justin.lebar+bug) → review+

Makoto Kato [:m_kato]

Assignee

Updated

•

7 years ago

Attachment #559445 - Attachment is obsolete: true

Makoto Kato [:m_kato]

Assignee

Updated

•

7 years ago

Blocks: 1420369

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/204486/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/204486/

Comment hidden (mozreview-request)

Comment on attachment 8933542 [details]
Bug 586838 - Add NEON versions of LossyConvertEncoding.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/204486/diff/1-2/

Makoto Kato [:m_kato]

Assignee

Updated

•

7 years ago

Attachment #8933542 - Flags: review?(erahm)

Makoto Kato [:m_kato]

Assignee

Comment 6

•

7 years ago

arm32 might not allow unaligned access (depends on hardware configuration), so we have to consider unsigned access.

When using Cortex-A53 board (arm32 mode), this improves as twice as fast.

Eric Rahm [:erahm]

Comment 7

•

7 years ago

(In reply to Makoto Kato [:m_kato] (slow due to PTO?) from comment #6)
> arm32 might not allow unaligned access (depends on hardware configuration),
> so we have to consider unsigned access.
> 
> When using Cortex-A53 board (arm32 mode), this improves as twice as fast.

Just a heads up this might take me a little bit, I'm getting up to speed on neon intrinsics. Really happy to see this patch though!

Makoto Kato [:m_kato]

Assignee

Comment 8

•

7 years ago

Attached patch Add NEON versions of LossyConvertEncoding — Details — Splinter Review

no mozreview version.  Eric, if you cannot review this, please change better reviewer.

Attachment #8933542 - Attachment is obsolete: true

Attachment #8933542 - Flags: review?(erahm)

Attachment #8936391 - Flags: review?(erahm)

Eric Rahm [:erahm]

Comment 9

•

7 years ago

Comment on attachment 8936391 [details] [diff] [review]
Add NEON versions of LossyConvertEncoding

Review of attachment 8936391 [details] [diff] [review]:
-----------------------------------------------------------------

This looks like a good start. I have high level question: does the store actually require 16-byte alignment? Would it be possible to reduce the restriction on the alignment of the destination buffer like we do in the SSE code so that we can use this in more cases?

::: xpcom/string/nsUTF8UtilsNEON.cpp
@@ +15,5 @@
> +                                      uint32_t aSourceLength)
> +{
> +  char* dest = mDestination;
> +
> +  // Align source and destination to a 16-byte boundary.

This comment is slightly misleading. We're trying to align the source to a 16-byte boundary and the dest to an 8-byte boundary.

@@ +17,5 @@
> +  char* dest = mDestination;
> +
> +  // Align source and destination to a 16-byte boundary.
> +  uint32_t i = 0;
> +  while (((NS_PTR_TO_INT32(aSource + i) & 0xf) ||

The SIMD code actually negates NS_PTR_TO_INT32, I'm not sure if that matters here (maybe because we're using an intptr_t cast instead of a uintptr_t cast).

@@ +19,5 @@
> +  // Align source and destination to a 16-byte boundary.
> +  uint32_t i = 0;
> +  while (((NS_PTR_TO_INT32(aSource + i) & 0xf) ||
> +          (NS_PTR_TO_INT32(dest + i) & 0x7)) &&
> +         i < aSourceLength) {

But if we can't align `aSrc` by 16 and `aDst` by 8 we do all the work here.

I'm not sure we should be so strict here, I don't think `vst1[q]_u8` actually requires dest to be aligned (at least judging by some stack overflow comments [1]). If we were hand writing ASM then we could specify an alignment requirement. Does that seem correct to you? What does the disassembly look like for this?

[1] https://stackoverflow.com/a/37537993

@@ +61,5 @@
> +                                      uint32_t aSourceLength)
> +{
> +  char16_t* dest = mDestination;
> +
> +  // Align source and destination to a 16-byte boundary.

This is the reverse, but the same comments above apply. I think we should still probably align on `aSource`.

@@ +78,5 @@
> +  }
> +
> +  // Walk 16 bytes at a time.
> +  while (aSourceLength - i > 15) {
> +    uint8x16_t s = vld1q_u8(reinterpret_cast<const uint8_t*>(aSource + i));

load src into a q reg...why don't we just load into d reg, ie:

> uint16x8_t low  = vmovl_u8(vld1_u8(...));
> uint16x8_t high = vmovl_u8(vld1_u8(...));

It seems like that's one less instruction, but maybe working with a q reg will be quicker?

Attachment #8936391 - Flags: review?(erahm) → feedback+

Eric Rahm [:erahm]

Comment 10

•

6 years ago

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #9)
> Comment on attachment 8936391 [details] [diff] [review]
> Add NEON versions of LossyConvertEncoding
> 
> Review of attachment 8936391 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks like a good start. I have high level question: does the store
> actually require 16-byte alignment? Would it be possible to reduce the
> restriction on the alignment of the destination buffer like we do in the SSE
> code so that we can use this in more cases?

It looks like there is `__builtin_assume_aligned` for gcc and clang. We can use that to tell the compiler that it should use optimized assembly instructions. Perhaps we can use that for the source at least?

> ::: xpcom/string/nsUTF8UtilsNEON.cpp
> @@ +15,5 @@
> > +                                      uint32_t aSourceLength)
> > +{
> > +  char* dest = mDestination;
> > +
> > +  // Align source and destination to a 16-byte boundary.
> 
> This comment is slightly misleading. We're trying to align the source to a
> 16-byte boundary and the dest to an 8-byte boundary.
> 
> @@ +17,5 @@
> > +  char* dest = mDestination;
> > +
> > +  // Align source and destination to a 16-byte boundary.
> > +  uint32_t i = 0;
> > +  while (((NS_PTR_TO_INT32(aSource + i) & 0xf) ||
> 
> The SIMD code actually negates NS_PTR_TO_INT32, I'm not sure if that matters
> here (maybe because we're using an intptr_t cast instead of a uintptr_t
> cast).

I worked out what's going on here, the SSE version is a bit-twiddling hack. This version is okay, we should just be using uintptr_t directly instead of the macro, ie:

> while (uintptr_t(aSource + 1) & 0xf)

Makoto Kato [:m_kato]

Assignee

Comment 11

•

6 years ago

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #9)
> Comment on attachment 8936391 [details] [diff] [review]
> Add NEON versions of LossyConvertEncoding
> 
> Review of attachment 8936391 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks like a good start. I have high level question: does the store
> actually require 16-byte alignment? Would it be possible to reduce the
> restriction on the alignment of the destination buffer like we do in the SSE
> code so that we can use this in more cases?

It depends on hardware configuration (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344f/Cihejdic.html) for unalignment access on arm32.  SSE has unaligned access instruction, but no one on arm32.

Also, most current arm device that uses android, it allows unaligned access.
 
> ::: xpcom/string/nsUTF8UtilsNEON.cpp
> @@ +15,5 @@
> > +                                      uint32_t aSourceLength)
> > +{
> > +  char* dest = mDestination;
> > +
> > +  // Align source and destination to a 16-byte boundary.
> 
> This comment is slightly misleading. We're trying to align the source to a
> 16-byte boundary and the dest to an 8-byte boundary.
> 
> @@ +17,5 @@
> > +  char* dest = mDestination;
> > +
> > +  // Align source and destination to a 16-byte boundary.
> > +  uint32_t i = 0;
> > +  while (((NS_PTR_TO_INT32(aSource + i) & 0xf) ||
> 
> The SIMD code actually negates NS_PTR_TO_INT32, I'm not sure if that matters
> here (maybe because we're using an intptr_t cast instead of a uintptr_t
> cast).
> 
> @@ +19,5 @@
> > +  // Align source and destination to a 16-byte boundary.
> > +  uint32_t i = 0;
> > +  while (((NS_PTR_TO_INT32(aSource + i) & 0xf) ||
> > +          (NS_PTR_TO_INT32(dest + i) & 0x7)) &&
> > +         i < aSourceLength) {
> 
> But if we can't align `aSrc` by 16 and `aDst` by 8 we do all the work here.
> 
> I'm not sure we should be so strict here, I don't think `vst1[q]_u8`
> actually requires dest to be aligned (at least judging by some stack
> overflow comments [1]). If we were hand writing ASM then we could specify an
> alignment requirement. Does that seem correct to you? What does the
> disassembly look like for this?
> 
> [1] https://stackoverflow.com/a/37537993

According to http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344f/Cihejdic.html, "If an alignment qualifier is not specified, and A=1, the alignment fault is taken if it is not aligned to element size."

> 
> @@ +61,5 @@
> > +                                      uint32_t aSourceLength)
> > +{
> > +  char16_t* dest = mDestination;
> > +
> > +  // Align source and destination to a 16-byte boundary.
> 
> This is the reverse, but the same comments above apply. I think we should
> still probably align on `aSource`.
> 
> @@ +78,5 @@
> > +  }
> > +
> > +  // Walk 16 bytes at a time.
> > +  while (aSourceLength - i > 15) {
> > +    uint8x16_t s = vld1q_u8(reinterpret_cast<const uint8_t*>(aSource + i));
> 
> load src into a q reg...why don't we just load into d reg, ie:
> 
> > uint16x8_t low  = vmovl_u8(vld1_u8(...));
> > uint16x8_t high = vmovl_u8(vld1_u8(...));
> 
> It seems like that's one less instruction, but maybe working with a q reg
> will be quicker?

Qx reg is Dx reg pair on NEON.  So this code will be compiled like

  86:   f964 0a2d       vld1.8  {d16-d17}, [r4 :128]!
  8a:   f105 0620       add.w   r6, r5, #32
  8e:   3b10            subs    r3, #16
  90:   ffc8 2a30       vmovl.u8        q9, d16
  94:   ffc8 0a31       vmovl.u8        q8, d17
  98:   f945 2a6d       vst1.16 {d18-d19}, [r5 :128]!
  9c:   f945 0a6f       vst1.16 {d16-d17}, [r5 :128]

(this code uses __builtin_assume_aligned, so alignment qualifier is added)

Makoto Kato [:m_kato]

Assignee

Comment 12

•

6 years ago

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #10)
> (In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment
> #9)
> > Comment on attachment 8936391 [details] [diff] [review]
> > Add NEON versions of LossyConvertEncoding
> > 
> > Review of attachment 8936391 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > This looks like a good start. I have high level question: does the store
> > actually require 16-byte alignment? Would it be possible to reduce the
> > restriction on the alignment of the destination buffer like we do in the SSE
> > code so that we can use this in more cases?
> 
> It looks like there is `__builtin_assume_aligned` for gcc and clang. We can
> use that to tell the compiler that it should use optimized assembly
> instructions. Perhaps we can use that for the source at least?

__builtin_assume_aligned can add alignment qualifier.  I should use it.

Makoto Kato [:m_kato]

Assignee

Comment 13

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a29d8c64eb6d55b8349d96ee1be3cefbee0b6219

Makoto Kato [:m_kato]

Assignee

Comment 14

•

6 years ago

Attached patch Add NEON versions of LossyConvertEncoding v2 — Details — Splinter Review

Attachment #8940745 - Flags: review?(erahm)

Eric Rahm [:erahm]

Comment 15

•

6 years ago

(In reply to Makoto Kato [:m_kato] from comment #14)
> Created attachment 8940745 [details] [diff] [review]
> Add NEON versions of LossyConvertEncoding v2

Thanks for the update, I should be able to look at this tomorrow.

Eric Rahm [:erahm]

Comment 16

•

6 years ago

(In reply to Makoto Kato [:m_kato] from comment #11)

> According to
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344f/
> Cihejdic.html, "If an alignment qualifier is not specified, and A=1, the
> alignment fault is taken if it is not aligned to element size."

I think "element size" is the key here, I took that to mean `alignmentof(uint8_t)` for `vst1q_u8`. I am by no means an expert on ARM, so if you know I'm wrong or feel strongly about keeping dst aligned I'm fine with that, you don't need to prove it to me.

That said, it would be nice if we could test on a device with A=1. 

> Qx reg is Dx reg pair on NEON.  So this code will be compiled like

I just learned something very cool :)

Eric Rahm [:erahm]

Comment 17

•

6 years ago

Comment on attachment 8940745 [details] [diff] [review]
Add NEON versions of LossyConvertEncoding v2

Review of attachment 8940745 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, I'm still not totally sure we need 8/16 byte alignment for the dst but I'm okay with this as-is.

::: xpcom/string/moz.build
@@ +60,5 @@
>      SOURCES['nsReadableUtilsSSE2.cpp'].flags += CONFIG['SSE2_FLAGS']
>  
> +if CONFIG['BUILD_ARM_NEON'] or CONFIG['CPU_ARCH'] == 'aarch64':
> +    SOURCES += ['nsUTF8UtilsNEON.cpp']
> +    SOURCES['nsUTF8UtilsNEON.cpp'].flags += CONFIG['NEON_FLAGS']

Do we need neon flags for 'aarch64'? I'm guessing it doesn't matter, but double checking.

::: xpcom/string/nsUTF8Utils.h
@@ +664,5 @@
>        write_sse2(aSource, aSourceLength);
>        return;
>      }
>  #endif
> +#if defined(MOZILLA_MAY_SUPPORT_NEON) && defined(MOZ_LITTLE_ENDIAN)

Why only little endian?

::: xpcom/string/nsUTF8UtilsNEON.cpp
@@ +36,5 @@
> +            vmovn_u16(s));
> +    i += 8;
> +  }
> +
> +  // Align source and destination to a 16-byte boundary.

I wonder if we could just declare aligned versions here and avoid the `_builtin_assume_aligned`s below, ie:

> const char16_t* src = reinterpret_cast<const uint16_t*>(__builtin_assume_aligned(aSource + i, 16);
> dest = __builtin_assume_aligned(dest + i, 16)
> 
>  while (aSourceLength - i > 15) {
>    uint16x8_t low = vld1q_u16(src + i);
>    uint16x8_t low = vld1q_u16(src + i + 8);
>    ...

I assume the compiler can tell we're going in 16-byte increments.

Attachment #8940745 - Flags: review?(erahm) → review+

Makoto Kato [:m_kato]

Assignee

Comment 18

•

6 years ago

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #17)
> Comment on attachment 8940745 [details] [diff] [review]
> Add NEON versions of LossyConvertEncoding v2
> 
> Review of attachment 8940745 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good, I'm still not totally sure we need 8/16 byte alignment for the
> dst but I'm okay with this as-is.
> 
> ::: xpcom/string/moz.build
> @@ +60,5 @@
> >      SOURCES['nsReadableUtilsSSE2.cpp'].flags += CONFIG['SSE2_FLAGS']
> >  
> > +if CONFIG['BUILD_ARM_NEON'] or CONFIG['CPU_ARCH'] == 'aarch64':
> > +    SOURCES += ['nsUTF8UtilsNEON.cpp']
> > +    SOURCES['nsUTF8UtilsNEON.cpp'].flags += CONFIG['NEON_FLAGS']
> 
> Do we need neon flags for 'aarch64'? I'm guessing it doesn't matter, but
> double checking.

Unnecessary on aarch64.  But, NEON_FLAGS is useful macro to use NEON on both 32bit and 64bit (bug 1298569).
 
> ::: xpcom/string/nsUTF8Utils.h
> @@ +664,5 @@
> >        write_sse2(aSource, aSourceLength);
> >        return;
> >      }
> >  #endif
> > +#if defined(MOZILLA_MAY_SUPPORT_NEON) && defined(MOZ_LITTLE_ENDIAN)
> 
> Why only little endian?

I have no environment for big endian.  vcombine_u8 should be replace high with low for big endian. I will follow up it by another bug.
 
> ::: xpcom/string/nsUTF8UtilsNEON.cpp
> @@ +36,5 @@
> > +            vmovn_u16(s));
> > +    i += 8;
> > +  }
> > +
> > +  // Align source and destination to a 16-byte boundary.
> 
> I wonder if we could just declare aligned versions here and avoid the
> `_builtin_assume_aligned`s below, ie:
> 
> > const char16_t* src = reinterpret_cast<const uint16_t*>(__builtin_assume_aligned(aSource + i, 16);
> > dest = __builtin_assume_aligned(dest + i, 16)
> > 
> >  while (aSourceLength - i > 15) {
> >    uint16x8_t low = vld1q_u16(src + i);
> >    uint16x8_t low = vld1q_u16(src + i + 8);
> >    ...
> 
> I assume the compiler can tell we're going in 16-byte increments.

When using it, alignment qualifier isn't added on clang-5.0 unfortunately.

Pulsebot

Comment 19

•

6 years ago

Pushed by m_kato@ga2.so-net.ne.jp:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3db5dd3a6fe6
Add NEON versions of LossyConvertEncoding. r=erahm

Natalia Csoregi [:nataliaCs]

Comment 20

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/3db5dd3a6fe6

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox59: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla59

Henri Sivonen (:hsivonen)

Comment 21

•

6 years ago

Do I understand correctly that we build this code by default when compiling on an ARMv7 GNU/Linux host?

Flags: needinfo?(m_kato)

Makoto Kato [:m_kato]

Assignee

Comment 22

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #21)
> Do I understand correctly that we build this code by default when compiling
> on an ARMv7 GNU/Linux host?

When target is Linux/arm that compiler supports NEON, this is compiled.  This code is tested on Android/arm and Ubuntu/armeabihf 16.04.

Flags: needinfo?(m_kato)

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: String → XPCOM

fix v1 13 years ago Makoto Kato [:m_kato] 8.84 KB, patch		Details \| Diff \| Splinter Review
fix v1.1 13 years ago Makoto Kato [:m_kato] 8.47 KB, patch	justin.lebar+bug : review+	Details \| Diff \| Splinter Review
Bug 586838 - Add NEON versions of LossyConvertEncoding. 7 years ago Makoto Kato [:m_kato] 59 bytes, text/x-review-board-request		Details
Add NEON versions of LossyConvertEncoding 7 years ago Makoto Kato [:m_kato] 7.28 KB, patch	erahm : feedback+	Details \| Diff \| Splinter Review
Add NEON versions of LossyConvertEncoding v2 6 years ago Makoto Kato [:m_kato] 8.25 KB, patch	erahm : review+	Details \| Diff \| Splinter Review