Closed Bug 548664 Opened 11 years ago Closed 11 months ago

NEON support for Optimize UTF8 to UTF16 conversion

Categories

(Core :: String, defect)

ARM
All
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: dougt, Assigned: m_kato)

References

Details

Attachments

(1 file, 2 obsolete files)

in bug 506430, we added sse support for the utf8 -> utf16 conversion.  we should try using NEON (which has 128-bit registers).
NEON may be not very suitable for this operation. Branching based on the value in NEON register is extremely slow (~20 cycles penalty).

So the following approaches can be tried:
1. Pure ARMv6 implementation using UXTB16 instruction which can convert a pair of 8-bit values to 16-bit.
2. Mixed ARM and NEON, with ARM used for testing high bits and loop breaking, and with NEON used just for 8-bit -> 16-bit extending.

ARMv6 and newer ARM cores support unaligned memory accesses (at least in linux), which is also handy.

Also for this particular use case (conversion of a huge buffer with ascii text, which does not even fit L2 cache), memory bandwidth is going to be a limitation. So explicit prefetch is required (Cortex-A8 and older ARM cores do not have a hardware prefetch).
When I tested 2., it could get 20% faster then current on i.mx515.

Also, it is no good to use 128-bit register because comparing cost vmov.32 are too expensive.
s/comparing cost vmov.32/comparing cost and vmov.32/
Attached patch test patch (obsolete) — Splinter Review
64-bit compare is high cost on Cortex-A8.  string length is 8 or 16, it is good performance to use 64-bit compare version with 64bit->128bit NEON conversion.  But on other situation such as length is 21 and etc, it is not good performance...

So we may use 32-bit compare version like this patch.
__ARM_NEON__ is unfortunately not useful, because we make those decisions at runtime instead of at compile time; but the patch looks good in any case, just need to figure out how to get it picking the right code at runtime..
(In reply to comment #5)
> __ARM_NEON__ is unfortunately not useful, because we make those decisions at
> runtime instead of at compile time; but the patch looks good in any case, just
> need to figure out how to get it picking the right code at runtime..

I agree.  We should add NEON detection to SSE.h at first...
Depends on: 583958
Comment on attachment 458627 [details] [diff] [review]
test patch

>+    __asm__ __volatile__(
>+      "vdup.32   d0, %0\n\t"
>+      "vmovl.u8  q1, d0\n\t"
>+      "vst1.8    {d2}, [%1]!\n\t"
>+      : "+r"(in), "+r"(dst32)
>+      :
>+      : "%d0", "%d2", "%q1");

Seeing as you're modifying %1 (i.e. dst32) with the vst1.8, shouldn't you add it to the list of written registers?  If you did that, the asm might not have to be volatile.
"vst1.8 {d2}, [%1]!" means "vst1.8 {d2}, [%1]" and %1 register (dst32) += 8.  So dst32 is overwritten.
Right.  My comment is just on the list of registers defined after the assembly.  Looking up the syntax again, I see that +r indicates an input/output register.  Since |in| is only read, perhaps it should be:

__asm__ (
      ...
      : "+r"(dst32)                     /* output */
      : "=r"(in)                        /* input */
      : "%d0", "%d2", "%q1", "memory"); /* clobber */

Of course, you'd have to switch %0 and %1, too.  Or maybe you could do

__asm__ (
      "vdup.32   d0, %[in]\n\t"
      "vmovl.u8  q1, d0\n\t"
      "vst1.8    {d2}, %[dst]!\n\t"
      : [dst] "+r"(dst32)
      : [in]  "=r"(in)
      : "%d0", "%d2", "%q1", "memory");

http://www.ethernut.de/en/documents/arm-inline-asm.html

IIRC gcc doesn't understand that clobbering q1 implies clobbering d2 and d3, so it might be more correct to add "%d3" to the clobber list too.  But I don't think that matters here.
Assignee: nobody → m_kato
Status: NEW → ASSIGNED
OS: Mac OS X → All
Hardware: x86 → ARM
Attached patch fix (obsolete) — Splinter Review
Attachment #458627 - Attachment is obsolete: true
Attachment #533542 - Flags: review?(justin.lebar+bug)
>+    while (len > 4) {
>+      PRUint32 in = *src32++;
>+      if (in & 0x80808080U) {
>+        src32--;
>+        break;
>+      }
>+
>+      __asm__(
>+        ".fpu neon\n"
>+        "vdup.32   d0, %1\n"
>+        "vmovl.u8  q1, d0\n"
>+        "vst1.8    {d2}, [%0]!\n"
>+        : "+r"(dst32)
>+        : "r"(in)
>+        : "d0", "d2", "q1", "memory");
>+      len -= 4;
>+    }

Do you know if this is any faster than the original code?

Would it be faster to do 64 bits at a time?

  PRUint32 in1 = *src32++;
  PRUint32 in2 = *src32++;

  if (in1 & 0x80808080U || in2 & 0x80808080U) {
    src32 -= 2;
    break;
  }

  vmov d0, in1, in2
  vmovl.u8 q1, d0
  vst1.8 q1, [%0]!

You could even try loading into d0 first, then moving into general purpose registers to do the if.  Something like:

  Before loop, load 0x808080 into d31.

  Load [src32] into d0.
  d1 = d0 & d31
  vmov r0, r1, d0
  check that r0 == r1 == 0
  finish up

I suspect this will be slower than loading into the general-purpose registers first, but it might be worth a shot.
(In reply to comment #11)

> Do you know if this is any faster than the original code?

It is 20% faster then original code on Cortex-A8 board.

> Would it be faster to do 64 bits at a time?
> 
>   PRUint32 in1 = *src32++;
>   PRUint32 in2 = *src32++;
> 
>   if (in1 & 0x80808080U || in2 & 0x80808080U) {
>     src32 -= 2;
>     break;
>   }
> 
>   vmov d0, in1, in2
>   vmovl.u8 q1, d0
>   vst1.8 q1, [%0]!

Oh.  This is more 10% faster than my fix on Cortex-A8 or Cortex-A9.

 
> You could even try loading into d0 first, then moving into general purpose
> registers to do the if.  Something like:
> 
>   Before loop, load 0x808080 into d31.
> 
>   Load [src32] into d0.
>   d1 = d0 & d31
>   vmov r0, r1, d0
>   check that r0 == r1 == 0
>   finish up

This was slower than original.
Attachment #533542 - Flags: review?(justin.lebar+bug)
You may get an additional speedup by adding prefetch instructions.
Attached patch fix v2Splinter Review
- aligned access to get performace.  (aligned copy is 1 cycle vs unaligned copy is 2 or 3 cycle)
- use pld for prefetch
Attachment #533542 - Attachment is obsolete: true
Attachment #559426 - Flags: review?(justin.lebar+bug)
Comment on attachment 559426 [details] [diff] [review]
fix v2

> +    // Although ARMv7 can support unaligned access, We use aligned acess
> +    // for distnation to get more performance

"access", not "acess".  "destination", not "distnation".  Please end the sentence with a period, and please don't capitalize "We" in the middle of the sentence.

In fact, I'd reword it to say:

  Although ARMv7 supports unaligned loads and stores, aligning the src and dst pointers here gives us a speedup. 

> +    while (((NS_PTR_TO_UINT32(dst) & 15) || (NS_PTR_TO_UINT32(src) & 3))

Suppose src = 0x1002, dst=0x100.  Then the following are the values that src and dst will take on as we loop.  I'll put a * next to the src values aligned to 4 bits and the dst values aligned to 16 bits.  We need *s next to both numbers to exit the loop.

   src     dst
0x1002   0x100*
0x1004*  0x101
0x1006   0x102
0x1008*  0x103
...
0x1032*  0x10E
0x1034   0x110*

In other words, src and dst are out of phase with one another.  src is aligned on every odd iteration (i=1, i=3, etc), and dst is aligned when the iteration count is a multiple of 16 (i=0, i=16, etc.).  We'll never exit this loop and run the NEON code.

This is what the current code is checking for when it says "with some alignments, we'd never break out of the slow loop."

I think the right thing to do is to check whether src and dst are in phase or not.  If they're in phase, run the code in this patch.  If not, run the code in your earlier version, which assumed only that src was aligned.

(You could just do what the current version does and fall back to a slow path when the data is misaligned.  But looking at how Convert_ascii_run is used, it seems to me that we'll run the out-of-phase version often.)

> +        "pld       [%1, #8]\n"

Did you try different prefetch distances?  You may get a bigger speedup by using a larger offset than #8.

Sorry to send this back for another round.
Attachment #559426 - Flags: review?(justin.lebar+bug) → review-
> But looking at how Convert_ascii_run is used, it seems to me that 
> we'll run the out-of-phase version often.

... because if there's a Japanese character in the UTF-8 string, for instance, we'll move the UTF-8 pointer forward 3 bytes but move the UTF-16 pointer forward only 2 bytes.  This will change the two pointers' relative phase.
(In reply to Justin Lebar [:jlebar] from comment #15)
> Comment on attachment 559426 [details] [diff] [review]
> fix v2
> 
> > +    // Although ARMv7 can support unaligned access, We use aligned acess
> > +    // for distnation to get more performance
> 
> "access", not "acess".  "destination", not "distnation".  Please end the
> sentence with a period, and please don't capitalize "We" in the middle of
> the sentence.
> 
> In fact, I'd reword it to say:
> 
>   Although ARMv7 supports unaligned loads and stores, aligning the src and
> dst pointers here gives us a speedup. 

Thanks.

 
> > +    while (((NS_PTR_TO_UINT32(dst) & 15) || (NS_PTR_TO_UINT32(src) & 3))
> 
> Suppose src = 0x1002, dst=0x100.  Then the following are the values that src
> and dst will take on as we loop.  I'll put a * next to the src values
> aligned to 4 bits and the dst values aligned to 16 bits.  We need *s next to
> both numbers to exit the loop.
> 
>    src     dst
> 0x1002   0x100*
> 0x1004*  0x101
> 0x1006   0x102
> 0x1008*  0x103
> ...
> 0x1032*  0x10E
> 0x1034   0x110*

???  src* is char and dst is PRUnicode*.  It is incorrect.

> In other words, src and dst are out of phase with one another.  src is
> aligned on every odd iteration (i=1, i=3, etc), and dst is aligned when the
> iteration count is a multiple of 16 (i=0, i=16, etc.).  We'll never exit
> this loop and run the NEON code.

I should add more optimization like the following...

while (src & 3) {
  // not using NEON
}

while (dst & 15) {
  // using unaligned NEON (src grows 4 bytes per loop)
}

// using aligned NEON

> > +        "pld       [%1, #8]\n"
> 
> Did you try different prefetch distances?  You may get a bigger speedup by
> using a larger offset than #8.

If we grows chunk for src per loop, it is more faster.  But my concern is a cost of comparing.
> I should add more optimization like the following.

Yes, that's what I had in mind.

> If we grows chunk for src per loop, it is more faster. But my concern is a cost of comparing.

Sorry, I don't understand what you mean.  If you changed the 8 to 16 in the pld, what would be the extra cost?

string conversion uses encoding_rs now. It uses SIMD

Status: ASSIGNED → RESOLVED
Closed: 11 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.