Closed
Bug 586838
Opened 14 years ago
Closed 6 years ago
Add NEON versions of LossyConvertEncoding
Categories
(Core :: XPCOM, defect)
Tracking
()
RESOLVED
FIXED
mozilla59
Tracking | Status | |
---|---|---|
firefox59 | --- | fixed |
People
(Reporter: justin.lebar+bug, Assigned: m_kato)
References
(Blocks 1 open bug)
Details
Attachments
(2 files, 3 obsolete files)
7.28 KB,
patch
|
erahm
:
feedback+
|
Details | Diff | Splinter Review |
8.25 KB,
patch
|
erahm
:
review+
|
Details | Diff | Splinter Review |
See bug 586698. We should be able to do the same for NEON.
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → m_kato
Status: NEW → ASSIGNED
Assignee | ||
Comment 1•13 years ago
|
||
Assignee | ||
Comment 2•13 years ago
|
||
Attachment #559444 -
Attachment is obsolete: true
Assignee | ||
Updated•13 years ago
|
Attachment #559445 -
Flags: review?(justin.lebar+bug)
Reporter | ||
Comment 3•13 years ago
|
||
Comment on attachment 559445 [details] [diff] [review] fix v1.1 In the future, can you please include more context in your patches? This has just three lines of context. Add [diff] showfunc = 1 unified = 8 to your hgrc. > +#if defined(MOZILLA_MAY_SUPPORT_NEON) > + if (mozilla::supports_neon()) > + { > + p = write_ascii_neon(start, N); > + } > +#endif Unlike the lossy versions, this method stops when it sees a non-ascii character. Please add a comment explaining that here. > +#include "nscore.h" > +#include "nsUTF8Utils.h" > + > +const ConvertUTF8toUTF16::value_type* > +ConvertUTF8toUTF16::write_ascii_neon( const value_type* start, PRUint32 N ) > +{ > + const value_type* p = start; > + const value_type* end = start + N; > + buffer_type* out = mBuffer; > + > + // Although ARMv7 supports unaligned aceess, unaligned access > + // is slow. So we need aligned acess to get performance s/aceess/access/ (twice). Also, please end the sentence with a period. > + "pld [%1, #8]\n" Like in the other bug, do you know that #8 here, or #16 / #32 elsewhere, is right? > + if (done_writing >= aSource + 8) { > + __asm__( > + "pld [%0, #16]\n" Do you need a pld here, and in the similar code above, since we're not in a loop? The rest of it looks good to me. My only other question is, do you need to annotate the vldm / vstm instructions to indicate that your pointers are aligned? In bug 548664, you have > + "vst1.16 {q1}, [%0, :128]!\n". Do you need to add similar things to this patch?
Attachment #559445 -
Flags: review?(justin.lebar+bug) → review+
Assignee | ||
Updated•7 years ago
|
Attachment #559445 -
Attachment is obsolete: true
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Updated•7 years ago
|
Attachment #8933542 -
Flags: review?(erahm)
Assignee | ||
Comment 6•7 years ago
|
||
arm32 might not allow unaligned access (depends on hardware configuration), so we have to consider unsigned access. When using Cortex-A53 board (arm32 mode), this improves as twice as fast.
Comment 7•7 years ago
|
||
(In reply to Makoto Kato [:m_kato] (slow due to PTO?) from comment #6) > arm32 might not allow unaligned access (depends on hardware configuration), > so we have to consider unsigned access. > > When using Cortex-A53 board (arm32 mode), this improves as twice as fast. Just a heads up this might take me a little bit, I'm getting up to speed on neon intrinsics. Really happy to see this patch though!
Assignee | ||
Comment 8•7 years ago
|
||
no mozreview version. Eric, if you cannot review this, please change better reviewer.
Attachment #8933542 -
Attachment is obsolete: true
Attachment #8933542 -
Flags: review?(erahm)
Attachment #8936391 -
Flags: review?(erahm)
Comment 9•7 years ago
|
||
Comment on attachment 8936391 [details] [diff] [review] Add NEON versions of LossyConvertEncoding Review of attachment 8936391 [details] [diff] [review]: ----------------------------------------------------------------- This looks like a good start. I have high level question: does the store actually require 16-byte alignment? Would it be possible to reduce the restriction on the alignment of the destination buffer like we do in the SSE code so that we can use this in more cases? ::: xpcom/string/nsUTF8UtilsNEON.cpp @@ +15,5 @@ > + uint32_t aSourceLength) > +{ > + char* dest = mDestination; > + > + // Align source and destination to a 16-byte boundary. This comment is slightly misleading. We're trying to align the source to a 16-byte boundary and the dest to an 8-byte boundary. @@ +17,5 @@ > + char* dest = mDestination; > + > + // Align source and destination to a 16-byte boundary. > + uint32_t i = 0; > + while (((NS_PTR_TO_INT32(aSource + i) & 0xf) || The SIMD code actually negates NS_PTR_TO_INT32, I'm not sure if that matters here (maybe because we're using an intptr_t cast instead of a uintptr_t cast). @@ +19,5 @@ > + // Align source and destination to a 16-byte boundary. > + uint32_t i = 0; > + while (((NS_PTR_TO_INT32(aSource + i) & 0xf) || > + (NS_PTR_TO_INT32(dest + i) & 0x7)) && > + i < aSourceLength) { But if we can't align `aSrc` by 16 and `aDst` by 8 we do all the work here. I'm not sure we should be so strict here, I don't think `vst1[q]_u8` actually requires dest to be aligned (at least judging by some stack overflow comments [1]). If we were hand writing ASM then we could specify an alignment requirement. Does that seem correct to you? What does the disassembly look like for this? [1] https://stackoverflow.com/a/37537993 @@ +61,5 @@ > + uint32_t aSourceLength) > +{ > + char16_t* dest = mDestination; > + > + // Align source and destination to a 16-byte boundary. This is the reverse, but the same comments above apply. I think we should still probably align on `aSource`. @@ +78,5 @@ > + } > + > + // Walk 16 bytes at a time. > + while (aSourceLength - i > 15) { > + uint8x16_t s = vld1q_u8(reinterpret_cast<const uint8_t*>(aSource + i)); load src into a q reg...why don't we just load into d reg, ie: > uint16x8_t low = vmovl_u8(vld1_u8(...)); > uint16x8_t high = vmovl_u8(vld1_u8(...)); It seems like that's one less instruction, but maybe working with a q reg will be quicker?
Attachment #8936391 -
Flags: review?(erahm) → feedback+
Comment 10•6 years ago
|
||
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #9) > Comment on attachment 8936391 [details] [diff] [review] > Add NEON versions of LossyConvertEncoding > > Review of attachment 8936391 [details] [diff] [review]: > ----------------------------------------------------------------- > > This looks like a good start. I have high level question: does the store > actually require 16-byte alignment? Would it be possible to reduce the > restriction on the alignment of the destination buffer like we do in the SSE > code so that we can use this in more cases? It looks like there is `__builtin_assume_aligned` for gcc and clang. We can use that to tell the compiler that it should use optimized assembly instructions. Perhaps we can use that for the source at least? > ::: xpcom/string/nsUTF8UtilsNEON.cpp > @@ +15,5 @@ > > + uint32_t aSourceLength) > > +{ > > + char* dest = mDestination; > > + > > + // Align source and destination to a 16-byte boundary. > > This comment is slightly misleading. We're trying to align the source to a > 16-byte boundary and the dest to an 8-byte boundary. > > @@ +17,5 @@ > > + char* dest = mDestination; > > + > > + // Align source and destination to a 16-byte boundary. > > + uint32_t i = 0; > > + while (((NS_PTR_TO_INT32(aSource + i) & 0xf) || > > The SIMD code actually negates NS_PTR_TO_INT32, I'm not sure if that matters > here (maybe because we're using an intptr_t cast instead of a uintptr_t > cast). I worked out what's going on here, the SSE version is a bit-twiddling hack. This version is okay, we should just be using uintptr_t directly instead of the macro, ie: > while (uintptr_t(aSource + 1) & 0xf)
Assignee | ||
Comment 11•6 years ago
|
||
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #9) > Comment on attachment 8936391 [details] [diff] [review] > Add NEON versions of LossyConvertEncoding > > Review of attachment 8936391 [details] [diff] [review]: > ----------------------------------------------------------------- > > This looks like a good start. I have high level question: does the store > actually require 16-byte alignment? Would it be possible to reduce the > restriction on the alignment of the destination buffer like we do in the SSE > code so that we can use this in more cases? It depends on hardware configuration (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344f/Cihejdic.html) for unalignment access on arm32. SSE has unaligned access instruction, but no one on arm32. Also, most current arm device that uses android, it allows unaligned access. > ::: xpcom/string/nsUTF8UtilsNEON.cpp > @@ +15,5 @@ > > + uint32_t aSourceLength) > > +{ > > + char* dest = mDestination; > > + > > + // Align source and destination to a 16-byte boundary. > > This comment is slightly misleading. We're trying to align the source to a > 16-byte boundary and the dest to an 8-byte boundary. > > @@ +17,5 @@ > > + char* dest = mDestination; > > + > > + // Align source and destination to a 16-byte boundary. > > + uint32_t i = 0; > > + while (((NS_PTR_TO_INT32(aSource + i) & 0xf) || > > The SIMD code actually negates NS_PTR_TO_INT32, I'm not sure if that matters > here (maybe because we're using an intptr_t cast instead of a uintptr_t > cast). > > @@ +19,5 @@ > > + // Align source and destination to a 16-byte boundary. > > + uint32_t i = 0; > > + while (((NS_PTR_TO_INT32(aSource + i) & 0xf) || > > + (NS_PTR_TO_INT32(dest + i) & 0x7)) && > > + i < aSourceLength) { > > But if we can't align `aSrc` by 16 and `aDst` by 8 we do all the work here. > > I'm not sure we should be so strict here, I don't think `vst1[q]_u8` > actually requires dest to be aligned (at least judging by some stack > overflow comments [1]). If we were hand writing ASM then we could specify an > alignment requirement. Does that seem correct to you? What does the > disassembly look like for this? > > [1] https://stackoverflow.com/a/37537993 According to http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344f/Cihejdic.html, "If an alignment qualifier is not specified, and A=1, the alignment fault is taken if it is not aligned to element size." > > @@ +61,5 @@ > > + uint32_t aSourceLength) > > +{ > > + char16_t* dest = mDestination; > > + > > + // Align source and destination to a 16-byte boundary. > > This is the reverse, but the same comments above apply. I think we should > still probably align on `aSource`. > > @@ +78,5 @@ > > + } > > + > > + // Walk 16 bytes at a time. > > + while (aSourceLength - i > 15) { > > + uint8x16_t s = vld1q_u8(reinterpret_cast<const uint8_t*>(aSource + i)); > > load src into a q reg...why don't we just load into d reg, ie: > > > uint16x8_t low = vmovl_u8(vld1_u8(...)); > > uint16x8_t high = vmovl_u8(vld1_u8(...)); > > It seems like that's one less instruction, but maybe working with a q reg > will be quicker? Qx reg is Dx reg pair on NEON. So this code will be compiled like 86: f964 0a2d vld1.8 {d16-d17}, [r4 :128]! 8a: f105 0620 add.w r6, r5, #32 8e: 3b10 subs r3, #16 90: ffc8 2a30 vmovl.u8 q9, d16 94: ffc8 0a31 vmovl.u8 q8, d17 98: f945 2a6d vst1.16 {d18-d19}, [r5 :128]! 9c: f945 0a6f vst1.16 {d16-d17}, [r5 :128] (this code uses __builtin_assume_aligned, so alignment qualifier is added)
Assignee | ||
Comment 12•6 years ago
|
||
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #10) > (In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment > #9) > > Comment on attachment 8936391 [details] [diff] [review] > > Add NEON versions of LossyConvertEncoding > > > > Review of attachment 8936391 [details] [diff] [review]: > > ----------------------------------------------------------------- > > > > This looks like a good start. I have high level question: does the store > > actually require 16-byte alignment? Would it be possible to reduce the > > restriction on the alignment of the destination buffer like we do in the SSE > > code so that we can use this in more cases? > > It looks like there is `__builtin_assume_aligned` for gcc and clang. We can > use that to tell the compiler that it should use optimized assembly > instructions. Perhaps we can use that for the source at least? __builtin_assume_aligned can add alignment qualifier. I should use it.
Assignee | ||
Comment 13•6 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a29d8c64eb6d55b8349d96ee1be3cefbee0b6219
Assignee | ||
Comment 14•6 years ago
|
||
Attachment #8940745 -
Flags: review?(erahm)
Comment 15•6 years ago
|
||
(In reply to Makoto Kato [:m_kato] from comment #14) > Created attachment 8940745 [details] [diff] [review] > Add NEON versions of LossyConvertEncoding v2 Thanks for the update, I should be able to look at this tomorrow.
Comment 16•6 years ago
|
||
(In reply to Makoto Kato [:m_kato] from comment #11) > According to > http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344f/ > Cihejdic.html, "If an alignment qualifier is not specified, and A=1, the > alignment fault is taken if it is not aligned to element size." I think "element size" is the key here, I took that to mean `alignmentof(uint8_t)` for `vst1q_u8`. I am by no means an expert on ARM, so if you know I'm wrong or feel strongly about keeping dst aligned I'm fine with that, you don't need to prove it to me. That said, it would be nice if we could test on a device with A=1. > Qx reg is Dx reg pair on NEON. So this code will be compiled like I just learned something very cool :)
Comment 17•6 years ago
|
||
Comment on attachment 8940745 [details] [diff] [review] Add NEON versions of LossyConvertEncoding v2 Review of attachment 8940745 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, I'm still not totally sure we need 8/16 byte alignment for the dst but I'm okay with this as-is. ::: xpcom/string/moz.build @@ +60,5 @@ > SOURCES['nsReadableUtilsSSE2.cpp'].flags += CONFIG['SSE2_FLAGS'] > > +if CONFIG['BUILD_ARM_NEON'] or CONFIG['CPU_ARCH'] == 'aarch64': > + SOURCES += ['nsUTF8UtilsNEON.cpp'] > + SOURCES['nsUTF8UtilsNEON.cpp'].flags += CONFIG['NEON_FLAGS'] Do we need neon flags for 'aarch64'? I'm guessing it doesn't matter, but double checking. ::: xpcom/string/nsUTF8Utils.h @@ +664,5 @@ > write_sse2(aSource, aSourceLength); > return; > } > #endif > +#if defined(MOZILLA_MAY_SUPPORT_NEON) && defined(MOZ_LITTLE_ENDIAN) Why only little endian? ::: xpcom/string/nsUTF8UtilsNEON.cpp @@ +36,5 @@ > + vmovn_u16(s)); > + i += 8; > + } > + > + // Align source and destination to a 16-byte boundary. I wonder if we could just declare aligned versions here and avoid the `_builtin_assume_aligned`s below, ie: > const char16_t* src = reinterpret_cast<const uint16_t*>(__builtin_assume_aligned(aSource + i, 16); > dest = __builtin_assume_aligned(dest + i, 16) > > while (aSourceLength - i > 15) { > uint16x8_t low = vld1q_u16(src + i); > uint16x8_t low = vld1q_u16(src + i + 8); > ... I assume the compiler can tell we're going in 16-byte increments.
Attachment #8940745 -
Flags: review?(erahm) → review+
Assignee | ||
Comment 18•6 years ago
|
||
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #17) > Comment on attachment 8940745 [details] [diff] [review] > Add NEON versions of LossyConvertEncoding v2 > > Review of attachment 8940745 [details] [diff] [review]: > ----------------------------------------------------------------- > > Looks good, I'm still not totally sure we need 8/16 byte alignment for the > dst but I'm okay with this as-is. > > ::: xpcom/string/moz.build > @@ +60,5 @@ > > SOURCES['nsReadableUtilsSSE2.cpp'].flags += CONFIG['SSE2_FLAGS'] > > > > +if CONFIG['BUILD_ARM_NEON'] or CONFIG['CPU_ARCH'] == 'aarch64': > > + SOURCES += ['nsUTF8UtilsNEON.cpp'] > > + SOURCES['nsUTF8UtilsNEON.cpp'].flags += CONFIG['NEON_FLAGS'] > > Do we need neon flags for 'aarch64'? I'm guessing it doesn't matter, but > double checking. Unnecessary on aarch64. But, NEON_FLAGS is useful macro to use NEON on both 32bit and 64bit (bug 1298569). > ::: xpcom/string/nsUTF8Utils.h > @@ +664,5 @@ > > write_sse2(aSource, aSourceLength); > > return; > > } > > #endif > > +#if defined(MOZILLA_MAY_SUPPORT_NEON) && defined(MOZ_LITTLE_ENDIAN) > > Why only little endian? I have no environment for big endian. vcombine_u8 should be replace high with low for big endian. I will follow up it by another bug. > ::: xpcom/string/nsUTF8UtilsNEON.cpp > @@ +36,5 @@ > > + vmovn_u16(s)); > > + i += 8; > > + } > > + > > + // Align source and destination to a 16-byte boundary. > > I wonder if we could just declare aligned versions here and avoid the > `_builtin_assume_aligned`s below, ie: > > > const char16_t* src = reinterpret_cast<const uint16_t*>(__builtin_assume_aligned(aSource + i, 16); > > dest = __builtin_assume_aligned(dest + i, 16) > > > > while (aSourceLength - i > 15) { > > uint16x8_t low = vld1q_u16(src + i); > > uint16x8_t low = vld1q_u16(src + i + 8); > > ... > > I assume the compiler can tell we're going in 16-byte increments. When using it, alignment qualifier isn't added on clang-5.0 unfortunately.
Comment 19•6 years ago
|
||
Pushed by m_kato@ga2.so-net.ne.jp: https://hg.mozilla.org/integration/mozilla-inbound/rev/3db5dd3a6fe6 Add NEON versions of LossyConvertEncoding. r=erahm
Comment 20•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/3db5dd3a6fe6
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
status-firefox59:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Comment 21•6 years ago
|
||
Do I understand correctly that we build this code by default when compiling on an ARMv7 GNU/Linux host?
Flags: needinfo?(m_kato)
Assignee | ||
Comment 22•6 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #21) > Do I understand correctly that we build this code by default when compiling > on an ARMv7 GNU/Linux host? When target is Linux/arm that compiler supports NEON, this is compiled. This code is tested on Android/arm and Ubuntu/armeabihf 16.04.
Flags: needinfo?(m_kato)
Updated•3 years ago
|
Component: String → XPCOM
You need to log in
before you can comment on or make changes to this bug.
Description
•