Add SSE versions of LossyConvertEncoding

RESOLVED FIXED in mozilla5

Status

()

defect
RESOLVED FIXED
9 years ago
8 years ago

People

(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)

Tracking

({perf})

unspecified
mozilla5
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 7 obsolete attachments)

I think this could be vectorized on both ARM and x86.  It's used by nsTextFragment.cpp -- see bug 585978.
OS: Windows XP → All
Hardware: x86 → All
Posted patch Patch v1 (obsolete) — Splinter Review
Here's an x86/SSE2 patch.  I think this would be pretty easy to do in NEON, but I'd don't currently have a setup for that.

One issue: The non-vectorized code completely ignores the high-order bytes in the sequence when converting from UTF-16 to ASCII.  But this will do a saturating conversion -- i.e. if a high-order byte of is non-zero, the result of the vectorized conversion will be FF.

If this matters, we can mask out the high-order bytes.  But since we should only be calling this lossy conversion function after first checking that all the high-order bytes are in fact 0, I'm not sure we care.
Is it a big perf hit to mask the high-bytes? I'm definitely worried about changing behavior for a function as central as this is.
Posted patch Patch v1.1 (obsolete) — Splinter Review
New patch, which (I think) gets replaces all the loops in nsTextFragment with calls into nsUTF8Utils.

This is a lot faster.  On my benchmark from bug 585978 [1], I get

Trunk:              220ms (first run), 155ms (later runs)
Bug 585978's patch: 180ms (first run), 122ms (later runs)
This bug's patch:   118ms (first run),  54ms (later runs)

I'll try adding in the masking as suggested above and report back.

[1] https://bugzilla.mozilla.org/attachment.cgi?id=464450
Attachment #465405 - Attachment is obsolete: true
Posted patch Patch v1.2 (obsolete) — Splinter Review
Runtime is the same with the extra ANDs.
Attachment #465421 - Attachment is obsolete: true
Keywords: perf
Summary: Vectorize LossyConvertEncoding<PRUnichar, char> in nsUTF8Utils.h → Vectorize LossyConvertEncoding in nsUTF8Utils.h
Posted patch Patch v2 (obsolete) — Splinter Review
This one adds a vectorization of the 8- to 16-bit conversion.
Assignee: nobody → justin.lebar+bug
Attachment #465431 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Patch v2 doesn't speed up this benchmark a lot on my machine, but there's a definite improvement.

For some reason this benchmark exhibits much more variance than the last one.  I'm reporting min and max observed times here.

          min    max
Trunk:    320ms  350ms
Patch v2: 270ms  320ms
Posted patch Patch v2.1 (obsolete) — Splinter Review
Ready for review.
Attachment #465478 - Attachment is obsolete: true
Attachment #465486 - Flags: review?(vladimir)
Summary: Vectorize LossyConvertEncoding in nsUTF8Utils.h → Add SSE versions of LossyConvertEncoding
See Also: → 586838
See Also: 586838
I filed bug 586838 for adding NEON support for these routines.
I looked over the patch for bug 506430 and I noticed that it does some amount of work to avoid using unaligned loads and stores.

I experimented a bit and didn't see any perf difference between these two on my machine, but perhaps it makes a difference on older machines.  I'd be OK adding the extra code to avoid the unaligned instructions, but only if we can show that it's actually faster.
Vlad tells me that he's probably not the right person to review this and the few other SSE patches I've posted.  He suggests that the usual suspects, cjones and jrmuizel, are booked pretty hard right now.

How important is this speedup?  How much do we want it for FF4?
Comment on attachment 465486 [details] [diff] [review]
Patch v2.1

Vlad says he doesn't think he's the right person to review here; untagging him.
Attachment #465486 - Flags: review?(vladimir) → review?
Posted patch Patch v2.2 (obsolete) — Splinter Review
Fixing Makefile logic for Linux 32-bit.
Attachment #465486 - Attachment is obsolete: true
Attachment #468487 - Flags: review?(jones.chris.g)
Attachment #465486 - Flags: review?
Hi Justin, I'm pretty swamped with fennec b1 work and won't be able to do this review until probably the end of next week; it's going to need a very careful read.  If you need the review sooner than that, please request from someone else.  Sorry!
Hard to say how important this is... We sort of need all the help we can get.  :(
After observing bz optimizing a similar function on IRC, I may be able to speed this up a bit by using indexed loads as opposed to pointer arithmetic.

I'll have a look.
The function I was referring to above is in bug 592786.
Huh; it looks like gcc 4.4 with -O3 (bug 590181) speeds up the existing code a fair bit on my Linux x64 box.

I see no difference with or without the patch on this bug's benchmark; we're at 200ms in either case (which is 1.3x faster than the fastest I saw that benchmark go before -O3).

On the benchmark from bug 585978 [1] (compare with comment 3), I get

Trunk, -03:
 187ms (first run)
 127ms (later runs)
Patched (this bug plus bug 585978's patch), -O3:
 108ms (first run)
 44ms  (later runs)

It looks like -O3 is adding some SSE2 instructions, but most of the win comes from using the x86 string instructions.

[1] https://bugzilla.mozilla.org/attachment.cgi?id=464450
Depends on: 585978
> It looks like -O3 is adding some SSE2 instructions, but most of the win comes
> from using the x86 string instructions.

Worded better: -O3 is adding some SSE2 instructions, but they're not on the hot path of the benchmarks.
Well, GCC definitely generates tighter code when I use indexing rather than pointer arithmetic, but it doesn't impact my benchmarks, at least not on my Linux 64 box.
This version uses offsets instead of pointer arithmetic.

I see no difference in execution speed on my machine, but I'd be curious if this was any faster on someone else's.

My instinct is to say that we should use this version since it generates tighter code.  But a scientific comparison to the other patch is probably in order.
cc'ing derf, who may have an opinion about which version to use.
Compilers generally find array-style indexing in loops much easier to reason about than pointer arithmetic. It's been standard advice to write loops this way in Intel and AMD optimization manuals for years. This is less true on ARM, which can do pre- and post-incrementing of the pointer as part of the load/store instruction, but this code isn't meant for ARM.
Note that one counter-example we've found is that gcc apparently flubs array-style indexing if calls to inlined member functions are in the loop.  See https://bugzilla.mozilla.org/show_bug.cgi?id=587257#c20

Not an issue here, of course.
Blocks: 586838
derf, do you think you could review the SSE here and bug 585978?  Finding someone who knows SSE and has a few spare cycles seems to be the only thing that's holding this patch back.
Comment on attachment 473246 [details] [diff] [review]
Patch v2.3 (offsets instead of pointer arithmetic)

> +#ifdef MOZILLA_MAY_SUPPORT_SSE2

I'm not really qualified to review build-system changes, but it seems dangerous to me to have the call to write_sse2() guarded by this macro, yet the compilation of write_sse2() guarded by

> +ifneq (,$(INTEL_ARCHITECTURE))

> +  while (i < aSourceLength && (NS_PTR_TO_UINT32(aSource + i) & 0xf)) {

How long are typical inputs to this function?

If they're often smaller than the alignment amount, it may be worth optimizing this test, e.g., computing the loop bound up front with something along the lines of
  alignLength = min(aSourceLength, -NS_PTR_TO_UINT32(aSource) & 0xf);
  while (i < alignLength ) { ...

> +  // Walk 32 bytes (two XMM registers) at a time.

If they're often longer than this, it may be worth unrolling this loop more. packuswb is very slow (4 cycles) on Intel CPUs before Penryn, and it's helpful to have something else to do while it's executing. You've got plenty of registers for it.

The same comment applies to punpck[lh]bw in LossyConvertEncoding8to16::write_sse2() as well.

> +  // Align destination to 16-byte boundary.

I realize since there's twice as many stores as loads, by raw cycle counts, it's better to use aligned stores. But you don't need to wait for the stores to complete before you go on to the next iteration, while you _do_ need to wait for the loads to complete before you can finish unpacking. So I suspect it's still better to align the loads (feel free to benchmark and prove me wrong).

> +    __m128i source1 = _mm_load_si128((__m128i*)(aSource + i));
> ...
> +    __m128i source2 = _mm_load_si128((__m128i*)(aSource + i + 8));
> ...
> +    __m128i source = _mm_loadu_si128((__m128i*)(aSource + i));

Same comments as bug 585978 about the C-style cast and discarding the const qualifier.
(In reply to comment #25)
> Comment on attachment 473246 [details] [diff] [review]
> Patch v2.3 (offsets instead of pointer arithmetic)
> 
> > +#ifdef MOZILLA_MAY_SUPPORT_SSE2
> 
> I'm not really qualified to review build-system changes, but it seems dangerous
> to me to have the call to write_sse2() guarded by this macro, yet the
> compilation of write_sse2() guarded by
> 
> > +ifneq (,$(INTEL_ARCHITECTURE))

Hm, that's a good point.  Ted, what do you think?  Should we just use INTEL_ARCHITECTURE (maybe renamed) in both the Makefiles and the code?  I guess it's not a huge deal either way, because the worst that happens if we mess it up is the tree burns.

> How long are typical inputs to this function?

I instrumented the code and went to a few websites.  Most invocations process fewer than 10 characters, but some process hundreds, and a few process one or two thousand.  The mean is about 30.

> If they're often longer than this, it may be worth unrolling this loop more.
> packuswb is very slow (4 cycles) on Intel CPUs before Penryn, and it's helpful
> to have something else to do while it's executing. You've got plenty of
> registers for it.
> 
> The same comment applies to punpck[lh]bw in
> LossyConvertEncoding8to16::write_sse2() as well.

Unrolling doesn't make a noticable difference on my modern machine, so if you think it'll be faster on some older cores, I'm happy to leave it in.

> I realize since there's twice as many stores as loads, by raw cycle counts,
> it's better to use aligned stores. But you don't need to wait for the stores to
> complete before you go on to the next iteration, while you _do_ need to wait
> for the loads to complete before you can finish unpacking. So I suspect it's
> still better to align the loads (feel free to benchmark and prove me wrong).

Again, no real difference on my machine, but I'm happy with this option.

Patch in a moment.
Posted patch Patch v3Splinter Review
Given that so many invocations are relatively small, maybe I should have a step after the unrolled loop that processes just one XMM register.
Attachment #468487 - Attachment is obsolete: true
Attachment #473246 - Attachment is obsolete: true
Attachment #492908 - Flags: review?(tterribe)
Attachment #468487 - Flags: review?(jones.chris.g)
cc'ing ted, re comment 26.
In general it's good to make your #ifdefs match your Makefile conditions. In this case, MAY_SUPPORT_SSE2 is a subset of INTEL_ARCHITECTURE, so it's probably not harmful.
Comment on attachment 492908 [details] [diff] [review]
Patch v3

r+=me
Attachment #492908 - Flags: review?(tterribe) → review+
Comment on attachment 492908 [details] [diff] [review]
Patch v3

I think this needs a review from someone in content, too.  jst?
Attachment #492908 - Flags: review?(jst)
Comment on attachment 492908 [details] [diff] [review]
Patch v3

r=jst with a license header added to nsUTF8UtilsSSE2.cpp
Attachment #492908 - Flags: review?(jst) → review+
Comment on attachment 492908 [details] [diff] [review]
Patch v3

Requesting a2.0.

This plus bug 585978 yields a significant perf improvement when creating text
nodes.  It's a self-contained change which shouldn't have strange interactions
with other code, although of course it's string processing, so mistakes would
be fatal.
Attachment #492908 - Flags: approval2.0?
Attachment #492908 - Flags: approval2.0? → approval2.0+
Although this is now (finally) unblocked, I think it's too late to check this in for 4.0, so I'm clearing the a2.0 flag.  Feel free to renom if you disagree!
Attachment #492908 - Flags: approval2.0+
Whiteboard: [check-in-after-branch]
Blocks: post2.0

Updated

9 years ago
No longer blocks: post2.0
Depends on: post2.0
Ehsan checked this in to Cedar on Sunday:

http://hg.mozilla.org/projects/cedar/rev/d0b98f8c4734
Whiteboard: [check-in-after-branch] → [fixed-in-cedar][check-in-after-branch]
Whiteboard: [fixed-in-cedar][check-in-after-branch] → [fixed-in-cedar]

Comment 36

8 years ago
http://hg.mozilla.org/mozilla-central/rev/d0b98f8c4734
Status: ASSIGNED → RESOLVED
Last Resolved: 8 years ago
No longer depends on: post2.0
Resolution: --- → FIXED
Whiteboard: [fixed-in-cedar]
Target Milestone: --- → mozilla2.2
You need to log in before you can comment on or make changes to this bug.