Closed Bug 586698 Opened 14 years ago Closed 13 years ago

Add SSE versions of LossyConvertEncoding

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla5

People

(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)

References

Details

(Keywords: perf)

Attachments

(2 files, 7 obsolete files)

Patch v1 14 years ago Justin Lebar (not reading bugmail) 9.20 KB, patch		Details \| Diff \| Splinter Review
Patch v1.1 14 years ago Justin Lebar (not reading bugmail) 11.23 KB, patch		Details \| Diff \| Splinter Review
Patch v1.2 14 years ago Justin Lebar (not reading bugmail) 11.37 KB, patch		Details \| Diff \| Splinter Review
Patch v2 14 years ago Justin Lebar (not reading bugmail) 13.12 KB, patch		Details \| Diff \| Splinter Review
Benchmark of 8- to 16-bit conversion 14 years ago Justin Lebar (not reading bugmail) 728 bytes, text/html		Details
Patch v2.1 14 years ago Justin Lebar (not reading bugmail) 13.06 KB, patch		Details \| Diff \| Splinter Review
Patch v2.2 14 years ago Justin Lebar (not reading bugmail) 13.21 KB, patch		Details \| Diff \| Splinter Review
Patch v2.3 (offsets instead of pointer arithmetic) 14 years ago Justin Lebar (not reading bugmail) 12.96 KB, patch		Details \| Diff \| Splinter Review
Patch v3 14 years ago Justin Lebar (not reading bugmail) 14.17 KB, patch	derf : review+ jst : review+	Details \| Diff \| Splinter Review

Justin Lebar (not reading bugmail)

Assignee

Description

•

14 years ago

I think this could be vectorized on both ARM and x86.  It's used by nsTextFragment.cpp -- see bug 585978.

Justin Lebar (not reading bugmail)

Assignee

Updated

•

14 years ago

OS: Windows XP → All

Hardware: x86 → All

Justin Lebar (not reading bugmail)

Assignee

Comment 1

•

14 years ago

Attached patch Patch v1 (obsolete) — Details — Splinter Review

Here's an x86/SSE2 patch.  I think this would be pretty easy to do in NEON, but I'd don't currently have a setup for that.

One issue: The non-vectorized code completely ignores the high-order bytes in the sequence when converting from UTF-16 to ASCII.  But this will do a saturating conversion -- i.e. if a high-order byte of is non-zero, the result of the vectorized conversion will be FF.

If this matters, we can mask out the high-order bytes.  But since we should only be calling this lossy conversion function after first checking that all the high-order bytes are in fact 0, I'm not sure we care.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 2

•

14 years ago

Is it a big perf hit to mask the high-bytes? I'm definitely worried about changing behavior for a function as central as this is.

Justin Lebar (not reading bugmail)

Assignee

Comment 3

•

14 years ago

Attached patch Patch v1.1 (obsolete) — Details — Splinter Review

New patch, which (I think) gets replaces all the loops in nsTextFragment with calls into nsUTF8Utils.

This is a lot faster.  On my benchmark from bug 585978 [1], I get

Trunk:              220ms (first run), 155ms (later runs)
Bug 585978's patch: 180ms (first run), 122ms (later runs)
This bug's patch:   118ms (first run),  54ms (later runs)

I'll try adding in the masking as suggested above and report back.

[1] https://bugzilla.mozilla.org/attachment.cgi?id=464450

Attachment #465405 - Attachment is obsolete: true

Justin Lebar (not reading bugmail)

Assignee

Comment 4

•

14 years ago

Attached patch Patch v1.2 (obsolete) — Details — Splinter Review

Runtime is the same with the extra ANDs.

Attachment #465421 - Attachment is obsolete: true

Justin Lebar (not reading bugmail)

Assignee

Updated

•

14 years ago

Keywords: perf

Justin Lebar (not reading bugmail)

Assignee

Updated

•

14 years ago

Summary: Vectorize LossyConvertEncoding<PRUnichar, char> in nsUTF8Utils.h → Vectorize LossyConvertEncoding in nsUTF8Utils.h

Justin Lebar (not reading bugmail)

Assignee

Comment 5

•

14 years ago

Attached patch Patch v2 (obsolete) — Details — Splinter Review

This one adds a vectorization of the 8- to 16-bit conversion.

Assignee: nobody → justin.lebar+bug

Attachment #465431 - Attachment is obsolete: true

Status: NEW → ASSIGNED

Justin Lebar (not reading bugmail)

Assignee

Comment 6

•

14 years ago

Attached file Benchmark of 8- to 16-bit conversion — Details

Patch v2 doesn't speed up this benchmark a lot on my machine, but there's a definite improvement.

For some reason this benchmark exhibits much more variance than the last one.  I'm reporting min and max observed times here.

          min    max
Trunk:    320ms  350ms
Patch v2: 270ms  320ms

Justin Lebar (not reading bugmail)

Assignee

Comment 7

•

14 years ago

Attached patch Patch v2.1 (obsolete) — Details — Splinter Review

Ready for review.

Attachment #465478 - Attachment is obsolete: true

Attachment #465486 - Flags: review?(vladimir)

Justin Lebar (not reading bugmail)

Assignee

Updated

•

14 years ago

Summary: Vectorize LossyConvertEncoding in nsUTF8Utils.h → Add SSE versions of LossyConvertEncoding

Justin Lebar (not reading bugmail)

Assignee

Updated

•

14 years ago

Updated

•

14 years ago

Comment 8

•

14 years ago

I filed bug 586838 for adding NEON support for these routines.

Justin Lebar (not reading bugmail)

Assignee

Comment 9

•

14 years ago

I looked over the patch for bug 506430 and I noticed that it does some amount of work to avoid using unaligned loads and stores.

I experimented a bit and didn't see any perf difference between these two on my machine, but perhaps it makes a difference on older machines.  I'd be OK adding the extra code to avoid the unaligned instructions, but only if we can show that it's actually faster.

Justin Lebar (not reading bugmail)

Assignee

Comment 10

•

14 years ago

Vlad tells me that he's probably not the right person to review this and the few other SSE patches I've posted.  He suggests that the usual suspects, cjones and jrmuizel, are booked pretty hard right now.

How important is this speedup?  How much do we want it for FF4?

Justin Lebar (not reading bugmail)

Assignee

Comment 11

•

14 years ago

Comment on attachment 465486 [details] [diff] [review]
Patch v2.1

Vlad says he doesn't think he's the right person to review here; untagging him.

Attachment #465486 - Flags: review?(vladimir) → review?

Justin Lebar (not reading bugmail)

Assignee

Comment 12

•

14 years ago

Attached patch Patch v2.2 (obsolete) — Details — Splinter Review

Fixing Makefile logic for Linux 32-bit.

Attachment #465486 - Attachment is obsolete: true

Attachment #468487 - Flags: review?(jones.chris.g)

Attachment #465486 - Flags: review?

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 13

•

14 years ago

Hi Justin, I'm pretty swamped with fennec b1 work and won't be able to do this review until probably the end of next week; it's going to need a very careful read.  If you need the review sooner than that, please request from someone else.  Sorry!

Boris Zbarsky [:bzbarsky]

Comment 14

•

14 years ago

Hard to say how important this is... We sort of need all the help we can get.  :(

Justin Lebar (not reading bugmail)

Assignee

Comment 15

•

14 years ago

After observing bz optimizing a similar function on IRC, I may be able to speed this up a bit by using indexed loads as opposed to pointer arithmetic.

I'll have a look.

Justin Lebar (not reading bugmail)

Assignee

Comment 16

•

14 years ago

The function I was referring to above is in bug 592786.

Justin Lebar (not reading bugmail)

Assignee

Comment 17

•

14 years ago

Huh; it looks like gcc 4.4 with -O3 (bug 590181) speeds up the existing code a fair bit on my Linux x64 box.

I see no difference with or without the patch on this bug's benchmark; we're at 200ms in either case (which is 1.3x faster than the fastest I saw that benchmark go before -O3).

On the benchmark from bug 585978 [1] (compare with comment 3), I get

Trunk, -03:
 187ms (first run)
 127ms (later runs)
Patched (this bug plus bug 585978's patch), -O3:
 108ms (first run)
 44ms  (later runs)

It looks like -O3 is adding some SSE2 instructions, but most of the win comes from using the x86 string instructions.

[1] https://bugzilla.mozilla.org/attachment.cgi?id=464450

Justin Lebar (not reading bugmail)

Assignee

Updated

•

14 years ago

Depends on: 585978

Justin Lebar (not reading bugmail)

Assignee

Comment 18

•

14 years ago

> It looks like -O3 is adding some SSE2 instructions, but most of the win comes
> from using the x86 string instructions.

Worded better: -O3 is adding some SSE2 instructions, but they're not on the hot path of the benchmarks.

Justin Lebar (not reading bugmail)

Assignee

Comment 19

•

14 years ago

Well, GCC definitely generates tighter code when I use indexing rather than pointer arithmetic, but it doesn't impact my benchmarks, at least not on my Linux 64 box.

Justin Lebar (not reading bugmail)

Assignee

Comment 20

•

14 years ago

Attached patch Patch v2.3 (offsets instead of pointer arithmetic) (obsolete) — Details — Splinter Review

This version uses offsets instead of pointer arithmetic.

I see no difference in execution speed on my machine, but I'd be curious if this was any faster on someone else's.

My instinct is to say that we should use this version since it generates tighter code.  But a scientific comparison to the other patch is probably in order.

Justin Lebar (not reading bugmail)

Assignee

Comment 21

•

14 years ago

cc'ing derf, who may have an opinion about which version to use.

Timothy B. Terriberry (:derf)

Comment 22

•

14 years ago

Compilers generally find array-style indexing in loops much easier to reason about than pointer arithmetic. It's been standard advice to write loops this way in Intel and AMD optimization manuals for years. This is less true on ARM, which can do pre- and post-incrementing of the pointer as part of the load/store instruction, but this code isn't meant for ARM.

Boris Zbarsky [:bzbarsky]

Comment 23

•

14 years ago

Note that one counter-example we've found is that gcc apparently flubs array-style indexing if calls to inlined member functions are in the loop.  See https://bugzilla.mozilla.org/show_bug.cgi?id=587257#c20

Not an issue here, of course.

Makoto Kato [:m_kato] (PTO until 5/6)

Updated

•

14 years ago

Blocks: 586838

Justin Lebar (not reading bugmail)

Assignee

Comment 24

•

14 years ago

derf, do you think you could review the SSE here and bug 585978?  Finding someone who knows SSE and has a few spare cycles seems to be the only thing that's holding this patch back.

Timothy B. Terriberry (:derf)

Comment 25

•

14 years ago

Comment on attachment 473246 [details] [diff] [review]
Patch v2.3 (offsets instead of pointer arithmetic)

> +#ifdef MOZILLA_MAY_SUPPORT_SSE2

I'm not really qualified to review build-system changes, but it seems dangerous to me to have the call to write_sse2() guarded by this macro, yet the compilation of write_sse2() guarded by

> +ifneq (,$(INTEL_ARCHITECTURE))

> +  while (i < aSourceLength && (NS_PTR_TO_UINT32(aSource + i) & 0xf)) {

How long are typical inputs to this function?

If they're often smaller than the alignment amount, it may be worth optimizing this test, e.g., computing the loop bound up front with something along the lines of
  alignLength = min(aSourceLength, -NS_PTR_TO_UINT32(aSource) & 0xf);
  while (i < alignLength ) { ...

> +  // Walk 32 bytes (two XMM registers) at a time.

If they're often longer than this, it may be worth unrolling this loop more. packuswb is very slow (4 cycles) on Intel CPUs before Penryn, and it's helpful to have something else to do while it's executing. You've got plenty of registers for it.

The same comment applies to punpck[lh]bw in LossyConvertEncoding8to16::write_sse2() as well.

> +  // Align destination to 16-byte boundary.

I realize since there's twice as many stores as loads, by raw cycle counts, it's better to use aligned stores. But you don't need to wait for the stores to complete before you go on to the next iteration, while you _do_ need to wait for the loads to complete before you can finish unpacking. So I suspect it's still better to align the loads (feel free to benchmark and prove me wrong).

> +    __m128i source1 = _mm_load_si128((__m128i*)(aSource + i));
> ...
> +    __m128i source2 = _mm_load_si128((__m128i*)(aSource + i + 8));
> ...
> +    __m128i source = _mm_loadu_si128((__m128i*)(aSource + i));

Same comments as bug 585978 about the C-style cast and discarding the const qualifier.

Justin Lebar (not reading bugmail)

Assignee

Comment 26

•

14 years ago

(In reply to comment #25)
> Comment on attachment 473246 [details] [diff] [review]
> Patch v2.3 (offsets instead of pointer arithmetic)
> 
> > +#ifdef MOZILLA_MAY_SUPPORT_SSE2
> 
> I'm not really qualified to review build-system changes, but it seems dangerous
> to me to have the call to write_sse2() guarded by this macro, yet the
> compilation of write_sse2() guarded by
> 
> > +ifneq (,$(INTEL_ARCHITECTURE))

Hm, that's a good point.  Ted, what do you think?  Should we just use INTEL_ARCHITECTURE (maybe renamed) in both the Makefiles and the code?  I guess it's not a huge deal either way, because the worst that happens if we mess it up is the tree burns.

> How long are typical inputs to this function?

I instrumented the code and went to a few websites.  Most invocations process fewer than 10 characters, but some process hundreds, and a few process one or two thousand.  The mean is about 30.

> If they're often longer than this, it may be worth unrolling this loop more.
> packuswb is very slow (4 cycles) on Intel CPUs before Penryn, and it's helpful
> to have something else to do while it's executing. You've got plenty of
> registers for it.
> 
> The same comment applies to punpck[lh]bw in
> LossyConvertEncoding8to16::write_sse2() as well.

Unrolling doesn't make a noticable difference on my modern machine, so if you think it'll be faster on some older cores, I'm happy to leave it in.

> I realize since there's twice as many stores as loads, by raw cycle counts,
> it's better to use aligned stores. But you don't need to wait for the stores to
> complete before you go on to the next iteration, while you _do_ need to wait
> for the loads to complete before you can finish unpacking. So I suspect it's
> still better to align the loads (feel free to benchmark and prove me wrong).

Again, no real difference on my machine, but I'm happy with this option.

Patch in a moment.

Justin Lebar (not reading bugmail)

Assignee

Comment 27

•

14 years ago

Attached patch Patch v3 — Details — Splinter Review

Given that so many invocations are relatively small, maybe I should have a step after the unrolled loop that processes just one XMM register.

Attachment #468487 - Attachment is obsolete: true

Attachment #473246 - Attachment is obsolete: true

Attachment #492908 - Flags: review?(tterribe)

Attachment #468487 - Flags: review?(jones.chris.g)

Justin Lebar (not reading bugmail)

Assignee

Comment 28

•

14 years ago

cc'ing ted, re comment 26.

(not currently active) Ted Mielczarek

Comment 29

•

14 years ago

In general it's good to make your #ifdefs match your Makefile conditions. In this case, MAY_SUPPORT_SSE2 is a subset of INTEL_ARCHITECTURE, so it's probably not harmful.

Timothy B. Terriberry (:derf)

Comment 30

•

14 years ago

Comment on attachment 492908 [details] [diff] [review]
Patch v3

r+=me

Attachment #492908 - Flags: review?(tterribe) → review+

Justin Lebar (not reading bugmail)

Assignee

Comment 31

•

14 years ago

Comment on attachment 492908 [details] [diff] [review]
Patch v3

I think this needs a review from someone in content, too.  jst?

Attachment #492908 - Flags: review?(jst)

Johnny Stenback (:jst)

Comment 32

•

14 years ago

Comment on attachment 492908 [details] [diff] [review]
Patch v3

r=jst with a license header added to nsUTF8UtilsSSE2.cpp

Attachment #492908 - Flags: review?(jst) → review+

Justin Lebar (not reading bugmail)

Assignee

Comment 33

•

14 years ago

Comment on attachment 492908 [details] [diff] [review]
Patch v3

Requesting a2.0.

This plus bug 585978 yields a significant perf improvement when creating text
nodes.  It's a self-contained change which shouldn't have strange interactions
with other code, although of course it's string processing, so mistakes would
be fatal.

Attachment #492908 - Flags: approval2.0?

Johnny Stenback (:jst)

Updated

•

14 years ago

Attachment #492908 - Flags: approval2.0? → approval2.0+

Justin Lebar (not reading bugmail)

Assignee

Comment 34

•

14 years ago

Although this is now (finally) unblocked, I think it's too late to check this in for 4.0, so I'm clearing the a2.0 flag.  Feel free to renom if you disagree!

Justin Lebar (not reading bugmail)

Assignee

Updated

•

14 years ago

Attachment #492908 - Flags: approval2.0+

Justin Lebar (not reading bugmail)

Assignee

Updated

•

14 years ago

Whiteboard: [check-in-after-branch]

Justin Lebar (not reading bugmail)

Assignee

Updated

•

14 years ago

Blocks: post2.0

Philip Chee

Updated

•

14 years ago

No longer blocks: post2.0

Depends on: post2.0

Justin Lebar (not reading bugmail)

Assignee

Comment 35

•

13 years ago

Ehsan checked this in to Cedar on Sunday:

http://hg.mozilla.org/projects/cedar/rev/d0b98f8c4734

Boris Zbarsky [:bzbarsky]

Updated

•

13 years ago

Whiteboard: [check-in-after-branch] → [fixed-in-cedar][check-in-after-branch]

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Whiteboard: [fixed-in-cedar][check-in-after-branch] → [fixed-in-cedar]

(no longer active)

Comment 36

•

13 years ago

http://hg.mozilla.org/mozilla-central/rev/d0b98f8c4734

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

No longer depends on: post2.0

Resolution: --- → FIXED

Whiteboard: [fixed-in-cedar]

Target Milestone: --- → mozilla2.2

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: String → XPCOM

You need to log in before you can comment on or make changes to this bug.