Closed Bug 506430 Opened 15 years ago Closed 15 years ago

Optimize UTF8 to UTF16 conversion

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: vlad, Assigned: vlad)

References

(Blocks 1 open bug)

Details

(Whiteboard: [ts])

Attachments

(9 files, 2 obsolete files)

sample patch 15 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 1.90 KB, patch		Details \| Diff \| Splinter Review
The text that we convert on startup 15 years ago Jeff Muizelaar [:jrmuizel] 897.29 KB, text/plain		Details
Original patch with SSE2 one byte to two byte code 15 years ago Michael Moy 4.99 KB, patch		Details \| Diff \| Splinter Review
standalone benchmark 15 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 14.50 KB, text/plain		Details
patch with 4-byte alignment on check 15 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 4.32 KB, patch		Details \| Diff \| Splinter Review
patch to standalone benchmark, optimizing original code 15 years ago David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla) 1.99 KB, patch		Details \| Diff \| Splinter Review
Patch to benchmark adding Covert_dbaron and Convert_sseoneloop 15 years ago Kyle Huey (Exited; not receiving bugmail, old account, do not use) 9.63 KB, patch		Details \| Diff \| Splinter Review
Patch adding dbaron and sseoneloop to standalone benchmark 15 years ago Kyle Huey (Exited; not receiving bugmail, old account, do not use) 9.64 KB, patch		Details \| Diff \| Splinter Review
final patch 15 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 6.60 KB, patch	jst : review+ jst : superreview+	Details \| Diff \| Splinter Review
Make the UCS conversion faster for the non-ascii part 15 years ago Alfred Kayser 4.90 KB, patch		Details \| Diff \| Splinter Review
VC7.1 bustage fix 15 years ago neil@parkwaycc.co.uk 487 bytes, patch	vlad : review+	Details \| Diff \| Splinter Review

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Description

•

15 years ago

ConvertUTF8_to_UTF16 shows up at about 5-8% of warm start time.  Our implementation is straightforward C code; I bet we can find a much faster implementation using SSE, NEON, whatever's available.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Updated

•

15 years ago

Whiteboard: [ts]

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 1

•

15 years ago

There is u8u16, but I don't think it's license is GPL-compatible.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 2

•

15 years ago

Quick and dirty test:

inBufSize: 2155589
Convert_1: 4.78326 ms
Convert_dumb: 2.2116 ms
Convert_dumb2: 2.8301 ms
Convert_u8u16: 1.90388 ms

the input is all ASCII, about 2MB worth.

* Convert_1 is our current routine. 
* Convert_dumb just does u8->u16 straight; it's not a true utf8->utf16 conversion.
* Convert_dumb2 walks the input string and looks for any high bits; if no high bits are set, it does a straight u8->u16 conversion, otherwise it calls Convert_1 (so it basically times the overhead of checking for any high bits set).
* Convert_u8u16 uses u8u16.

Convert_dumb could be accelerated even more using SSE; my impl is just a straight C impl.

I'd suggest that we take the dumb2 approach in our current code and keep searching for something faster.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 3

•

15 years ago

Er, sorry:

Convert_1: 7.22484 ms
Convert_dumb: 2.18996 ms
Convert_dumb2: 2.84106 ms
Convert_u8u16: 1.8791 ms

The 4.7ms number was with some hacks applied.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 4

•

15 years ago

Attached patch sample patch — Details — Splinter Review

Convert_1: 6.70116 ms
Convert_2: 2.73848 ms
Convert_dumb: 2.12909 ms
Convert_dumb2: 2.73505 ms
Convert_u8u16: 1.77669 ms

Here's an impl of Convert_2.  Note that I'm trying to optimize for ascii here, and also that this isn't touching utf8->utf16 that happens with js, and maybe not for xbl/xml.  Not sure.

This needs some SSE2 love, and it also should probably do something with the work it did to figure out which initial parts of a string have no non-ascii chars by doing the fast copy for that segment...

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Updated

•

15 years ago

blocking1.9.1: needed → ---

Flags: blocking1.9.2?

Ryan VanderMeulen [:RyanVM]

Comment 5

•

15 years ago

Maybe mmoy can help?

Michael Moy

Comment 6

•

15 years ago

I ran a comparison of char to short using c code and sse2 intrinsics and came up with 625 for c and 484 for intrinsics. This was with the best case of input and output aligned 16. To get this to work in the real world, there would have to be some special code at the beginning and end to handle the alignment stuff and this would consume some cycles.

I can code up something to do the high-bit testing too just to get comparison times.

A few questions:

- Can I assume that in/out are aligned on dwords?
- What is the typical number of characters to be processed?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 7

•

15 years ago

(In reply to comment #6)
> I ran a comparison of char to short using c code and sse2 intrinsics and came
> up with 625 for c and 484 for intrinsics. This was with the best case of input
> and output aligned 16. To get this to work in the real world, there would have
> to be some special code at the beginning and end to handle the alignment stuff
> and this would consume some cycles.
> 
> I can code up something to do the high-bit testing too just to get comparison
> times.

Great!

> A few questions:
> 
> - Can I assume that in/out are aligned on dwords?

I /think/ so.  If not, we may be able to change things around so that it's always true.

> - What is the typical number of characters to be processed?

Good question.  Someone should do some instrumentation and figure that out... would probably be helpful to know distribution of run lengths, and also for each run how many ascii chars before you hit a char with the high bit set.

Dão Gottwald [:dao]

Updated

•

15 years ago

Blocks: 506431

Jeff Muizelaar [:jrmuizel]

Comment 8

•

15 years ago

(In reply to comment #0)
> ConvertUTF8_to_UTF16 shows up at about 5-8% of warm start time.  Our
> implementation is straightforward C code; I bet we can find a much faster
> implementation using SSE, NEON, whatever's available.

How much data is being converted? 5% of 900ms is 45ms which would be like 12MB of data, which seems like a lot of text... Is all of that unique text or are we converting the same text more than once?

Boris Zbarsky [:bzbarsky]

Comment 9

•

15 years ago

> How much data is being converted?

Well, let's see.  Quickly instrumenting the sync case in nsXBLService::FetchBindingDocument shows us loading about 630KB of UTF-8 encoded XBL at startup.

Doing the same for nsStreamLoader::OnDataAvailable (which should really only be hit for CSS and maybe scripts) I see us loading about 1.1MB of data that way.

Then there are the actual XUL documents, of course, not to mention whatever internal conversions we might be doing.

Jeff Muizelaar [:jrmuizel]

Comment 10

•

15 years ago

Attached file The text that we convert on startup — Details

About 900k worth, nothing obviously duplicated. I don't know if any of the stuff is unnecessary, but there sure is a lot of it.

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

15 years ago

Attachment #391128 - Attachment mime type: application/octet-stream → text/plain

Michael Moy

Comment 11

•

15 years ago

I instrumented an image with the patch and it came up with 748671 bytes converted. But doing a lot of other things generates conversions too. As far as startup time goes, is precomputing the converted text a possibility?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 12

•

15 years ago

Possibly; I looked into doing a quick-and-dirty iconv conversion of the relevant files, but that caused other problems since that switches the default charset of a bunch of documents (with a utf16 bom).  There's also additional IO cost.  But if someone can think of a way to do it without breaking anything, it'd be good to evaluate that cost.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 13

•

15 years ago

However, a lot of this is JS modules and XBL, which we want to fastload...

Michael Moy

Comment 14

•

15 years ago

I have an SSE2 implementation of the one to two byte conversion working but it uses MOVDQU to write the output doublequadwords. MOVDQU is an unaligned write so there's a performance penalty there. I did some additional instrumenting and it looks like both source and destination are doubleword aligned which means that I can add code to do aligned writes with a little effort. I have to add some code to turn this on for Mac OS X and Windows too.

Michael Moy

Comment 15

•

15 years ago

Attached patch Original patch with SSE2 one byte to two byte code — Details — Splinter Review

This patch has been tested on Mac OS X and Windows and it appears to work (returns correct results). There is special code to handle the cases where the offset between the source and destination is 0 or 8 after the source has been double quadword aligned. These two cases are seen about 99% of the time. Other alignments are handled by the movdqu (unaligned move) default code.

Could someone run a performance test to determine if this code helps?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 16

•

15 years ago

Warm start on my fast win7 box went from 343-344ms down to 332-334ms.  Not a huge boost, but noticeable.  I also added a block at the start of the high-bit-set checking to align the pointer to 4 bytes, so that we don't blow up on ARM.

However... after just doing a bunch more startups, I started seeing some 342's again.  Now I'm starting to doubt my measurement accuracy; I'll try this again on my slower Mac tomorrow.

Also, u8u16 is really crazy fast!  I'll plug your code into the hack benchmark that I wrote tomorrow as well to see how it compares.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 17

•

15 years ago

On my mac -- goes from 825-843 to 780-790.  I'll take 5-6%!

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 18

•

15 years ago

You could also probably use _mm_movemask_pi8 instead of & 0x80808080 to pick up another couple percent.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 19

•

15 years ago

Attached file standalone benchmark — Details

Just added the sse code in to the benchmark -- it gets us in the right ballpark, g++ -O2:

                orig: 8.06701 ms
                 chk: 2.87505 ms
          dumb_ascii: 2.26799 ms
                 sse: 2.86604 ms
           sse_ascii: 2.23546 ms
               u8u16: 1.8919 ms

"chk" is basically the same as the sse case, where it checks for high byte, but then instead of using sse it just does a simple byte-to-short copy loop.  The two _ascii variants just do straight copies and don't do the high bit check.  Kinda unfortunate that the sse variant is basically the same speed.  But, it doesn't hurt.

So, what I'm trying to say: let's get this patch in (but with the additional alignment fix).

I've also attached the benchmark; apologies for horrible code.  Some more bits:

g++ -O2 -mtune=prescott (gcc 4.0.1):

                orig: 6.72993 ms
                 chk: 2.87276 ms
          dumb_ascii: 2.23084 ms
                 sse: 2.90849 ms
           sse_ascii: 2.23746 ms
               u8u16: 1.8933 ms

similar for g++-4.2 with -mtune=core2.

g++ -O3 -mtune=prescott (gcc 4.0.1):

                orig: 7.47574 ms
                 chk: 2.87053 ms
          dumb_ascii: 2.23784 ms
                 sse: 2.90917 ms
           sse_ascii: 2.24879 ms
               u8u16: 1.91424 ms

(yes, slower than -O2, even worse slowdown with g++-4.2)

g++ -Os -mtune=prescott:

                orig: 9.78382 ms
                 chk: 2.86463 ms
          dumb_ascii: 2.25211 ms
                 sse: 3.0451 ms
           sse_ascii: 2.33603 ms
               u8u16: 2.01193 ms

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 20

•

15 years ago

Attached patch patch with 4-byte alignment on check (obsolete) — Details — Splinter Review

Just the 4-byte alignment at the start in this patch.  Let's get this in, and someone can write the followup bug for linux (really, the only thing that's missing is having sse2_available equivalent defined somewhere on linux).

Attachment #395743 - Flags: superreview?

Attachment #395743 - Flags: review?(mmoy)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Updated

•

15 years ago

Attachment #395743 - Attachment is patch: true

Attachment #395743 - Attachment mime type: application/octet-stream → text/plain

Attachment #395743 - Flags: superreview? → superreview?(jst)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 21

•

15 years ago

And while I'm here, here's the benchmark for real utf8 data (longer than the earlier set, so they're not directly comparable... this is 3145728 bytes, original was 2155589):

g++ -O2:

                orig: 15.4942 ms
                 chk: 15.4288 ms
          dumb_ascii: 3.77464 ms
                 sse: 15.4654 ms
           sse_ascii: 3.78217 ms
               u8u16: 7.89827 ms

g++ -O2 -mtune=prescott:

                orig: 14.4634 ms
                 chk: 14.2659 ms
          dumb_ascii: 3.78667 ms
                 sse: 14.2653 ms
           sse_ascii: 3.77649 ms
               u8u16: 7.92226 ms

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 22

•

15 years ago

Replacing "& 0x80808080" with "_mm_movemask_epi8(*(__m128i *) src)" gives me the following results on the benchmark (I used my own pure ASCII text, so obviously the results aren't directly comparable, and I commented out u8u16 since it doesn't seem to want to build on x64)

Before
kshuey@linux-5wgr:~/Downloads> ./t.o
inBufSize: 512262
                orig: 5.40125 ms
                 chk: 2.95589 ms
          dumb_ascii: 2.64377 ms
                 sse: 1.31553 ms
           sse_ascii: 0.82156 ms

After
kshuey@linux-5wgr:~/Downloads> ./t.o
inBufSize: 512262
                orig: 5.54781 ms
                 chk: 2.9853 ms
          dumb_ascii: 2.5621 ms
                 sse: 0.994294 ms
           sse_ascii: 0.81584 ms


That works out to a relative savings of 25%.  Given your number of 5-6% of startup time overall that shaves off a little over 1% more.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 23

•

15 years ago

A few comments:

A code comment that said something like "Use the real converter if there are any bytes with the high bit set, otherwise just expand ASCII to UTF-16" might have saved me a few minutes of head-scratching.

The "return NS_OK" at the end doesn't seem like it's equivalent to what the other code does; don't you need to return NS_OK_UDEC_MOREOUTPUT in some cases?

Finally, the UTF8-UTF16 conversion code in xpcom/string has been optimized a good bit more than this code (although it's also a little less tolerant of errors, since it's designed for "sanitized" data... although I think that's actually mostly been fixed recently).  I suspect ConvertReal could be made a good bit faster; one thing I might try would be loading the member variables into locals at the start of the function and saving them back into the member variables at the end.

Peter Weilbacher

Comment 24

•

15 years ago

Shouldn't the |#endif ! | be |#endif // | instead? Why are you using two different *_SSE2 macros when you are really only using them together?

Are you planning to address Linux (or in general, other GCC platforms) in a similar way (in another bug)? I hope I adapted your benchmark correctly for my Linux x86_64 system (taking the MAC_SSE2 route without u8u16), I get these numbers for comparison

inBufSize: 918825
                orig: 3.12586 ms
                 chk: 2.68073 ms
          dumb_ascii: 0.815924 ms
                 sse: 2.68309 ms
           sse_ascii: 0.446675 ms

Makoto Kato [:m_kato]

Comment 25

•

15 years ago

I think that your patch should not define MAC_SSE2.  gcc defines __SSE2__ for MacOS X and x86_64, so you should use __SSE2__ instead of your MacOS's define.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 26

•

15 years ago

So I did some testing of vlad's benchmark from attachment 395742 [details] using attachment 391128 [details] as lipsum.txt.

My understanding is that the changes in this patch are supposed to move us from "orig" to "sse".

On my laptop (Linux, x86-64, Thinkpad), I see:

                orig: 2.97091 ms
                 chk: 2.73285 ms
          dumb_ascii: 0.91355 ms
                 sse: 2.74453 ms
           sse_ascii: 0.83265 ms

on mrbkap's laptop (x86 Mac), I actually see chk and sse being slightly slower than orig (I guess the cost of the extra pass varies).

Now, I wrote a patch to the original code to:
 * not use member variables during the conversion
 * add a nested inner loop

and on my laptop I now get:

                orig: 1.34414 ms

(I think these numbers only really have two significant figures, though.)


So I think maybe it would be better to optimize the original code and avoid the two-pass approach.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 27

•

15 years ago

Attached patch patch to standalone benchmark, optimizing original code — Details — Splinter Review

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 28

•

15 years ago

Except I realize I'm supposed to be testing with pure ASCII data, since vlad's test doesn't do any chunking.  With that, on mrbkap's laptop (with SSE), I see:

original:


inBufSize: 2300000
                orig: 8.06885 ms
                 chk: 2.68913 ms
          dumb_ascii: 2.11991 ms
                 sse: 2.10505 ms
           sse_ascii: 1.70828 ms


with my patch:

inBufSize: 2300000
                orig: 2.92254 ms
                 chk: 2.52946 ms
          dumb_ascii: 2.0997 ms
                 sse: 1.9211 ms
           sse_ascii: 1.64253 ms

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 29

•

15 years ago

At the risk of further cluttering the bug with benchmarks and attachments ...

I've been working on further optimizing the sse variation by eliminating the dual loop structure (i.e. combining the scanning loop and the processing loop).

What I came up with is "sseoneloop". It scans and converts at the same time as long as it finds only ASCII. Once it finds non-ASCII it calls ConvertReal. Right now it calls ConvertReal to do the whole thing so theoretically it could be worse than orig (if you had a bunch of ASCII text with one multibyte character at the very end, runtime would be approximately sseoneloop + orig) but it's fairly simple to just call ConvertReal on what remains (given that you already have a nice character boundary).

I'll clean up the code and attach it in patch form in a bit. As for numbers:

orig, chk, dumb_ascii, and sse_ascii are all what they are in vlad's benchmark. dbaron is dbaron's code. sseoneloop is as above.

Benchmark results: x86-64 linux
inBufSize: 3073578
orig: 30.6071 ms
chk: 18.392 ms
dumb_ascii: 14.9552 ms
dbaron: 20.4106 ms
sse: 8.57576 ms
sseoneloop: 4.90909 ms
sse_ascii: 5.37327 ms

I find it interesting that all of your numbers seem to come out with sse about equal to chk when every time I run it it runs almost twice as fast as chk. Also, I don't believe that there are even two significant figures in these results. From what I've seen the results fluctuate in a roughly one millisecond interval.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 30

•

15 years ago

Attached patch Patch to benchmark adding Covert_dbaron and Convert_sseoneloop (obsolete) — Details — Splinter Review

This patch adds dbaron's code to the benchmark and adds sseoneloop as described in my previous comment.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 31

•

15 years ago

If you're on a 32 bit system you'll probably want to add u8u16 back in after you apply that patch, as I forgot to unremove it before diffing.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 32

•

15 years ago

And with -O2

inBufSize: 3073578
                orig: 14.5715 ms
                 chk: 6.72878 ms
          dumb_ascii: 5.47907 ms
              dbaron: 6.89872 ms
                 sse: 5.41505 ms
          sseoneloop: 0.82828 ms
           sse_ascii: 4.41576 ms

And -O3

inBufSize: 3073578
                orig: 14.3186 ms
                 chk: 6.87903 ms
          dumb_ascii: 5.33305 ms
              dbaron: 6.39502 ms
                 sse: 4.88152 ms
          sseoneloop: 0.706781 ms
           sse_ascii: 3.99238 ms

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 33

•

15 years ago

Attached patch Patch adding dbaron and sseoneloop to standalone benchmark — Details — Splinter Review

Ignore the last optimized benchmarks, I was missing a piece of the routine (it was just throwing away its output and the compiler knew that).

Corrected benchmarks:
-O0
inBufSize: 3073578
                orig: 30.5634 ms
                 chk: 18.0327 ms
          dumb_ascii: 14.9367 ms
              dbaron: 20.4797 ms
                 sse: 8.27054 ms
          sseoneloop: 5.60232 ms
           sse_ascii: 5.32999 ms

-O2
inBufSize: 3073578
                orig: 14.182 ms
                 chk: 5.98159 ms
          dumb_ascii: 5.13692 ms
              dbaron: 6.23665 ms
                 sse: 4.77082 ms
          sseoneloop: 3.89594 ms
           sse_ascii: 3.86942 ms

-O3
inBufSize: 3073578
                orig: 14.1013 ms
                 chk: 6.66146 ms
          dumb_ascii: 4.79984 ms
              dbaron: 6.18853 ms
                 sse: 4.75576 ms
          sseoneloop: 3.81217 ms
           sse_ascii: 3.801 ms

Sorry for the incorrect numbers earlier.  I should have realized it's absurd for it to perform faster than the version that doesn't even check to see if there are multibyte characters.  The overhead of checking for those characters is incredibly small though, only 10 microseconds at -O3 according to this.

Attachment #396042 - Attachment is obsolete: true

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 34

•

15 years ago

Kyle, thanks for your help!  It looks like a combination of dbaron's work (for when we have multibyte chars) and yours (for optimizing the common case of where we don't) is what we want.  I'll cook up a patch for the actual code later on today.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Updated

•

15 years ago

Attachment #395743 - Attachment is obsolete: true

Attachment #395743 - Flags: superreview?(jst)

Attachment #395743 - Flags: review?(mmoy)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 35

•

15 years ago

(In reply to comment #34)
> Kyle, thanks for your help!  It looks like a combination of dbaron's work (for
> when we have multibyte chars) and yours (for optimizing the common case of
> where we don't) is what we want.  I'll cook up a patch for the actual code
> later on today.

No problem.  Sounds great.  If you don't get around to it and want me to turn it into a patch just let me know.  I probably couldn't get around to it until tomorrow, but would be happy to do it if wanted.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 36

•

15 years ago

Attached patch final patch — Details — Splinter Review

Ok, here is what I think is a good final patch.  It includes dbaron's optimization as the core piece, and then two customized ascii run converters, one for SSE2, and one for ARM.

Here's x86, with the SSE code...

ascii:
                orig: 10.4181 ms (ok)
              dbaron: 4.79316 ms (ok)
               final: 2.34829 ms (ok)

utf8 content (wikipedia utf8 page, so short html runs and long utf8 runs):

                orig: 15.1058 ms (ok)
              dbaron: 12.4791 ms (ok)
               final: 13.4203 ms (ok)

so a little slower than the simple loop, due to a few extra bits of code for short runs.

ARM:

ascii:
                orig: 164.045 ms (ok) 006d 006f 0064 006f
               final: 80.1377 ms (ok) 006d 006f 0064 006f

utf8:

                orig: 177.403 ms (ok) 006f 006e 0069 0063
             dbaron2: 117.875 ms (ok) 006f 006e 0069 0063

(3MB in all cases)

Attachment #396576 - Flags: superreview?(jst)

Attachment #396576 - Flags: review?(jst)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Updated

•

15 years ago

Assignee: nobody → vladimir

OS: Mac OS X → All

Hardware: x86 → All

Alfred Kayser

Comment 37

•

15 years ago

Attached patch Make the UCS conversion faster for the non-ascii part — Details — Splinter Review

Some nits:
Instead of  if ((*src & 0x80) == 0), one can also do if (*src >= 0).
Also instead of if ((*in & 0xE0) == 0xC0), testing bits with if (*in & 0x40) is also faster.
The first is sometimes (but not always!) done by a smart optimizing compiler, but the second not.

Attached patch reduces the loop part for non-ascii from 88 ASM lines to 71 lines, for example from:
; Line 152
	mov	al, cl
	and	al, 224					; 000000e0H
	cmp	al, 192					; 000000c0H
	jne	SHORT $LN25@Convert
; Line 155
	and	ecx, 31					; 0000001fH
	shl	ecx, 6
	mov	DWORD PTR [edx+12], ecx
; Line 156
	mov	BYTE PTR [edx+16], 1
; Line 157
	mov	BYTE PTR [edx+17], 2
	jmp	$LN30@Convert

to:
; Line 161
	test	cl, 32					; 00000020H
	jne	SHORT $LN25@Convert
; Line 164
	and	ecx, 31					; 0000001fH
	shl	ecx, 6
; Line 165
	mov	BYTE PTR [edx+17], 2
	jmp	SHORT $LN45@Convert

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 38

•

15 years ago

Vlad, shouldn't it be "include <emmintrin.h>"?

Also, are we handling support for this on linux at a later date? (presumably since there's no easy way to get at __sse2_available)

Other than that looks good for me.  I built firefox with it on windows 7 and I think it loads noticeably faster though I haven't done any quantitative measurements.

Peter Weilbacher

Comment 39

•

15 years ago

We have a few functions already in the tree that determine SSE2 functionality on Linux, e.g. CheckForSSE2() in jstracer.cpp and sse2_available() in qcms, so Linux on x86 shouldn't be a problem.

As I hinted in comment 24, at least the Linux x86_64 case could very well be covered by MAC_SSE2.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 40

•

15 years ago

Yeah, the issue with linux isn't getting it to work, it's figuring out where to put the detection code.  I could just copy sse2_available somewhere, but was hoping to end up with a libxul-global spot... but maybe that's not feasible.

I missed comment 24 -- I'll change the ifdefs to __SSE2__.  The further optimizations in comment 37 we should also take, though merging the patches is going to become tricky; Alfred, can you remerge after I land the ascii bit?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 41

•

15 years ago

(In reply to comment #37)
> Instead of  if ((*src & 0x80) == 0), one can also do if (*src >= 0).

You can't do this; you're assuming than |char| == |signed char|, but |char| can be either |signed char| or |unsigned char|.  (And I thought it was usually |unsigned char|.)

> Also instead of if ((*in & 0xE0) == 0xC0), testing bits with if (*in & 0x40) is
> also faster.

But not equivalent, since you're not checking that that the bit below is unset (i.e., you want to require (*in & 0x40) && !(*in & 0x20)), which probably leads to security bugs.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 42

•

15 years ago

(In reply to comment #41)
> But not equivalent, since you're not checking that that the bit below is unset
> (i.e., you want to require (*in & 0x40) && !(*in & 0x20)), which probably leads
> to security bugs.

Er, actually, given the order you check it, it is ok, so never mind.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 43

•

15 years ago

(In reply to comment #38)
> Also, are we handling support for this on linux at a later date? (presumably
> since there's no easy way to get at __sse2_available)

It actually looks like there are a few tricky things about this, but I filed bug 513422 on some changes that I think will get us a bit closer.

Johnny Stenback (:jst)

Updated

•

15 years ago

Attachment #396576 - Flags: superreview?(jst)

Attachment #396576 - Flags: superreview+

Attachment #396576 - Flags: review?(jst)

Attachment #396576 - Flags: review+

Johnny Stenback (:jst)

Comment 44

•

15 years ago

Comment on attachment 396576 [details] [diff] [review]
final patch

+#if defined(MAC_SSE2) || defined(WIN_SSE2)
+
+static inline void
+Convert_ascii_run (const char *&src,
+                   PRUnichar *&dst,
+                   PRInt32 len)
+{
+  if (__sse2_available && len > 15) {

Reversing the order of those two checks would avoid checking if sse2 is available for short strings. Not a big deal, but would save us a few instructions when converting short strings.

r+sr=jst

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 45

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/00d2664f70d4

Checked in with the check reversed as suggested.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 46

•

15 years ago

backed out due to maybe-orange, and I gotta sleep so can't watch the second cycle.  will reland tomorrow if it ends up being not this patch's fault

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Alfred Kayser

Comment 47

•

15 years ago

re #40: I will look at my part after this is landed safely.
re #41: we can always force the char pointer to be signed char within the function to be sure about the matching. Secondly testing &0x40 is indeed not the same, but in my patch first the high bit is checked and the next bit.

Alfred Kayser

Comment 48

•

15 years ago

Created bug 514140 for the non-ascii part

Blocks: 514140

Dão Gottwald [:dao]

Comment 49

•

15 years ago

(In reply to comment #46)
> backed out due to maybe-orange, and I gotta sleep so can't watch the second
> cycle.  will reland tomorrow if it ends up being not this patch's fault

These were all known intermittent oranges.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 50

•

15 years ago

Maybe a silly question at this point in the game given that you just relanded this, but what optimizations option are we using on this now?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 51

•

15 years ago

Whatever the default is, which I think is -Os -- we should probably switch to at least -O2 here, but that can be a separate bug.

Status: REOPENED → RESOLVED

Closed: 15 years ago → 15 years ago

Resolution: --- → FIXED

neil@parkwaycc.co.uk

Comment 52

•

15 years ago

Attached patch VC7.1 bustage fix — Details — Splinter Review

Attachment #398357 - Flags: review?(vladimir)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Updated

•

15 years ago

Attachment #398357 - Flags: review?(vladimir) → review+

neil@parkwaycc.co.uk

Comment 53

•

15 years ago

Comment on attachment 398357 [details] [diff] [review]
VC7.1 bustage fix

Pushed changeset 8bb68d6639b7 to mozilla-central.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Updated

•

15 years ago

Depends on: 514624

Alfred Kayser

Comment 54

•

15 years ago

Note, it seems that nsUnicharInputStream doesn't use this converter, 
but the converter from  ConvertUTF8toUTF16 (xpcom/string/public/nsUTF8Utils.h)
Also the  CountValidUTF8Bytes function is not really needed as one can easily make the safe assumption that destLen = srcLen + 1
(see http://mxr.mozilla.org/mozilla-central/source/intl/uconv/src/nsUTF8ToUnicode.cpp#113).
The easiest way would be to change nsUnicharInputStream to use the nsUTF8ToUnicode converter, instead the one in nsUTF8Utils.h.

ConvertUTF8toUTF16 is also used in:
# xpcom/string/src/nsReadableUtils.cpp
# toolkit/xre/nsWindowsRestart.cpp

So, we should these also redirect use the optimized nsUTF8toUnicode converter, and remove the ConvertUTF8toUTF16 version completely.

Boris Zbarsky [:bzbarsky]

Comment 55

•

15 years ago

That would mean moving nsUTF8toUnicode out of intl and into xpcom.  All the things you list live in xpcom and can't depend on intl.

I would be all for not having multiple UTF8-to-UTF16 conversion codepaths around...

Simon Montagu :smontagu

Comment 56

•

15 years ago

(In reply to comment #55)

> I would be all for not having multiple UTF8-to-UTF16 conversion codepaths
> around...

A hearty "me too"!! Note however that there are some functional differences between the two converters: ConvertUTF8toUTF16 is less fault-tolerant than nsUTF8toUnicode, and can only safely be used when the input is known to be valid UTF-8. 

See also bug 497204

Alfred Kayser

Updated

•

15 years ago

Blocks: 459117

Alfred Kayser

Updated

•

15 years ago

Blocks: 515969

Alfred Kayser

Updated

•

15 years ago

Blocks: 515956

Mike Beltzner [:beltzner, not reading bugmail]

Comment 58

•

15 years ago

Not blocking, but if a roll-up patch were to be nominated, we should take it.

Flags: wanted1.9.2+

Flags: blocking1.9.2?

Flags: blocking1.9.2-

Doug Turner (:dougt)

Comment 59

•

14 years ago

file 548664 for neon support

Justin Lebar (not reading bugmail)

Updated

•

14 years ago

Blocks: 548664

Justin Lebar (not reading bugmail)

Updated

•

14 years ago

Depends on: 585538

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: String → XPCOM

You need to log in before you can comment on or make changes to this bug.