Closed Bug 129387 Opened 23 years ago Closed 7 years ago

speed up Unicode en/decoders for large character sets(CJK)

Categories

(Core :: Internationalization, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla56

People

(Reporter: nicolas, Assigned: hsivonen)

References

()

Details

(Keywords: intl, perf, Whiteboard: [fixed by encoding_rs])

Attachments

(2 files, 6 obsolete files)

Gtk platform. Any page with korean characters will use a lot of CPU. During the page loading, several seconds are spent in nsFontGTKNormal::GetWidth(). Any page with a charset "euc-kr" will spend on average 50 to 100 times more time in this function than if it is a western one.
It is highly related to the performance of function uMapCode() in intl/src/uconv/umap.c Within the function, sequential search is used to find correct mapping in *.u[t]f]. So, the larger the table, the poorer the performance. I have try to use binary search instead. I got a great improvement but 'slightly' broken those mapping table which is not sorted by srcBegin field. But it can be fixed by regenerating the table sorted by srcBegin. Anyway, I could find the raw table now. Anyone point me to get that?
Keywords: intl
QA Contact: ruixu → ylong
assign to smontagu for performance
Assignee: yokoyama → smontagu
Simon: I'd recommend you chat with Frank about this.
Confirm the refered URL page does use a lot of CPU on RH7.2.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: perf
The performance issue resembles bug 113549 (thai (?) fonts) When i loaded http://www.nstda.or.th/newsroom/pr/pr010998.html in the beginning of December, it took around two minutes to load. When i loaded it again in the middle of February, it took over 5 minutes. (Linux, builds from around the time of testing)
Status: NEW → ASSIGNED
Blocks: 157673
probablly adding cache code into korean converter could solve this issue.
reassign back to ftang I won't fix this issue for a while. currently, Korean Linux is not in the priority of my team. Anyone who want to improve it is welcome to take it (hint hint jshin, katakai ) I think the way to improve it is to 1. rewrite the korean converters use similar way we did in simp chinese one, or 2. add a cache code for the umap (I have some half stuff there) and use either 32, 64, or 128 array to cache the mapping result based on the lower 5,6,7 bits of the input field. mark it as future for now. take out blocking 157673 people care about Linux/Unix performance or Korean performance are welcome to grab this bug and work on it. This is not a priority for my group right now.
Assignee: smontagu → ftang
No longer blocks: 157673
Status: ASSIGNED → NEW
Target Milestone: --- → Future
> Gtk platform. > Any page with korean characters will use a lot of CPU. > During the page loading, several seconds are spent in > nsFontGTKNormal::GetWidth(). > Any page with a charset "euc-kr" will spend on average 50 to 100 times more > time in this function than if it is a western one. Thank you for catching this problem. I haven't noticed this partly because I usually use ksc5601.1992-3 fonts instead of ksc5601.1987-0 fonts. Using ksc5601.1992-3 fonts reduces the delay significantly because only the decoding euc-kr to Unicode goes thru uMapcode() while the conversion in the the other direction is algorithmic (for Hangul syllables which make up more than 90% of Korean text.) Still I've seen mozilla take up ~ 25% of CPU time(of my P III 800Mhz box) while loading Korean pages in EUC-KR. > 1. rewrite the korean converters use similar way we did in simp chinese one, A couple of years ago when I knew little about uscan.c/umap.c/ugen.c, I considered doing this because I had already done a similar job for Glibc iconv module. Now I'm a bit reluctant to take this path because I would have to take care of all the nuisances I've already taken care of in uscan.c for various edge cases(of Korean pages) all again. > or > 2. add a cache code for the umap (I have some half stuff there) and use either > 32, 64, or 128 array to cache the mapping result based on the lower 5,6,7 bits > of the input field. This is certainly more universal than 1. in that it'll speed up other converters as well. Could you attach your half-stuff here so that I can build upon it? In addition to this, we can also implement binary search as suggested by Jacky in comment #1. Of course, in that case, umaptable.c has to be modified to sort items in srcBegin order and all the *ut/*uf tables have to be regenerated with a new umaptable.c. I've just taken a look at umaptable.c to find that it's doable.
Status: NEW → ASSIGNED
I've just tried Jacky's idea (binary search) and got about 4to5-fold speed-up ( 2.0sec : 0.55 sec, 1.6sec : 0.30sec, 0.80 sec : 0.21 sec, 0.52 sec : 0.14 sec) in loading local documents(made only of text cut'n'pasted from several real life Korean documents) in EUC-KR. Graphic-rich pages at remote locations would not have this much reduction in the loading time(in terms of the reduction factor), but still I guess it's still worth to put in. I used ksc5601.1987-0 font to amplify the speed differential. This way, both encoding and decoding (to/from Unicode) heavily use uMapCode in umap.c With ksc5601.1992-3 fonts under Linux/Unix/X11 and truetype fonts(with Unicode cmap) under MS-Windows/MacOS, only decoding depends heavily on uMapCode(). Nonetheless, making uMapCode() faster is not only for Korean Linux/Unix but also for any encoder/decoder that uses it on all the platforms. Simply sorting entries in ut/uf files in terms of srcBegin does not work because format2 and format0 entries following a format1 entry can be completely within the range covered by the preceeding format1 entry. Suppose there are the following entries in a ut/uf file and the key we're looking up is 0x2145. 0x2121, 0x217e, 0x0100, /* format 1 : entry 20*/ 0x2135, 0x2140, 0x0300, /* format 0 : entry 21 */ 0x2162, 0, 0xf320, /* format 2 : entry 22 */ If the middle point happens to be entry 21, the beginning of the next search bracket would be set to 22 (i.e. moving away from entry 20 where 0x2145 actually belongs). To work around this problem, I split up MapCellArray into two parts, one composed of format0 and format2 entries followed by the other made of format1 entries. There is also a Korean-specific optimization I haven't tried yet. Currently, Korean converters use monolithic mapping tables for KS X 1001. Problem with this is that Hanja and special symbol characters take up the bulk of spaces in the table slowing down the look up of Hangul syllables which comprises most Korean web pages. By breaking up u20kscgl.ut/uf into three separate tables, look-up of Hangul syllables can get faster. To a much lesser degree, breaking up u20cp949.ut/uf may help a bit.(although there's a little trade-off for this.) These break-up along with binary search in uMapCode(), I think, would have roughly the same effect as rewriting Korean converters the way SC converters were rewritten in 2000. On top of these, caching can be implemented for a further speed up.
I need to regenerate ut/uf files for JA/TW/CN by a new umaptable.c to make my patch(for umap.c) work with JA/TW/CN mapping tables. For some of them, the source mapping tables are obvious, but for others I can't figure out where the source mapping tables come from. Frank, Could you tell me what source mapping tables you used to generate hkscs.uf|ut (HKSCS has a few 'variants') and jis0208ext.uf? Did you use CP950.TXT in Unicode ftp archive for big5.uf|ut?
I'll take another look at this in a few days. In the meantime, I just wanna note that this bug may not be as critical as before because Xft is rapidly being deployed for Linux/*BSD and X11core fonts will be used less as time goes on (hopefully commerical Unixen will follow soon...) . On Xft-enabled Mozilla, 'encoding' (from Unicode to font encoding) is not necessary any more for most cases (custom-Cmapped TTFs are exceptions) as on Windows and MacOS (with TTF with Unicode Cmap). However, this still needs to be taken care of not only for Linux but also for Windwos/MacOS because I observed Mozilla take up about 90% of CPU time momentarily on Win2k as well while rendering some Korean pages. This is not to say that the sole reason for spiking up of CPU usage is inefficient converters. There are other factors like very excessive use of flash plugins in many Korean pages. To figure out what is most to blame, I'll try some 'text only' pages and gather some performance statistics with profiling enabled.
> I observed Mozilla > take up about 90% of CPU time momentarily on Win2k as well while rendering > some Korean pages. This is not to say that the sole reason > for spiking up of CPU usage is inefficient converters. > There are other factors like very excessive use of flash > plugins in many Korean pages. With flash plugin removed, several Korean pages that took up 90% CPU time on my PIII 750MHz box while being loaded never take up more than 50% CPU time. This was observed with an optimized build for Linux with X11core font(Gtk). I also ran jprof on it (without flash plugin) and found that Mozilla indeed spent the longest in |uMapCode|. According to Jprof, it spent about 5.5%(the average of three runs) of the total CPU time it used in |uMapCode|. With attachement 95359 applied, the percentile CPU time spent in |uMapCode| and |uMapBsearch| dropped to 0.6%(again the ave. of three runs). Although I didn't do any systematic study, it seems obvious that binary search alone speed things up a lot. Using binary search is roughly equivalent to the first solution Frank suggested in comment #7. Now, the question is whether we need caching on top of this. It depends on the overhead of caching, the frequency of character occurence, and the sizes of maps looked-up among other things. The performance with caching will also be a function of the time elapsed since Mozilla launching. Frank, what would you say? How aboug just going ahead with binary search patch first? To do that, I need to regenerate mapping tables. As I wrot ein comment #12, can you tell me where to get the source files for hkscs, big5 and jis0208ext?
I tracked down all the sources of CJK ut|ut files with Lxr and Bugzilla and regenerated them all with a new version of umaptable.c.
Attachment #95359 - Attachment is obsolete: true
I used this script to download source tables and regenerate uf|ut files for CJK converters. This script also documents all the sources and related bugs for CJK converters.
Somehow the source mapping tables for hkscs.uf|ut produced by following the procedure given by Anthony in bug 182089 are different from those he attached to that bug. So, this script directly downloads them(the source mapping files) and feeds them to umaptable.c New uf|ut files I generated are available at http://jshin.net/moztest/129387.umaps.tar.gz I'll upload it here if necessary (it's about 1.9MB)
Attachment #109630 - Attachment is obsolete: true
I'm adding some people who contributed to CJK converters in the past (bug 108136, bug 54135, bug 182089, bug 79275, bug 61422) because my patch requires regenerating uf/ut files for CJK converters to enable binary search. I faithfully followed the procedure described in bugs listed to regenerate the source mapping tables and uf/ut files so that there should be no regression. I tried some test pages given in bugs 182089 and bug 61422 as well as random CJK web sites and have no come across any regression so far. Nonetheless, it'd be nice if newly generated uf/ut files (along with my patch) are tested by those more familiar with cases that can unveil flaws in new uf|ut files (especially cases that led you to report bugs listed above). Basically, what my patch does is to split uf|ut files into two sections, one with format 0 and format 2 entries and the other with format 1 entries with entries in each section sorted. This way, binary search can be used by |uMapCode| in umap.c speeding up the look-up by a signficant factor for large umap tables (for CJK) I set the cut-off for linear search to the number of entries being 0x30(this is not optimal), but I used that value to avoid regenerating ut|uf files in ucvlatin and ucvmath (there are some medium size uf|ut files with the number of entries as large as 0x2e). Regnerating them as well would increase QA requirement (although I'm pretty sure that there'd be no regression) without much gain. As I wrote in comment #12, I ran 'jprof' on Mozilla-X11core with my patch and the percentile CPU time spent in uMapCode() dropped significantly with my patch. This would help mostly Mozilla-X11corefont which spends quite a lot of time in the conversion from Unicode to font encoding. Somehow Mozilla spends lot less time in uMapCode() the other way around. BTW, could someone who can change the summary line to 'make Unicode converter (or uMapCode look-up) more efficient'? This is not only for Korean but also for others with large uf|ut tables. Thank you in advance for testing.
Summary: Korean font : 90% CPU in nsFontGTKNormal::GetWidth() → speed up Unicode en/decoders for large character sets(CJK)
Now that ftang@netscape.com address doesn't work any more. I'm taking.
Assignee: ftang → jshin
Blocks: 116645
Status: ASSIGNED → NEW
Blocks: 226183
Attached patch updated patch (obsolete) — Splinter Review
got rid of #if 0'd blocks
Attached patch updated patchSplinter Review
got rid of #if 0'd blocks
Here's the instruction for testing the patch (as I promised to bz in bug 227106) 1. apply the patch 2. download this attachment and move it to intl/uconv/tools 3. cd to intl/uconv/tools 4. run the script. (if you're behind a filrewall, you may need to edit the scritp to run 'wget' with '--passive-ftp' option) 5. recompile intl/uconv 6. take a profile of a mozilla run during which you visit some CJK sites in legacy encodings (http://tw.yahoo.com, http://cn.yahoo.com, http://www.yahoo.co.jp, http://www.yahoo.co.kr)
Attachment #109628 - Attachment is obsolete: true
Attachment #109633 - Attachment is obsolete: true
Attachment #114005 - Attachment is obsolete: true
Attachment #136679 - Attachment is obsolete: true
The patch for umap files is too big to upload here. That's why I uploaded a shell script to generate them. Anyway, it's available at http://i18nl10n.com/mozilla/129387.umaps.patch.bz2 (it's > 800kB in bzip2)
Status: NEW → ASSIGNED
QA Contact: amyy → i18n
Fixed and wontfixed by bug 1261841. Decode got faster for legacy CJK encoding. Encode got faster for EUC-KR. Encode got slower for legacy Chinese and Japanese encodings for the ideographic parts, because we no longer need that functionality for perf-sensitive stuff (no longer used for fonts as reported here), so optimized those for data size instead.
Assignee: jshin1987 → hsivonen
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by encoding_rs]
Target Milestone: Future → mozilla56
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: