Closed Bug 129387 Opened 20 years ago Closed 5 years ago

speed up Unicode en/decoders for large character sets(CJK)

Categories

(Core :: Internationalization, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla56

People

(Reporter: nicolas, Assigned: hsivonen)

References

()

Details

(Keywords: intl, perf, Whiteboard: [fixed by encoding_rs])

Attachments

(2 files, 6 obsolete files)

Gtk platform.
Any page with korean characters will use a lot of CPU.
During the page loading, several seconds are spent in nsFontGTKNormal::GetWidth().
Any page with a charset "euc-kr" will spend on average 50 to 100 times more
time in this function than if it is a western one.
It is highly related to the performance of function uMapCode() in
intl/src/uconv/umap.c

Within the function, sequential search is used to find correct mapping in
*.u[t]f]. So, the larger the table, the poorer the performance. I have try to
use binary search instead. I got a great improvement but 'slightly' broken those
mapping table which is not sorted by srcBegin field. But it can be fixed by
regenerating the table sorted by srcBegin. Anyway, I could find the raw table
now. Anyone point me to get that? 
Keywords: intl
QA Contact: ruixu → ylong
assign to smontagu for performance
Assignee: yokoyama → smontagu
Simon: I'd recommend you chat with Frank about this.
Confirm the refered URL page does use a lot of CPU on RH7.2.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: perf
The performance issue resembles bug 113549 (thai (?) fonts)
When i loaded http://www.nstda.or.th/newsroom/pr/pr010998.html in the beginning
of December, it took around two minutes to load.
When i loaded it again in the middle of February, it took over 5 minutes.
(Linux, builds from around the time of testing)
Status: NEW → ASSIGNED
Blocks: 157673
probablly adding cache code into korean converter could solve this issue. 
reassign back to ftang
I won't fix this issue for a while. currently, Korean Linux is not in the
priority of my team. Anyone who want to improve it is welcome to take it (hint
hint jshin, katakai )  I think the way to improve it is to 
1. rewrite the korean converters use similar way we did in simp chinese one, or 
2. add a cache code for the umap (I have some half stuff there) and use either
32, 64, or 128 array to cache the mapping result based on the lower 5,6,7 bits
of the input field. 
mark it as future for now. take out blocking 157673 
people care about Linux/Unix performance or Korean performance  are welcome to
grab this bug and work on it. This is not a priority for my group right now.
Assignee: smontagu → ftang
No longer blocks: 157673
Status: ASSIGNED → NEW
Target Milestone: --- → Future
> Gtk platform.
> Any page with korean characters will use a lot of CPU.
> During the page loading, several seconds are spent in 
> nsFontGTKNormal::GetWidth().
> Any page with a charset "euc-kr" will spend on average 50 to 100 times more
> time in this function than if it is a western one.

Thank you for catching this problem. I haven't noticed this partly because
I usually use ksc5601.1992-3 fonts instead of ksc5601.1987-0 fonts.
Using ksc5601.1992-3 fonts reduces the delay significantly
because only the decoding
euc-kr to Unicode goes thru uMapcode() while the conversion in the
the other direction is algorithmic (for Hangul syllables which make up
more than 90% of Korean text.) Still I've seen mozilla take up
~ 25% of CPU time(of my P III 800Mhz box) while loading Korean pages 
in EUC-KR. 

> 1. rewrite the korean converters use similar way we did in simp chinese one,
A couple of years ago when I knew little about uscan.c/umap.c/ugen.c,
I considered doing this because I had already done a similar job
for Glibc iconv module. Now I'm a bit reluctant to take this path
because I would have to take care of all the nuisances I've already
taken care of in uscan.c for various edge cases(of Korean
pages) all again.   

> or
 
> 2. add a cache code for the umap (I have some half stuff there) and use either
> 32, 64, or 128 array to cache the mapping result based on the lower 5,6,7 bits
> of the input field.

 This is certainly more universal than 1. in that it'll speed up other
converters as well. Could you attach your half-stuff here so that
I can build upon it?  

In addition to this, we can also implement
binary search as suggested by Jacky in comment #1. Of course,
in that case, umaptable.c has to be modified to sort
items in srcBegin order and all the *ut/*uf
tables have to be regenerated with a new umaptable.c. I've just
taken a look at umaptable.c to find that it's doable.

Status: NEW → ASSIGNED
I've just tried Jacky's idea (binary search) and got about 4to5-fold
speed-up ( 2.0sec : 0.55 sec, 1.6sec : 0.30sec, 0.80 sec : 0.21 sec,
0.52 sec : 0.14 sec) in  loading  local documents(made only
of text cut'n'pasted from several real life Korean documents) in EUC-KR.
Graphic-rich pages at remote locations would not have this much
reduction in the loading time(in terms of the reduction factor), 
but still I guess it's still worth to put in. 

I used ksc5601.1987-0 font to amplify the speed differential.
This way, both encoding and decoding (to/from Unicode) heavily use
uMapCode in umap.c With ksc5601.1992-3 fonts under Linux/Unix/X11
and truetype fonts(with Unicode cmap) under MS-Windows/MacOS, only decoding 
depends heavily on uMapCode(). Nonetheless, making uMapCode()
faster is not only for Korean Linux/Unix but also for
any encoder/decoder that uses it on all the platforms. 


Simply sorting entries in ut/uf files in terms of srcBegin does not
work because format2 and format0 entries following a format1 entry
can be completely within the range covered by the preceeding format1
entry. Suppose there are  the following entries in a ut/uf file
and  the key we're looking up is 0x2145. 

0x2121, 0x217e, 0x0100,   /* format 1 : entry 20*/
0x2135, 0x2140, 0x0300,   /* format 0 : entry 21 */
0x2162, 0,      0xf320,   /* format 2 : entry 22 */

If the middle point happens to be entry 21, the beginning of
the next search bracket would be set to 22 (i.e.
moving away from entry 20 where 0x2145 actually belongs). 


To work around this problem, I split up MapCellArray into two parts,
one composed of format0 and format2 entries followed by
the other made of format1 entries. 
  
There is also a Korean-specific optimization I haven't tried yet.
Currently, Korean converters use monolithic mapping tables
for KS X 1001. Problem with this is that Hanja and special
symbol characters take up the bulk of spaces in the table
slowing down the look up of Hangul syllables which comprises
most Korean web pages. By breaking up u20kscgl.ut/uf into
three separate tables, look-up of Hangul syllables can get
faster. To a much lesser degree, breaking up u20cp949.ut/uf may
help a bit.(although there's a little trade-off for this.)
These break-up along with binary search in uMapCode(), I think, 
would have roughly the same effect as rewriting Korean
converters the way SC converters were rewritten in 2000.   

On top of these, caching can be implemented for a further
speed up. 
I need to regenerate ut/uf files for JA/TW/CN by a new umaptable.c
to make my patch(for umap.c)
work with JA/TW/CN mapping tables. For some of them, the source mapping
tables are obvious, but for others I can't figure out where the source
mapping tables come from.
 
Frank, 
Could you tell me what source mapping tables you used to generate
hkscs.uf|ut (HKSCS has a few 'variants') and  jis0208ext.uf? Did you use
CP950.TXT in Unicode ftp archive for big5.uf|ut?
I'll take another look at this in a few days. In the meantime,
I just wanna note that this bug may not be as critical as before
because Xft is rapidly being deployed for Linux/*BSD and X11core
fonts will be used less as time goes on (hopefully
commerical Unixen will follow soon...) . On Xft-enabled
Mozilla, 'encoding' (from Unicode to font encoding) is not necessary
any more for most cases (custom-Cmapped TTFs are exceptions)
as on Windows and MacOS (with TTF with Unicode Cmap).
However, this still needs to be taken care of not only for
Linux but also for Windwos/MacOS because I observed Mozilla
take up about 90% of CPU time momentarily on Win2k as well while rendering
some Korean pages. This is not to say that the sole reason
for spiking up of CPU usage is inefficient converters.
There are other factors like very excessive use of flash 
plugins in many Korean pages.  To figure out what is
most to blame, I'll try some 'text only' pages and gather some performance
statistics with profiling enabled. 
> I observed Mozilla
> take up about 90% of CPU time momentarily on Win2k as well while rendering
> some Korean pages. This is not to say that the sole reason
> for spiking up of CPU usage is inefficient converters.
> There are other factors like very excessive use of flash 
> plugins in many Korean pages.  

  With flash plugin removed, several Korean pages that took up 90% CPU
time on my PIII 750MHz box  while being loaded never take up more than 50% CPU
time. 
This was observed with an optimized build for Linux with X11core font(Gtk).

I also ran jprof on it (without flash plugin) 
and found that Mozilla indeed spent the longest
in |uMapCode|. According to Jprof, it spent about 5.5%(the average of
three runs) of the total CPU time it used in |uMapCode|.   
With attachement 95359 applied, the percentile CPU
time spent in |uMapCode| and |uMapBsearch| dropped to 0.6%(again
the ave. of three runs). Although I didn't do any systematic study,
it seems obvious that binary search alone speed things up a lot. 
Using binary search is roughly equivalent to the first 
solution Frank suggested in comment #7. 

Now, the question is whether we need caching on top of this. 
It depends on the overhead of caching, the frequency of character
occurence, and the sizes of maps looked-up among other things. The performance
with caching will also be a function of the time elapsed since Mozilla launching.

Frank, what would you say?
How aboug just going ahead with binary search patch first?
To do that, I need to regenerate mapping tables. As I wrot ein
comment #12, can you tell me where to get the source
files for hkscs, big5 and jis0208ext? 

I tracked down all the sources of CJK ut|ut files with Lxr and Bugzilla and
regenerated
them all with a new version of umaptable.c.
Attachment #95359 - Attachment is obsolete: true
I used this script to download source tables and regenerate 
uf|ut files for CJK converters. This script also documents
all the sources and related bugs for CJK converters.
Somehow the source mapping tables for hkscs.uf|ut  produced by
following the procedure given by Anthony in bug 182089 are different
from those he attached to that bug. So, this script directly downloads
them(the source mapping files) and feeds  them to umaptable.c

New uf|ut files I generated are available at
http://jshin.net/moztest/129387.umaps.tar.gz
I'll upload it here if necessary (it's about 1.9MB)
Attachment #109630 - Attachment is obsolete: true
I'm adding some people who contributed to CJK converters in the past
(bug 108136, bug 54135, bug 182089, bug 79275, bug 61422)
because my patch requires regenerating uf/ut files for CJK converters
to enable binary search. 
I faithfully followed the procedure described in
bugs listed to regenerate the source mapping
tables and uf/ut files  so that there should be no regression. 
I  tried some test pages given in bugs 182089 and bug 61422
as well as random CJK web sites and have no come across any regression so
far. Nonetheless, it'd be nice if newly generated uf/ut files
(along with my patch) are tested by those more familiar with
cases that can unveil flaws in new uf|ut files (especially
cases that led you to report bugs listed above).  

Basically, what my patch does is to split uf|ut files into 
two sections, one with format 0 and format 2 entries and the other 
with format 1 entries with entries in each section sorted.
This way, binary search can be used by |uMapCode| in umap.c
speeding up the look-up by a signficant factor for large umap tables
(for CJK) I set the cut-off for linear search to the number of entries being
0x30(this is not optimal), but I used that value to avoid regenerating
ut|uf files in ucvlatin and ucvmath (there are some medium size 
uf|ut files with the number of entries as large as 0x2e).
Regnerating them as well would increase QA requirement (although
I'm pretty sure that there'd be no regression) without much 
gain.  
 
As I wrote in comment #12, I ran 'jprof' on Mozilla-X11core with
my patch and the percentile CPU time spent in uMapCode()
dropped significantly with my patch. This would help mostly
Mozilla-X11corefont which spends quite a lot of time
in the conversion from Unicode to font encoding.  Somehow
Mozilla spends lot less time in uMapCode() the other way around.

BTW, could someone who can change the summary line to 'make Unicode
converter (or uMapCode look-up) more efficient'?  This is not only 
for Korean but also for others with large uf|ut tables.

Thank you in advance for testing.
Summary: Korean font : 90% CPU in nsFontGTKNormal::GetWidth() → speed up Unicode en/decoders for large character sets(CJK)
Now that ftang@netscape.com address doesn't work any more. I'm taking. 
Assignee: ftang → jshin
Blocks: 116645
Status: ASSIGNED → NEW
Blocks: 226183
Attached patch updated patch (obsolete) — Splinter Review
got rid of #if 0'd blocks
Attached patch updated patchSplinter Review
got rid of #if 0'd blocks
Here's the instruction for testing the patch (as I promised to bz
in bug 227106)

1. apply the patch
2. download this attachment and move it to intl/uconv/tools
3. cd to intl/uconv/tools
4. run the script.
   (if you're behind a filrewall, you may need to edit the scritp
    to run 'wget' with '--passive-ftp' option)

5. recompile intl/uconv
6. take a profile of a mozilla run during which you visit
   some CJK sites in legacy encodings (http://tw.yahoo.com,
   http://cn.yahoo.com, http://www.yahoo.co.jp, http://www.yahoo.co.kr)
Attachment #109628 - Attachment is obsolete: true
Attachment #109633 - Attachment is obsolete: true
Attachment #114005 - Attachment is obsolete: true
Attachment #136679 - Attachment is obsolete: true
The patch for umap files is too big to upload here. That's why I uploaded a
shell script to generate them. Anyway, it's available at 
http://i18nl10n.com/mozilla/129387.umaps.patch.bz2  (it's > 800kB in bzip2)
Status: NEW → ASSIGNED
QA Contact: amyy → i18n
Depends on: encoding_rs
Fixed and wontfixed by bug 1261841. Decode got faster for legacy CJK encoding. Encode got faster for EUC-KR. Encode got slower for legacy Chinese and Japanese encodings for the ideographic parts, because we no longer need that functionality for perf-sensitive stuff (no longer used for fonts as reported here), so optimized those for data size instead.
Assignee: jshin1987 → hsivonen
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by encoding_rs]
Target Milestone: Future → mozilla56
You need to log in before you can comment on or make changes to this bug.