Closed Bug 116030 Opened 23 years ago Closed 23 years ago

provide a way to identify langGroup/rang for a unicode char

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla0.9.9

People

(Reporter: shanjian, Assigned: shanjian)

References

Details

Attachments

(1 file, 8 obsolete files)

In mozilla, if the glyph of a unicode character could not be found in local font and 
generic font, we will try global fonts (and preference fonts after bug 115392) rather
randomly. Loading each font and getting its cmap is very expensive, it really hurts 
our performance. We can narrow the font search range by identifying the langGroup 
(or range).
What milestone should this bug target?

/be
Target at 0.9.8. Patch on my local tree is almost done. 
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.8
so you want a 
Unicode -> list of langGroup 
operation ?

Will this approach help performance of normal page after we fix 115392 ?
the FindPrefFont should already shorter the search list for most of the page we
care. If we want some tune base on the PRUnichar, we should use it to change the
order of using gCharsetInfo in the FindPrefFont
for example, if the PRUnichar is between 0x3400-0x9FFF we should try "ja",
"zh-CN", "zh-TW", "ko" first before other pref. 
if the PRUnichar is between 0xAC00-0xD7FF we should try "ko" first. 
shanjian, I am not sure which page (give me the URL) will be benefit from this
tuning. (I think we may find some, but probably won't be the common case in the
top 100 eng and top 100 intl page). I think 115393 is more important. Because I
saw it on every (including English) page. 
Frank, 
It does not make sense to try cyrillic font for a chinese characters. Bug 115392
might reduce font search list to 50%, this bug should reduce it further. The 
range information can also be used to help reduce global font search list. We 
have language coverage information with each font. We should utilize this information
to narrow our search. 

I will address bug 115393 in 0.9.8. 
Target Milestone: mozilla0.9.8 → mozilla0.9.9
Attached patch patch (obsolete) — Splinter Review
rbs, could you review my patch? thx!
Comment on attachment 67137 [details] [diff] [review]
patch

+  PRUint8 unicodeRange = FindCharUnicodeRange(aChar);
+  if (kRangFrequentNum > unicodeRange) { 

|kRangFrequentNum| is a rather uninformative variable name to be
tested the way you are doing

+    // spcific language is identified
typo: specific

+    AppendGenericFontFromPref(font.name,
LangGroupFromUnicodeRange(unicodeRange), 
+				 NS_ConvertUCS2toUTF8(mGeneric).get());
+  } else if (kRangSetLatin == unicodeRange) { 
+    // latin language set, try western and central european
+    // It will be a surprise if it omes here when mLangGroup is western or
central euro

typo: s/if it omes/to come/
(leaving other typos to brendan :-)

Index: nsUnicodeRange.cpp
===================================================================
RCS file: nsUnicodeRange.cpp

+ * Contributor(s):
+ *   Roger B. Sidje <rbs@maths.uq.edu.au>
+ *   Pierre Phaneuf <pp@ludusdesign.com>

remnants of copy-paste in both .h/.cpp to be removed

+
+static PRUint8 UnicodeSubrangeTable[NUM_OF_SUBTABLES][SUBTABLE_SIZE] =

customary to prefix global variables with 'g'

+{ 
+  { // table for X---
+    kRangTableBase+1,	//u0xxx
+    kRangTableBase+2,	//u1xxx
+    kRangTableBase+3,	//u2xxx
+    kRangTableBase+4,	//uaxxx
+    kRangTableBase+5,	//udxxx  
+    kRangTableBase+6	//ufxxx
[...]
uncommented genuflexions here..., yet important to document what is going on:
+  if (range < kRangTableBase)
+    return range;
+  table = range - kRangTableBase;

might be more readable to just s/kRang/kRange/g everywhere

r=rbs
Attachment #67137 - Flags: review+
Attached patch update patch, (obsolete) — Splinter Review
Attachment #67137 - Attachment is obsolete: true
Comment on attachment 67436 [details] [diff] [review]
update patch,

carry over review
Attachment #67436 - Flags: review+
Attachment #67137 - Flags: review+
brendan, could you sr?
Comment on attachment 67436 [details] [diff] [review]
update patch,

>+  PRUint8 unicodeRange = FindCharUnicodeRange(aChar);
>+  if (kRangeSpecificItemNum > unicodeRange) {

Style nit: this makes visualizing a number line hard -- why not put the
constant on the right and use <?  Your call, I'm just wondering.  The dogma
that constants must go on the left of all operators doesn't work for me (it
comes from the fear that someone may typo variable == kFoo as variable = kFoo
in a condition, but we have compiler warnings for such things, and they're so
unlikely in practice, from what I can tell, that it seems a shame to obfuscate
*all* relational and equality ops).

>+    // a single language is identified
>+    AppendGenericFontFromPref(font.name, LangGroupFromUnicodeRange(unicodeRange), 
>+                                NS_ConvertUCS2toUTF8(mGeneric).get());

Style nit: here and below, maybe make the second line of actual parameters
indent so the params line up with the first param (start in the same column)?

>+  } else if (kRangeSetLatin == unicodeRange) { 
>+    // latin language set, try western and central european
>+    // It will be a surprise to come here when mLangGroup is western or central euro
>+    AppendGenericFontFromPref(font.name, "x-western",
>+                                NS_ConvertUCS2toUTF8(mGeneric).get());
>+    AppendGenericFontFromPref(font.name, "x-central-euro",
>+                                NS_ConvertUCS2toUTF8(mGeneric).get());
>+  } else if (kRangeSetCJK == unicodeRange) { 
>+    // CJK, we have to be careful about the order, use locale info as hint
>+    
>+    // try user locale first, if it is CJK
>+    if ((gUsersLocale != mLangGroup) && IsCJKLangGroupAtom(gUsersLocale)) {
>+      nsAutoString usersLocaleLangGroup;
>+      gUsersLocale->ToString(usersLocaleLangGroup);
>+      AppendGenericFontFromPref(font.name, NS_ConvertUCS2toUTF8(usersLocaleLangGroup).get(), 
>                                 NS_ConvertUCS2toUTF8(mGeneric).get());

Why the need for an nsAutoString here?	Why not just NS_ConvertUCS2toUTF8 on
the shared (not copied) result of gSystemLocale->GetUnicode?

>     }
>-    NS_IF_RELEASE(langGroup);
>-  }
>+    
>+    // then system locale (os language)
>+    if ((gSystemLocale != mLangGroup) && (gSystemLocale != gUsersLocale) && IsCJKLangGroupAtom(gSystemLocale)) {
>+      nsAutoString systemLocaleLangGroup;
>+      gSystemLocale->ToString(systemLocaleLangGroup);
>+      AppendGenericFontFromPref(font.name, NS_ConvertUCS2toUTF8(systemLocaleLangGroup).get(), 
>+                                NS_ConvertUCS2toUTF8(mGeneric).get());

Same question here.

>+PRUint8 FindCharUnicodeRange(PRUnichar ch)
>+{
>+  PRUint8 range;
>+  PRUint32 table = 0;
>+
>+  //search the first table
>+  range = gUnicodeSubrangeTable[table][ch >> 12];

Why not use 0 here instead of table?

>+  
>+  if (range < kRangeTableBase)
>+    // we got a specific range 
>+    return range;

(Nice < number-line order of operands here! :-)

>+
>+  // otherwise, we have one more table to look at
>+  table = range - kRangeTableBase;
>+  range = gUnicodeSubrangeTable[table][(ch & 0x0f00) >> 8];

Ok, I see this in the table of 7 sub-tables:

+  { //table for 0X--
+    kRangeSetLatin,	      //u00xx
+    kRangeSetLatin,	      //u01xx
+    kRangeSetLatin,	      //u02xx
+    kRangeTableBase+8,       //u03xx
+    kRangeCyrillic,	      //u04xx
+    kRangeTableBase+9,       //u05xx

Why won't those +8 and +9 terms cause an out-of-bounds array index?

>+const PRUint8   kRangeUnassgined           = 35;

Unassigned misspelled here.

/be
Attachment #67436 - Flags: superreview+
Attachment #67436 - Flags: needs-work+
Attachment #67436 - Flags: superreview+
Attachment #67436 - Attachment is obsolete: true
Attachment #67490 - Flags: review+
Attachment #67490 - Attachment is obsolete: true
Attachment #67494 - Flags: review+
Shanjian, those last two patches are identical to the one I reviewed.

/be
Attachment #67494 - Attachment is obsolete: true
Sorry brendan. I always copy patches to a spefic directory before I post it. 
A typo in my copy command cause it to be put to a different directory. Now it 
should be OK. 
Attachment #67507 - Flags: review+
Comment on attachment 67507 [details] [diff] [review]
repost patch, the last two are outdated ones.

I think using PRUint8 for the type of range, and for the return type of
FindCharUnicodeRange, may result in extra instructions being generated to widen
(and narrow).  I think unsigned int (or PRUintn for naming consonance) is
better -- same goes for the PRUint32 table; variable, which you could eliminate
as it is set once and used once.

sr=brendan@mozilla.org, make the above change if it helps shrink compiled code
(compare object file sizes before and after).

/be
Attachment #67507 - Flags: superreview+
Attached patch comment clarification (obsolete) — Splinter Review
As per Timeless in #mozillazine, the following comment:
+ // latin language set, try western and central european
+ // It will be a surprise if it omes here when mLangGroup is western or
central euro

has been changed to read more clearly, and now reads as follows:

+ // Latin language is now set, so we'll try western and central european
+ // If mLangGroup is western or central european, this most probably will not
be
+ // used, but is here as a fallback scenario.
Attachment #67507 - Attachment is obsolete: true
Attached patch more comment clarification (obsolete) — Splinter Review
Attachment #67684 - Attachment is obsolete: true
Previous comment modification has some misunderstanding (caused by my poor 
English). Rewrote it.

to brendan,
Changing from PRUint8 to PRUint32 will increase the size from 110592 bytes to 
111104 bytes. The reason of choosing PRUint8 is to shrink the size of 
gUnicodeSubrangeTable. Comparison of optimized dll file shows that does make 
some difference. 
Attachment #67767 - Flags: superreview+
Attachment #67767 - Flags: review+
shanjian, I was *not* proposing that you change the type of any table from
PRUint8 to unsigned int.  I was *only* proposing that you change the type of
local variables loaded from such table elements.  Using narrow types for local
variables and arguments can be suboptimal and have undesired "wraparound" effects.

/be
Attachment #67767 - Attachment is obsolete: true
brendan, is that what you suggested? There is no difference made with windows
compiler. This might not be true for other platform/compiler. Since there is 
no negative impact, I made the change. 
While the code is identical to the previous patch, as per request from
Timeless, the comments are cleaned up to try to be as clear and correct as
possible.

Note to Shanjian: I wanted to make clear that I meant no insult towards your
english. I understand it's not your native language, and it's near impossible
to learn, but you seem to have an excellent grip on it. I was just
asked/volunteered to help clear it up a bit more. :)
Attachment #67825 - Attachment is obsolete: true
Comment on attachment 67855 [details] [diff] [review]
comment-edited patch, code identical to last patch

carry over r/sr
Attachment #67855 - Flags: superreview+
Attachment #67855 - Flags: review+
fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Changed QA contact to shanjian@netscape.com. Shanjian, please verify this.
QA Contact: teruko → shanjian
An errata was checked in for this bug in bug 184848.
Depends on: 184848
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: