Closed Bug 177877 (non_unicode_wide) Opened 22 years ago Closed 22 years ago

extending non-Unicode (custome-encoding) TTF support for multibyte converters

Categories

(Core :: Internationalization, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: jshin1987)

Details

Attachments

(2 files, 8 obsolete files)

While trying to enable the rendering of pre-1933 Korean text represented
in U+1100 Hangul Jamos with fonts with custom-encodings (Oxxx and Nxxx
fonts mentioned in bug 176315) under MS Windows, I found 
the following problems in the current implementation in nsFontMetricWin.cpp
for custom-encoded non-Unicode TTFs. 

1. |nsFontNonUnicode| in nsFontMetricWin.cpp assumes that 
all non-Unicode TTFs (with custome/hack encoding) have 
single byte converters. Therefore, it uses 'A'nsi functions
like ExtTextOutA and GetTextExtentPoint32A. It also assumes
that the result of charset. conversion has the same length
as the source (which is the case for single byte converters
for MathML and webdings). Unfortunately, these assumption
don't hold for fonts with more than 256 glyphs and double-byte
converters.

  There are two problems to solve here. One is how to specify
that a font needs a double byte converter in fontEncoding.properties
file. The other is modify or subclass nsFontNonUnicode.  

One way to solve the former is add '.double' postfix to the 
encoding name (value in fontEncoding.properties file). The postfix
 has to be stripped away right after setting the fonttype (say,
eFontType_NonUnicodeDouble). 
 

     

2. Win32 API functions for getting font names return family names in
the lang. corresponding to the current locale. For TeX, Mathematica,
Webding fonts, this is not an issue because they usually have only 
English names and Win32 APIs fall back to English names if they're
the only names specified in fonts. (This could be an issue even for them 
in the future if somebody add locale/language-specific names to them).
However, two sets of Korean fonts I'd like to use mentioned in
bug 176315 have both English and Korean names and Korean names
are returned when Mozilla is running under Korean locale. 
Therefore, specifying English names in fontEncoding.properties
file doesn't work. 

I thought specifying both Korean and English names might
work. However,it turned out that it doesn't because a shortcut is
taken to convert the locale character encoding 
to UTF-16 assuming that font names are representable
in US-ASCII. Using a full-pledged  converter from
the locale charset to UTF-16(I guess we can even use 
MultiByteToWideChar Win32 API here) might be an easy
solution (albeit a bit expensive). 

A better approach I like to implement is to figure out a way
to get English-only (fully representable in US-ASCII) family
name regardless of the current locale and use that throughout.
(when names are exposed in UI, of course, the locale/language
specific names had better be used. Even better would be doing
it based on lang/locale pack currently in use.) 
This will also eliminate the need for an ugly hack to identify
MS P Gothic ( Shift_JIS string for 'MS P Gothic' is hard-coded).
A quick look at Win32 APIs didn't give me anything
that looks promising. (Xft/fontconfig offer API calls
for getting family/face names in various lang/locale and
English. I wish there were a similar API call under Windows) 
Any help from Windows guru would be nice.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Attached patch a patch (obsolete) — Splinter Review
I added nsFontNonUnicodeWide class. It's mostly identical to
nsFontNonUnicode class except that it  uses 'W'ide functions
instead of 'A'NSI functions and converters for this class 
are assumed to return the result in UCS2. Before calling 'W'ide functions,
this result is converted to UCS2 in native byte order. 

To work around the problem of family name in ANSI code page of
the current locale(I couldn't figure out how to get family name
in US-ASCII- English- regardless of the current locale. Win32 API doesn't
offer this functionality), I had to add a kludge to check the current codepage
and call MultiByteToWideChar. fontEncoding.properties have two entries
for each font-encoding pair (one with family name in English and
the other with family name in native language in UTF-8). 

rbs, could you take a look?
Alias: non_unicode_wide
rbs, if you can take a look, it'd be nice. well, bugzilla will
be down soon...
Attachment #104997 - Attachment is obsolete: true
Now nsFontWinNonUnicodeWide directly inherits nsFontWin instead
of indirectly inheriting it via  nsFontWinNonUnicode. GetBoundingMetrics
is implemented for nsFontWinNonUnicodeWide in case it's used by MathML
in some cases.
Attachment #105658 - Attachment is obsolete: true
Shanjian and rbs,
Could you review my latest patch? 
Comment on attachment 106538 [details] [diff] [review]
patch v2 (nsFontWinNonUnicode directly inherits nsFontWin)

rbs, thank you in advance...
Attachment #106538 - Flags: review?(rbs)
Comment on attachment 106538 [details] [diff] [review]
patch v2 (nsFontWinNonUnicode directly inherits nsFontWin)

Saving rbs for sr and asking
Shanjian for r :-)

Not be a Trojian horse, I have
to reveal that	eventually
this patch will increase the 'disk'
footprint of libuconv.so/uconv.dll
by about 45k - 60k. However,
according to what alecf wrote
when combining all ucv* into
a single so/dll,  this should not
be a problem for those who would never view old Korean pages.
Attachment #106538 - Flags: review?(rbs) → review?(shanjian)
Sorry for spamming. My previous comment about the size of uconv was not for this
bug.
It's for bug 176315. 
I took a look just now and I am going to have a close look tomorrow. Jshin, can
you give me some idea why we have such a issue? Is there truetype font available
for those characters? Or those characters are just not encoded in unicode? It is
a little bit of odd to add non-unicode support now. I think non-unicode font is
dying. Are those fonts you are trying to support available on other platforms? 
Shanjia, thank you for lookinag at it.
Pls, note that this bug is one of a series of bugs I filed for Hangul Jamo
rendering.
Other bugs in the series are bug 176290(which is also for
MathML and 'symbol'  fonts) and bug 176315.
(you may also take a look at http://bugzilla.gnome.org/show_bug.cgi?id=95708) 

Here's some background to help you understand the issue at hand.

Hangul Jamo(alphabet) is just like Indic scripts. 
Having __nominal__ glyphs for them  at their Unicode codepoint is  not
sufficient at all.
(that is, fonts like MS Arial Unicode and Cyberbit are totally useless for
Hangul Jamo rendering other than the Unicode code chart because 
what they have for U+1100 Jamo is *non-combining* glyphs for Jamo)
They have to be put together in a certain way to form syllables. 
Therefore, the proper way to support them is with opentype fonts with gsub/gpos
tables.
Unfortunately, no *opentype* font for Hangul Jamos is available for free and
it might take pretty long to have freely available opentype fonts (as opposed
to mere truetype fonts) with gsub/gpos and other bells and whistles for
Hangul Jamos. 
The only opentype for Hangul Jamos is included in Korean version of MS Office XP
and virtually none in opensource community has access to   it. (I bought English
MS Office XP for the sole purpose of accesing it because I never use MS Office 
but it turned out that English MS Office XP doesn't include it)

 
In the meantime, two fonts I want to support are at least downloadable on the net
although its license has some restrcition(I'm contacting the foundry Hanyang
System in Korea to ask for a more liberal license).  
They're truetype fonts with Unicode cmap with glyphs for Hangul Jamo composition
in PUA. The mapping from a __sequence__ of Hangul Jamos
to a sequence of glyph codepoints in two fonts (note that the mapping is
not 1 to 1 but n to m) has to be done somewhere. 

That mapping is dealt with in bug 176315. 176315 also solves Hangul Jamo
rendering problemm for Mozilla-X11core font, Xprint(because GTK,X11, and
Xprint modules in gfx/src already have a mechanism to deal with custom-encoded
fonts. See, for instance,
http://lxr.mozilla.org/seamonkey/source/gfx/src/gtk/nsFontMetricsGTK.cpp#421
The array defined there play the role of fontEncoding.properties file for
gtk,X11,Xprint
gfx modules) and possibly MacOS (for MathML, custom-encoding font handling was
added to MacOS gfx and that may just work for 'wide' custom-encoding fonts as well.)
This bug sits on top of bug 176315 to make use of two converters I added
there to render Hangul Conjoining Jamos (Mozilla-Windows already has
code for handling custom-encoded fonts as used by MathML and some 'symbol'
fonts, but they don't work for 'wide' custom encoded fonts)
Bug 176290 is to make Mozilla-Xft take advantage of two converters 
added in bug 176315 as well as  MathML(and 'symbol' font) converters. 


Another thing to note is that currently Mozilla-Unix/Linux doesn't make use of
opentype or AAT fonts even for Indic scripts for which free opentype fonts began
to emerge. (I'm a bit confused about the status of Mozilla-Windows and
Mozilla-MacOS in this aspect. Do they use Opentype fonts or
AAT fonts to render complex scripts? If they do, do they use OS-provided APIs
such as
Uniscribe and ATSUI or roll out in-house rendering? It seems like
neither of them do any of this, but I'm not sure at the moment. If the answer
is negative, my patch can also be used to render Indic scripts wih
still widely available custom-encoded truetype fonts for Indic scripts in
Mozilla-Windows if converters for them are added)  That means 
even if opentype fonts for Hangul Jamo are available, 
Mozilla-Unix/Linux wouldn't work for Hangul Jamos. 

Mozilla's ctl extension (mozilla/extensions/ctl) renders
Indic and Thai scripts with basically the same approach I'm taking
here and in bug 176315. In other words, Mozilla/Unix/Linux rely on
custom-encoded fonts for Indic scripts to render Indic scripts. 
Hangul Jamos are a bit simpler than
Indic scripts so that  just adding converters (in bug 176315) makes
Mozilla-X11core and Xprint work. However, Mozilla-Windows and Mozilla-Xft
need a bit more work and this bug is for Mozilla-Windows and bug 176290 is
for Mozilla-Xft.

Hopefully, this long comment helped you understand my intent in this bug
and bug 176290 and bug 176315. 
I finished looking through the patch. I have 2 suggestions.
1, The indentation is incorrect in several places, probably tab is used there.
+  // encoding in fontEncoding.properties for a wide NonUnicode font
+  // has '.wide' at the end. 
+	        if (Substring(encoding, encoding.Length() - 5, 5) ==
NS_LITERAL_STRING(".wide"))
+            aFontType = eFontType_NonUnicodeWide;
+					else 
+            aFontType = eFontType_NonUnicode;

2, the following code appears several time in nsFontWinNonUnicode implementation.
+  char str_local[CHAR_BUFFER_SIZE];
+  char* str = str_local;
+  PRInt32 destLength;
+  if (NS_FAILED(mConverter->GetMaxLength(aString, aLength, &destLength)))
+      return 0;
+  if (destLength > CHAR_BUFFER_SIZE) {
+    str = new char[destLength];
+    if (!str) return 0;
+  }
+
+  PRInt32 srcLength = aLength;
+  mConverter->Convert(aString, &srcLength, str, &destLength);
+
+
+#ifdef IS_LITTLE_ENDIAN
+  // convert BigEndian UCS2 to little endian UCS2
+	char* pstr = str;
+  while (pstr < str + destLength) {
+    PRUint8 tmp = pstr[0];
+    pstr[0] = pstr[1];
+    pstr[1] = tmp;
+    pstr += 2; // swap every two bytes
+  }
+#endif
and 
+  if (str != str_local) {
+    delete[] str;
+  }

It is possible to have a help class to simply this. memory deallocation can be
handled in object destruction. 


Shanjian,
Can you review this update? 
As you pointed out, I added a helper class for temp. buffer
allocation and removed tabs.
Attachment #106538 - Attachment is obsolete: true
Attachment #108981 - Flags: review?(shanjian)
Comment on attachment 108981 [details] [diff] [review]
a new patch addressing Shanjian's concerns

>@@ -1468,7 +1532,12 @@
>         nsAutoString encoding;
>         if (NS_SUCCEEDED(GetEncoding(aShortName, encoding))) {
>           aCharset = DEFAULT_CHARSET;
>-          aFontType = eFontType_NonUnicode;
>+  // encoding in fontEncoding.properties for a wide NonUnicode font
>+  // has '.wide' at the end. 

The indentation of these 2 line does not seems right. Please double check.
Attachment #108981 - Flags: review?(shanjian) → review+
Comment on attachment 108981 [details] [diff] [review]
a new patch addressing Shanjian's concerns

Thank you for r, Shanjian.
I've fixed the indentation in my copy.
Now asking rbs	for sr.
Attachment #108981 - Flags: superreview?(rbs)
The approach in the patch is OK but it seems to me that it would be much simpler
just to build upon the existing code with a simple |if| as follows:

  /* yep, I know that IsWide() doesn't exist, but it is generally good to be
     able to query the biodata of objects, and functions to GetState/Behavior()
     should have been added in the first place in nsIUnicodeEncoder. But now you
     can initialize as:
     GetConverter(converter, &isWide);
     font = new nsFontWinNonUnicode(&logFont, hfont, ccmap, converter, isWide);
     and use a member bolean variable mIsWide for such a purpose.
 */
  if (converter->IsWide()) // or if (mIsWide)
    // use 'W':
    ::ExtTextOutW(aDC, aX, aY, 0, NULL, (PRUnichar*) str, destLength / 2, NULL);
  else
    // use 'A':
    ::ExtTextOutA(aDC, aX, aY, 0, NULL, str, destLength, NULL);

The reason I am suggesting this is because the code you are adding is differing
from the existing code by just this type of one-liners, and so the benefits
would be to cut down the excessive C++ and ease maintenance. (A direct effect
would be that most of your code would be kept in the testing radars even without
those rare Jamo fonts. Presently, your code is almost never going to be tested
given that most people won't have those fonts.) The auto memory buffer is a
welcome addition, though. BTW, rename |nsCharBufferHdl| to |nsAutoCharBuffer|.
It is customary in the Mozilla codebase to use the prefix |auto| for this type
of scope-based data structures.

+  PRInt32 destLength;
+  if (NS_FAILED(mConverter->GetMaxLength(aString, aLength, &destLength)))
+      return 0;
+
+  nsCharBufferHdl buffHdl;
+  char* str = buffHdl.GetBuffer(destLength); 
+  if (!str) return 0;
+
+  PRInt32 srcLength = aLength;
+  mConverter->Convert(aString, &srcLength, str, &destLength);
+
+#ifdef IS_LITTLE_ENDIAN
+  // Convert BE UCS2 to LE UCS2
+  buffHdl.ByteSwap16(destLength);
+#endif

The above chunk is redundant and can be put in a helper function, e.g.,
ConvertUnicodeToGlyphs(..., nsAutoCharBuffer& aResult) or something.

As shanjian wondered, I have also been wondering if there is a wide demand for
bending backward this way rather than leading the way forward. Is this patch
just for the sake of it, or is there a sizeable demand for supporting this
feature other than your own needs?

Finally, there isn't much point in landing this patch without landing bug 176315
first.
Attachment #108981 - Attachment is obsolete: true
I found that |nsAutoFontDataBuffer| already does
|nsAutoCharBuffer| except that its static buffer
is larger and the type is unsigned. Template is
not supposed to be used for xpcode, but can't I
use even in platform specific code for which the only
compiler we use(VC++) is known to work well with it?
I also found that there are several places where PRUnichar[]
is used similarly to the same way  char[] is used for which I
made a helper class per Shanjian's suggestion. So, I extended
the prev. patch to deal with the redundancy there as well.
This is actually outside the scope of this bug, but I thought
I might do it while I was at it. This also uses template
so that it depends on the Mozilla coding policy.
Comment on attachment 109259 [details] [diff] [review]
a similar patch using template to unify two nsAuto* classes

I'm asking for r and sr of
the second in a series of three
(attachment 1092588, attachment 1092598, attachment 1092600),
but please feel free to pick one
you think of as the most appropriate.
Attachment #109259 - Flags: superreview?(rbs)
Attachment #109259 - Flags: review?(shanjian)
Thank you for review and suggestion. I made changes as you suggested although
to me having a separate subclass appears to be cleaner despite some redundancy.
(there's a method for 'wide' font that is more like the one for Unicode font
class than the one for 'narrow' non-unicode font).

As for your question about the demand for this feature, my short answer would be
definitely yes. The leader of Jikji Project (Korean equivalent of
Gutenberg Project) is very much interested in this work because the project
got kinda stuck due to the lack of portable/cross-platform support
for old Korean on the web. 
It also has to be noted that this patch is not only for
Korean but could be used for other complex scripts for which opentype fonts
are not readily available yet. That category includes a lot of South/South
Asian scripts. Moreover, having opentype fonts doesn't make Mozilla
all of sudden render them correctly. Opentype fonts are intelligent
but certainly NOT to the degree that they can do all the magic for complex scripts
without 'clients' of opentype fonts doing anything. As far as I can tell,
there's no trace of opentype font support in Mozilla code, yet. Therefore,
even though I don't  have any doubt whatsoever about
a thesis that opentype font support is the way of the future, I wouldn't use 
that as an argument against implementing the only feasible way to support some 
complex scripts represented NOT in hacked text encoding BUT in the proper
Unicode way. To me, being able to render complex scripts encoded in their
Unicode codeposition (instead of PUA code points or hacked **text** encoding)
is not a step backward but a big step forward, which would accelerate 
opentype font developments with gsub/gpos and other opentype tables for them. 
 

I wrote a very long response to your question, but decided not to clutter
bugzilla. Instead, I put it up at http://jshin.net/i18n/korean/177877.comment.txt.
It's incomplete, not so well organized and is in plain text.  Nonetheless,
you're welcome to read it. 
Comment on attachment 109260 [details] [diff] [review]
a more aggresive patch to remove redundancy with temp. buffer alloc/free.  

This one is my favorite. I only have the few nits mentioned below.

s=rbs applies from now on.
========================
   name.StripWhitespace();
   ToLowerCase(name);

+
   // if we have not init the property yet, init it right now.
----------
     mMat.eM12 = mMat.eM21 = zero;
-    mMat.eM11 = mMat.eM22 = one;	
+    mMat.eM11 = mMat.eM22 = one;    
----------------
@@ -5247,7 +5308,6 @@

   return 0;
 }
-
 NS_IMETHODIMP

nit/whitespace: no need to mess the patch with these useless changes.


================================
+// a wrapper function overloading |GetGlyphIndices| defined above.
+static nsresult
+GetGlyphIndices(HDC		     aDC,
+		 nsCharacterMap**    aCMAP,
+		 const PRUnichar*    aString, 
+		 PRUint32	     aLength,
+		 nsAutoChar16Buffer& aBuffer)
+{  
+  PRUnichar* str = aBuffer.GetArray(aLength);
+  if (!str)
+    return NS_ERROR_OUT_OF_MEMORY;
+  if (!GetGlyphIndices(aDC, aCMAP, aString, aLength, str, aLength))
+    return NS_ERROR_UNEXPECTED;
+  return NS_OK;
+}

What about just changing the definition of the old function
so that it accepts |nsAutoChar16Buffer& aBuffer| as argument.
This way you can remove the internal alloc that is done therein.

================================
+  PRBool mIsWide; 
+
 #ifdef MOZ_MATHML
   virtual nsresult
   GetBoundingMetrics(HDC		 aDC,
@@ -1959,6 +2098,7 @@
   nsCOMPtr<nsIUnicodeEncoder> mConverter;
 };

readability: just move the declaration at the end.
Attachment #109260 - Flags: superreview+
Attachment #108981 - Flags: superreview?(rbs) → superreview-
Attachment #109259 - Flags: superreview?(rbs) → superreview-
rbs, can you take a final look and transfer your sr stamp here ?
Thanks.
Attachment #109258 - Attachment is obsolete: true
Attachment #109259 - Attachment is obsolete: true
Attachment #109260 - Attachment is obsolete: true
My own 'nit' detector found unnecessary '// if output.....' and I've just removed. 
----------------
     // if output buffer too small, allocate a bigger array -- the caller should
free
-    PRUint16* result = aBuffer;
-    if (aLength > aBufferLength) {
-      result = new PRUint16[aLength];
-      if (!result) return nsnull;
+    PRUint16* result = aBuffer.GetArray(aLength);
--------------
 

Just to save time for rbs, the only diff. brounght in in this patch is
the following along with the change in GetGlyphIndices() as suggested.
( I thought there are several invocations like this and made an
overloading wrapper, but it turned out there's only one...) 

-    GetGlyphIndices(aDC, nsnull, &aChar, 1, &aGlyphIndex, 1);
+    nsAutoChar16Buffer gbuffer;
+    GetGlyphIndices(aDC, nsnull, &aChar, 1, gbuffer);
+    aGlyphIndex = *(gbuffer.GetArray());
   }

 
Comment on attachment 109300 [details] [diff] [review]
a new patch addressing rbs' concerns

=============
+static nsresult
+GetGlyphIndices(HDC		     aDC,
+		 nsCharacterMap**    aCMAP,
+		 const PRUnichar*    aString, 
+		 PRUint32	     aLength,
+		 nsAutoChar16Buffer& aBuffer) <--------
readability: rename aBuffer -> aResult

+static nsresult <--------- so don't return |nsnull| later
[...]
+  if (!aString)
     return nsnull;

return NS_ERROR_NULL_POINTER

=============
+    nsAutoChar16Buffer gbuffer; <-------------
+    GetGlyphIndices(aDC, nsnull, &aChar, 1, gbuffer);
+    aGlyphIndex = *(gbuffer.GetArray());

Just drop the prefix 'g' throughout your patch -- it
is is often used for global variables (e.g., use |buf| and
|buffer| if you need two names)
Error check for robustness:
  if (NS_SUCCEEDED(GetGlyphIndices(aDC, nsnull, &aChar, 1, buffer)))
    aGlyphIndex = *(buffer.GetArray());

=============
+  PRBool mIsWide;  <--------------
+
 #ifdef MOZ_MATHML
   virtual nsresult
   GetBoundingMetrics(HDC		 aDC,
@@ -1959,6 +2080,7 @@
   nsCOMPtr<nsIUnicodeEncoder> mConverter;
 };

I mentioned earlier to do:
   nsCOMPtr<nsIUnicodeEncoder> mConverter;
+  PRBool mIsWide;
 };

=============
+    nsresult rv = GetGlyphIndices(aDC, &mCMAP, aString, aLength, buffer);
+    if (NS_FAILED(rv)) {
       return NS_ERROR_UNEXPECTED; <---- |rv| is lost
     }

use: return rv; (2 places)

=============
   // we reach here if we couldn't transliterate, so fallback to question marks 
-  if (aLength > aBufferLength) {
-    // allocate a bigger array that the caller should free
-    result = new PRUnichar[aLength];
-    if (!result) return nsnull;
-  }
+  result = aBuffer.GetArray(*aCount);	<--------------------
+  if (!result) return NS_ERROR_OUT_OF_MEMORY;
   for (PRUint32 i = 0; i < aLength; i++) {
     result[i] = NS_REPLACEMENT_CHAR;
   }
   *aCount = aLength;  <-------------------------------------

you are using an uninitialized varibale.  
Use |result = aBuffer.GetArray(aLength)| instead.
Attachment #109300 - Flags: superreview-
> readability: rename aBuffer -> aResult

ignore this, it isn't worth the trouble. It will just make the patch bigger
without much gain.
hopefully this is the last iteration :-)
Attachment #109300 - Attachment is obsolete: true
Attachment #109379 - Flags: superreview?(rbs)
Comment on attachment 109379 [details] [diff] [review]
a new patch with rbs' concerns addressed

sr=rbs
Attachment #109379 - Flags: superreview?(rbs) → superreview+
Attachment #106538 - Flags: review?(shanjian)
Comment on attachment 109259 [details] [diff] [review]
a similar patch using template to unify two nsAuto* classes

Sorry for spamming. I'm cleaning
up obsolete r request.

Fix just checked in.
Thank you all.
Attachment #109259 - Flags: review?(shanjian)
-> fixed
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Jungshik Shin: could you please verify this change? otherwise, please provide
the test case/steps then I can verify it.  Thanks!
Yuying,
You have to wait until my patch for bug 176315 is checked in.
I'll let you know when that's done. 
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: