Closed Bug 182877 Opened 22 years ago Closed 21 years ago

Xft-enabled Mozilla : Needs non-BMP (plane 1 and beyond) rendering support

Categories

(Core :: Internationalization, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: jshin1987)

References

()

Details

(Keywords: testcase)

Attachments

(5 files, 11 obsolete files)

84.43 KB, image/jpeg
Details
636 bytes, text/html
Details
79.91 KB, image/jpeg
Details
106.17 KB, image/jpeg
Details
16.67 KB, patch
Details | Diff | Splinter Review
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021028
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021028

Xft-enabled Mozilla currently can't render non-BMP characters.
As plane 1 and plane 2 have been being filled pretty rapidly with 
quite a lot of characters, the demand for non-BMP character rendering
will go higher. With a very little change, Xft-enabled Mozilla
can render non-BMP characters. I've already made a patch and
it works pretty well except for RTL script(see bug 120114).


Reproducible: Always

Steps to Reproduce:
1.Go to the URL specified above
2.See how it gets rendered.
3.

Actual Results:  
Each non-BMP character is rendered with 
a pair of unknown glyph chars(with 4 hex digit inside
a box representing surrogate code point). That's because
Mozilla's internal Unicode representation is UTF-16.

Expected Results:  
All non-BMP characters have to be rendered if a font
for them is present. 
If not, they have to be rendered with a single unknown glyph character
(6 hex digits inside a box) instead of two unknown glyph characters.

I have a patch. Xft and Freetype2 library also need a small patch.
 Making it depend on bug 172768 (Xft-tracking bug).

Simon, as I mentioned in the report, it's time to get in the patch
you and Shanjian made for bug 120114. RTL/BIDI is now necessary
for non-BMP characters as well :-)

I don't know why I can't assign this to myself. Simon, I'm sorry
to bother you, but can you give it to me? 


Blocks: xft_tracking
Attached patch patch v1 (obsolete) — Splinter Review
This, by itself, does not work. Xft and Freetyp2 also have
to be patched. I'll post patches for them here as well
as sending them to the maintainers of two libraries.
I'm gonna send this to Keith Packard. If I can make my case
well, this is likely to be included in upcoming XFree86 4.3.0 release.
Alternatively, he can make XftTextExtentsUtf16(), but from Mozilla's
point of view, extending XftTextExtents16() to support UTF-16
is better because Mozilla's internal string is UTF-16.
This patch is necessary because Freetype2 doesn't handle
well truetype fonts (e.g. Code2001) with multiple Unicode cmaps
of which only one is UCS-4 (UTF-32) cmap while other two are
UCS-2 cmap. Cmap search routine in FT2 (FT_Select_Charmap) 
returns the first Unicode cmap it find, but most of time
it's not UCS-4 cmap but UCS-2 cmap although fonts like Code2001
have non-BMP characters. FcCharSetHasChar(fontconfig API)
used by Mozilla fails to recognize that non-BMP characters
are covered by CODE2001 because FT2 (used by fontconfig)
fails the way described. 

With this patch applied, 'ft-cache' has to be run afresh
to rebuild on-disk-cache (used by fontconfig and in turn
by Xft and Mozilla) of chara. coverage map of fonts.
Etruscan is broken because it's RTL script encoded in Plane 1. 
Simon has a patch that I believe will solve this problem.
Attached file another test page
Testing this enabled me to catch a bug in my
previous patch. In both attachment 107850 [details] [diff] [review] and 
attachment 107851 [details] [diff] [review], I used bitwise-or where
I should have used arithmetic '+'.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: patch, testcase
Attached patch patch v2 (obsolete) — Splinter Review
fixed a bug in surr. pair to UCS4 macro
and removed a mix-up from my patch to another bug.
Attachment #107850 - Attachment is obsolete: true
Attached patch Xft patch (revised) (obsolete) — Splinter Review
replacing attachment 107851 [details] [diff] [review]
Attachment #107851 - Attachment is obsolete: true
I should have taken a look at the patch for 
bug 120114 to avoid saying something stupid....
Bug 120114 is about solving the problem in Windows..
Now, I have to figure out what Resolve(For|Back)ward()
are for... nsFontMetricsXft didn't implement them,
but it seems like they're used for BIDI. 
Simon, could you tell me a little about the role of
those two methods?
Perhaps, I have to rework my patch from the scratch or ...
The patch for bug 122800 solved the problem with RTL
text as shown by the screenshot.

 There still is a mystery for which I don't have an explanation
at the moment. Somehow, some characters are rendered as hollow
boxes at 100%/120%. When I change the zooming factor to a
higher value, they get rendered correctly. Some metric-related
problem?...
Is this change about hacking nsAString to hold something other than UCS-2 like 
some Windows code does?  If that is the case, we need to seriously reconsider 
our string strategy in general instead of hacking around it all over...
From the very early days of Mozilla development, the internal representation 
used has been not UCS-2 but UTF-16. Therefore, the content of 
nsAString has always been UTF-16 (!= UCS-2). I'm not making up anything here :-)
 
Circa 1998, I argued that it'd be easier to use UTF-32, but perhaps due to
concerns about memory consumption and other reasons, Mozilla team
decided to use UTF-16 ( != UCS-2). Therefore, handling of surrogate
pairs have to be implemented wherever necessary. This is not the only
place but there are some other places where surrogate pairs are dealt with.
A simple search with 'IS_HIGH_SURROGATE'  or similar yields (maybe a dozen or so)
hits. Anyway, UTF-16 is not so bad as you may think (although I prefer
to work in UTF-32 ). Firstly, UTF-16 is  much easier to deal with than old
multibyte encodings (CJK encodings) .  In some CJK encodings, 
in the worst case, you have to back off all the way to the beginning of
a string to figure out the 'character boundary' correctly. In UTF-16, 
it's much simpler than that. 
Secondly, even if we use UTF-32,
we have to deal with combining characters and many other subtleties of
Unicode so that  one codepoint for one  'grapheme'  model doesn't hold even in
that case. (for instance, see DUTR #29 at Unicode web site). 
Thirdly, Win32 and Java(and to a lesser extent, 
Javascript) have demonstrated that UTF-16  could be used without much 
disruption and much difficulty. Win32 'W' APIs use UTF-16 extensively
and applications built on top of those APIs handle the full repertoire
of Unicode (upto U+10FFFF) very well. Java's 'string' is also a string
of UTF-16 (NOT UCS-2).  And, Mozilla itself is another good example
that UTF-16 can be used rather easily. If you're not yet convinced that
UTF-16 works rather well (hope you're :-)) , you can read a lot more at Unicode
mailing list archive at http://www.unicode.org.  

At this point,  IMHO, switching over to UTF-32 is too risky and costs too much
with little gain. (again, I'm not a big fan of UTF-16 and I wish UTF-32 had been
adopted at the beginning, but  it's not that bad so I can live with it.) 
Moreover, the distinction between UTF-16 and UCS-2 
has to be made only in limited areas like this.   

If you read some of old news articles in
mozilla.* hierarchy, you'll find that Erik (not at netscape any more) definitely
stated that Mozilla string is UTF-16 (not UCS-2). He also made it clear to those
involved in 'String' class development
that they wouldn't have to concern themselves with this issue, to which they
agreed.   
> not UCS-2 but UTF-16

In that case, please explain the following:

1)  Why there is no way to create a UTF-16 string (we have ways to convert
    ASCII or UTF-8 to UCS-2, but not to UTF-16).
2)  Why nsIUnicodeDecoder/Encoder seem to work with UCS-2 (if they do not,
    the "number of Unicode characters written/read" params on Convert() are
    sort of useless).
3)  Why the string documentation keeps referring to UCS-2 (eg see
    http://www.mozilla.org/projects/xpcom/string-guide.html)

The basic issue is that right now almost everyone working on Mozilla is firmly 
in the "nsAString is UCS-2 with one PRUnichar per one character except in some 
weird Windows-only text-measurement code or something".  If this viewpoint is 
incorrect, perhaps an effort should be made to educate people.  Properly 
commenting the string classes and unicode decoder/encoder interfaces would be a 
good start.

I'm not saying we should switch to UTF-32 or anything like that.  I'm just 
saying that we can't have half the app assuming that nsAString is UTF-16 and 
the other half assuming it's UCS-2.  We have to choose one or the other and use 
it.  Otherwise, what's the point of having nsAString?
This is really a dupe of bug 127713, but since a lot of work has happened here
already, for the time being I am mutually cc-ing the respective owners of the
two bugs so you can work together and decide which one to close.
Assignee: smontagu → jshin
Simon,
Thanks for giving it to me. 

BTW,   as Brian noted in bug 127713, Xft-Mozilla and FT2-Mozilla have separate
codepaths so that they're not dupe of each other. 



Boris, I've just filed bug 183048 for UTF-16 vs UCS-2 documentation change. I
think you raised a very important issue here.
BTW, despite unfortunate choices of names like NS_ConvertUTF8toUCS2, I believe
they work correctly for non-BMP characters.  Therefore, the issue is mainly
documentation
and 'education'. Anyway, I'll try to check if  they indeed do.
 
That bug does not address item #2 in comment 14.  If NS_Convert*toUCS2 actually
convert to UTF-16, we should rename them to make that clear.  If developers
cannot depend on the "one PRUnichar == one rendered glyph" mapping, I bet there
is some code that needs rethinking.
Actually, I mentioned (though rather passingly) about the need to rename related
functions/classes/methods in bug 183048. Do you think we need a separate bug
to replace UCS2 in function/class/method names with UTF16? Perhaps, we do.
That may be a huge replace/search operation that canbe gradually done with
macro definitions. 

BTW, I believe all *UCS2* functions/classes are written to be 'UTF-16 clean'(see
for instance  |ConvertUTF8toUCS2| in nsString2.h 
(http://lxr.mozilla.org/seamonkey/source/string/obsolete/nsString2.h#587)
or |ConvertUCS2toUTF8|  in nsString.cpp.  As for the lack 
of 1-to-1 correspondence between 'codepoints'  and 'glyphs'/graphemes/whatever,
 no matter
what representation we use for Unicode string, that NEVER holds true
for even seemingly simple Latin/Greek/Cyrillic letters. Virtually all Unicode
characters are 'complex' and have to be treated as such. Surrogate-pari
handling can be regarded as a special case of 'combining character sequences'.
I believe that that has been well noted by Mozilla developers. Of course, this
doesn't mean
that everything is perfect as of now (for instance, Latin/Greek/Cyrillic
diacritic marks are not well supported at the moment) and that
there's no need for rethinking. 

Let's move this discussion to bug 183048 or better(?) still, i18n newsgroup. 
Attached patch a third patch (obsolete) — Splinter Review
I haven't _yet_ managed to convince Keith that extending UCS2 API to support
UTF-16 is harmless. Instead, he added XftTextExtentsUtf16()(I guess
it'll go into XFree86 CVS soon). This patch
calls that API.  As for Freetype2 patch, a patch that does what I intended to
do has been committed to the CVS of FT2. 

Because this patch calls XftTextExtentsUtf16 API (not yet available
in any released version of Xft), it can't go in now. Nonetheless,
Simon, can you review it?   On my machine, it works well.
Attachment #107857 - Attachment is obsolete: true
Comment on attachment 108478 [details] [diff] [review]
a third patch

Why not included unicharutil.h and use the macros from there instead of
defining your own IS_HIGH_SURROGATE IS_LOW_SURROGATE and SURROGATE_TO_UCS4?

Why do you have several times the pattern

+	 if (! IS_LOW_SURROGATE(c)) { 
+	     if (IS_HIGH_SURROGATE(c)) {  

Unless I am missing something, the first |if| is implied by the second.

I would rather use a macro with an informative name like IS_NON_BMP, or a
boolean, instead of testing |c >> 16| all the time.

Apart from that, patch looks OK.
Attached patch patch v4 (obsolete) — Splinter Review
Simon, thanks for reviewing it.
This patch addressed your concern.
As for two if's, the first 'if' is not implied by the second if because
the set of codepoints satisfying the first condition contains not 
only high surrogates but also regular codepoints. See comments added in this
patch. 
Can you reivew it again?
Attachment #108478 - Attachment is obsolete: true
Comment on attachment 108551 [details] [diff] [review]
patch v4

r=smontagu, sorry for my confusion about the conditions before.
Attachment #108551 - Flags: review+
Comment on attachment 108551 [details] [diff] [review]
patch v4

+		     goto FoundFont; // for speed -- avoid "if" statement
you could simply break; here...
Attached patch patch v5 (obsolete) — Splinter Review
Thank you Simon for r. 
Thanks, Christian. I should have done that way. It's clearer. Then, it wouldn't

have confused Simon :-) 
Boris or David, can I get sr?
Attachment #108551 - Attachment is obsolete: true
Um, isn't it the |goto NotFound;|s that you should change to |break;| rather
than the |goto FoundFont;|s?
urg, I think I might have misread the patch and read "FoundFont" as "NotFound".

damn I hate gotos.

In that light, I think it might be better to ignore what I said.
I don't like goto, either ;-)  Anyway, patch v4 and patch v5 do exactly the same
thing. 
When unpaired LOW SURR. is found, it goes to 'NotFound'. In patch v4, it's
implicite while in patch v5 it's explicite and less confusing to human readers.  
Note that both NotFound and FoundFont blocks are _inside_ the (outer)
for-loop(walking
thru |aString[ ]|). Whether a font for char.
is found or not and whether illegal surrogate is come across or not, the loop
has to keep on going.
Yeah, reading only the patch, I might have been confused, too. 
 
I realized that using XftTextExtentsUtf16 (that was committted to the XFree86
CVS a couple
of days ago) can be a 'deployment headache' because it'll take a while for
a new Xft lib. with XftTextExtentsUtf16 to get widely deployed. 
Following Keith's suggestion, I added a run-time
check for the presence of XftTextExtentsUtf16 and made Mozilla use
the internally defined version if it's not available. 

I tested under four combinations (compiled with old Xft/new Xft and run with
old Xft/new Xft)
and in all four cases, it worked fine.
Attachment #107858 - Attachment is obsolete: true
Attachment #108582 - Attachment is obsolete: true
Comment on attachment 109358 [details] [diff] [review]
a new patch with the run-time check for XftTextExtentsUtf16()

Simon, can you review again?
This is basically the same
patch as the one you reivewed.
The only significant difference
is the addition of run-time
check for XftTextExtentsUtf16().
Attachment #109358 - Flags: review?(smontagu)
  Instead of the following line in attachment 109358 [details] [diff] [review],

  +        gXftTextExtentsUtf16 = PR_FindSymbolAndLibrary("XftTextExtentsUtf16",
&lib);

I now have the following: 

        // ISO C++ forbids casting between pointer-to-object(void *) and 
        // pointer-to-function. g++ allows it nonetheless, but some compilers
        // may not. To be on the safe side, we need to do this double
        // casting. It depends on ptrdiff_t type being of the same size
        // as pointer, which I believe is the case on all platforms.
        // I found this trick with google at 
        // http://www.geocrawler.com/archives/3/1452/2001/6/0/5924216/
        gXftTextExtentsUtf16 = NS_REINTERPRET_CAST(void (*)(Display*, XftFont*,
                const FcChar8*, FcEndian, int, XGlyphInfo*),
                NS_REINTERPRET_CAST(ptrdiff_t,
                PR_FindSymbolAndLibrary("XftTextExtentsUtf16", &lib)));
Comment on attachment 109358 [details] [diff] [review]
a new patch with the run-time check for XftTextExtentsUtf16()

On Win16 (Windows 3.1), ptrdiff_t was 16 bits, signed.	I think this was just
broken, never mind any ISO standards -- you can't express a pointer difference
within a 64K segment with 15 bits.

Anyway, I'm not a language laywer, but I wonder if PRWord wouldn't be a better
type than ptrdiff_t here?

/be
 Brendan,

 Your comment prompted to do some more digging and I found 
 two threads in comp.lang.c++.* about casting between
pointer to object and pointer to function. ( 'Casting void* to function pointer'
in 1999 and 'g++ 3.1: ISO C++ forbids casting between pointer-to-function and
pointer-to-object' this year). I was naive (as you pointed out by an example
of Win16) to assume what I assumed in my prev. comment. Anyway,
opinions varied as to the best way to work around that they could be of different
sizes and that dlsym() returns |void *| for both function and object. Using
a union as a generic pointer appears to be safest, but I feel like
that's a bit too overkill for this. 

Someone cited POSIX document (http://www.opengroup.org/austin/docs/austin_110r2.txt)
that stipulates that an object of type void * be able to hold 
a pointer to a function on a system implementing POSIX although 
it's not required by C/C++ standard.  It also suggested the following to
'silence' a compiler complian to the provision of ISO C++/C that
casting between ptr-to-obj and ptr-to-fcnt is forbidden.


*(void **)(&fptr) = dlsym(handle, "my_function");

So, I tried the above and the following and both worked fine on my ix86 Linux box,
which, of course,  says little about the portability with which
we're concerned.

*NS_REINTERPRET_CAST(void **, &gXftTextExtentsUtf16) = 
PR_FindSymbolAndLibrary("XftTextExtentsUtf16", &lib);

This looks at least less ugly than double casting...
What would you say?  
No, PRUptrdiff need not be big enough to hold a pointer.  PRWord is good enough;
I don't mind the *(void **) hack, it's equivalent but IMHO uglier.  A PRWord is
the size of a void*, and we (following POSIX) stipulate that function pointers
fit in void pointers.

/be
My NSPR-fu is old school (NSPR1, back in 1995, something kipp and I wrote). 
PRWord and PRUword are deprecated by comments in prtypes.h, and PRPtrdiff and
PRUptrdiff have comments claiming they're big enough for pointers, as well as
for pointer differences.

/be
I thought I added my comment this morning, but I forgot to commit.
Anyway, as two persons have converged to double casting
with PRUptrdiff as an intermediary and prtypes.h claims that
it's big enough for pointers, I'll go with it as below:

        gXftTextExtentsUtf16 = NS_REINTERPRET_CAST(void (*)(Display*, XftFont*,
                const FcChar8*, FcEndian, int, XGlyphInfo*),
                NS_REINTERPRET_CAST(PRUptrdiff,
                PR_FindSymbolAndLibrary("XftTextExtentsUtf16", &lib)));


Simon, can you review attachment 109358 [details] [diff] [review] with that in mind?   
Comment on attachment 109358 [details] [diff] [review]
a new patch with the run-time check for XftTextExtentsUtf16()

r=smontagu, with the caveat that you need specific sr approval for xft zone
(i.e. blizzard) and for pointer fu (but comment 36 seems to imply that
already).
Attachment #109358 - Flags: review?(smontagu) → review+
Comment on attachment 109358 [details] [diff] [review]
a new patch with the run-time check for XftTextExtentsUtf16()

Thank you Simon for r
and Brendan and Christopher
for help with casting. Asking
Blizzard for sr.
Attachment #109358 - Flags: superreview?(blizzard)
Chris,
Can we move this forward? It's been two months..
With this, we can boast that Mozilla-Xft
is sorta on par with Mozill-Win in terms of non-BMP support.
I know, I've just been swamped.  I've been catching up on older reviews than
this (if you can believe that.)
Just to make it a bit easier for Chris when he finally catches up with
old reviews, I made a new patch against the current CVS
with the casting issue fixed as we discussed.

BTW, bug 127713 (non-BMP support for Mozilla with FT2) has been fixed
recently.
Attachment #109358 - Attachment is obsolete: true
Sorry I diddn't realize nsFontMetricsXft.cpp had changed yesterday.
This is a patch against the current cvs.
Attachment #115714 - Attachment is obsolete: true
Comment on attachment 115741 [details] [diff] [review]
an updated patch(against the cvs)

assuming Simon's r is carried over and asking
for sr.
Attachment #115741 - Flags: superreview?(blizzard)
Attached patch patchSplinter Review
This is the patch that I checked in.  It's a little different than the last
patch, mostly in style.  I cleaned up a lot of the braces and indenting, fixed
a couple extra else/return places and some other random things.  No big
functionality changes, as tempting as they were.

Thanks for the patch, and sorry it took so long to get into the tree.
Attachment #107852 - Attachment is obsolete: true
Attachment #115741 - Attachment is obsolete: true
Fixed.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Comment on attachment 116090 [details] [diff] [review]
patch

>+void
>+MozXftTextExtentsUtf16 (Display *aDisplay, XftFont *aFont, 
>+                        _Xconst FcChar8 *aString, FcEndian aEndian, 
>+                        int aLen, XGlyphInfo *aExtents)
>+{
>+    FT_UInt *glyphs, glyphs_local[NUM_LOCAL_GLYPHS];
>+    int i;
>+    int nglyphs = 0;
>+
>+    // This implementation is not for generic use, but is only to be
>+    // used where we know for sure that 'string' can be cast to 16bit
>+    // shorts.
>+    _Xconst FcChar16 *wstr = (_Xconst FcChar16 *) aString;
>+
>+    aLen = aLen / 2;


Hey, you forgot what I said over IRC about needing to make aLen unsigned, or
else sticking with >> (or both).  Using a signed int a / will result in slight
code and average-executed-instruction bloat due to the need to check for a
negative left operand before strength-reducing to >> 1 (as the original patch
did by hand).

/be
> Hey, you forgot what I said over IRC about needing to make aLen unsigned, or
> else sticking with >> (or both).  Using a signed int a / will result in slight
> code and average-executed-instruction bloat due to the need to check for a
> negative left operand before strength-reducing to >> 1 (as the original patch
> did by hand).
> 

I couldn't change the definition of the function since it has to be the same as
the one included in the Xft header file, and that function uses a regular int. 
I could have just used an unsigned int in the function and assigned to it from
the argument, but I figured that would have just added more code bloat than
using division instead of the shift.
Assigning an int to an unsigned takes zero instructions.  But you might just
leave the int and use >> as the original patch did.  I don't think it's a good
idea to use say aLen /= 2 (I forgot to pick on the aLen = aLen / 2 failure to
use /=, but that's not a big deal compared to what the compiler produces :-)
with a signed int type.

/be
Attachment #115741 - Flags: superreview?(blizzard)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: