Closed
Bug 130441
Opened 23 years ago
Closed 23 years ago
Unable to lay out surrogate in text-align: justify style
Categories
(Core :: Internationalization, defect, P2)
Tracking
()
VERIFIED
FIXED
mozilla1.0
People
(Reporter: ji, Assigned: ftang)
References
Details
(Keywords: dataloss, intl, Whiteboard: adt2)
Attachments
(3 files, 5 obsolete files)
113 bytes,
text/html
|
Details | |
6.25 KB,
patch
|
shanjian
:
review+
|
Details | Diff | Splinter Review |
10.51 KB,
patch
|
kinmoz
:
superreview+
asa
:
approval+
|
Details | Diff | Splinter Review |
When using MS Word file to generate html file, the file contains "text-align:
justify" style. With the current build, mozilla is unable to display surrogate
in this style.
Testcase to follow.
Comment 3•23 years ago
|
||
In justify style, we are send each utf16 short separately, and surrogate pair
will be separated this way.
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.0
Assignee | ||
Comment 4•23 years ago
|
||
shanjian- I probably can come out a patch quickly. But I cannot test my patch
since my winnt4j do not display them anyway. Can you test my patch ?
Assignee | ||
Comment 5•23 years ago
|
||
shanjian- can you verify this on your window which can display surrogate ?
Comment 6•23 years ago
|
||
Updated•23 years ago
|
Attachment #74349 -
Attachment is obsolete: true
Comment 7•23 years ago
|
||
frank,
Your patch could lead to correct display of those glyph, but the spacing between
those glyph are inconsistent and incorrect. I posted a new patch to address the
problem. I reassign it back to you for your consideration.
Assignee: shanjian → ftang
Status: ASSIGNED → NEW
Assignee | ||
Comment 8•23 years ago
|
||
Currently, we have 1:1 mapping between the PRUnichar* and the space array.
Shanjian's patch will break that- which will cause issue on Mac and Linux.
We should produce the same X for the surrogate low and skip it on Window
Assignee | ||
Comment 9•23 years ago
|
||
I realy want to make sure that
1. make this bug only focus on window display issue
2. do not change the contract beteen layout and gfx
Therefore, I would like to spin off the layout change into a different bug and
propose to take my origional patch.
Assignee | ||
Updated•23 years ago
|
Attachment #74806 -
Attachment is obsolete: true
Assignee | ||
Comment 10•23 years ago
|
||
Comment on attachment 74349 [details] [diff] [review]
testing patch. Not sure will fix the proble.
remove the obsolete flag.
Attachment #74349 -
Attachment is obsolete: false
Assignee | ||
Comment 12•23 years ago
|
||
rbs- can you r= attachment 74349 [details] [diff] [review] ?
Comment 14•23 years ago
|
||
Actually shanjian's iteration in attachment 74806 [details] [diff] [review] seems iteresting to me.
>1. make this bug only focus on window display issue
Adding #ifdef _WIN32 in what shanjian did in nsTextFrame would avoid troubles.
>2. do not change the contract beteen layout and gfx
It reminds me of the handling of the German's sslig that used some genuflexions.
I did a search in LXR, and I am seeing the following defined and redefined here
and there.
+#define IS_HIGH_SURROGATE(u)
+#define IS_LOW_SURROGATE(u)
Might be best to consolidate them in one place, e.g., nsICharRepresentable.h.
I will give r=rbs on shanjian's iteration with the above consolidation and the
#ifdef _WIN32 for protection of other platforms.
Assignee | ||
Comment 15•23 years ago
|
||
/intl/lwbrk/src/nsJISx4501LineBreaker.cpp, line 377 -- #define
IS_HIGH_SURROGATE(u) ((PRUnichar)(u) >= (PRUnichar)0xd800 && (PRUnichar)(u) <=
(PRUnichar)0xdbff)
/intl/uconv/ucvcn/nsUnicodeToGBK.cpp, line 152 -- #define IS_HIGH_SURROGATE(u)
(((PRUnichar)0xD800 <= (u)) && ((u) <= (PRUnichar)0xDBFF))
/gfx/src/windows/nsFontMetricsWin.cpp, line 3477 -- #define IS_HIGH_SURROGATE(u)
((PRUnichar)(u) >= (PRUnichar)0xd800 && (PRUnichar)(u) <= (PRUnichar)0xdbff)
I think we should put into xpcom/ds/nsCRT.h
what do you think ?
Assignee | ||
Comment 16•23 years ago
|
||
>>2. do not change the contract beteen layout and gfx
>It reminds me of the handling of the German's sslig that used some genuflexions.
The German's sslig does *sp++
I want to make sure there are a 1:1 mapping between PRUnichar* and the spacing
array. I don't want to break that even for surrogate .
Assignee | ||
Comment 17•23 years ago
|
||
we should put the width of the surrogate pair into the space for the high
surrogate and 0 for the low surrogate
Comment 18•23 years ago
|
||
>I think we should put into xpcom/ds/nsCRT.h
>what do you think ?
Fine with me since other definitions are there too.
In addition to IS_HIGH_SURROGATE and IS_LOW_SURROGATE, nsFontMetricsWin.cpp
also has SURROGATE_TO_UCS4 which needs to be moved there too.
#define IS_HIGH_SURROGATE(u) ((PRUnichar)(u) >= (PRUnichar)0xd800 &&
(PRUnichar)(u) <= (PRUnichar)0xdbff)
#define IS_LOW_SURROGATE(u) ((PRUnichar)(u) >= (PRUnichar)0xdc00 &&
(PRUnichar)(u) <= (PRUnichar)0xdfff)
#define SURROGATE_TO_UCS4(h, l) ((((PRUint32)(h)-(PRUint32)0xd800) << 10) + \
(PRUint32)(l) - (PRUint32)(0xdc00) +
0x10000)
>I want to make sure there are a 1:1 mapping between PRUnichar* and the spacing
>array. I don't want to break that even for surrogate .
I am afraid I still don't see why this is a critical aspect to you.
>we should put the width of the surrogate pair into the space for the high
>surrogate and 0 for the low surrogate
Since it is not known in advance whether the string will have surrogates, that
would require to always allocate twice the space (i.e., what is allocated for
the German sslig in SmallCaps mode would have to become the default).
Assignee | ||
Updated•23 years ago
|
Whiteboard: adt2
Assignee | ||
Comment 19•23 years ago
|
||
>In addition to IS_HIGH_SURROGATE and IS_LOW_SURROGATE, nsFontMetricsWin.cpp
>also has SURROGATE_TO_UCS4 which needs to be moved there too.
Will do.
>>I want to make sure there are a 1:1 mapping between PRUnichar* and the spacing
>>array. I don't want to break that even for surrogate .
>I am afraid I still don't see why this is a critical aspect to you.
rbs- even you don't see this is a critical, is there any important issue to
against that?
we have more and more #ifdef platform in the layout code now. I really want to
1. bring that down one by one,
2. don't introudce a new one if there are no major win/reason.
From my point of view, there are no major reason that we should introduce one
for this purpose. Do you agree ?
Priority: -- → P2
Assignee | ||
Comment 20•23 years ago
|
||
Attachment #74349 -
Attachment is obsolete: true
Comment 21•23 years ago
|
||
seems to be the wrong patch.
I don't agree that the 1:1 mapping between PRUnichar* and the spacing array is
crucial. But I agree with your other points about risk management. In fact, I
have been wondering why you were keen to fix this rather obscure and very
minimal bug (in terms of its impact/benefit) -- I would have given it a very low
priority, but...
Comment 22•23 years ago
|
||
If the bug was futured, perhaps a cleaner integration could be done with
feedback from the string savvy people, e.g., using a custom sink/string iterator
so as to limit the risk of littering the tree with error-prone tests for
IS_HIGH_SURROGATE() and IS_LOW_SURROGATE().
If we're moving (despite initial design assumptions, and by a process of just
making a change and then finding the bugs as they pop up, it seems) to an
"extended UCS2" or a UTF-16 encoding where we're not guaranteed any character
size, why shouldn't we just use UTF-8 and save space for typical pages?
Wasn't the initial rationale for the decision to use UCS2 internally that we
*wouldn't* have to worry about multiple-unit characters, and were willing to
limit ourselves to the characters within Plane 0? Or is there some other
rationale for UCS2 that I'm completely missing? Why was this objection ignored
with absolutely no discussion? (See bug 118000 comment 19 to the end of the bug.)
If we're going to litter this kind of stuff all over the tree, why can't we
litter the UTF-8 versions of it around instead?
(To respond to rbs more directly: the string library doesn't support 2-byte
strings with variable length characters. If we wanted to do that we'd probably
need another string hierarchy (nsAUCS4String, etc.).)
Assignee | ||
Comment 24•23 years ago
|
||
>If we're going to litter this kind of stuff all over the tree, why can't
>we litter the UTF-8 versions of it around instead?
This kind of thing won't be all over the tree. It will only be limited in some
part of layout and some part of gfx.
a big difference between THIS and UTF8 is all the Sysem API is also moving to
UTF-16, not UTF8. For example, all Win32/MacOS API are moving to UTF-16 and use
UTF8 will simply introduce additional conversion. Making the code work from UCS2
to UTF-16 only require very small amount of changes. Using UTF-8 require big
code chagne plus performance / footprint overhead against all the new system API
designed by Microsoft, Apple and other platforms.
Assignee | ||
Comment 25•23 years ago
|
||
David Baron : if you do not agree with me, then discuss this with
unicode@unicode.org mailling list or international@w3c.org mailling list. Most
of the people will tell you use UTF-16 instead of UTF-8 is the right thing to do.
Assignee | ||
Comment 26•23 years ago
|
||
please focus this bug in this particular problem.
The justify layout routine is not very well design for BMP neither. The current
algorithm assume one PRUnichar can represent one text element (read unicode
standard about text element) which is not true in BMP neither. For example, (1)
this algorightm currently break connected Arabic into disconnected part when we
display. (2) Also, this algorithm put combining mark and base character
sepearately for even latin characters. (and also Greek and Hebrew with points)
(3) This will break Thai and indic languages. (4) The algorithm have problem
with surrogate too. With in all the problem I described above, fixing (4) is the
fastest and easiest one because it only need a range check. To fix the rest need
table lookup so we don't want to do it now.
Please understand, we don't need to intrduce such surrogate check here if this
algorithm itself design against the text element concept in the unicode 3.1
standard.
We are very very close the have a good support of surrogate on window, we
already support surrogate fonts on window (with this bug). I already have a fix
for selection and cursor movement in bug 122584. I will work with jfrancis to
find out how to fix editor delete/backspace in near future. After m1.0, I will
add MacOS X support and Masaki Kataki from are currently working on adding
Surrogate surppot with FreeType code on GTK GFX. (plan for post m1.0)
>If the bug was futured, perhaps a cleaner integration could be done with
>feedback from the string savvy people, e.g., using a custom sink/string
>iterator so as to limit the risk of littering the tree with error-prone
>tests for IS_HIGH_SURROGATE() and IS_LOW_SURROGATE().
Stirng iterator won't solve any of this particular problem since the problem is
what we really need to iterate through is "text element" not unicode character
(neither UCS2 nor UTF-16). It happen that all the current defined surrogate
characters ARE one "text element". But in BMP a 'A' follow by a combining '
shoudl form a "text element"
>In fact, I have been wondering why you were keen to fix this rather obscure
> and very minimal bug (in terms of its impact/benefit) -- I would have given
> it a very low priority, but...
ok, let me explain to you why we want to fix this bug NOW.
The following are some FACTS you should know first
1. Unicode 3.1 define plan 2 as Han Ideograph Extension B and contains many
Chinese characters
2. China govement define GB18030 standard as a new standard character set for PR
China and they give a FIXED date for all vendor to migrate their support to
GB18030 from GB2312 support
3. GB18030 encode all unicode BMP characters and surrogate characters in
unicode. and we know China govement will look at the surrogate support when they
want to verify their GB18030 support.
4. The internet population in China is now 33.7 M (see
http://www.cnnic.net.cn/develst/rep200201-e.shtml )
5. MS support surrogate in Win XP and Office XP. They already ship fonts which
contains Han ideographs Extension B. (both IE and mozilla can display by using
it. please read http://www.microsoft.com/globaldev/drintl/015/default.asp ) They
already ship chinese input method which support surrogate input .
6. When we test Mozilla's surrogate support by viewing the the html generated by
Simplified Chinese Office2000, we found this bug because by default the
simplfied chinese Office2000 generate text-align: justify and all the surrogate
characters now are display as blank under this condiction. IE display them
correctly. We want to fix that so when we try to convience some of our Chinese
embedding partner that Gecko is better than IE we won't got push back because of
this.
I need to fix this for both external and internal customers who care about China
markets.
Do you think impacting 33.7 M internet users have minimun impact/benefit?
Assignee | ||
Comment 27•23 years ago
|
||
Attachment #76056 -
Attachment is obsolete: true
Assignee | ||
Comment 28•23 years ago
|
||
>seems to be the wrong patch.
thanks rbs, I just put in the RIGHT patch. :)
Comment 29•23 years ago
|
||
frank, I don't understand why you have this line here:
+ glyphWidth = charWidth + 2 * aTextStyle.mLetterSpacing;
The character represented by surrogate pair should be no different from
normal characters. It shouldn't have more space following it. I understood
that if this aTextStyle.mLetterSpacing is calculated instead of specified
directly by CSS, we might have problem in alignment. If that is the case,
it should be a separate problem, a problem in calculating mLetterSpacing.
Assignee | ||
Comment 30•23 years ago
|
||
good point. I will resubmit my patch
Assignee | ||
Comment 31•23 years ago
|
||
Attachment #76215 -
Attachment is obsolete: true
Comment 32•23 years ago
|
||
The patch looks fine to me. If you could replace "aLength >=1 " with
"aLength > 0" in following line, it would be better.
(The latter comparison may take less clock cycle in some machine.).
r=shanjian
+ if (IS_HIGH_SURROGATE(ch) && aLength >= 1 &&
+ IS_LOW_SURROGATE(*(aBuffer+1))) {
Assignee | ||
Comment 33•23 years ago
|
||
Attachment #76275 -
Attachment is obsolete: true
Comment 34•23 years ago
|
||
Comment on attachment 76280 [details] [diff] [review]
v4 of patch
r=shanjian
Attachment #76280 -
Flags: review+
Comment 35•23 years ago
|
||
Comment on attachment 76280 [details] [diff] [review]
v4 of patch
The changes look ok to me with the exception of where you are moving the
SURROGATE macros to ... I don't think nsCRT.h is the correct place for it. Do
we have a UnicodeUtils header somewhere?
Also it would be nice if you put comments near the surrogate code you are
adding that explain what a surrogate pair is, and why you have to get the width
and draw them together. It could help others that look at this code later,
understand it better.
Comment 36•23 years ago
|
||
>Stirng iterator won't solve any of this particular problem since the problem is
>what we really need to iterate through is "text element" not unicode character
>(neither UCS2 nor UTF-16).
That's what I was actually alluding to. Having a specialized iterator to help
encapsulate these aspects, but this needs more thinking/care, so it is a bit
blur in my mind as I speak...
>we found this bug because by default the
>simplfied chinese Office2000 generate text-align: justify and all the surrogate
>characters now are display as blank under this condiction.
I buy this argument, dataloss, which you should have mentioned earlier. I was
curious why "text-align: justify" was a show stopper given that if the text was
rendered as left-align as most documents on the web, people wouldn't even
notice/care about it.
Keywords: dataloss
Assignee | ||
Comment 37•23 years ago
|
||
Comment 38•23 years ago
|
||
Comment on attachment 76327 [details] [diff] [review]
patch v5, include fix of 122584
sr=kin@netscape.com assuming you make the necessary mac project changes so that
they pick up the header file that defines your surrogate macros.
Attachment #76327 -
Flags: superreview+
Assignee | ||
Comment 39•23 years ago
|
||
>which you should have mentioned earlier
read the origional bug report:
>When using MS Word file to generate html file, the file contains "text-align:
>justify" style. With the current build, mozilla is unable to display surrogate
>in this style.
Assignee | ||
Comment 40•23 years ago
|
||
chat with sfraser, he said I don't need to do anything to make the mac build.
Comment 41•23 years ago
|
||
re: read the origional bug report:
means it is doing align-left...
Updated•23 years ago
|
Attachment #76327 -
Flags: approval+
Comment 42•23 years ago
|
||
Comment on attachment 76327 [details] [diff] [review]
patch v5, include fix of 122584
a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Assignee | ||
Comment 43•23 years ago
|
||
fixed and check in
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Comment 44•23 years ago
|
||
Fixed verified on 03-27 trunk build / WinXP-SimpChinese.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•