Closed Bug 130441 Opened 23 years ago Closed 23 years ago

Unable to lay out surrogate in text-align: justify style

Categories

(Core :: Internationalization, defect, P2)

x86
Windows XP
defect

Tracking

()

VERIFIED FIXED
mozilla1.0

People

(Reporter: ji, Assigned: ftang)

References

Details

(Keywords: dataloss, intl, Whiteboard: adt2)

Attachments

(3 files, 5 obsolete files)

When using MS Word file to generate html file, the file contains "text-align: justify" style. With the current build, mozilla is unable to display surrogate in this style. Testcase to follow.
Attached file A testing html file.
Nominating for nsbeta1.
Keywords: intl, nsbeta1
QA Contact: ruixu → ylong
In justify style, we are send each utf16 short separately, and surrogate pair will be separated this way.
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.0
shanjian- I probably can come out a patch quickly. But I cannot test my patch since my winnt4j do not display them anyway. Can you test my patch ?
shanjian- can you verify this on your window which can display surrogate ?
Attachment #74349 - Attachment is obsolete: true
frank, Your patch could lead to correct display of those glyph, but the spacing between those glyph are inconsistent and incorrect. I posted a new patch to address the problem. I reassign it back to you for your consideration.
Assignee: shanjian → ftang
Status: ASSIGNED → NEW
Currently, we have 1:1 mapping between the PRUnichar* and the space array. Shanjian's patch will break that- which will cause issue on Mac and Linux. We should produce the same X for the surrogate low and skip it on Window
I realy want to make sure that 1. make this bug only focus on window display issue 2. do not change the contract beteen layout and gfx Therefore, I would like to spin off the layout change into a different bug and propose to take my origional patch.
Attachment #74806 - Attachment is obsolete: true
Comment on attachment 74349 [details] [diff] [review] testing patch. Not sure will fix the proble. remove the obsolete flag.
Attachment #74349 - Attachment is obsolete: false
nsbeta1+ per i18n triage
Keywords: nsbeta1nsbeta1+
add rbs@maths.uq.edu.au to cc list
Status: NEW → ASSIGNED
Actually shanjian's iteration in attachment 74806 [details] [diff] [review] seems iteresting to me. >1. make this bug only focus on window display issue Adding #ifdef _WIN32 in what shanjian did in nsTextFrame would avoid troubles. >2. do not change the contract beteen layout and gfx It reminds me of the handling of the German's sslig that used some genuflexions. I did a search in LXR, and I am seeing the following defined and redefined here and there. +#define IS_HIGH_SURROGATE(u) +#define IS_LOW_SURROGATE(u) Might be best to consolidate them in one place, e.g., nsICharRepresentable.h. I will give r=rbs on shanjian's iteration with the above consolidation and the #ifdef _WIN32 for protection of other platforms.
/intl/lwbrk/src/nsJISx4501LineBreaker.cpp, line 377 -- #define IS_HIGH_SURROGATE(u) ((PRUnichar)(u) >= (PRUnichar)0xd800 && (PRUnichar)(u) <= (PRUnichar)0xdbff) /intl/uconv/ucvcn/nsUnicodeToGBK.cpp, line 152 -- #define IS_HIGH_SURROGATE(u) (((PRUnichar)0xD800 <= (u)) && ((u) <= (PRUnichar)0xDBFF)) /gfx/src/windows/nsFontMetricsWin.cpp, line 3477 -- #define IS_HIGH_SURROGATE(u) ((PRUnichar)(u) >= (PRUnichar)0xd800 && (PRUnichar)(u) <= (PRUnichar)0xdbff) I think we should put into xpcom/ds/nsCRT.h what do you think ?
>>2. do not change the contract beteen layout and gfx >It reminds me of the handling of the German's sslig that used some genuflexions. The German's sslig does *sp++ I want to make sure there are a 1:1 mapping between PRUnichar* and the spacing array. I don't want to break that even for surrogate .
we should put the width of the surrogate pair into the space for the high surrogate and 0 for the low surrogate
>I think we should put into xpcom/ds/nsCRT.h >what do you think ? Fine with me since other definitions are there too. In addition to IS_HIGH_SURROGATE and IS_LOW_SURROGATE, nsFontMetricsWin.cpp also has SURROGATE_TO_UCS4 which needs to be moved there too. #define IS_HIGH_SURROGATE(u) ((PRUnichar)(u) >= (PRUnichar)0xd800 && (PRUnichar)(u) <= (PRUnichar)0xdbff) #define IS_LOW_SURROGATE(u) ((PRUnichar)(u) >= (PRUnichar)0xdc00 && (PRUnichar)(u) <= (PRUnichar)0xdfff) #define SURROGATE_TO_UCS4(h, l) ((((PRUint32)(h)-(PRUint32)0xd800) << 10) + \ (PRUint32)(l) - (PRUint32)(0xdc00) + 0x10000) >I want to make sure there are a 1:1 mapping between PRUnichar* and the spacing >array. I don't want to break that even for surrogate . I am afraid I still don't see why this is a critical aspect to you. >we should put the width of the surrogate pair into the space for the high >surrogate and 0 for the low surrogate Since it is not known in advance whether the string will have surrogates, that would require to always allocate twice the space (i.e., what is allocated for the German sslig in SmallCaps mode would have to become the default).
Whiteboard: adt2
>In addition to IS_HIGH_SURROGATE and IS_LOW_SURROGATE, nsFontMetricsWin.cpp >also has SURROGATE_TO_UCS4 which needs to be moved there too. Will do. >>I want to make sure there are a 1:1 mapping between PRUnichar* and the spacing >>array. I don't want to break that even for surrogate . >I am afraid I still don't see why this is a critical aspect to you. rbs- even you don't see this is a critical, is there any important issue to against that? we have more and more #ifdef platform in the layout code now. I really want to 1. bring that down one by one, 2. don't introudce a new one if there are no major win/reason. From my point of view, there are no major reason that we should introduce one for this purpose. Do you agree ?
Priority: -- → P2
Attached patch v2 of patch (obsolete) — Splinter Review
Attachment #74349 - Attachment is obsolete: true
seems to be the wrong patch. I don't agree that the 1:1 mapping between PRUnichar* and the spacing array is crucial. But I agree with your other points about risk management. In fact, I have been wondering why you were keen to fix this rather obscure and very minimal bug (in terms of its impact/benefit) -- I would have given it a very low priority, but...
If the bug was futured, perhaps a cleaner integration could be done with feedback from the string savvy people, e.g., using a custom sink/string iterator so as to limit the risk of littering the tree with error-prone tests for IS_HIGH_SURROGATE() and IS_LOW_SURROGATE().
If we're moving (despite initial design assumptions, and by a process of just making a change and then finding the bugs as they pop up, it seems) to an "extended UCS2" or a UTF-16 encoding where we're not guaranteed any character size, why shouldn't we just use UTF-8 and save space for typical pages? Wasn't the initial rationale for the decision to use UCS2 internally that we *wouldn't* have to worry about multiple-unit characters, and were willing to limit ourselves to the characters within Plane 0? Or is there some other rationale for UCS2 that I'm completely missing? Why was this objection ignored with absolutely no discussion? (See bug 118000 comment 19 to the end of the bug.) If we're going to litter this kind of stuff all over the tree, why can't we litter the UTF-8 versions of it around instead? (To respond to rbs more directly: the string library doesn't support 2-byte strings with variable length characters. If we wanted to do that we'd probably need another string hierarchy (nsAUCS4String, etc.).)
>If we're going to litter this kind of stuff all over the tree, why can't >we litter the UTF-8 versions of it around instead? This kind of thing won't be all over the tree. It will only be limited in some part of layout and some part of gfx. a big difference between THIS and UTF8 is all the Sysem API is also moving to UTF-16, not UTF8. For example, all Win32/MacOS API are moving to UTF-16 and use UTF8 will simply introduce additional conversion. Making the code work from UCS2 to UTF-16 only require very small amount of changes. Using UTF-8 require big code chagne plus performance / footprint overhead against all the new system API designed by Microsoft, Apple and other platforms.
David Baron : if you do not agree with me, then discuss this with unicode@unicode.org mailling list or international@w3c.org mailling list. Most of the people will tell you use UTF-16 instead of UTF-8 is the right thing to do.
please focus this bug in this particular problem. The justify layout routine is not very well design for BMP neither. The current algorithm assume one PRUnichar can represent one text element (read unicode standard about text element) which is not true in BMP neither. For example, (1) this algorightm currently break connected Arabic into disconnected part when we display. (2) Also, this algorithm put combining mark and base character sepearately for even latin characters. (and also Greek and Hebrew with points) (3) This will break Thai and indic languages. (4) The algorithm have problem with surrogate too. With in all the problem I described above, fixing (4) is the fastest and easiest one because it only need a range check. To fix the rest need table lookup so we don't want to do it now. Please understand, we don't need to intrduce such surrogate check here if this algorithm itself design against the text element concept in the unicode 3.1 standard. We are very very close the have a good support of surrogate on window, we already support surrogate fonts on window (with this bug). I already have a fix for selection and cursor movement in bug 122584. I will work with jfrancis to find out how to fix editor delete/backspace in near future. After m1.0, I will add MacOS X support and Masaki Kataki from are currently working on adding Surrogate surppot with FreeType code on GTK GFX. (plan for post m1.0) >If the bug was futured, perhaps a cleaner integration could be done with >feedback from the string savvy people, e.g., using a custom sink/string >iterator so as to limit the risk of littering the tree with error-prone >tests for IS_HIGH_SURROGATE() and IS_LOW_SURROGATE(). Stirng iterator won't solve any of this particular problem since the problem is what we really need to iterate through is "text element" not unicode character (neither UCS2 nor UTF-16). It happen that all the current defined surrogate characters ARE one "text element". But in BMP a 'A' follow by a combining ' shoudl form a "text element" >In fact, I have been wondering why you were keen to fix this rather obscure > and very minimal bug (in terms of its impact/benefit) -- I would have given > it a very low priority, but... ok, let me explain to you why we want to fix this bug NOW. The following are some FACTS you should know first 1. Unicode 3.1 define plan 2 as Han Ideograph Extension B and contains many Chinese characters 2. China govement define GB18030 standard as a new standard character set for PR China and they give a FIXED date for all vendor to migrate their support to GB18030 from GB2312 support 3. GB18030 encode all unicode BMP characters and surrogate characters in unicode. and we know China govement will look at the surrogate support when they want to verify their GB18030 support. 4. The internet population in China is now 33.7 M (see http://www.cnnic.net.cn/develst/rep200201-e.shtml ) 5. MS support surrogate in Win XP and Office XP. They already ship fonts which contains Han ideographs Extension B. (both IE and mozilla can display by using it. please read http://www.microsoft.com/globaldev/drintl/015/default.asp ) They already ship chinese input method which support surrogate input . 6. When we test Mozilla's surrogate support by viewing the the html generated by Simplified Chinese Office2000, we found this bug because by default the simplfied chinese Office2000 generate text-align: justify and all the surrogate characters now are display as blank under this condiction. IE display them correctly. We want to fix that so when we try to convience some of our Chinese embedding partner that Gecko is better than IE we won't got push back because of this. I need to fix this for both external and internal customers who care about China markets. Do you think impacting 33.7 M internet users have minimun impact/benefit?
Attached patch the real v2 of patch (obsolete) — Splinter Review
Attachment #76056 - Attachment is obsolete: true
>seems to be the wrong patch. thanks rbs, I just put in the RIGHT patch. :)
frank, I don't understand why you have this line here: + glyphWidth = charWidth + 2 * aTextStyle.mLetterSpacing; The character represented by surrogate pair should be no different from normal characters. It shouldn't have more space following it. I understood that if this aTextStyle.mLetterSpacing is calculated instead of specified directly by CSS, we might have problem in alignment. If that is the case, it should be a separate problem, a problem in calculating mLetterSpacing.
good point. I will resubmit my patch
Attached patch v3 of patch (obsolete) — Splinter Review
Attachment #76215 - Attachment is obsolete: true
The patch looks fine to me. If you could replace "aLength >=1 " with "aLength > 0" in following line, it would be better. (The latter comparison may take less clock cycle in some machine.). r=shanjian + if (IS_HIGH_SURROGATE(ch) && aLength >= 1 && + IS_LOW_SURROGATE(*(aBuffer+1))) {
Attached patch v4 of patchSplinter Review
Attachment #76275 - Attachment is obsolete: true
Comment on attachment 76280 [details] [diff] [review] v4 of patch r=shanjian
Attachment #76280 - Flags: review+
Comment on attachment 76280 [details] [diff] [review] v4 of patch The changes look ok to me with the exception of where you are moving the SURROGATE macros to ... I don't think nsCRT.h is the correct place for it. Do we have a UnicodeUtils header somewhere? Also it would be nice if you put comments near the surrogate code you are adding that explain what a surrogate pair is, and why you have to get the width and draw them together. It could help others that look at this code later, understand it better.
>Stirng iterator won't solve any of this particular problem since the problem is >what we really need to iterate through is "text element" not unicode character >(neither UCS2 nor UTF-16). That's what I was actually alluding to. Having a specialized iterator to help encapsulate these aspects, but this needs more thinking/care, so it is a bit blur in my mind as I speak... >we found this bug because by default the >simplfied chinese Office2000 generate text-align: justify and all the surrogate >characters now are display as blank under this condiction. I buy this argument, dataloss, which you should have mentioned earlier. I was curious why "text-align: justify" was a show stopper given that if the text was rendered as left-align as most documents on the web, people wouldn't even notice/care about it.
Keywords: dataloss
Comment on attachment 76327 [details] [diff] [review] patch v5, include fix of 122584 sr=kin@netscape.com assuming you make the necessary mac project changes so that they pick up the header file that defines your surrogate macros.
Attachment #76327 - Flags: superreview+
>which you should have mentioned earlier read the origional bug report: >When using MS Word file to generate html file, the file contains "text-align: >justify" style. With the current build, mozilla is unable to display surrogate >in this style.
chat with sfraser, he said I don't need to do anything to make the mac build.
Blocks: 104060
re: read the origional bug report: means it is doing align-left...
Attachment #76327 - Flags: approval+
Comment on attachment 76327 [details] [diff] [review] patch v5, include fix of 122584 a=asa (on behalf of drivers) for checkin to the 1.0 trunk
fixed and check in
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Fixed verified on 03-27 trunk build / WinXP-SimpChinese.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: