Closed Bug 162949 Opened 23 years ago Closed 23 years ago

Line wrapping: suspension points are placed at the beginning of a new line.

Categories

(Core :: Internationalization, defect, P1)

x86
Windows XP
defect

Tracking

()

VERIFIED FIXED
mozilla1.0.1

People

(Reporter: ji, Assigned: shanjian)

References

Details

(Keywords: intl, topembed, Whiteboard: [adt2 RTM] [ETA 08/23] [Ready for Mozilla1.0.2])

Attachments

(4 files, 3 obsolete files)

This is reported by one of our customers. Gecko has some problems with line wrapping. Steps to reproduce: 1. Open attached testing html page. 2. Resize the window to certain point, you can see the suspension points are placed at the head of a new line, which doesn't following line wrapping rules.. IE doesn't have this problem.
Attached file A testing html page
Keywords: intl, nsbeta1, topembed
Attached image A screenshot
QA Contact: ruixu → ylong
Xianglan and I found there are two puncuations will has this problem: "бн" and "б├"
Blocks: 154896
Severity: normal → critical
Priority: -- → P1
Whiteboard: [adt2 RTM] [ETA 08/21] [Needs ADT & Drivers' Approval]
Target Milestone: --- → mozilla1.0.1
We might want to change the class of those characters. Need to check the standard to make sure we are not breaking anything.
Status: NEW → ASSIGNED
Unicode Standard Annex #14 Line Breaking Properties http://www.unicode.org/unicode/reports/tr14/ Data file: http://www.unicode.org/Public/3.2-Update/LineBreak-3.2.0.txt ... 2026;IN # HORIZONTAL ELLIPSIS ... 2236;AI # RATIO ... From the above TechReport: AI - Ambiguous (Alphabetic or Ideographic) - Characters with Ambiguous East Asian Width - act like AL when the resolved EAW is N otherwise act as ID IN - Inseparable - Leaders - allow only indirect line breaks between pairs.
Attached patch patch (obsolete) — Splinter Review
forget about the commented line, it will be removed if agreed with this approach.
For character u+2026, we are handling it based on JISX4051. Chinese users might expecting something different. We can change to what IE does (see patch above). All other character belong to the same category will be changed as well. (ie, u+2024, u+2025). This is also the same as IE. u+2236 can be treated as colon, but it is defined as RATIO. RATIO should allow line break before it, and IE does the same thing. We probably don't want to do anything for it.
Whiteboard: [adt2 RTM] [ETA 08/21] [Needs ADT & Drivers' Approval] → [adt2 RTM] [ETA 08/22] [Needs Reviews]
I am thinking about a fix that could make use line break depend on locale. Not sure we can do it now or not. Shanjian: can you look into the following? 1. in nsLWBreakerFImp::GetBreaker, do we got different aParam depend on the charset ? 2. if we change gCnNoBegin and add 2026 and 2236 to gCnNoBegin, will that pass to the nsJISx4501LineBreaker constructor 3. can we add code in nsJISx4501LineBreaker constructor to remember the passed in aNoBegin, aNoEnd characters in a member data mNoBegin and mNoEnd 4. in nsJISx4501LineBreaker GetClass, if we add code in the beginning to match the mNoBegin , if the characters is in it, return class 5 if the characters match mNoEnd, return class 1.
> 1. in nsLWBreakerFImp::GetBreaker, do we got different aParam depend on the > charset ? No, we didn't pass anything in existing code. > 2. if we change gCnNoBegin and add 2026 and 2236 to gCnNoBegin, will that pass > to the nsJISx4501LineBreaker constructor Yes, it will if we fix No. 1. > 4. in nsJISx4501LineBreaker GetClass, if we add code in the beginning to match > the mNoBegin , if the characters is in it, return class 5 > if the characters match mNoEnd, return class 1. u+2026 has to be treated differently from other punctuation like period '.'. Two connecting u+2026 should not be separated, otherwise it is an even worse problem. It doesn't seems really necessary to me to treat line break differently for ja and cn. Unicode tr14 doesn't, IE doesn't , why should we? It is reasonable to me not to put u+2026 in the beginning of line in japanese context as well.
pls help, this is wanted asap by a major embeddor. cc'ing a few folks to get reviews.
From comment #9 > It doesn't seems really necessary to me to treat line break differently for > ja and cn. Unicode tr14 doesn't, IE doesn't , why should we? It is > reasonable to me not to put u+2026 in the beginning of line in japanese > context as well. I'd tend to agree. I think having an extra break at a breakable point (for Japanese) is better than breaking at a non-breakable point (for Chinese). Also, maybe JIS will update this later.
>It is reasonable to me not to put u+2026 in the beginning of line in japanese > context as well. I disagree with the following reason 1. in JISx4501-1995, a. appendix 1 clearly defined that character class 7 are those characters defined in appendix 8, b. and in appendix 8, it define the following 4 characters:U+2014 EM DASH, U+2024 ONE DOT LEADER, U+202 TWO DOT LEADER, U+2026 HORIZONTAL ELLIPSIS c. table 3 clearly define that class 7 could not break if the previous characters are class 7 or class 1 (which mean U+0028 LEFT PARENTHESIS, U+005b LEFT SQUARE BRACKET, U+007B LEFT CURLY BRACKET, U+2018 LEFT SINGLE QUOTATION MARK, U+ 201B SINGLE HIGH-REVERSED 9 QUOTATION MARK U+201C LEFT DOUBLE QUOTATION MARK, U+201F DOUBLE HIGH-REVSERED 9 QUOTATUON MARK U+3008, U+300A, U+300C, U+300E, U+3010, U+3014, U+3016, U+3018, U+301A, U+301D for other characters, it can break after them Therefore, it will violate JISx4501 if we make the japanese page no breakable before U+2026 2. In the Developing International Software for Window 95, Nardin's book page 239, it said Japanese Line breaking is based on kinsoku rule- you can break lines between any two characters, with sevearl exceptions. ... and U+2026 are NOT listed under the 3 lists of exception later in the chinese case, U+2026 IS listed under the 3 exception I think we should not and don't need to trade off jisx4501 comformance with the fix for this bug. There are no reason we cannot or should not make japanse page comform to jisx4501 AND the nardin's book and in the mean time make chinese page comform to Nardin's book
ie seems does not line break before U+2026. However, if you try Ms Word if you set the text as Japanese, then it is breakable before U+2026, but if you set the text as Chinese or English , then it is not breakable before U+2026
>4. in nsJISx4501LineBreaker GetClass, if we add code in the beginning to match >the mNoBegin , if the characters is in it, return class 5 >if the characters match mNoEnd, return class 1. I think I said it wrong, it should be if it match characters in mNoBegin, return 1 if it match characters in mNoEnd, return 5 currently value 1 (which is [a] ) represent the jisx4501 class 2,3,4,5,6 and value 5 (which is [b]) represent the jisx4501 class 10,11,12,17 >u+2026 has to be treated differently from other punctuation why should we? why should we treat U+2026 different from U+3008 in the case of chinese ? Yes, we should not break between two U+2026, we should not break between two U+3008 in chinese neither.
let me summarize my position & believe about this bug: 1. currently we display japanese page according to jis x4501, I don't want to change that. 2. it is ok to change the line wrapping behavior to chinese page now but I don't want to impact japanese 3. from 2, it seems we need a charset sensitive line breaking behavior. We do not have it now but we do have half of the work/design there 4. I don't want to touch gPair in nsJISx4501LineBreaker.cpp at all. This is not designed to be change or subclassed. 5. it is OK to change the return value of GetClass based on the langgroup of the document.
This is what I have in mind to be changed inside intl/lwbrk of course, we need more code in mozilla/content/base/src/nsDocument.cpp to pass in the parameter Bascailly, if we can get a nsIPresContext in the nsDocument.cpp where we call GetBreaker, then we can call the nsIPresContext::GetLangauge to got the lang group and pass in the argument.
from lxr.mozilla.org, it looks like we need to add the nsIPresContext::GetLanguage code in the following places which we call GetBreaker * editor/libeditor/text/nsInternetCiter.cpp: o View change log or Blame annotations line 238 * editor/libeditor/text/nsWrapUtils.cpp: o View change log or Blame annotations line 72 # content/base/src/nsDocument.cpp: * View change log or Blame annotations line 1076 # content/base/src/nsPlainTextSerializer.cpp: * View change log or Blame annotations line 180 # content/xul/document/src/nsXULDocument.cpp: * View change log or Blame annotations line 981
If we really think it is too much risk to make it charset depend on branch, then I prefer the following fix then the change in the gPair (see later patch)
so. this patch make it the same across from ja,cn,ko,tc I don't think this is the right fix for trunk I am willing to take this fix for branch after we ship machv I think we should make the trunk build charset dependent but for branch, that could be too risky. In that case, I rather we change the character to class mapping instead of the pair table notice this patch look big, But the real change is only two lines in the jisx4501class.h. The rest is fixing up the generation tools to match up the mpl version in the current jisx4501class.h and also change the source to the table to change the class definitation of 4 characters. (plus one which we chnage in the h but does not chagne in the source file )
frank, change the character class is more evil than the gPairs table. You must keep u+2026 u+2026 inseparatable. Break between бнбн will not be accepted in either Chinese or Japanese users. How will normal japanese user treat this бн? Can you grab a native japanese user and find out the answer? How do they think if we do not allow break before бн ? Is it preferred, or don't care, or unacceptable?
I think we should take out the following changes from the patch RCS file: /cvsroot/mozilla/intl/lwbrk/tools/jisx4501class.txt,v -2014;;7 +2014;;2 and RCS file: /cvsroot/mozilla/intl/lwbrk/src/jisx4501class.h,v -0x88828888, // U+2010 - U+2017 +0x88818888, // U+2010 - U+2017
Attachment #96224 - Attachment is obsolete: true
Attachment #96285 - Attachment is obsolete: true
Attachment #96288 - Attachment is obsolete: true
notice that jisx4501class.h is machine generated by lwbrk/tools/anzx4501.pl if we don't change the mpl in lwbrk/tools/anzx4501.pl then the diff will show up in jisx4501class.h
Comment on attachment 96366 [details] [diff] [review] patch which only change U+2024,2025,2026 class definitation I talked with frank over aim and reached consesus on the approach. r=shanjian
Attachment #96366 - Flags: review+
This may be too late but let me add this comment any way. It seems to me that we have here a break between what is expected of the standard like JIS X 4051 and what web developers expect to see. 1. One problem is that Class 7 characters like u+2026 are not part of the Kinsoku characters according to JIS X 4051. 2. However, by the rules of Separation, class 7 characters cannot be sperated after and before certain sets of characters. Because of this they look like Kinsoku characters in some cases. 3. Kano's and Ken Lunde's books do not list Class 7 characters as Kinsoku characters. This is true. 4. However, there are some popular books and web pages that list u+2026 and other Class 7 characters as prohibited from beginning a line. 5. It is also true that JIS x 4051 prohibts breaking up u+2026 u+2026 sequence. It seems to me that we cannot deal with these issues unless we have full imeplementation of JIS X 4051 including Rules of Separation, Elongation, and Spacing in addition to Line Breaking. From an apperance point of view, I agree with position in #4 above. Having u+2026 at the beginning of a line does not look good and lots of people would agree with that regardless of what JIS X 4051 says. For this reason, I think it would be Ok to us to treat u+2026 and other Class 7 characters as Kinsoku characters.
Comment on attachment 96366 [details] [diff] [review] patch which only change U+2024,2025,2026 class definitation rs=rbs
Attachment #96366 - Flags: superreview+
answser to momoi #25, We are doing exactly that with exception of u+2014. (People probably don't want to see u+2014 in kinsoku.) Frank just used a different approach than my original one. The result should be the same.
checked into the trunk so that we can test it before branch landing.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
ylong: pls verify this as fixed on the trunk. thanks!
Keywords: mozilla1.0.2
Whiteboard: [adt2 RTM] [ETA 08/22] [Needs Reviews] → [adt2 RTM] [ETA 08/23] [Ready for Mozilla1.0.2]
ruixu- can you verify on trunk ?
I still see the problem of suspension points with 2002-08-26-04 trunk. Reopen. It is still reproducible with another style of suspension points, which is made by several full stops. And also, the similar problem happens on the followings: single byte "-", "." and "\" Test case attached.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached file one more test case
Test case for "...","-","." and "\"
This patch only fixed u+2024, u+2025, u+2026. Those single byte specific punctuation like '.' and '\' is not used in chinese or japanese, so they are not a problem.
Status: REOPENED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
Comment on attachment 96366 [details] [diff] [review] patch which only change U+2024,2025,2026 class definitation a=chofmann for 1.0.2
Attachment #96366 - Flags: approval+
patch checked to 1.02 branch.
I got confused. With Simplied Chinese IME, we definitely can use the following single byte signs: "...","-","." and "\". At least, the single byte minus sign "-" is used very popular in Simplied Chinese. Should we separate these issues to another bugs?
Marking as "mozilla1.0.2+" per Comment #34 From chris hofmann.
mark it as fixed1.0.2
Rui, please file a new bug and we can discuss the issue there. Those things can only be considered case by case.
OK, separated other issues to bug 164759. Mark this bug as VERIFIED (with 2002-08-26-04 trunk).
Status: RESOLVED → VERIFIED
verified as fixed with 2002-08-28-08-1.0 build.
If you were changing the license anyway we should have switched this to the MPL tri-license form. Keep that in mind for next time.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: