Closed Bug 162949 Opened 22 years ago Closed 22 years ago

Line wrapping: suspension points are placed at the beginning of a new line.

Categories

(Core :: Internationalization, defect, P1)

x86
Windows XP
defect

Tracking

()

VERIFIED FIXED
mozilla1.0.1

People

(Reporter: ji, Assigned: shanjian)

References

Details

(Keywords: intl, topembed, Whiteboard: [adt2 RTM] [ETA 08/23] [Ready for Mozilla1.0.2])

Attachments

(4 files, 3 obsolete files)

This is reported by one of our customers. Gecko has some problems with line
wrapping.

Steps to reproduce:
1. Open attached testing html page.
2. Resize the window to certain point, you can see the suspension points are
placed at the head of a new line, which doesn't following line wrapping rules..
IE doesn't have this problem.
Attached file A testing html page
Keywords: intl, nsbeta1, topembed
Attached image A screenshot
QA Contact: ruixu → ylong
Xianglan and I found there are two puncuations will has this problem:
"бн" and "б├" 
Blocks: 154896
Severity: normal → critical
Priority: -- → P1
Whiteboard: [adt2 RTM] [ETA 08/21] [Needs ADT & Drivers' Approval]
Target Milestone: --- → mozilla1.0.1
We might want to change the class of those characters. Need to check the
standard to make sure we are not breaking anything. 
Status: NEW → ASSIGNED
Unicode Standard Annex #14 Line Breaking Properties
http://www.unicode.org/unicode/reports/tr14/

Data file: http://www.unicode.org/Public/3.2-Update/LineBreak-3.2.0.txt
...
2026;IN # HORIZONTAL ELLIPSIS
...
2236;AI # RATIO
...

From the above TechReport:
AI
    - Ambiguous (Alphabetic or Ideographic)
    - Characters with Ambiguous East Asian Width
    - act like AL when the resolved EAW is N otherwise act as ID
IN
    - Inseparable
    - Leaders
    - allow only indirect line breaks between pairs.
Attached patch patch (obsolete) — Splinter Review
forget about the commented line, it will be removed if agreed with this
approach.
For character u+2026, we are handling it based on JISX4051. Chinese users might
expecting something different. We can change to what IE does (see patch above).
All other character belong to the same category will be changed as well. (ie,
u+2024, u+2025). This is also the same as IE. 

u+2236 can be treated as colon, but it is defined as RATIO. RATIO should allow
line break before it, and IE does the same thing. We probably don't want to do
anything for it.
Whiteboard: [adt2 RTM] [ETA 08/21] [Needs ADT & Drivers' Approval] → [adt2 RTM] [ETA 08/22] [Needs Reviews]
I am thinking about a fix that could make use line break depend on locale. Not
sure we can do it now or not.
Shanjian: can you look into the following?
1. in nsLWBreakerFImp::GetBreaker, do we got different aParam depend on the
charset ?
2. if we change gCnNoBegin and add 2026 and 2236 to gCnNoBegin, will that pass
to the nsJISx4501LineBreaker constructor
3. can we add code in nsJISx4501LineBreaker constructor to remember the passed
in aNoBegin, aNoEnd characters in a member data mNoBegin and mNoEnd
4. in nsJISx4501LineBreaker GetClass, if we add code in the beginning to match
the mNoBegin , if the characters is in it, return class 5
if the characters match mNoEnd, return class 1.
> 1. in nsLWBreakerFImp::GetBreaker, do we got different aParam depend on the
> charset ?
No, we didn't pass anything in existing code. 

> 2. if we change gCnNoBegin and add 2026 and 2236 to gCnNoBegin, will that pass
> to the nsJISx4501LineBreaker constructor
Yes, it will if we fix No. 1.

> 4. in nsJISx4501LineBreaker GetClass, if we add code in the beginning to match
> the mNoBegin , if the characters is in it, return class 5
> if the characters match mNoEnd, return class 1.

u+2026 has to be treated differently from other punctuation like period '.'. 
Two connecting u+2026 should not be separated, otherwise it is an even worse
problem. 

It doesn't seems really necessary to me to treat line break differently for ja
and cn. Unicode tr14 doesn't, IE doesn't , why should we? It is reasonable to me
not to put u+2026 in the beginning of line in japanese context as well. 
pls help, this is wanted asap by a major embeddor. cc'ing a few folks to get
reviews. 
From comment #9
> It doesn't seems really necessary to me to treat line break differently for
> ja and cn. Unicode tr14 doesn't, IE doesn't , why should we? It is
> reasonable to me not to put u+2026 in the beginning of line in japanese
> context as well. 

I'd tend to agree.  I think having an extra break at a breakable point (for
Japanese) is better than breaking at a non-breakable point (for Chinese).

Also, maybe JIS will update this later.
>It is reasonable to me not to put u+2026 in the beginning of line in japanese
> context as well. 
I disagree with the following reason
1. in JISx4501-1995, 
a. appendix 1 clearly defined that character class 7 are those characters
defined in appendix 8, 
b. and in appendix 8, it define the following 4 characters:U+2014 EM DASH,
U+2024 ONE DOT LEADER, U+202 TWO DOT LEADER, U+2026 HORIZONTAL ELLIPSIS
c. table 3 clearly define that class 7 could not break if the previous
characters are class 7 or class 1 (which mean U+0028 LEFT PARENTHESIS, U+005b
LEFT SQUARE BRACKET, U+007B LEFT CURLY BRACKET, U+2018 LEFT SINGLE QUOTATION
MARK, U+ 201B SINGLE HIGH-REVERSED 9 QUOTATION MARK U+201C LEFT DOUBLE QUOTATION
MARK, U+201F DOUBLE HIGH-REVSERED 9 QUOTATUON MARK U+3008, U+300A, U+300C,
U+300E, U+3010, U+3014, U+3016, U+3018, U+301A, U+301D

for other characters, it can break after them

Therefore, it will violate JISx4501 if we make the japanese page no breakable
before U+2026

2. In the Developing International Software for Window 95, Nardin's book page
239, it said
Japanese Line breaking is based on kinsoku rule- you can break lines between any
two characters, with sevearl exceptions. ...
and U+2026 are NOT listed under the 3 lists of exception 
later in the chinese case, U+2026 IS listed under the 3 exception

I think we should not and don't need to trade off jisx4501 comformance with the
fix for this bug. There are no reason we cannot or should not make japanse page
comform to jisx4501 AND the nardin's book and in the mean time make chinese page
comform to Nardin's book
ie seems does not line break before U+2026. However, if you try Ms Word
if you set the text as Japanese, then it is breakable before U+2026, but if you
set the text as Chinese or English , then it is not breakable before U+2026
>4. in nsJISx4501LineBreaker GetClass, if we add code in the beginning to match
>the mNoBegin , if the characters is in it, return class 5
>if the characters match mNoEnd, return class 1.
I think I said it wrong, it should be
if it match characters in mNoBegin, return  1
if it match characters in mNoEnd, return 5

currently value 1 (which is [a] ) represent the jisx4501 class 2,3,4,5,6
and value 5 (which is [b]) represent the jisx4501 class 10,11,12,17

>u+2026 has to be treated differently from other punctuation
why should we? why should we treat U+2026 different from U+3008 in the case of
chinese ? Yes, we should not break between two U+2026, we should not break
between two U+3008 in chinese neither. 



let me summarize my position & believe about this bug:
1. currently we display japanese page according to jis x4501, I don't want to
change that.
2. it is ok to change the line wrapping behavior to chinese page now but I don't
want to impact japanese
3. from 2, it seems we need a charset sensitive line breaking behavior. We do
not have it now but we do have half of the work/design there
4. I don't want to touch gPair in nsJISx4501LineBreaker.cpp at all. This is not
designed to be change or subclassed. 
5. it is OK to change the return value of GetClass based on the langgroup of the
document.
This is what I have in mind to be changed inside intl/lwbrk
of course, we need more code in mozilla/content/base/src/nsDocument.cpp 
to pass in the parameter
Bascailly, if we can get a nsIPresContext in the nsDocument.cpp where we call
GetBreaker, then we can call the nsIPresContext::GetLangauge to got the lang
group
and pass in the argument.
from lxr.mozilla.org, it looks like we need to add the
nsIPresContext::GetLanguage code in the following places which we call GetBreaker

    * editor/libeditor/text/nsInternetCiter.cpp:
    o View change log or Blame annotations line 238
    * editor/libeditor/text/nsWrapUtils.cpp:
    o View change log or Blame annotations line 72
# content/base/src/nsDocument.cpp:

* View change log or Blame annotations line 1076 
# content/base/src/nsPlainTextSerializer.cpp:

* View change log or Blame annotations line 180
# content/xul/document/src/nsXULDocument.cpp:

* View change log or Blame annotations line 981 
If we really think it is too much risk to make it charset depend on branch, then
I prefer the following fix then the change in the gPair (see later patch)
so. this patch make it the same across from ja,cn,ko,tc
I don't think this is the right fix for trunk
I am willing to take this fix for branch after we ship machv
I think we should make the trunk build charset dependent
but for branch, that could be too risky. In that case, I rather we change the
character to class mapping instead of the pair table
notice this patch look big, But the real change is only two lines in the
jisx4501class.h. The rest is fixing up the generation tools to match up the mpl
version in the current jisx4501class.h and also change the source to the table
to change the class definitation of 4 characters. (plus one which we chnage in
the h but does not chagne in the source file )
frank, 
change the character class is more evil than the gPairs table. You must keep
u+2026 u+2026 inseparatable. Break between бнбн will not be accepted in either
Chinese or Japanese users.

How will normal japanese user treat this бн? Can you grab a native japanese user
and find out the answer? How do they think if we do not allow break before бн
? Is it preferred, or don't care, or unacceptable?    





I think we should take out the following changes from the patch
RCS file: /cvsroot/mozilla/intl/lwbrk/tools/jisx4501class.txt,v
-2014;;7
+2014;;2

and
RCS file: /cvsroot/mozilla/intl/lwbrk/src/jisx4501class.h,v
-0x88828888, // U+2010 - U+2017
+0x88818888, // U+2010 - U+2017

Attachment #96224 - Attachment is obsolete: true
Attachment #96285 - Attachment is obsolete: true
Attachment #96288 - Attachment is obsolete: true
notice that jisx4501class.h is machine generated by lwbrk/tools/anzx4501.pl if
we don't change the mpl in lwbrk/tools/anzx4501.pl then the diff will show up in
jisx4501class.h
Comment on attachment 96366 [details] [diff] [review]
patch which only change U+2024,2025,2026 class definitation

I talked with frank over aim and reached consesus on the approach. r=shanjian
Attachment #96366 - Flags: review+
This may be too late but let me add this comment any way.
It seems to me that we have here a break between what is expected of
the standard like JIS X 4051 and what web developers expect to see.

1. One problem is that Class 7 characters like u+2026 are not part of 
   the Kinsoku characters according to JIS X 4051.
2. However, by the rules of Separation, class 7 characters cannot
   be sperated after and before certain sets of characters. 
   Because of this they look like Kinsoku characters in some cases.
3. Kano's and Ken Lunde's books do not list Class 7 characters 
   as Kinsoku characters. This is true.
4. However, there are some popular books and web pages that 
   list u+2026 and other Class 7 characters as prohibited 
   from beginning a line.
5. It is also true that JIS x 4051 prohibts breaking up u+2026 u+2026
   sequence.

It seems to me that we cannot deal with these issues unless we have
full imeplementation of JIS X 4051 including Rules of Separation, Elongation,
and Spacing in addition to Line Breaking. 

From an apperance point of view, I agree with position in #4 above. Having
u+2026 at the beginning of a line does not look good and lots 
of people would agree with that regardless of what JIS X 4051 
says. For this reason, I think it would be Ok to us to treat 
u+2026 and other Class 7 characters as Kinsoku characters. 
Comment on attachment 96366 [details] [diff] [review]
patch which only change U+2024,2025,2026 class definitation

rs=rbs
Attachment #96366 - Flags: superreview+
answser to momoi #25,

We are doing exactly that with exception of u+2014. (People probably don't want
to see u+2014 in kinsoku.) Frank just used a different approach than my original
one.
The result should be the same. 
checked into the trunk so that we can test it before branch landing. 
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
ylong: pls verify this as fixed on the trunk. thanks!
Keywords: mozilla1.0.2
Whiteboard: [adt2 RTM] [ETA 08/22] [Needs Reviews] → [adt2 RTM] [ETA 08/23] [Ready for Mozilla1.0.2]
ruixu- can you verify on trunk ?
I still see the problem of suspension points with 2002-08-26-04 trunk. 
Reopen.

It is still reproducible with another style of suspension points, which is made 
by several full stops.
And also, the similar problem happens on the followings:
single byte "-", "." and "\"

Test case attached.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached file one more test case
Test case for "...","-","." and "\"
This patch only fixed u+2024, u+2025, u+2026. Those single byte specific
punctuation like '.' and '\' is not used in chinese or japanese, so they are not
a problem.
Status: REOPENED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
Comment on attachment 96366 [details] [diff] [review]
patch which only change U+2024,2025,2026 class definitation

a=chofmann for 1.0.2
Attachment #96366 - Flags: approval+
patch checked to 1.02 branch. 
I got confused. With Simplied Chinese IME, we definitely can use 
the following single byte signs: "...","-","." and "\". 
At least, the single byte minus sign "-" is used very popular 
in Simplied Chinese.

Should we separate these issues to another bugs?
Marking as "mozilla1.0.2+" per Comment #34 From chris hofmann.
mark it as fixed1.0.2
Rui, please file a new bug and we can discuss the issue there. Those things can
only be considered case by case. 
OK, separated other issues to bug 164759.
Mark this bug as VERIFIED (with 2002-08-26-04 trunk).
Status: RESOLVED → VERIFIED
verified as fixed with 2002-08-28-08-1.0 build.
If you were changing the license anyway we should have switched this to the MPL
tri-license form. Keep that in mind for next time.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: