Open Bug 164759 Opened 17 years ago Updated 2 years ago

Line wrapping issues with some punctuations

Categories

(Core :: Internationalization, defect)

defect
Not set

Tracking

()

ASSIGNED

People

(Reporter: ruixu, Assigned: jshin1987)

References

(Blocks 1 open bug)

Details

(Keywords: fixed1.4, intl)

Attachments

(4 files, 2 obsolete files)

Build: 2002-08-26-04 trunk

This is seperated from bug 162949.
Line wrappings are incorrect with some punctuations, e.g. 
"...", "-", "." and "\" in single byte.

Please refer to the test case attached. (better to compare with IE)
Attached file test case
Keywords: intl
QA Contact: ruixu → ylong
*** Bug 202883 has been marked as a duplicate of this bug. ***
See attachment 121366 [details] for a bit more extensive list of test cases.

In case of U+002E (full stop), it's easy to fix. In the following,
the condition (U_SPACE != next) in line 365
(http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#365)
is too broad. It need to be removed all together or has to be narrowed down.
U+002E is tricky because even in CJK text, somebody might use it 
between CJK Ideographs or Hangul syllables for abbreviation (as in
'e.g.'). If we can put it aside as too edge a case, we can just remove
line 365 and 366 unless there's a special provision about U+002E
in JIS X 4501 that I don't know about. Does JIS X 4501 stipulate
that U+002E cannot end a line if it's followed by anything other
than SPACE? It could be, but I guess it doesn't allow it to 
start a line no matter what. Currently, line 366 (assigning
class 8 to U+002E) effectively allows line break both before and after U+002E. 

350 PRInt8  nsJISx4501LineBreaker::ContextualAnalysis(
351    PRUnichar prev, PRUnichar cur, PRUnichar next
352 )
353 {
.....
360    else if( U_PERIOD == cur)
361    {
362          if( (IS_ASCII_DIGIT (prev) || (0x0020 == prev) )&& 
363              IS_ASCII_DIGIT (next) )
364            return NUMERIC_CLASS;
365          if( U_SPACE != next )
366            return CHARACTER_CLASS;
367    }
......
374    return this->GetClass(cur);


As for U+005C(reverse solidus), it's class 3(in Mozilla's reduced class scheme
( see 
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/jisx4501class.h and
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/jisx4051pairtable.txt )
so that it can start a line. To me, it's odd to see a line starting with
U+005C. (btw, U+002F is class 8 and it can both start and end a line.
although not so much as U+005C, it's a bit strange to start a line with
U+002F.) Anyway, we'd not want to break away from JIS X 4501 for
Japanese.  Therefore, I think
we have to special-case U+005C (and possibly U+002F) for zh and ko.

Other punctuation marks have be dealt with in lang/locale-specific
line breakers, IMO (see bug 203016 and bug 56652)
 Instead of just removing two lines
(which would make lines break at the first full stop in cases 
like 'i.e.' or 'e.g.'), I assigned CHARACTER_CLASS if both prev. 
and next chars are CHARACTER_CLASS.
Comment on attachment 121406 [details] [diff] [review]
a simple fix for full stop only. 

Asking for review. The patch is to allow break after a full stop when it's
followed by a character that can be broken with class 1(when 0 is the first
class), full stop's 'native' class unless it's between
class 8 characters (as in 'e.g.').

BTW, to short-circuit the most common case
(next == SPACE)
we may use

if (next != U_SPACE &&
    GetClass(prev) ==  ...... &&
    GetClass(next) == .......)

instead of

+	  if( GetClass(prev) == CHARACTER_CLASS && 
+	      GetClass(next) == CHARACTER_CLASS )

Let's deal with the 'simplest' of all (full stop). It seems like JIS X 4051 and
UTR #14 are in disagreement on other punctuation marks and they'd better be
dealt with in a broader context (see my comment in bug 56652).
Attachment #121406 - Flags: superreview?(bzbarsky)
Attachment #121406 - Flags: review?(yokoyama)
Comment on attachment 121406 [details] [diff] [review]
a simple fix for full stop only. 

sr=bzbarsky, but including just a _little_ more context in that diff would have
saved me having to open the file and read the surrounding code...
Attachment #121406 - Flags: superreview?(bzbarsky) → superreview+
Attachment #121406 - Flags: review?(yokoyama) → review+
I'm really sorry to take back my patch after getting r and sr, but I'd
rather be sorry now than to repatch it after commiting a not-so-right
patch.  The more I think about it, the more like a mine field it becomes ;-).  
My patch doesn't cover cases like 'a.3', '3.a', or '<cjk>.3' (as used in section
numbers
and equation numbers). At least in the first two cases,
we don't want to break after the full stop. If our bottom line is
that a full stop can never start a line, perhaps a better condition
is 'GetClass(next) == 6 || GetClass(next) == 8' (simply put, class 6 is
numeric digits and class 8 is alphabetic letters).

For myself and others, here I'm summarizing the difference between
the native class of the character in question(full stop) class 1 and
CHARACTER_CLASS (class 8).  (I'm using the actual class numbers
used in C++ code instead of comments)
(see
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#169)

 1. breaking before 

 - No line breaking is allowed before class 1 
 - No line breaking is allowed before class 8 when the it's preceeded by 
   a character of class 0, 6(numeric), 7, and 8(alphabetic letters). 
   That is, a character of class 8 can start a line if it's following
   a character of class 1 - 5.
 
 - difference : if prev == class [1..5], class 8 starts a line but
   class 1 cannot even in that case. 
 
 - interim conclusion : we don't have to special case 'full stop'
   *based on prev. char.* because we don't want it to start a line.
   The only exception is when it follows space and is followed by
   a numeric digit, which is taken care of before the fragment
   of the code we're discussing. (see
  
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#362
)
   
   
 2. breaking after
   - line breaking is allowed after class 1 except when  followed by
     class 1
   - line breaking is allowed after class 8 except when followed by
     class 1, 6, 7, and 8
   
   - difference : class 8 cannot end a line before class [6..8]
     while class 1  can before class [6..8]. [1]

Based on this comparison and  the fact that we don't want to break
after '.' in cases like 'a.3', '3.a' and probably '<cjk>.a' as well
(i.e. we don't want to break after '.' if it's followed by 
class 6 and 8) and we don't want to start a line with '.'
unless it's at the beginning of fractions ( .357 ). 
I propose to replace the condition 'next != U_SPACE' in
line #365
(http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#365)
with a new condition 'GetClass(next) == 6 || GetClass(next) == 8'

 
Sorry again and thanks. 
  
[1] The only class 7 character is U+2126 (Ohm sign). I wonder
what's so special about it to make it a class of its own :-)
Attached patch a new patch (obsolete) — Splinter Review
> 1. break before
  .....
> - difference : if prev == class [1..5], class 8 starts a line but
>   class 1 cannot even in that case. 
 
> - interim conclusion : we don't have to special case 'full stop'
>   *based on prev. char.* because we don't want it to start a line.

Actually, we have to preserve this difference.Therefore, if prev. char. class
is [1..5], we should keep the native class of the full stop. In my
previous comment, I suggested using (next_class==6 || next_class=8).
Here I'm using (next_class > 5), which is equivalent because
class 9 is Thai and it breaks with everything and class 7(a class
made of a single character) behaves the same way as class 6/8
when following full-stop.
Attachment #121406 - Attachment is obsolete: true
Comment on attachment 122108 [details] [diff] [review]
a new patch

Shanjian, I'm really sorry to bother you again with this.(I'm aware that you're
now in other projects). But I couldn't help because I discovered a problem with
my prev. patch.
I guess this is as good as I can make it within JIS X 0451 framework.

Boris, I'm sorry to you, too. There's a bit
more context in this patch and 
I gave a rather long rationale for my 
patch in the bugzilla.

Thank you !
Attachment #122108 - Flags: superreview?(bzbot)
Attachment #122108 - Flags: review?(shanjian)
Comment on attachment 122108 [details] [diff] [review]
a new patch

Sorry for spamming.
I realized that bz is away. I'm asking dbaron 
for sr.
Attachment #122108 - Flags: superreview?(bzbot) → superreview?(dbaron)
Comment on attachment 122108 [details] [diff] [review]
a new patch

Shanjian seems too busy (he's not any more on mozilla-side). Because Simon has
doen a lot of work on linebreaking, I'm asking him for review. 
Thanks
Attachment #122108 - Flags: superreview?(heikki)
Attachment #122108 - Flags: superreview?(dbaron)
Attachment #122108 - Flags: review?(smontagu)
Attachment #122108 - Flags: review?(shanjian)
Depends on: line-breaking
Comment on attachment 122108 [details] [diff] [review]
a new patch

>+         if( pc > 5 && pc < 1 && GetClass(next) > 5 )

This can't be right since this if-clause can never be true (pc can never be
more than 5 and less than 1 at the same time).
Attachment #122108 - Flags: superreview?(heikki) → superreview-
Attached file another test case
In place of (pc < 1 && pc>5), I have ( pc==0 || pc > 5). It's beyond me how
that blunder slipped in (perhaps I hand-edited the patch ....)
Attachment #122108 - Attachment is obsolete: true
Attachment #122108 - Flags: review?(smontagu)
Comment on attachment 123780 [details] [diff] [review]
a new patch with the stupid mistake fixed

sorry for unnecessary spamming due to my mistake.
Attachment #123780 - Flags: superreview?(heikki)
Attachment #123780 - Flags: review?(smontagu)
Comment on attachment 123780 [details] [diff] [review]
a new patch with the stupid mistake fixed

r=smontagu
Attachment #123780 - Flags: review?(smontagu) → review+
jshin: I am assigning this to a right owner. I hope you don't mind.
Assignee: yokoyama → jshin
Roy, no problem. Accepting.
Status: NEW → ASSIGNED
Comment on attachment 123780 [details] [diff] [review]
a new patch with the stupid mistake fixed

>Index: intl/lwbrk/src/nsJISx4501LineBreaker.cpp
>===================================================================
>+     // character/numeric class (class 8, 6, and 7. class 9 (Thai) 

Wouldn't it make more sense to say classes 6-8?

sr=heikki
Attachment #123780 - Flags: superreview?(heikki) → superreview+
Comment on attachment 123780 [details] [diff] [review]
a new patch with the stupid mistake fixed

Thank you for r/sr. I'll change the comment as you suggested.

Asking for a1.4. This is a low risk patch that will make Mozilla more compliant
to JIS X 4501 and UTR #10.
Attachment #123780 - Flags: approval1.4?
Have we been referring to JIS X 4051 as JIS X 4501?

I don't think there is such a standard as "JIS X 4501". Gosh,
we even have a file name including "4501"? Shouldn't this
error be corrected in the source?
Thank you for the note.
I'm always confused as to which is the correct name, JIS X 4051 or JIS X 4501,
partly due to the fact that Mozilla uses both names. We can rename 
*JISX4501*cpp. The only downside is that we'll lose the cvs change record for
the old file (, which is a weak point of CVS.)

BTW, in comment #20, it should have been UTR #14 instead of UTR #10.
fix checked in to the trunk. waiting for a1.4
OS: Windows 2000 → All
Hardware: PC → All
Please don't include spurious style/spacing changes in patches; it makes them
harder to review, and also makes cvs blame much less functional.
Or if you do clean up spacing, attacha diff -wu as well.

/be
Comment on attachment 123780 [details] [diff] [review]
a new patch with the stupid mistake fixed

a=asa (on behalf of drivers) for checkin to the 1.4 branch.
Attachment #123780 - Flags: approval1.4? → approval1.4+
Attached file Quotation marks woes
Here is an additional testcase with regard to quotation marks (ubiquotous ' and
" as well as paired ones; U+2018, U+2019, U+201C, U+201D). It is encoded in
UTF-8, and modified from attachment 121366 [details].
Keywords: fixed1.4
asa, sorry I forgot to note that the patch had been checked in for 1.4branch.

as for attachment 124625 [details], thanks for Issac for alerting me to this issue. It's
really curious that U+2019 (single-high 9) and U+201D(double-high-9) behave 
differently (case 4 and case 5). A quick glance at intl/lwbrk tells me a
different story (they're both class 1 and are not supposed to start a line.).
Depending on the cause, we may have to open a new bug.
I don't know how I forgot that U+2019 is also subjected 
to contextual analaysis. It's dealt with in the 'if-clause' right after that for
full-stop. It may be handled similarly, but that may not be the best. 
No longer depends on: line-breaking
jshin:

Why is this bug still opened? I think this bug should be marked as fixed. If there are other bugs, we should file new bugs for them.
QA Contact: amyy → i18n
You need to log in before you can comment on or make changes to this bug.