Closed Bug 56652 Opened 24 years ago Closed 9 months ago

More intelligent Unicode-compatible linebreaking algorithms (UAX #14) needed

Categories

(Core :: Internationalization, enhancement, P4)

enhancement

Tracking

()

RESOLVED DUPLICATE of bug 1719535
Future

People

(Reporter: cmattar, Assigned: m_kato)

References

(Blocks 2 open bugs)

Details

(Keywords: intl, testcase, Whiteboard: [p-ie])

Attachments

(1 file)

Currently the line only breaks on whitespaces. There are some more cases where 
linebreaking wouldn't hurt (and IE seems to agree with me there), see the 
attachment for example.

I tested this using Mozilla M18 on Win98, but I don't think this is a 
regression.

Assigning to Layout, don't know if this bug's right there.
Should this go to Internationalization?
UNICODE has a zero-width space for this kind of thing -- are we sure this 
is valid? I'm always dubious of heuristics, and this would seem to fall 
into that category...
Reassigning to Buster and marking future.
Assignee: clayton → buster
Target Milestone: --- → Future
Depends on: 99457
Build reassigning Buster's bugs to Marc.
Assignee: buster → attinasi
FWIW, <URL:http://www.unicode.org/unicode/reports/tr14/> provides some guidance 
on line-breaking implementation.
*** Bug 147836 has been marked as a duplicate of this bug. ***
as I see, this is 2 years old bug. I found it today again and reported as bug
147836.

there is't possibly simple solution for all languages, but what do you think
about starting with simple wordsplitting using fixed maximal number of
characters for every word? it will solve worst cases of this bug.

I have seen this on a page with many paragraphs, all were turned into 10
monitors wide lines by this bug.
Layout doesn't even break lines on hyphens, for crying out loud.
OS: Windows 98 → All
Hardware: PC → All
There are two types of hyphens according to the HTML4.01 spec:

http://www.w3.org/TR/html401/struct/text.html#h-9.3.3

In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen.
The plain hyphen should be interpreted by a user agent as just another
character (i.e no special breaking behavior) The soft hyphen tells the user
agent where a line break can occur.

Those browsers that interpret soft hyphens must observe the following semantics:
If a line is broken at a soft hyphen, a hyphen character must be displayed at
the end of the first line. If a line is not broken at a soft hyphen, the user
agent must not display a hyphen character. For operations such as searching and
sorting, the soft hyphen should always be ignored.

In HTML, the plain hyphen is represented by the "-" character (&#45; or &#x2D;).
The soft hyphen is represented by the character entity reference &shy; (&#173;
or &#xAD;)
"(i.e., no special breaking behavior)" isn't part of the spec and isn't really
what the spec means.  Rather, the relevant section is lower, in 9.3.5, which
uses Western scripts as an example and gives incorrect rules.  I'd say the
example is non-normative and can be ignored.
dbaron: Should we be breaking a long repeating line of hyphens? 

This is an issue in http://bugscape.mcom.com/show_bug.cgi?id=15288
Blocks: 168902
see also bug 157967 : Mac OS X needs to use the ATSUI-services which, among
other things, will do the line breaking for you.
Altering summary: Unicode seems to lay down some normative linebreaking
behavior, although they don't define a full linebreaking alogrithm per se.
<URL:http://www.unicode.org/unicode/reports/tr14/>. (More precisely, it defines
in what places lines may, must, or must not break; whether or not the layout
takes advantage of a possible linebreak is left to a higher-level algorithm.)
Keywords: testcase
Summary: More intelligent linebreaking algorithms needed → More intelligent Unicode-compatible linebreaking algorithms needed
*** Bug 175578 has been marked as a duplicate of this bug. ***
IE breaks on slashes (as the unicode standard recommends), Mozilla does not.
Adding [p-ie] to whiteboard.
Whiteboard: [p-ie]
ATSUI claims to support Unicode 3.2 (which defines the line breaks as noted
above, http://www.unicode.org/unicode/reports/tr14/):

http://developer.apple.com/techpubs/macosx/Carbon/text/ATSUI/
ATSUI_Concepts/atsui_app_unicode/index.html

(yeah I inserted a space ... ;-)

"ATSUI provides full layout support for Unicode 3.2 and supports text rendering
for all the features required by scripts included with version 2.1 of the
Unicode standard or later. "

So using ATSUI to render at the paragraph level on Mac OS X would fix this bug
on Mac OS X at least (obviously this == fixing bug 157967).
Mozilla currently implements JIS X 4051 and Thai linebreaking. 
(see files in http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/).
There are several differences between JIS X 4051 (the only linebreaker
implemented so far) and UTR #14.  They include (but are not limited to)

   - treatment of NBSP, ZWNBSP, CGJ : the current linebreaker doesn't
     implement 'do not break after. do not break before, either'

      UTR #14>                                                                 
                      
      GL * Non-breaking (“Glue”) NBSP, ZWNBSP,CGJ prohibit line breaks before 
      or after

     Currently, Mozilla breaks before(after) NBSP if what follows(preceeds)
     it is CJK Ideograph or Hangul syllables.
    
   -  In (current) JIS X 4051 (implementation), Euro (U+20AC) and
      other currency signs  are class 8 while Yen(U+00A5) and Pound(U+00A3) 
      are class 3. UTR #14 stipulates that they be treated consistently.
  
   -   comma is treated per UTR, but fullstop is not (see bug 164759.
       A simple 2-line patch will fix this). Other characters in 
       UTR#14 IS category need to be taken care of. 
 
       UTR> IS - Numeric Separator (Infix) (XB)
             Characters that usually occur inside a numerical expression may 
       not be separated from following numeric characters, unless space 
       character intervenes. Since they are otherwise sentence ending 
       punctuation, they prevent breaks before.                                
                            
      
  - UTR #14 prohibits break before ‘]’ or ‘!’ or ‘;’ or ‘/’,  even after spaces,
    but JIS X 4051 allows break before '/'. 

FYI, other bugs on linebreaking  we may need a tracking bug opened) are :

bug 193212, bug 203016, bug 178290, bug 172052, bug 164759(dup: bug 202833),
bug 162049(closed), bug 162940 and more.

It has to be noted that *not all* rules in UTR #14 are  
normative and we can tailor them(non-normative rules) 
as we see fit (per lang/locale or based on other criteria).
Keywords: intl
> fullstop is not (see bug 164759. A simple 2-line patch will fix this).

Actually, a bit more work is necessary. We have to add a new class 
(break neither before nor after) and assign that class to fullstop in 
some context (for instance, between 'e' and 'g' in 'e.g.').
We need that class anyway for NBSP/ZWNBSP/CGJ.
Will Mozilla's linebreaking algorithm include BPH -- break permitted here?  (Or
something equivalent.)

Does bug 172819 belong to this "bug family" too?
Sorry for spamming.
Adding a few more to CC. BTW, note that I uploaded a fix for fullstop case in
bug 164759 (attachment 121406 [details] [diff] [review]). 
Depends on: line-breaking
->Fonts & Text
Assignee: attinasi → font
Component: Layout → Layout: Fonts and Text
QA Contact: petersen → ian
Priority: P3 → --
Target Milestone: Future → ---
This is a serious usability problem, because it causes pages with long
hyperlinks to grow excessively wide. If fixing to Unicode is going to be
futured, then a workaround to line-break on slashes and hyphens should be put in
place in the meantime.
Priority: -- → P4
Target Milestone: --- → Future
No longer depends on: line-breaking
re: comment 23 (Simon Woodside):  Breaking after ASCII hyphen (hyphen-minus) is 
now bug 95067; breaking after slash is bug 218580.  Also, handling of 
soft-hyphen is bug 9101.

re: comment 20 (Torsten Bronger):  I am unfamiliar with "BPH", but Unicode does 
provide a zero-width space (U200B -- note that 'zwsp' is not a defined entity 
name) to use as a non-visible author-specified break point -- and Mozilla does 
in fact handle this correctly.
I had a descussion about U200B in comp.text.xml last year.  I don't like it
because the Unicode specs say that it "may expand in justification" which is
totally unacceptable of course.  But with this Unicode description I must assume
that some XML interpreters do/will treat it like this.
*** Bug 222057 has been marked as a duplicate of this bug. ***
This bug hasn't seen any activity in a couple years, but it still seems to be a problem.  What's up?
Lack of resources.  Patches accepted.
Blocks: 359179
Blocks: 346969
smontagu asked me to comment here -- basically I agree with Chris Hoess's assessment in comment 14.

Most of UAX #14 is non-normative. (This will be even clearer in the next revision.) The normative rules deal mainly with line breaking control characters, and should be implemented as specced in the proposed update.

The rest of it is tailorable, and many of those rules are impractical unless we also implement prioritization. E.g. we should allow breaks after hyphens as suggested, but only if they are at a lower priority than spaces.

IMHO UAX#14's non-normative rules should be viewed more as hints on how to do things right rather than a specification for how to do things right. It collects together a lot of hard-to-find and useful information about line breaking, but it's not a complete usable algorithm, its heuristics are not always the best, and sometimes it's just wrong.

So, to summarize, work on line breaking at punctuation other than spaces should
  a) implement prioritization
  b) use UAX14 as a starting point but also
  b) use common sense, expert opinion, and/or research to support any changes
     from what we do today, not just blindly implement UAX14's pairs table
  c) use the latest proposed update to UAX 14 [1], as it fixes some substantial
     errors in the latest approved version

[1] http://unicode.org/reports/tr14/tr14-20.html
This bug is affected by the recent fix to bug 95067 (which, I suppose, is actually a duplicate, although at a little more specific level). The fix allows linebreaking in connection with hyphens, slashes and a number of other characters, mainly by imitating the linebreaking behavior of WinIE 7. See the comparison table between Gecko, IE 7 and Opera 9.2:
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/tools/spec_table.html

While offering a solution to the lay-out problems caused by URLs and other very long strings, the fix seems to introduce rather undesirable side-effects. For example, linebreaking is allowed after the slash in "c/o", and both before and after the parentheses in "colo(u)ring".

I was considering filing bugs for some of the new issues but I haven't had the opportunity to test them properly. And then I found this bug and realized that basically they all concern the same subject, so perhaps the discussion should continue here.

I don't think imitating IE's over-simplified linebreaking algorithms is the right thing to do. Mozilla has made its reputation by being better than IE, even if doing so caused some web-sites that were optimized for IE to look bad. Now, the competition has finally forced even Microsoft to bring its browser to the 21st century. This is not the time to lower the standards and start trailing them.
By the way, when considering the applicability of UAX #14, the general criticism by Jukka Korpela might be worth taking into account (although it isn't quite up to date with the most recent revisions):
http://www.cs.tut.fi/~jkorpela/unicode/linebr.html

There is even a more extensive article about word division in IE and the problems it causes especially from the point of web-authoring:
http://www.cs.tut.fi/~jkorpela/html/nobr.html
Can we have an update on this bug.

By the way, bug 346969 is now fixed, but I cannoy close it.
Bug 450088 is related to this issue.  I also have a zlib-licensed implementation of UAX #14 available at:

http://vimgadgets.cvs.sourceforge.net/vimgadgets/common/tools/linebreak/
Assignee: layout.fonts-and-text → nobody
QA Contact: ian → layout.fonts-and-text
I think that this is now important for compatibility with other browsers. I'll try to implement by a new class which can be chosen with pref. I think that when we enable the new class in default settings, we should remove the pref and current implementation.
Assignee: nobody → masayuki
Severity: minor → normal
Component: Layout: Text → Internationalization
Priority: P4 → --
Summary: More intelligent Unicode-compatible linebreaking algorithms needed → More intelligent Unicode-compatible linebreaking algorithms (UAX #14) needed
If we import libicu (bug 724531 and bug 820261) into our code, we can handle this more easily instead of creating new table.
(In reply to Makoto Kato from comment #35)
> If we import libicu (bug 724531 and bug 820261) into our code, we can handle
> this more easily instead of creating new table.

Is it enough for our requirement? Probably, if we implement UAX #14 strictly, we break compatibility with a lot of websites. So, we need to add similar customization added in current line breaker. Is it possible?
Looks like it's not capable of CSS3 text, such as line-break. I don't think that we should use 3rd party's library for line breaker because it's too sensitive for compatibility and performance.
(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #37)
> Looks like it's not capable of CSS3 text, such as line-break.

What exactly do you mean here?  Do you think we need to change the spec?  If so, which part (5.1? 5.2?)
It seems that Chrominum also uses their own table for compatibility:
http://mxr.mozilla.org/chromium/source/src/third_party/WebKit/Source/WebCore/rendering/break_lines.cpp#71

(In reply to John Daggett (:jtd) from comment #38)
> (In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #37)
> > Looks like it's not capable of CSS3 text, such as line-break.
> 
> What exactly do you mean here?  Do you think we need to change the spec?  If
> so, which part (5.1? 5.2?)

No. If we would use ICU line breaker, the library should have all behavior defined by CSS3 Text and the behavior should have compatibility with current Gecko and other browsers moderately, especially in ASCII character range.
If ICU supports complex line breaking script, it's worthwhile to use ICU only for them, I think. Currently, we use native API's line breaker for them. So, Gecko doesn't behave same on all platforms for such language users.
(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #40)
> If ICU supports complex line breaking script, it's worthwhile to use ICU
> only for them, I think. Currently, we use native API's line breaker for
> them. So, Gecko doesn't behave same on all platforms for such language users.

Platform's line breaker may not handle correct line break position for complex language such as khmer.  We should use another way (ex. using libicu) for these languages.

Also, actually, even if not complex script, line breaker isn't compatible on each browser implementation. See http://w3c-test.org/framework/results/i18n-css3-text/.
Hmm, chromimum might use ICU for fallback class of non-ASCII characters. But I'm not sure if the build option (ICU_UNICODE) is enabled in the default setting. And if it's enabled, I'm not sure how do they think about supporting line-break property in the future.

(In reply to Makoto Kato from comment #41)
> Also, actually, even if not complex script, line breaker isn't compatible on
> each browser implementation. See
> http://w3c-test.org/framework/results/i18n-css3-text/.

Yes, but I think we can improve the compatibility in non-ASCII range since we have never used UAX #14 yet.
And probably, if we use ICU, it becomes more difficult to fix bug 389710.
Whiteboard: [p-ie] → [p-ie] [platform-rel-Intel]
Whiteboard: [p-ie] [platform-rel-Intel] → [p-ie]
Assignee: masayuki → nobody
Severity: normal → S3
Type: defect → enhancement
Priority: -- → P4
No longer blocks: 359179
Assignee: nobody → m_kato
Status: NEW → ASSIGNED
Status: ASSIGNED → RESOLVED
Closed: 9 months ago
Duplicate of bug: 1719535
Resolution: --- → DUPLICATE

We've integrated ICU4X line segmenter in bug 1719535, which is UAX 14 compatible.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: