Open
Bug 56652
Opened 23 years ago
Updated 2 years ago
More intelligent Unicode-compatible linebreaking algorithms (UAX #14) needed
Categories
(Core :: Internationalization, enhancement, P4)
Core
Internationalization
Tracking
()
ASSIGNED
Future
People
(Reporter: cmattar, Assigned: m_kato)
References
(Blocks 2 open bugs)
Details
(Keywords: intl, testcase, Whiteboard: [p-ie])
Attachments
(1 file)
967 bytes,
text/html
|
Details |
Currently the line only breaks on whitespaces. There are some more cases where linebreaking wouldn't hurt (and IE seems to agree with me there), see the attachment for example. I tested this using Mozilla M18 on Win98, but I don't think this is a regression. Assigning to Layout, don't know if this bug's right there.
Reporter | ||
Comment 1•23 years ago
|
||
Comment 2•23 years ago
|
||
Should this go to Internationalization?
Comment 3•23 years ago
|
||
UNICODE has a zero-width space for this kind of thing -- are we sure this is valid? I'm always dubious of heuristics, and this would seem to fall into that category...
Comment 4•23 years ago
|
||
Reassigning to Buster and marking future.
Assignee: clayton → buster
Target Milestone: --- → Future
Comment 6•22 years ago
|
||
FWIW, <URL:http://www.unicode.org/unicode/reports/tr14/> provides some guidance on line-breaking implementation.
Comment 7•21 years ago
|
||
*** Bug 147836 has been marked as a duplicate of this bug. ***
Comment 8•21 years ago
|
||
as I see, this is 2 years old bug. I found it today again and reported as bug 147836. there is't possibly simple solution for all languages, but what do you think about starting with simple wordsplitting using fixed maximal number of characters for every word? it will solve worst cases of this bug. I have seen this on a page with many paragraphs, all were turned into 10 monitors wide lines by this bug.
Comment 9•21 years ago
|
||
Layout doesn't even break lines on hyphens, for crying out loud.
OS: Windows 98 → All
Hardware: PC → All
Comment 10•21 years ago
|
||
There are two types of hyphens according to the HTML4.01 spec: http://www.w3.org/TR/html401/struct/text.html#h-9.3.3 In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character (i.e no special breaking behavior) The soft hyphen tells the user agent where a line break can occur. Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored. In HTML, the plain hyphen is represented by the "-" character (- or -). The soft hyphen is represented by the character entity reference ­ (­ or ­)
Comment 11•21 years ago
|
||
"(i.e., no special breaking behavior)" isn't part of the spec and isn't really what the spec means. Rather, the relevant section is lower, in 9.3.5, which uses Western scripts as an example and gives incorrect rules. I'd say the example is non-normative and can be ignored.
Comment 12•21 years ago
|
||
dbaron: Should we be breaking a long repeating line of hyphens? This is an issue in http://bugscape.mcom.com/show_bug.cgi?id=15288
Comment 13•21 years ago
|
||
see also bug 157967 : Mac OS X needs to use the ATSUI-services which, among other things, will do the line breaking for you.
Comment 14•21 years ago
|
||
Altering summary: Unicode seems to lay down some normative linebreaking behavior, although they don't define a full linebreaking alogrithm per se. <URL:http://www.unicode.org/unicode/reports/tr14/>. (More precisely, it defines in what places lines may, must, or must not break; whether or not the layout takes advantage of a possible linebreak is left to a higher-level algorithm.)
Keywords: testcase
Summary: More intelligent linebreaking algorithms needed → More intelligent Unicode-compatible linebreaking algorithms needed
Comment 15•21 years ago
|
||
*** Bug 175578 has been marked as a duplicate of this bug. ***
Comment 16•21 years ago
|
||
IE breaks on slashes (as the unicode standard recommends), Mozilla does not. Adding [p-ie] to whiteboard.
Whiteboard: [p-ie]
Comment 17•21 years ago
|
||
ATSUI claims to support Unicode 3.2 (which defines the line breaks as noted above, http://www.unicode.org/unicode/reports/tr14/): http://developer.apple.com/techpubs/macosx/Carbon/text/ATSUI/ ATSUI_Concepts/atsui_app_unicode/index.html (yeah I inserted a space ... ;-) "ATSUI provides full layout support for Unicode 3.2 and supports text rendering for all the features required by scripts included with version 2.1 of the Unicode standard or later. " So using ATSUI to render at the paragraph level on Mac OS X would fix this bug on Mac OS X at least (obviously this == fixing bug 157967).
Comment 18•20 years ago
|
||
Mozilla currently implements JIS X 4051 and Thai linebreaking. (see files in http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/). There are several differences between JIS X 4051 (the only linebreaker implemented so far) and UTR #14. They include (but are not limited to) - treatment of NBSP, ZWNBSP, CGJ : the current linebreaker doesn't implement 'do not break after. do not break before, either' UTR #14> GL * Non-breaking (“Glue”) NBSP, ZWNBSP,CGJ prohibit line breaks before or after Currently, Mozilla breaks before(after) NBSP if what follows(preceeds) it is CJK Ideograph or Hangul syllables. - In (current) JIS X 4051 (implementation), Euro (U+20AC) and other currency signs are class 8 while Yen(U+00A5) and Pound(U+00A3) are class 3. UTR #14 stipulates that they be treated consistently. - comma is treated per UTR, but fullstop is not (see bug 164759. A simple 2-line patch will fix this). Other characters in UTR#14 IS category need to be taken care of. UTR> IS - Numeric Separator (Infix) (XB) Characters that usually occur inside a numerical expression may not be separated from following numeric characters, unless space character intervenes. Since they are otherwise sentence ending punctuation, they prevent breaks before. - UTR #14 prohibits break before ‘]’ or ‘!’ or ‘;’ or ‘/’, even after spaces, but JIS X 4051 allows break before '/'. FYI, other bugs on linebreaking we may need a tracking bug opened) are : bug 193212, bug 203016, bug 178290, bug 172052, bug 164759(dup: bug 202833), bug 162049(closed), bug 162940 and more. It has to be noted that *not all* rules in UTR #14 are normative and we can tailor them(non-normative rules) as we see fit (per lang/locale or based on other criteria).
Keywords: intl
Comment 19•20 years ago
|
||
> fullstop is not (see bug 164759. A simple 2-line patch will fix this).
Actually, a bit more work is necessary. We have to add a new class
(break neither before nor after) and assign that class to fullstop in
some context (for instance, between 'e' and 'g' in 'e.g.').
We need that class anyway for NBSP/ZWNBSP/CGJ.
Comment 20•20 years ago
|
||
Will Mozilla's linebreaking algorithm include BPH -- break permitted here? (Or something equivalent.) Does bug 172819 belong to this "bug family" too?
Comment 21•20 years ago
|
||
Sorry for spamming. Adding a few more to CC. BTW, note that I uploaded a fix for fullstop case in bug 164759 (attachment 121406 [details] [diff] [review]).
Updated•20 years ago
|
Depends on: line-breaking
Comment 22•20 years ago
|
||
->Fonts & Text
Assignee: attinasi → font
Component: Layout → Layout: Fonts and Text
QA Contact: petersen → ian
![]() |
||
Updated•20 years ago
|
Priority: P3 → --
Target Milestone: Future → ---
Comment 23•20 years ago
|
||
This is a serious usability problem, because it causes pages with long hyperlinks to grow excessively wide. If fixing to Unicode is going to be futured, then a workaround to line-break on slashes and hyphens should be put in place in the meantime.
Updated•20 years ago
|
Priority: -- → P4
Target Milestone: --- → Future
Updated•20 years ago
|
Blocks: line-breaking
No longer depends on: line-breaking
Comment 24•20 years ago
|
||
re: comment 23 (Simon Woodside): Breaking after ASCII hyphen (hyphen-minus) is now bug 95067; breaking after slash is bug 218580. Also, handling of soft-hyphen is bug 9101. re: comment 20 (Torsten Bronger): I am unfamiliar with "BPH", but Unicode does provide a zero-width space (U200B -- note that 'zwsp' is not a defined entity name) to use as a non-visible author-specified break point -- and Mozilla does in fact handle this correctly.
Comment 25•20 years ago
|
||
I had a descussion about U200B in comp.text.xml last year. I don't like it because the Unicode specs say that it "may expand in justification" which is totally unacceptable of course. But with this Unicode description I must assume that some XML interpreters do/will treat it like this.
Comment 26•20 years ago
|
||
*** Bug 222057 has been marked as a duplicate of this bug. ***
Comment 27•18 years ago
|
||
This bug hasn't seen any activity in a couple years, but it still seems to be a problem. What's up?
![]() |
||
Comment 28•18 years ago
|
||
Lack of resources. Patches accepted.
Comment 29•16 years ago
|
||
smontagu asked me to comment here -- basically I agree with Chris Hoess's assessment in comment 14. Most of UAX #14 is non-normative. (This will be even clearer in the next revision.) The normative rules deal mainly with line breaking control characters, and should be implemented as specced in the proposed update. The rest of it is tailorable, and many of those rules are impractical unless we also implement prioritization. E.g. we should allow breaks after hyphens as suggested, but only if they are at a lower priority than spaces. IMHO UAX#14's non-normative rules should be viewed more as hints on how to do things right rather than a specification for how to do things right. It collects together a lot of hard-to-find and useful information about line breaking, but it's not a complete usable algorithm, its heuristics are not always the best, and sometimes it's just wrong. So, to summarize, work on line breaking at punctuation other than spaces should a) implement prioritization b) use UAX14 as a starting point but also b) use common sense, expert opinion, and/or research to support any changes from what we do today, not just blindly implement UAX14's pairs table c) use the latest proposed update to UAX 14 [1], as it fixes some substantial errors in the latest approved version [1] http://unicode.org/reports/tr14/tr14-20.html
Comment 30•16 years ago
|
||
This bug is affected by the recent fix to bug 95067 (which, I suppose, is actually a duplicate, although at a little more specific level). The fix allows linebreaking in connection with hyphens, slashes and a number of other characters, mainly by imitating the linebreaking behavior of WinIE 7. See the comparison table between Gecko, IE 7 and Opera 9.2: http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/tools/spec_table.html While offering a solution to the lay-out problems caused by URLs and other very long strings, the fix seems to introduce rather undesirable side-effects. For example, linebreaking is allowed after the slash in "c/o", and both before and after the parentheses in "colo(u)ring". I was considering filing bugs for some of the new issues but I haven't had the opportunity to test them properly. And then I found this bug and realized that basically they all concern the same subject, so perhaps the discussion should continue here. I don't think imitating IE's over-simplified linebreaking algorithms is the right thing to do. Mozilla has made its reputation by being better than IE, even if doing so caused some web-sites that were optimized for IE to look bad. Now, the competition has finally forced even Microsoft to bring its browser to the 21st century. This is not the time to lower the standards and start trailing them.
Comment 31•16 years ago
|
||
By the way, when considering the applicability of UAX #14, the general criticism by Jukka Korpela might be worth taking into account (although it isn't quite up to date with the most recent revisions): http://www.cs.tut.fi/~jkorpela/unicode/linebr.html There is even a more extensive article about word division in IE and the problems it causes especially from the point of web-authoring: http://www.cs.tut.fi/~jkorpela/html/nobr.html
Comment 32•15 years ago
|
||
Can we have an update on this bug. By the way, bug 346969 is now fixed, but I cannoy close it.
Comment 33•15 years ago
|
||
Bug 450088 is related to this issue. I also have a zlib-licensed implementation of UAX #14 available at: http://vimgadgets.cvs.sourceforge.net/vimgadgets/common/tools/linebreak/
Updated•14 years ago
|
Assignee: layout.fonts-and-text → nobody
QA Contact: ian → layout.fonts-and-text
Comment 34•11 years ago
|
||
I think that this is now important for compatibility with other browsers. I'll try to implement by a new class which can be chosen with pref. I think that when we enable the new class in default settings, we should remove the pref and current implementation.
Assignee: nobody → masayuki
Severity: minor → normal
Component: Layout: Text → Internationalization
Priority: P4 → --
Summary: More intelligent Unicode-compatible linebreaking algorithms needed → More intelligent Unicode-compatible linebreaking algorithms (UAX #14) needed
Assignee | ||
Comment 35•11 years ago
|
||
If we import libicu (bug 724531 and bug 820261) into our code, we can handle this more easily instead of creating new table.
Comment 36•11 years ago
|
||
(In reply to Makoto Kato from comment #35) > If we import libicu (bug 724531 and bug 820261) into our code, we can handle > this more easily instead of creating new table. Is it enough for our requirement? Probably, if we implement UAX #14 strictly, we break compatibility with a lot of websites. So, we need to add similar customization added in current line breaker. Is it possible?
Comment 37•11 years ago
|
||
Looks like it's not capable of CSS3 text, such as line-break. I don't think that we should use 3rd party's library for line breaker because it's too sensitive for compatibility and performance.
Comment 38•11 years ago
|
||
(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #37) > Looks like it's not capable of CSS3 text, such as line-break. What exactly do you mean here? Do you think we need to change the spec? If so, which part (5.1? 5.2?)
Comment 39•11 years ago
|
||
It seems that Chrominum also uses their own table for compatibility: http://mxr.mozilla.org/chromium/source/src/third_party/WebKit/Source/WebCore/rendering/break_lines.cpp#71 (In reply to John Daggett (:jtd) from comment #38) > (In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #37) > > Looks like it's not capable of CSS3 text, such as line-break. > > What exactly do you mean here? Do you think we need to change the spec? If > so, which part (5.1? 5.2?) No. If we would use ICU line breaker, the library should have all behavior defined by CSS3 Text and the behavior should have compatibility with current Gecko and other browsers moderately, especially in ASCII character range.
Comment 40•11 years ago
|
||
If ICU supports complex line breaking script, it's worthwhile to use ICU only for them, I think. Currently, we use native API's line breaker for them. So, Gecko doesn't behave same on all platforms for such language users.
Assignee | ||
Comment 41•11 years ago
|
||
(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #40) > If ICU supports complex line breaking script, it's worthwhile to use ICU > only for them, I think. Currently, we use native API's line breaker for > them. So, Gecko doesn't behave same on all platforms for such language users. Platform's line breaker may not handle correct line break position for complex language such as khmer. We should use another way (ex. using libicu) for these languages. Also, actually, even if not complex script, line breaker isn't compatible on each browser implementation. See http://w3c-test.org/framework/results/i18n-css3-text/.
Comment 42•11 years ago
|
||
Hmm, chromimum might use ICU for fallback class of non-ASCII characters. But I'm not sure if the build option (ICU_UNICODE) is enabled in the default setting. And if it's enabled, I'm not sure how do they think about supporting line-break property in the future. (In reply to Makoto Kato from comment #41) > Also, actually, even if not complex script, line breaker isn't compatible on > each browser implementation. See > http://w3c-test.org/framework/results/i18n-css3-text/. Yes, but I think we can improve the compatibility in non-ASCII range since we have never used UAX #14 yet.
Comment 43•11 years ago
|
||
And probably, if we use ICU, it becomes more difficult to fix bug 389710.
Updated•7 years ago
|
Whiteboard: [p-ie] → [p-ie] [platform-rel-Intel]
Comment hidden (offtopic) |
Comment hidden (offtopic) |
Comment hidden (offtopic) |
Updated•7 years ago
|
Whiteboard: [p-ie] [platform-rel-Intel] → [p-ie]
Updated•3 years ago
|
Assignee: masayuki → nobody
Severity: normal → S3
Type: defect → enhancement
Priority: -- → P4
Assignee | ||
Updated•2 years ago
|
Assignee: nobody → m_kato
Status: NEW → ASSIGNED
You need to log in
before you can comment on or make changes to this bug.
Description
•