Closed
Bug 397597
Opened 17 years ago
Closed 17 years ago
Firefox ignores Unicode linebreak semantics needed for Tibetan script
Categories
(Core :: Layout: Text and Fonts, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: dalias, Assigned: roc)
Details
(Keywords: intl)
Attachments
(1 file, 1 obsolete file)
6.12 KB,
patch
|
masayuki
:
review+
dbaron
:
approval1.9+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a9pre) Gecko/2007092504 Minefield/3.0a9pre Build Identifier: Minefield 3.0a9pre Tibetan script does not use spaces but instead delimits word-like units with a "tsheg", U+0F0B. This character has Unicode line break properties such that line breaking should be possible after it unless it's followed by certain other punctuation, but Firefox is unable to break long runs of Tibetan text and instead forces horizontal scrolling. I believe this is a symptom of a much larger problem, that the developers seem to believe line breaking is a matter of spaces, CJK, and Thai. In reality there are many more languages/scripts that either need special-casing or proper application of the Unicode line breaking semantics. The latter is of course a lot more sane than trying to figure it all out all over again without consulting experts in countless languages. If fixing this correctly in time for Firefox 3 is not feasible, could you at least include special casing for the Tibetan tsheg and check to see if there are any other languages with similar requirements? Reproducible: Always Steps to Reproduce: View a long run of Tibetan text, e.g. paste the following string over and over in an html file and then view it: "བོད་སྐད་" Actual Results: Horizontal scroll bar appears or text overruns its block element. Expected Results: Text should wrap after the tsheg "་" mark when it gets to the end of a line.
Assignee | ||
Comment 1•17 years ago
|
||
> I believe this is a symptom of a much larger problem, that the developers seem > to believe line breaking is a matter of spaces, CJK, and Thai. Not at all. To fix this, we just need to add Tibetan characters (and other characters that require complex treatment) to the IS_COMPLEX macro here. (This was always part of the plan, that's why IS_COMPLEX isn't called IS_THAI.) http://mxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#419 Then runs of Tibetan characters will be broken by Uniscribe, Pango or ATSUI. I'm assuming those engines are good enough. Alternatively we could make IS_COMPLEX return true except for ranges of characters that we know we know how to break (Latin, CJK, and some other ranges). The problem with being conservative in that direction is that it increases the number of situations where we could have differences in line breaking between platforms, which is something we want to minimize.
Sounds like the line-breaking spec is simple enough. If the punctuations are only in Tibetan range (U+0F00 - U+0FFF), we can implement the line-breaking ourselves.
dalias: Can you list all punctuations which suppress the line-breaking?
If you really want to do the implementation yourself, look in Unicode's LineBreak.txt and at UAX#14, http://www.unicode.org/reports/tr14/tr14-19.html#BA http://www.unicode.org/reports/tr14/tr14-19.html#EX http://www.unicode.org/reports/tr14/tr14-19.html#GL http://www.unicode.org/reports/tr14/tr14-19.html#TibetanLinebreaking The characters which would suppress breaking after tsheg are the ones of class EX. Getting something sufficiently correct to be "usable" is as simple as I originally described, but correctly handling classical texts and making the optimal linebreak choices for them requires dealing with a good many UAX#14 line break classes, and I suspect it would be preferable to apply it fully. I am not an expert in these sort of texts so I really cannot advise on what is and isn't needed, but if you wish to pursue that path I can potentially get you in touch with an expert. And yes, all the punctuation relevant to linebreaking for Tibetan should be in the range U+0F00 - U+0FFF. CJK double-angle-bracket quotation marks and some western punctutation are also used at times, but I can't think of any interaction these have with Tibetan aside from their usual line breaking properties. Just wanted to mention that in case it is relevant to the way your code works. Also apologies for being presumptuous about the developers' treatment of line breaking. The only source I was going on was an old Bugzilla thread on the matter where I got that impression. Glad to see it's not the case!
Some other characters (defined in BA or EX) should be breakable in Tibetan, right? And if we can support them, is it enough for Tibetan text? BA: 0F0B TIBETAN MARK INTERSYLLABIC TSHEG 0F85 TIBETAN MARK PALUTA 0F34 TIBETAN MARK BSDUS RTAGS 0F7F TIBETAN SIGN RNAM BCAD 0FBE TIBETAN KU RU KHA 0FBF TIBETAN KU RU KHA BZHI MIG CAN EX: 0F0D TIBETAN MARK SHAD 0F0E TIBETAN MARK NYIS SHAD 0F0F TIBETAN MARK TSHEG SHAD 0F10 TIBETAN MARK NYIS TSHEG SHAD 0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD 0F14 TIBETAN MARK GTER TSHEG # I cannot trust the UAX#14 without being confirmed by the readers of the language :-(
Assignee | ||
Comment 6•17 years ago
|
||
(In reply to comment #2) > Sounds like the line-breaking spec is simple enough. If the punctuations are > only in Tibetan range (U+0F00 - U+0FFF), we can implement the line-breaking > ourselves. I don't think we should do this. That creates extra work for us with no benefit. We should just add the Tibetan range to IS_COMPLEX.
The typical hack websites seem to do for line breaking is to insert the (invalid html) <wbr> element after each tsheg (U+0F0B) that's not immediately followed by a shad. The blog at http://tibetanportal.com/myownblog/?lp_lang_pref=bo has JS code to perform this on the client side, and it you read that you'll find that <wbr> is inserted after any tsheg not followed by U+0F0D. This does not respect the other types of shads (of class EX) and will probably cause problems, but it's okay for simple modern writing. If you want a naive algorithm that's more robust, I would say to consider breaking after U+0F0B only when it's followed by a Tibetan letter. That might miss a few valid line break opportunities but it will prevent lines from overrunning the browser window and will never give false positives (breaks where they shouldn't be). This should be extremely easy to implement too.
Assignee | ||
Comment 8•17 years ago
|
||
This fixes it. I strongly believe this is better than coding up our own handling of Tibetan and goodness knows what else.
Assignee: nobody → roc
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Attachment #282484 -
Flags: review?
Assignee | ||
Updated•17 years ago
|
Attachment #282484 -
Flags: review? → review?(masayuki)
The comment here is wrong because the range excludes Lao: + return (0x0e01 <= aChar && aChar <= 0x0e5b) || // Thai, Lao + (0x0f00 <= aChar && aChar <= 0x0fff); // Tibetan Is that intentional? If Lao is supposed to be included, then you should just make it all one range since the three scripts are adjacent, and that will negate any potential performance loss from testing multiple ranges.
Assignee | ||
Comment 10•17 years ago
|
||
Updated with that fix
Attachment #282484 -
Attachment is obsolete: true
Attachment #282508 -
Flags: review?
Attachment #282484 -
Flags: review?(masayuki)
Assignee | ||
Updated•17 years ago
|
Attachment #282508 -
Attachment is patch: true
Attachment #282508 -
Attachment mime type: application/text → text/plain
Attachment #282508 -
Flags: review? → review?(masayuki)
(In reply to comment #6) > (In reply to comment #2) > > Sounds like the line-breaking spec is simple enough. If the punctuations are > > only in Tibetan range (U+0F00 - U+0FFF), we can implement the line-breaking > > ourselves. > > I don't think we should do this. That creates extra work for us with no > benefit. We should just add the Tibetan range to IS_COMPLEX. hmm.. I think that we can fix this with changing the table of the character class. So, I think the risk is low if the UAX#14 is good spec. And are the three platforms supporting Tibetan?
Assignee | ||
Comment 12•17 years ago
|
||
> I think that we can fix this with changing the table of the character > class. We can implement the complete UAX #14 spec for Tibetan by changing the character class table? It's still extra work for us, now and in maintenance later. If we're going to write our own code I think we need a positive reason to do so. I don't think guaranteed cross-platform consistency of Tibetan line breaking is a good enough reason. > And are the three platforms supporting Tibetan? This patch works for me on Mac, so I guess Mac does at least. If someone tested this on Linux or Windows and found it doesn't work, that would be a good reason to have our own Tibetan support.
Comment on attachment 282508 [details] [diff] [review] updated fix >> I think that we can fix this with changing the table of the character >> class. > > We can implement the complete UAX #14 spec for Tibetan by changing the > character class table? I think so. BA and EX are CLASS_CLOSE and GL are CLASS_NON_BREAKABLE. And other combinable characters are CLASS_CLOSE_LIKE_CHARACTER. And other characters are CLASS_CHARACTER. > It's still extra work for us, now and in maintenance later. If we're going to > write our own code I think we need a positive reason to do so. I don't think > guaranteed cross-platform consistency of Tibetan line breaking is a good enough > reason. But you don't hope work it on current cycle, I don't have any objection. Let's go this. But if this doesn't work fine on Win/Linux, let's try to implement ourselves. And I hope that Tibetan implementation will be our own code in next (Mozilla2 or later).
Attachment #282508 -
Flags: review?(masayuki) → review+
Reporter | ||
Comment 14•17 years ago
|
||
> BA and EX are CLASS_CLOSE and GL are CLASS_NON_BREAKABLE.
This cannot be right because BA and EX have very different semantics. BA allows a break after and EX inhibits a break before (even if it would otherwise be allowed). I'm pretty sure CLASS_CLOSE would do nothing useful whatsoever for Tibetan; syllables would still be stuck together in one never-ending long line.
If the proposed patch above (to use platform-native line breaking) gets into the nightly builds, I can test it on Linux, and ask someone else to test it on Windows. But I don't have sufficient hardware to build the beast myself. :(
(In reply to comment #14) > > BA and EX are CLASS_CLOSE and GL are CLASS_NON_BREAKABLE. > > This cannot be right because BA and EX have very different semantics. BA allows > a break after and EX inhibits a break before (even if it would otherwise be > allowed). I'm pretty sure CLASS_CLOSE would do nothing useful whatsoever for > Tibetan; syllables would still be stuck together in one never-ending long line. There are no differences between them on our implementation. (We cannot prohibit the break around spaces.) The current implementation: http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#51
Assignee | ||
Comment 16•17 years ago
|
||
Comment on attachment 282508 [details] [diff] [review] updated fix rather simple and low-risk fix to make Tibetan linebreaking work
Attachment #282508 -
Flags: approval1.9?
Reporter | ||
Comment 17•17 years ago
|
||
> There are no differences between them on our implementation.
> (We cannot prohibit the break around spaces.)
But you can prohibit the break after a Tibetan tsheg, which is the whole point. Tsheg needs to allow a break after, except when followed by class EX characters. Allowing the break is the absolutely essential thing that's missing right now; without the break, huge runs of text will stick together all on one line. However once the break is allowed, you then need to compensate to avoid breaking up punctuation...
OS: All → Linux
Version: Trunk → unspecified
Reporter | ||
Comment 18•17 years ago
|
||
So, is there anything holding up this patch? Anything I (or others familiar with Tibetan) can do to hurry it along?
Assignee | ||
Comment 19•17 years ago
|
||
No, we'll just have to wait for approval.
Comment on attachment 282508 [details] [diff] [review] updated fix a1.9=dbaron
Attachment #282508 -
Flags: approval1.9? → approval1.9+
Assignee | ||
Comment 21•17 years ago
|
||
checked in.
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•