Closed Bug 397597 Opened 17 years ago Closed 17 years ago

Firefox ignores Unicode linebreak semantics needed for Tibetan script

Categories

(Core :: Layout: Text and Fonts, defect)

All
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dalias, Assigned: roc)

Details

(Keywords: intl)

Attachments

(1 file, 1 obsolete file)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a9pre) Gecko/2007092504 Minefield/3.0a9pre
Build Identifier: Minefield 3.0a9pre

Tibetan script does not use spaces but instead delimits word-like units with a "tsheg", U+0F0B. This character has Unicode line break properties such that line breaking should be possible after it unless it's followed by certain other punctuation, but Firefox is unable to break long runs of Tibetan text and instead forces horizontal scrolling.

I believe this is a symptom of a much larger problem, that the developers seem to believe line breaking is a matter of spaces, CJK, and Thai. In reality there are many more languages/scripts that either need special-casing or proper application of the Unicode line breaking semantics. The latter is of course a lot more sane than trying to figure it all out all over again without consulting experts in countless languages.

If fixing this correctly in time for Firefox 3 is not feasible, could you at least include special casing for the Tibetan tsheg and check to see if there are any other languages with similar requirements?


Reproducible: Always

Steps to Reproduce:
View a long run of Tibetan text, e.g. paste the following string over and over in an html file and then view it:
"བོད་སྐད་"

Actual Results:  
Horizontal scroll bar appears or text overruns its block element.

Expected Results:  
Text should wrap after the tsheg "་" mark when it gets to the end of a line.
> I believe this is a symptom of a much larger problem, that the developers seem
> to believe line breaking is a matter of spaces, CJK, and Thai.

Not at all.

To fix this, we just need to add Tibetan characters (and other characters that require complex treatment) to the IS_COMPLEX macro here. (This was always part of the plan, that's why IS_COMPLEX isn't called IS_THAI.)
http://mxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#419
Then runs of Tibetan characters will be broken by Uniscribe, Pango or ATSUI. I'm assuming those engines are good enough.

Alternatively we could make IS_COMPLEX return true except for ranges of characters that we know we know how to break (Latin, CJK, and some other ranges). The problem with being conservative in that direction is that it increases the number of situations where we could have differences in line breaking between platforms, which is something we want to minimize.
Sounds like the line-breaking spec is simple enough. If the punctuations are only in Tibetan range (U+0F00 - U+0FFF), we can implement the line-breaking ourselves.
dalias:

Can you list all punctuations which suppress the line-breaking?
If you really want to do the implementation yourself, look in Unicode's LineBreak.txt and at UAX#14,
http://www.unicode.org/reports/tr14/tr14-19.html#BA
http://www.unicode.org/reports/tr14/tr14-19.html#EX
http://www.unicode.org/reports/tr14/tr14-19.html#GL
http://www.unicode.org/reports/tr14/tr14-19.html#TibetanLinebreaking

The characters which would suppress breaking after tsheg are the ones of class EX.

Getting something sufficiently correct to be "usable" is as simple as I originally described, but correctly handling classical texts and making the optimal linebreak choices for them requires dealing with a good many UAX#14 line break classes, and I suspect it would be preferable to apply it fully. I am not an expert in these sort of texts so I really cannot advise on what is and isn't needed, but if you wish to pursue that path I can potentially get you in touch with an expert.

And yes, all the punctuation relevant to linebreaking for Tibetan should be in the range U+0F00 - U+0FFF. CJK double-angle-bracket quotation marks and some western punctutation are also used at times, but I can't think of any interaction these have with Tibetan aside from their usual line breaking properties. Just wanted to mention that in case it is relevant to the way your code works.

Also apologies for being presumptuous about the developers' treatment of line breaking. The only source I was going on was an old Bugzilla thread on the matter where I got that impression. Glad to see it's not the case!
Some other characters (defined in BA or EX) should be breakable in Tibetan, right? And if we can support them, is it enough for Tibetan text?

BA:
0F0B    TIBETAN MARK INTERSYLLABIC TSHEG
0F85  	TIBETAN MARK PALUTA
0F34 	TIBETAN MARK BSDUS RTAGS
0F7F 	TIBETAN SIGN RNAM BCAD
0FBE 	TIBETAN KU RU KHA
0FBF 	TIBETAN KU RU KHA BZHI MIG CAN

EX:
0F0D  	TIBETAN MARK SHAD
0F0E 	TIBETAN MARK NYIS SHAD
0F0F 	TIBETAN MARK TSHEG SHAD
0F10 	TIBETAN MARK NYIS TSHEG SHAD
0F11 	TIBETAN MARK RIN CHEN SPUNGS SHAD
0F14 	TIBETAN MARK GTER TSHEG

# I cannot trust the UAX#14 without being confirmed by the readers of the language :-(
(In reply to comment #2)
> Sounds like the line-breaking spec is simple enough. If the punctuations are
> only in Tibetan range (U+0F00 - U+0FFF), we can implement the line-breaking
> ourselves.

I don't think we should do this. That creates extra work for us with no benefit. We should just add the Tibetan range to IS_COMPLEX.
The typical hack websites seem to do for line breaking is to insert the (invalid html) <wbr> element after each tsheg (U+0F0B) that's not immediately followed by a shad. The blog at http://tibetanportal.com/myownblog/?lp_lang_pref=bo has JS code to perform this on the client side, and it you read that you'll find that <wbr> is inserted after any tsheg not followed by U+0F0D. This does not respect the other types of shads (of class EX) and will probably cause problems, but it's okay for simple modern writing.

If you want a naive algorithm that's more robust, I would say to consider breaking after U+0F0B only when it's followed by a Tibetan letter. That might miss a few valid line break opportunities but it will prevent lines from overrunning the browser window and will never give false positives (breaks where they shouldn't be). This should be extremely easy to implement too.
Attached patch fix (obsolete) — Splinter Review
This fixes it. I strongly believe this is better than coding up our own handling of Tibetan and goodness knows what else.
Assignee: nobody → roc
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Attachment #282484 - Flags: review?
Attachment #282484 - Flags: review? → review?(masayuki)
The comment here is wrong because the range excludes Lao:

+  return (0x0e01 <= aChar && aChar <= 0x0e5b) || // Thai, Lao
+         (0x0f00 <= aChar && aChar <= 0x0fff);   // Tibetan

Is that intentional? If Lao is supposed to be included, then you should just make it all one range since the three scripts are adjacent, and that will negate any potential performance loss from testing multiple ranges.
Attached patch updated fixSplinter Review
Updated with that fix
Attachment #282484 - Attachment is obsolete: true
Attachment #282508 - Flags: review?
Attachment #282484 - Flags: review?(masayuki)
Attachment #282508 - Attachment is patch: true
Attachment #282508 - Attachment mime type: application/text → text/plain
Attachment #282508 - Flags: review? → review?(masayuki)
(In reply to comment #6)
> (In reply to comment #2)
> > Sounds like the line-breaking spec is simple enough. If the punctuations are
> > only in Tibetan range (U+0F00 - U+0FFF), we can implement the line-breaking
> > ourselves.
> 
> I don't think we should do this. That creates extra work for us with no
> benefit. We should just add the Tibetan range to IS_COMPLEX.

hmm.. I think that we can fix this with changing the table of the character class. So, I think the risk is low if the UAX#14 is good spec. And are the three platforms supporting Tibetan?
> I think that we can fix this with changing the table of the character
> class.

We can implement the complete UAX #14 spec for Tibetan by changing the character class table?

It's still extra work for us, now and in maintenance later. If we're going to write our own code I think we need a positive reason to do so. I don't think guaranteed cross-platform consistency of Tibetan line breaking is a good enough reason.

> And are the three platforms supporting Tibetan?

This patch works for me on Mac, so I guess Mac does at least. If someone tested this on Linux or Windows and found it doesn't work, that would be a good reason to have our own Tibetan support.
Comment on attachment 282508 [details] [diff] [review]
updated fix

>> I think that we can fix this with changing the table of the character
>> class.
> 
> We can implement the complete UAX #14 spec for Tibetan by changing the
> character class table?

I think so.

BA and EX are CLASS_CLOSE and GL are CLASS_NON_BREAKABLE. And other combinable characters are CLASS_CLOSE_LIKE_CHARACTER. And other characters are CLASS_CHARACTER.

> It's still extra work for us, now and in maintenance later. If we're going to
> write our own code I think we need a positive reason to do so. I don't think
> guaranteed cross-platform consistency of Tibetan line breaking is a good enough
> reason.

But you don't hope work it on current cycle, I don't have any objection. Let's go this.

But if this doesn't work fine on Win/Linux, let's try to implement ourselves. And I hope that Tibetan implementation will be our own code in next (Mozilla2 or later).
Attachment #282508 - Flags: review?(masayuki) → review+
> BA and EX are CLASS_CLOSE and GL are CLASS_NON_BREAKABLE.

This cannot be right because BA and EX have very different semantics. BA allows a break after and EX inhibits a break before (even if it would otherwise be allowed). I'm pretty sure CLASS_CLOSE would do nothing useful whatsoever for Tibetan; syllables would still be stuck together in one never-ending long line.

If the proposed patch above (to use platform-native line breaking) gets into the nightly builds, I can test it on Linux, and ask someone else to test it on Windows. But I don't have sufficient hardware to build the beast myself. :(
(In reply to comment #14)
> > BA and EX are CLASS_CLOSE and GL are CLASS_NON_BREAKABLE.
> 
> This cannot be right because BA and EX have very different semantics. BA allows
> a break after and EX inhibits a break before (even if it would otherwise be
> allowed). I'm pretty sure CLASS_CLOSE would do nothing useful whatsoever for
> Tibetan; syllables would still be stuck together in one never-ending long line.

There are no differences between them on our implementation. (We cannot prohibit the break around spaces.)
The current implementation: http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#51
Keywords: intl
OS: Linux → All
Version: unspecified → Trunk
Comment on attachment 282508 [details] [diff] [review]
updated fix

rather simple and low-risk fix to make Tibetan linebreaking work
Attachment #282508 - Flags: approval1.9?
> There are no differences between them on our implementation.
> (We cannot prohibit the break around spaces.)

But you can prohibit the break after a Tibetan tsheg, which is the whole point. Tsheg needs to allow a break after, except when followed by class EX characters. Allowing the break is the absolutely essential thing that's missing right now; without the break, huge runs of text will stick together all on one line. However once the break is allowed, you then need to compensate to avoid breaking up punctuation...
OS: All → Linux
Version: Trunk → unspecified
So, is there anything holding up this patch? Anything I (or others familiar with Tibetan) can do to hurry it along?
No, we'll just have to wait for approval.
Comment on attachment 282508 [details] [diff] [review]
updated fix

a1.9=dbaron
Attachment #282508 - Flags: approval1.9? → approval1.9+
checked in.
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: