Closed Bug 820261 Opened 12 years ago Closed 2 years ago

Use icu line breaker instead of

Categories

(Core :: Internationalization, defect)

x86
Windows 8
defect

Tracking

()

RESOLVED DUPLICATE of bug 1719535

People

(Reporter: m_kato, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

If we use libicu, we should replace with icu's line breaker.
Blocks: 933631
Depends on: 864843, 866301
ICU isn't compatible with our line breaker. So some tests on our tree are failure. Also, Although ICU uses UAX#14, there is spec bugs. http://unicode.org/pipermail/unicode/2015-April/001522.html (reported today)
I think that we should keep to use current line breaker at least with ASCII characters. It includes a lot of hacks for typical web pages. E.g., breaking in long URL, not breaking in emoticon, not breaking at arguments of command line. IIRC, WebKit/Blink also use their own rule for ASCII characters.
(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #3) > I think that we should keep to use current line breaker at least with ASCII > characters. It includes a lot of hacks for typical web pages. E.g., breaking > in long URL, not breaking in emoticon, not breaking at arguments of command > line. Yes. We should keep current code of 8 bit character version for performance. Unicode is very very complex and break rule is defined as UAX#14. At least, we should use it for loose and strict of line-break property implementation.
Depends on: 1215247
Depends on: 1215252

Has there been any news on this front? Has ICU line breaking been enabled for non-ASCII characters?

I just used the latest version of Firefox to test laobible.sea-sda.org (content that doesn't utilize zero-width spaces/zwsp between words) and Firefox still doesn't tokenize the individual words. If I double click, I typically get highlights of phrases. Then, using resizing the window much smaller, I found that the words are often cut in the middle with line breaks (vowels of syllables being placed on the next return line, words cut in half, etc).

Both word tokenization and proper line-breaking are important in Lao. Mobile sites will show horizontal scrolling, and desktop sites will show poorly wrapped text. The ability to double click on the word, without zwsp or spaces being used in the phrase, is also extremely convenient.

This effectively continues to render Firefox incompatible with the Lao language for most users. Few ordinary Lao people type with zero-width spaces, especially on mobile. Though there are places for zwsp, it shouldn't become a requirement for general public.

Anyways, just my thoughts. Let me know if I can be of any help!

I had hope to work on this further but have been block due to lack of try server access
https://bugzilla.mozilla.org/show_bug.cgi?id=1636341

The attached patch solves tibetan issues I was having, but has not be verified with the try server.
A limited set of relevant tests passed on my local machine.

Attachment #9152759 - Attachment is patch: true

We won't turn on ICU's break iterator for size issue. (over 5MB in package) I am considering new implementation (https://github.com/makotokato/uax14_rs) using UAX14 compatible code with Rust. And ICU doesn't have Latin1 only breaker, so It might cause perf issue for latin 1 only site.

Also, actually, using UAX14 rule causes test failure on our unit test.

So this should move to bug 1290022 or new bug.

I know that 5MB is a lot of data for some devices.

Many SE Asian scripts rely on the Break Iterators that ICU provides for their browser (which is currently Chrome or Safari). That 5MB of Break Iterator data helps make it possible to type natively for Lao (7 million), Thai (70 million), Khmer (16 million), and Burmese (53 million) - a grand total of over 146 million people+ who can type on their phones using standard keyboard inputs.

Desktops benefit from this too, since users who don't want to mess with invisible spaces, can type away and not be shown broken sites (Facebook and Twitter for example). I know that most users in the region don't type ZWSPs when they post on FB, Twitter, etc (because they use Chrome/Safari, and Chrome/Safari ship ICU Break Iterators) - and those sites probably don't add ZWSPs to the content on the server end, so that would cause all kinds of text layout issues for anyone browsing that content on Firefox. And because Tor Browser uses Firefox as a base, that means that all of those users have a harder time typing in their own language too.

Or would that new implementation in Rust still provide break iteration for these languages without the 5MB? How would that be accomplished? The only industry standard way currently is through a combination of dictionary lookups and algorithms that require expert understanding of each language's word construction. Are you all aiming to tackle that?

Or can't you all ship two builds (a 'Latin' build and a rest of the world build)?

Sorry if this is controversial, but I really don't understand some of the restrictions.

As part of bug 1423593 ICU's break iterator data will be included anyway, so the size issue shouldn't matter anymore.

Severity: normal → S3

We've integrated ICU4X segmenter in bug 1719535, which supports word boundaries for Chinese, Japanese, and SE Asian languages (including Lao).

Status: NEW → RESOLVED
Closed: 2 years ago
Duplicate of bug: 1719535
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: