ICU isn't compatible with our line breaker. So some tests on our tree are failure. Also, Although ICU uses UAX#14, there is spec bugs. http://unicode.org/pipermail/unicode/2015-April/001522.html (reported today)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(sick)

Comment 3

•

10 years ago

I think that we should keep to use current line breaker at least with ASCII characters. It includes a lot of hacks for typical web pages. E.g., breaking in long URL, not breaking in emoticon, not breaking at arguments of command line. IIRC, WebKit/Blink also use their own rule for ASCII characters.

Makoto Kato [:m_kato]

Reporter

Comment 4

•

10 years ago

(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #3) > I think that we should keep to use current line breaker at least with ASCII > characters. It includes a lot of hacks for typical web pages. E.g., breaking > in long URL, not breaking in emoticon, not breaking at arguments of command > line. Yes. We should keep current code of 8 bit character version for performance. Unicode is very very complex and break rule is defined as UAX#14. At least, we should use it for loose and strict of line-break property implementation.

[:fabrice] Fabrice Desré

Updated

•

10 years ago

Depends on: 1215247

[:fabrice] Fabrice Desré

Updated

•

10 years ago

Depends on: 1215252

robert.rcampbell

Comment 5

•

7 years ago

Has there been any news on this front? Has ICU line breaking been enabled for non-ASCII characters?

I just used the latest version of Firefox to test laobible.sea-sda.org (content that doesn't utilize zero-width spaces/zwsp between words) and Firefox still doesn't tokenize the individual words. If I double click, I typically get highlights of phrases. Then, using resizing the window much smaller, I found that the words are often cut in the middle with line breaks (vowels of syllables being placed on the next return line, words cut in half, etc).

Both word tokenization and proper line-breaking are important in Lao. Mobile sites will show horizontal scrolling, and desktop sites will show poorly wrapped text. The ability to double click on the word, without zwsp or spaces being used in the phrase, is also extremely convenient.

This effectively continues to render Firefox incompatible with the Lao language for most users. Few ordinary Lao people type with zero-width spaces, especially on mobile. Though there are places for zwsp, it shouldn't become a requirement for general public.

Anyways, just my thoughts. Let me know if I can be of any help!

Tom Hindle

Comment 6

•

5 years ago

Attached patch initial implementation of icu line breaker. — Details — Splinter Review

I had hope to work on this further but have been block due to lack of try server access
https://bugzilla.mozilla.org/show_bug.cgi?id=1636341

The attached patch solves tibetan issues I was having, but has not be verified with the try server.
A limited set of relevant tests passed on my local machine.

Tom Hindle

Updated

•

5 years ago

Attachment #9152759 - Attachment is patch: true

Makoto Kato [:m_kato]

Reporter

Comment 7

•

5 years ago

We won't turn on ICU's break iterator for size issue. (over 5MB in package) I am considering new implementation (https://github.com/makotokato/uax14_rs) using UAX14 compatible code with Rust. And ICU doesn't have Latin1 only breaker, so It might cause perf issue for latin 1 only site.

Also, actually, using UAX14 rule causes test failure on our unit test.

So this should move to bug 1290022 or new bug.

robert.rcampbell

Comment 8

•

5 years ago

I know that 5MB is a lot of data for some devices.

Many SE Asian scripts rely on the Break Iterators that ICU provides for their browser (which is currently Chrome or Safari). That 5MB of Break Iterator data helps make it possible to type natively for Lao (7 million), Thai (70 million), Khmer (16 million), and Burmese (53 million) - a grand total of over 146 million people+ who can type on their phones using standard keyboard inputs.

Desktops benefit from this too, since users who don't want to mess with invisible spaces, can type away and not be shown broken sites (Facebook and Twitter for example). I know that most users in the region don't type ZWSPs when they post on FB, Twitter, etc (because they use Chrome/Safari, and Chrome/Safari ship ICU Break Iterators) - and those sites probably don't add ZWSPs to the content on the server end, so that would cause all kinds of text layout issues for anyone browsing that content on Firefox. And because Tor Browser uses Firefox as a base, that means that all of those users have a harder time typing in their own language too.

Or would that new implementation in Rust still provide break iteration for these languages without the 5MB? How would that be accomplished? The only industry standard way currently is through a combination of dictionary lookups and algorithms that require expert understanding of each language's word construction. Are you all aiming to tackle that?

Or can't you all ship two builds (a 'Latin' build and a rest of the world build)?

Sorry if this is controversial, but I really don't understand some of the restrictions.

André Bargull [:anba]

Comment 9

•

5 years ago

As part of bug 1423593 ICU's break iterator data will be included anyway, so the size issue shouldn't matter anymore.

Zibi Braniecki [:zbraniecki][:gandalf]

Updated

•

5 years ago

Blocks: segmenter

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Comment 10

•

2 years ago

We've integrated ICU4X segmenter in bug 1719535, which supports word boundaries for Chinese, Japanese, and SE Asian languages (including Lao).

Status: NEW → RESOLVED

Closed: 2 years ago

Duplicate of bug: 1719535

Resolution: --- → DUPLICATE

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Use icu line breaker instead of

Categories

(Core :: Internationalization, defect)

Tracking

()

People

(Reporter: m_kato, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Updated

Comment 2

Comment 3

Comment 4

Updated

Updated

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Updated

Updated

Comment 10

Attachment

General

Description

File Name

Content Type