Use icu line breaker instead of
Categories
(Core :: Internationalization, defect)
Tracking
()
People
(Reporter: m_kato, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
6.73 MB,
patch
|
Details | Diff | Splinter Review |
Updated•11 years ago
|
Reporter | ||
Comment 2•10 years ago
|
||
Comment 3•10 years ago
|
||
Reporter | ||
Comment 4•10 years ago
|
||
Comment 5•6 years ago
|
||
Has there been any news on this front? Has ICU line breaking been enabled for non-ASCII characters?
I just used the latest version of Firefox to test laobible.sea-sda.org (content that doesn't utilize zero-width spaces/zwsp between words) and Firefox still doesn't tokenize the individual words. If I double click, I typically get highlights of phrases. Then, using resizing the window much smaller, I found that the words are often cut in the middle with line breaks (vowels of syllables being placed on the next return line, words cut in half, etc).
Both word tokenization and proper line-breaking are important in Lao. Mobile sites will show horizontal scrolling, and desktop sites will show poorly wrapped text. The ability to double click on the word, without zwsp or spaces being used in the phrase, is also extremely convenient.
This effectively continues to render Firefox incompatible with the Lao language for most users. Few ordinary Lao people type with zero-width spaces, especially on mobile. Though there are places for zwsp, it shouldn't become a requirement for general public.
Anyways, just my thoughts. Let me know if I can be of any help!
Comment 6•5 years ago
|
||
I had hope to work on this further but have been block due to lack of try server access
https://bugzilla.mozilla.org/show_bug.cgi?id=1636341
The attached patch solves tibetan issues I was having, but has not be verified with the try server.
A limited set of relevant tests passed on my local machine.
Updated•5 years ago
|
Reporter | ||
Comment 7•5 years ago
|
||
We won't turn on ICU's break iterator for size issue. (over 5MB in package) I am considering new implementation (https://github.com/makotokato/uax14_rs) using UAX14 compatible code with Rust. And ICU doesn't have Latin1 only breaker, so It might cause perf issue for latin 1 only site.
Also, actually, using UAX14 rule causes test failure on our unit test.
So this should move to bug 1290022 or new bug.
Comment 8•5 years ago
|
||
I know that 5MB is a lot of data for some devices.
Many SE Asian scripts rely on the Break Iterators that ICU provides for their browser (which is currently Chrome or Safari). That 5MB of Break Iterator data helps make it possible to type natively for Lao (7 million), Thai (70 million), Khmer (16 million), and Burmese (53 million) - a grand total of over 146 million people+ who can type on their phones using standard keyboard inputs.
Desktops benefit from this too, since users who don't want to mess with invisible spaces, can type away and not be shown broken sites (Facebook and Twitter for example). I know that most users in the region don't type ZWSPs when they post on FB, Twitter, etc (because they use Chrome/Safari, and Chrome/Safari ship ICU Break Iterators) - and those sites probably don't add ZWSPs to the content on the server end, so that would cause all kinds of text layout issues for anyone browsing that content on Firefox. And because Tor Browser uses Firefox as a base, that means that all of those users have a harder time typing in their own language too.
Or would that new implementation in Rust still provide break iteration for these languages without the 5MB? How would that be accomplished? The only industry standard way currently is through a combination of dictionary lookups and algorithms that require expert understanding of each language's word construction. Are you all aiming to tackle that?
Or can't you all ship two builds (a 'Latin' build and a rest of the world build)?
Sorry if this is controversial, but I really don't understand some of the restrictions.
Comment 9•5 years ago
|
||
As part of bug 1423593 ICU's break iterator data will be included anyway, so the size issue shouldn't matter anymore.
Updated•2 years ago
|
Comment 10•2 years ago
|
||
We've integrated ICU4X segmenter in bug 1719535, which supports word boundaries for Chinese, Japanese, and SE Asian languages (including Lao).
Description
•