Status

()

enhancement
P3
normal
a year ago
23 days ago

People

(Reporter: anba, Unassigned)

Tracking

(Blocks 3 bugs, {dev-doc-needed})

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox59 affected)

Details

(URL)

Attachments

(2 attachments)

(Reporter)

Comment 1

a year ago
Adds all break-iterator data to icudt for testing the WIP patch in part 2. Adding all break-iterator data increases icudt by 3.57MB which makes it kind of unlikely to be approved by release drivers. :-)
(Reporter)

Comment 2

a year ago
WIP patch for Intl.Segmenter.
Jonathan - the 3.5MB added is probably a no-go unless we can unify what we do internally with this API and either reuse the internal API here, or use ICU here and for Gecko and remove the internal API.

WDYT?
Flags: needinfo?(jfkthame)
I guess quite a lot of the data size here is to support proper line- and word-segmentation in languages like Thai, Khmer and Lao where text is written without word separators, yet breaks are not allowed just "anywhere" like in Chinese (to a first approximation) or derived from simple character-based rules; the only way to get it right is a dictionary-based algorithm.

We currently don't have support for this built in to Gecko, as seen for example in bug 448650. On desktop platforms, we solve the line-breaking issue by calling host platform APIs (Uniscribe on Windows; CoreFoundation on macOS; pango on Linux) for text in "complex" scripts, which means that we're not responsible for packaging/delivering the bulky data that underlies it, but it also means our behavior is likely to vary somewhat between platforms, depending on the support provided by the libraries we're using.

So in principle, we could switch from those separate platform-specific implementations to a single ICU-based one, and gain more consistent behavior across platforms (and fix the behavior on Android, where I assume we currently just fail, per bug 448650); but this would mean accepting the data size as part of Gecko instead of depending on the platform to provide it. Whether that's a worthwhile trade-off is a product-drivers call, I guess.

We also have a bunch of custom line-breaking support in nsJISx4051LineBreaker for non-Thai/Lao/Khmer text, which could perhaps be replaced by ICU, though this would need careful examination as we don't necessarily follow the default Unicode algorithm in all cases. Our current breaker has a lot of customizations to better handle breaking around things like URLs, Japanese-style emoticon sequences ¯\_(ツ)_/¯ etc., and we'd need to be wary of regressing these cases.

There's also word-boundary segmentation, used for things like double-click selection; here, our current implementation is quite rudimentary (it's nsSampleWordBreaker, which sounds suspiciously like it was an initial place-holder that never got any further love!), and I expect ICU's segment iterator is substantially better. If nothing else, it'll be up-to-date with the current Unicode character repertoire, whereas it doesn't look like our code has been updated since about Unicode 3.0 -- and it's completely missing any attempt to support languages like Thai properly.

Overall, I see several benefits to switching intl/lwbrk to be backed by ICU segmentation: we'd get consistent behavior across desktop platforms, and fix the currently-broken behavior on mobile for Thai etc; we'd get improved word selection with up-to-date Unicode support and correct behavior (I presume) for the "difficult" SEAsian languages where we currently fail; and we'd get to drop maintenance of separate platform-specific line-breaking backends. (In particular, I suspect the Uniscribe-based implementation we use on Windows may become problematic as we continue to tighten sandboxing restrictions.)

It's clear, though, that the amount of existing code and data that we'd get to drop in the process is pretty small, so I don't expect it would make a significant dent in overall size to offset the increase in ICU data. Can/should we live with that?

(FWIW, I think the opposite approach, trying to implement Intl.Segmenter on top of our current line- and word-breaking code (AFAIK we don't have any implementation of "sentence segmentation"), isn't really worthwhile because our implementations aren't good/comprehensive enough, or sufficiently consistent across platforms, to be exposed to web authors who IMO should be able to rely on consistency from this API if they're going to write code that depends on it.)
Flags: needinfo?(jfkthame)
See also bug 425915, which is about Thai word-segmentation. The current patches there are leveraging our existing platform-specific line-break implementations, so AFAICT would not fix the bug very well for Android, where we use the (much more limited) nsRuleBreaker rather than a dictionary-backed API. Adopting the ICU segmentation support would give a better result.
Priority: -- → P3
Thanks for the explanation Jonathan!

I think it all makes sense, however unfortunately it puts us at odds with the Intl.Segmenter proposal.

Is there any chance it would make sense for us to adopt the ICU algorithms and turn this problem into a table selection problem?

In other words, direct us into a unified hypenation/segmentation/line-breaking backed by ICU algorithms both for Gecko and Intl API, but select only tables that we have right now?

If that would be possible, my naive understanding is that we'd gain:

 - better algorithms to the ones we have to maintain
 - lower maintenance cost for us with ability to remove our own algos
 - unified API for the web and Gecko

Costs:

 - One of our APIs is probably more fine-tuned for the Web than what ICU is doing (line-breaking)
 - Potentially, even the subset of tables we'd want to have to match our current data set would cost in size (?) 

Is that a good summary? Would it be something potentially realistic?
Flags: needinfo?(jfkthame)

Comment 7

a year ago
I'm not sure what "table selection problem" is.

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #6)
> Thanks for the explanation Jonathan!
> 
> I think it all makes sense, however unfortunately it puts us at odds with
> the Intl.Segmenter proposal.
> 
> Is there any chance it would make sense for us to adopt the ICU algorithms
> and turn this problem into a table selection problem?
> 
> In other words, direct us into a unified
> hypenation/segmentation/line-breaking backed by ICU algorithms both for
> Gecko and Intl API, but select only tables that we have right now?
> 
> If that would be possible, my naive understanding is that we'd gain:
> 
>  - better algorithms to the ones we have to maintain
>  - lower maintenance cost for us with ability to remove our own algos
>  - unified API for the web and Gecko
> 
> Costs:
> 
>  - One of our APIs is probably more fine-tuned for the Web than what ICU is
> doing (line-breaking)
>  - Potentially, even the subset of tables we'd want to have to match our
> current data set would cost in size (?) 
> 
> Is that a good summary? Would it be something potentially realistic?
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #6)
> Thanks for the explanation Jonathan!
> 
> I think it all makes sense, however unfortunately it puts us at odds with
> the Intl.Segmenter proposal.
> 
> Is there any chance it would make sense for us to adopt the ICU algorithms
> and turn this problem into a table selection problem?
> 
> In other words, direct us into a unified
> hypenation/segmentation/line-breaking backed by ICU algorithms both for
> Gecko and Intl API, but select only tables that we have right now?

(Don't include "hyphenation" in discussion here, AFAIK there's no hyphenation support in ICU. We implement that entirely separately, using libhyphen.)

I don't know how much of the 3.5MB we could save by excluding some of the data; we could exclude the sentence-break rules, as we don't currently have any such functionality, and we could trim any localized versions of line- and word-breaking, as we don't currently implement language-sensitive behavior (that's bug 203016!). But my guess (without having actually tried) is that the savings would be relatively small. Most of the bulk probably comes from properties for the entire Unicode repertoire, which are needed even if we only support the root locale, plus the dictionaries that are needed to support Thai (etc) word-break recognition.

Maybe André can try creating a version of the data patch with trimmed content and see how much different it turns out? But I suspect it won't make a dramatic difference, and if we want this functionality at all we're going to have to accept a several-MB size increase.

(Remember that currently we're relying on data provided by the OS for substantial parts of this, which means we don't save any data size within Gecko by dropping the old implementations.)
Flags: needinfo?(jfkthame)
(Reporter)

Comment 9

a year ago
(In reply to Jonathan Kew (:jfkthame) from comment #8)
> Maybe André can try creating a version of the data patch with trimmed
> content and see how much different it turns out? But I suspect it won't make
> a dramatic difference, and if we want this functionality at all we're going
> to have to accept a several-MB size increase.

The total increase in bytes is 3.745.072: 2.882.016 bytes for dictionaries (burmese, chinese-japanese, khmer lao, thai) and 863.056 bytes (non-dictionary data).
See Also: → 196175
You need to log in before you can comment on or make changes to this bug.