Closed Bug 1423593 Opened 7 years ago Closed 4 months ago

Add Intl.Segmenter API

Categories

(Core :: JavaScript: Internationalization API, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
122 Branch
Tracking Status
firefox59 --- wontfix
firefox122 --- fixed

People

(Reporter: anba, Assigned: anba)

References

(Blocks 4 open bugs, )

Details

(Keywords: dev-doc-complete)

Attachments

(7 files, 10 obsolete files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
Attached file bug-1423593-part1-icu-data.zip (obsolete) —
Adds all break-iterator data to icudt for testing the WIP patch in part 2. Adding all break-iterator data increases icudt by 3.57MB which makes it kind of unlikely to be approved by release drivers. :-)
WIP patch for Intl.Segmenter.
Jonathan - the 3.5MB added is probably a no-go unless we can unify what we do internally with this API and either reuse the internal API here, or use ICU here and for Gecko and remove the internal API.

WDYT?
Flags: needinfo?(jfkthame)
I guess quite a lot of the data size here is to support proper line- and word-segmentation in languages like Thai, Khmer and Lao where text is written without word separators, yet breaks are not allowed just "anywhere" like in Chinese (to a first approximation) or derived from simple character-based rules; the only way to get it right is a dictionary-based algorithm.

We currently don't have support for this built in to Gecko, as seen for example in bug 448650. On desktop platforms, we solve the line-breaking issue by calling host platform APIs (Uniscribe on Windows; CoreFoundation on macOS; pango on Linux) for text in "complex" scripts, which means that we're not responsible for packaging/delivering the bulky data that underlies it, but it also means our behavior is likely to vary somewhat between platforms, depending on the support provided by the libraries we're using.

So in principle, we could switch from those separate platform-specific implementations to a single ICU-based one, and gain more consistent behavior across platforms (and fix the behavior on Android, where I assume we currently just fail, per bug 448650); but this would mean accepting the data size as part of Gecko instead of depending on the platform to provide it. Whether that's a worthwhile trade-off is a product-drivers call, I guess.

We also have a bunch of custom line-breaking support in nsJISx4051LineBreaker for non-Thai/Lao/Khmer text, which could perhaps be replaced by ICU, though this would need careful examination as we don't necessarily follow the default Unicode algorithm in all cases. Our current breaker has a lot of customizations to better handle breaking around things like URLs, Japanese-style emoticon sequences ¯\_(ツ)_/¯ etc., and we'd need to be wary of regressing these cases.

There's also word-boundary segmentation, used for things like double-click selection; here, our current implementation is quite rudimentary (it's nsSampleWordBreaker, which sounds suspiciously like it was an initial place-holder that never got any further love!), and I expect ICU's segment iterator is substantially better. If nothing else, it'll be up-to-date with the current Unicode character repertoire, whereas it doesn't look like our code has been updated since about Unicode 3.0 -- and it's completely missing any attempt to support languages like Thai properly.

Overall, I see several benefits to switching intl/lwbrk to be backed by ICU segmentation: we'd get consistent behavior across desktop platforms, and fix the currently-broken behavior on mobile for Thai etc; we'd get improved word selection with up-to-date Unicode support and correct behavior (I presume) for the "difficult" SEAsian languages where we currently fail; and we'd get to drop maintenance of separate platform-specific line-breaking backends. (In particular, I suspect the Uniscribe-based implementation we use on Windows may become problematic as we continue to tighten sandboxing restrictions.)

It's clear, though, that the amount of existing code and data that we'd get to drop in the process is pretty small, so I don't expect it would make a significant dent in overall size to offset the increase in ICU data. Can/should we live with that?

(FWIW, I think the opposite approach, trying to implement Intl.Segmenter on top of our current line- and word-breaking code (AFAIK we don't have any implementation of "sentence segmentation"), isn't really worthwhile because our implementations aren't good/comprehensive enough, or sufficiently consistent across platforms, to be exposed to web authors who IMO should be able to rely on consistency from this API if they're going to write code that depends on it.)
Flags: needinfo?(jfkthame)
See also bug 425915, which is about Thai word-segmentation. The current patches there are leveraging our existing platform-specific line-break implementations, so AFAICT would not fix the bug very well for Android, where we use the (much more limited) nsRuleBreaker rather than a dictionary-backed API. Adopting the ICU segmentation support would give a better result.
Priority: -- → P3
Thanks for the explanation Jonathan!

I think it all makes sense, however unfortunately it puts us at odds with the Intl.Segmenter proposal.

Is there any chance it would make sense for us to adopt the ICU algorithms and turn this problem into a table selection problem?

In other words, direct us into a unified hypenation/segmentation/line-breaking backed by ICU algorithms both for Gecko and Intl API, but select only tables that we have right now?

If that would be possible, my naive understanding is that we'd gain:

 - better algorithms to the ones we have to maintain
 - lower maintenance cost for us with ability to remove our own algos
 - unified API for the web and Gecko

Costs:

 - One of our APIs is probably more fine-tuned for the Web than what ICU is doing (line-breaking)
 - Potentially, even the subset of tables we'd want to have to match our current data set would cost in size (?) 

Is that a good summary? Would it be something potentially realistic?
Flags: needinfo?(jfkthame)
I'm not sure what "table selection problem" is.

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #6)
> Thanks for the explanation Jonathan!
> 
> I think it all makes sense, however unfortunately it puts us at odds with
> the Intl.Segmenter proposal.
> 
> Is there any chance it would make sense for us to adopt the ICU algorithms
> and turn this problem into a table selection problem?
> 
> In other words, direct us into a unified
> hypenation/segmentation/line-breaking backed by ICU algorithms both for
> Gecko and Intl API, but select only tables that we have right now?
> 
> If that would be possible, my naive understanding is that we'd gain:
> 
>  - better algorithms to the ones we have to maintain
>  - lower maintenance cost for us with ability to remove our own algos
>  - unified API for the web and Gecko
> 
> Costs:
> 
>  - One of our APIs is probably more fine-tuned for the Web than what ICU is
> doing (line-breaking)
>  - Potentially, even the subset of tables we'd want to have to match our
> current data set would cost in size (?) 
> 
> Is that a good summary? Would it be something potentially realistic?
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #6)
> Thanks for the explanation Jonathan!
> 
> I think it all makes sense, however unfortunately it puts us at odds with
> the Intl.Segmenter proposal.
> 
> Is there any chance it would make sense for us to adopt the ICU algorithms
> and turn this problem into a table selection problem?
> 
> In other words, direct us into a unified
> hypenation/segmentation/line-breaking backed by ICU algorithms both for
> Gecko and Intl API, but select only tables that we have right now?

(Don't include "hyphenation" in discussion here, AFAIK there's no hyphenation support in ICU. We implement that entirely separately, using libhyphen.)

I don't know how much of the 3.5MB we could save by excluding some of the data; we could exclude the sentence-break rules, as we don't currently have any such functionality, and we could trim any localized versions of line- and word-breaking, as we don't currently implement language-sensitive behavior (that's bug 203016!). But my guess (without having actually tried) is that the savings would be relatively small. Most of the bulk probably comes from properties for the entire Unicode repertoire, which are needed even if we only support the root locale, plus the dictionaries that are needed to support Thai (etc) word-break recognition.

Maybe André can try creating a version of the data patch with trimmed content and see how much different it turns out? But I suspect it won't make a dramatic difference, and if we want this functionality at all we're going to have to accept a several-MB size increase.

(Remember that currently we're relying on data provided by the OS for substantial parts of this, which means we don't save any data size within Gecko by dropping the old implementations.)
Flags: needinfo?(jfkthame)
(In reply to Jonathan Kew (:jfkthame) from comment #8)
> Maybe André can try creating a version of the data patch with trimmed
> content and see how much different it turns out? But I suspect it won't make
> a dramatic difference, and if we want this functionality at all we're going
> to have to accept a several-MB size increase.

The total increase in bytes is 3.745.072: 2.882.016 bytes for dictionaries (burmese, chinese-japanese, khmer lao, thai) and 863.056 bytes (non-dictionary data).
See Also: → 196175

Any idea when Intl.Segmenter will be added?

Flags: needinfo?(jwalden)

Includes all break iteration data, including line segmentation even though it's
unused, because ICU data tailoring doesn't support to exclude line segmentation
data.

Filtered break iteration and the accompanying "exceptions" data tree are
disabled, because the Intl.Segmenter proposal doesn't support sentence break
suppressions through the "ss" Unicode extension key.

Implement the standard Intl object code, including the constructor and the
resolvedOptions() method. Intl.Segmenter.prototype.segment() will be added
in the next patch.

Depends on D88432

Intl.Segmenter.prototype.segment() returns a SegmentsObject tied to a specific
string. The SegmentsObject supports finding segmentation boundaries through its
containing() method.

Each SegmentsObject holds a separate UBreakIterator instead of using a
shared UBreakIterator in SegmenterObject to allow ICU to precompute and cache
segmentation boundaries for a single string.

The "Segment Data" object is created in self-hosted code to ensure we get stable
shapes.

Depends on D88433

In addition to the containing() method, Segments objects also support
iteration. Segment Iterator objects always start the segment iteration at the
start of the string and are independent of the current position of the Segments
object's UBreakIterator. That means each Segment Iterator needs to have yet
another copy of the UBreakIterator.

Depends on D88434

Test262 tests cover the various error conditions, so the local tests are mostly
restricted to functional tests and SpiderMonkey specifics like cross-compartment
wrappers.

Depends on D88435

Attachment #8934993 - Attachment is obsolete: true
Attachment #8934995 - Attachment is obsolete: true

The patches are ready, but I haven't yet assigned a reviewer, because I'm not sure who is currently available.

Assignee: nobody → andrebargull
Status: NEW → ASSIGNED

Add a fast-path for Latin-1 strings to avoid the Latin-1 to Two-Byte string
conversion before calling into ICU. This gives a 50x speed-up for the cases
when AutoStableStringChars needs to malloc memory to hold the converted
characters. It also provides a 25-40% speed-up compared to ICU (when comparing
without the malloc overhead).
Word break rules and properties for Latin-1 shouldn't observe any changes in
future UAX #29 resp. Unicode versions, so hard-coding the rules and properties
shouldn't lead to any issues.
All rules are annotated to allow comparing them with UAX #29. Rules which don't
apply to Latin-1 characters (e.g. Hebrew letters or Katagana) were omitted.

Depends on D88438

Similar to part 7, add a fast-path for Latin-1 strings. In addition to avoiding
the malloc overhead, the fast-path is 3-4x faster than ICU (when comparing
without the malloc overhead).

Depends on D89373

I'm available to review stuff. At least for now I plan on continuing to do stuff on an as-I-feel-like-it basis, and while I didn't start doing intl stuff by deliberate and intended choice, over time I've come to find it reasonably interesting in its own right. So intl stuff is part of the SpiderMonkey things I will keep doing. Therefore, throw 'em at me if you want. :-)

Can't make guarantees about turnaround time, if I decide to spend any particular tomorrow biking all day instead of doing reviewing that's just how it'll be. But then I couldn't well offer guarantees before, either, could I, when other projects were my primary goals. So it's not really too much different. (And with recent bike bustage, before I could get to repairing it, I seem to have lost a bunch of fitness, so I gotta budget in rest days anyway.)

(In reply to me from comment #10)

Any idea when Intl.Segmenter will be added?

Not terribly precisely, no. Patches need reviewing, we need to decide just how to deal with the size increases (if we haven't already chosen to accept it), and I'm not 100% certain what exactly we're doing about either having unified segmentation support used throughout Gecko, or having different implementations for Gecko for layout and for JS for Intl. If you need segmentation now, or soon, you should plan on shipping and using your own implementation.

Flags: needinfo?(jwalden)

(In reply to Jeff Walden [:Waldo] from comment #20)

Therefore, throw 'em at me if you want. :-)

\o/

Attachment #9172419 - Attachment description: Bug 1423593 - Part 1: Enable break iteration data in ICU. → Bug 1423593 - Part 1: Enable break iteration data in ICU. r=jwalden!
Attachment #9172420 - Attachment description: Bug 1423593 - Part 2: Implement standard Intl constructor code for Intl.Segmenter. → Bug 1423593 - Part 2: Implement standard Intl constructor code for Intl.Segmenter. r=jwalden!
Attachment #9172421 - Attachment description: Bug 1423593 - Part 3: Implement Segmenter.prototype.segment and the SegmentsObject. → Bug 1423593 - Part 3: Implement Segmenter.prototype.segment and the SegmentsObject. r=jwalden!
Attachment #9172422 - Attachment description: Bug 1423593 - Part 4: Implement Segment Iterator objects. → Bug 1423593 - Part 4: Implement Segment Iterator objects. r=jwalden!
Attachment #9172423 - Attachment description: Bug 1423593 - Part 5: Add tests for Intl.Segmenter. → Bug 1423593 - Part 5: Add tests for Intl.Segmenter. r=jwalden!
Attachment #9172426 - Attachment description: Bug 1423593 - Part 6: Enable test262 tests for Intl.Segmenter. → Bug 1423593 - Part 6: Enable test262 tests for Intl.Segmenter. r=jwalden!
Attachment #9174218 - Attachment description: Bug 1423593 - Part 7: Add fast-path for word boundary detection on Latin-1 strings. → Bug 1423593 - Part 7: Add fast-path for word boundary detection on Latin-1 strings. r=jwalden!
Attachment #9174220 - Attachment description: Bug 1423593 - Part 8: Add fast-path for sentence boundary detection on Latin-1 strings. → Bug 1423593 - Part 8: Add fast-path for sentence boundary detection on Latin-1 strings. r=jwalden!

I can be a secondary reviewer here.

That would be great. Thanks! :-)

I spoke to a few people regarding using ICU (Zibi and Ted mostly) for this. What the patch does now is valid and would be web compatible, but would result in an increased binary size and (perhaps more importantly) inconsistent behaviour between Gecko and Spidermonkey. The solution that is being discussed would be a shared segmenter library, and there is a plan for this that will probably kick of in early next year.

My initial thought was, if we could simply replace ICU with this library, then we could go forward with the work you did here behind a build flag, however it looks like the shared library will likely incorporate some of the functionality of the the Intl.Segmenter proposal. So, for now, we should wait.

I have set this to "requested changes" on the first patch for the time being, since there is no way to communicate "we are waiting on changes elsewhere that will impact this patch" on phabricator in case anyone else is looking at ongoing work or wants to know what the status of Intl.Segmenter is. We might land a modified version of this once things become clearer in how Gecko and SM will share Intl.Segmenter functionality.

When we talked in Berlin about this topic, I thought we concluded that Gecko isn't affected by Intl.Segmenter, because Intl.Segmenter doesn't include line segmentation? And we got the okay to increase the binary size across team (desktop and mobile).

Andre - A month ago I sent you a proposal for the project we're kicking off to rework how segmentation will work to allow us to fit into this spec without paying the payload cost.
I also shared a couple more proposals related to unification of internationalization APIs between Gecko and SpiderMonkey.

I used your bugmail email. If there's a better way to share those with you, please, let me know.

Flags: needinfo?(andrebargull)

Thanks. I've read the proposals yesterday. It looks like the preparatory work for Intl.Segmenter will take a while, so I'll unassign myself for now, because I guess we shouldn't give the impression that Intl.Segmenter will land in the next few months.

Assignee: andrebargull → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(andrebargull)
Attachment #9172419 - Attachment is obsolete: true
Attachment #9172420 - Attachment is obsolete: true
Attachment #9172421 - Attachment is obsolete: true
Attachment #9172422 - Attachment is obsolete: true
Attachment #9172423 - Attachment is obsolete: true
Attachment #9172426 - Attachment is obsolete: true
Attachment #9174218 - Attachment is obsolete: true
Attachment #9174220 - Attachment is obsolete: true
Severity: normal → S3

Hi. I would like to add a note about the potential increase of usage (and also, potential web-compat breakages) of this proposal, which may worth a reconsideration of the discussion priority:

An npm package, string-width [1], started to use this Intl.Segmenter API since its 6.0.0 version released 2 weeks ago [2]. This package currently has ~101M weekly downloads, and might be eligible as a widely-used package. Therefore, subsequent dependency upgrades of various sites might start to cause (a lot of) problems on browsers that don't have this implemented.

  1. https://www.npmjs.com/package/string-width
  2. https://github.com/sindresorhus/string-width/releases/tag/v6.0.0
Depends on: 1719535

This is planned for later this year, but we're not able to work on this until the new ICU4x backed segmenter lands in Bug 1719535.

Dan, with Bug 1719535 landing, is this API effort back on the radar screen here in the near term?

Flags: needinfo?(dminor)

We're planning this for Q4, once some of the current standards related work is completed. It's great to see Bug 1719535 land :)

Flags: needinfo?(dminor)

Implement the standard Intl object code, including the constructor and the
resolvedOptions() method. Intl.Segmenter.prototype.segment() will be added
in the next patch.

Intl.Segmenter.prototype.segment() returns a SegmentsObject tied to a specific
string. The SegmentsObject supports finding segmentation boundaries through its
containing() method.

The "Segment Data" object is created in self-hosted code to ensure we get stable
shapes.

Depends on D186425

In addition to the containing() method, Segments objects also support
iteration. Segment Iterator objects always start the segment iteration at the
start of the string.

Depends on D186426

Test262 tests cover the various error conditions, so the local tests are mostly
restricted to functional tests and SpiderMonkey specifics like cross-compartment
wrappers.

Depends on D186427

Depends on D186428

Depends on D186429

Running the tests uncovered some issues which look like bugs in the ICU4X implementation. (ICU4C computes the expected results). Should I file this at https://github.com/unicode-org/icu4x?


WordBreakIterator::is_word_like() seems to return a wrong result when the input ends with single full stop.

For example "World" isn't marked as word-like in this example:

js> [...new Intl.Segmenter("en", {granularity:"word"}).segment("Hello World.")]
[
  {segment:"Hello", index:0, input:"Hello World.", isWordLike:true},
  {segment:" ", index:5, input:"Hello World.", isWordLike:false},
  {segment:"World", index:6, input:"Hello World.", isWordLike:false},
  {segment:".", index:11, input:"Hello World.", isWordLike:false}
]

Replacing "." with "!" leads to computing the correct word-like result:

js> [...new Intl.Segmenter("en", {granularity:"word"}).segment("Hello World!")]
[
  {segment:"Hello", index:0, input:"Hello World!", isWordLike:true},
  {segment:" ", index:5, input:"Hello World!", isWordLike:false},
  {segment:"World", index:6, input:"Hello World!", isWordLike:true},
  {segment:"!", index:11, input:"Hello World!", isWordLike:false}
]

SentenceBreakIterator seems to have issues with emojis.

The string "Hello World. " is correctly classified as a single sentence.

js> [...new Intl.Segmenter(undefined, {granularity: "sentence"}).segment("Hello World. ")].map(s => s.segment)
["Hello World. "]

But when followed by 🤥 (U+1F925), the string is suddenly split into three segments. (The expected result is ["Hello World. ", "\uD83E\uDD25"].)

js> [...new Intl.Segmenter(undefined, {granularity: "sentence"}).segment("Hello World. " + "\u{1F925}")].map(s => s.segment)
["Hello World", ".", " \uD83E\uDD25"]

When replacing the full stop with an exclamation mark, only the trailing whitespace is incorrectly attributed to the next segment.

js> [...new Intl.Segmenter(undefined, {granularity: "sentence"}).segment("Hello World! " + "\u{1F925}")].map(s => s.segment)
["Hello World!", " \uD83E\uDD25"]
Flags: needinfo?(m_kato)

WordBreakIterator::is_word_like() seems to return a wrong result when the input ends with single full stop.

I will look it. I guess ICU4X bug. Where is is_word_like test? I would like to import/convert it for ICU4X. https://github.com/tc39/test262/blob/main/test/intl402/Segmenter/prototype/segment/containing/iswordlike.js?

SentenceBreakIterator seems to have issues with emojis.

Ah, although this is SB11 rules, this seems a bug of segmenter rule. I will fix it in ICU4X.

Flags: needinfo?(m_kato)

(In reply to Makoto Kato [:m_kato] from comment #40)

WordBreakIterator::is_word_like() seems to return a wrong result when the input ends with single full stop.

I will look it. I guess ICU4X bug. Where is is_word_like test? I would like to import/convert it for ICU4X. https://github.com/tc39/test262/blob/main/test/intl402/Segmenter/prototype/segment/containing/iswordlike.js?

The tests are in https://phabricator.services.mozilla.com/D186428. The word-segmenter test is "js/src/tests/non262/Intl/Segmenter/word.js", it has this commented-out assertion which succeeds in ICU4C, but fails in ICU4X.

// The non-word parts in the samples are either punctuators or space separators.
// assertEq(obj.isWordLike, !/^(\p{gc=P}|\p{gc=Zs})+$/u.test(obj.segment), obj.segment);

SentenceBreakIterator seems to have issues with emojis.

Ah, although this is SB11 rules, this seems a bug of segmenter rule. I will fix it in ICU4X.

Thanks!

comment #39 will be fixed by ICU4X 1.3

Do I read the patches correctly that supportedLocalesOf responds based on what locales are explicitly known to other Intl objects?

(I don't have a good suggestion of how supportedLocalesOf for Intl.Segmenter should behave when the back end ignores the locale and in that sense "supported locale" isn't a good conceptual match for the back end. Chrome seems to have particularly involved way of coming up with supported/not-supported claims for its Intl.Segmenter, which while locale-sensitive, seems to be only minimally so. (Chrome seems to report sv-FI as supported but fi-SE as unsupported, which makes no sense considering what the ICU4C segmenter data contains, as it contains a common exception for fi and sv together ignoring the region for both, but instead seems to be based on knowledge that sv-FI has more status as a well-known thing in non-segmentation manner than fi-SE.) It seems a bit sad if for Web compat supportedLocalesOf becomes or, considering what Chrome does, has already become a particularly complex thing that has little to do with how the back end behaves.)

Flags: needinfo?(andrebargull)

Yes, the supportedLocalesOf implementation uses the set of locales supported by ICU4C (as determined by uloc_getAvailable). This matches the implementation in both V8 and JSC. This also explains why Intl.Segmenter.supportedLocalesOf(["sv-FI", "fi-SE"]) returns ["sv-FI", "fi-SE"] in both engines:

Flags: needinfo?(andrebargull)

Hi Anba, I was wondering if we are still blocked on this. ICU4X 1.3 has been released (https://github.com/unicode-org/icu4x/tree/release/1.3) although we haven't updated our in-tree version. Are we waiting on anything else to continue this work?

I can take a look at doing the ICU4X 1.3 update if that would help. Or do the update and also rebase the patches here if you're busy with other things :)

Flags: needinfo?(andrebargull)
Depends on: 1867586

Makoto is updating ICU4X segmenter to latest 1.4 in bug 1854031.

Depends on: 1854031
No longer depends on: 1867586

(In reply to Dan Minor [:dminor] from comment #45)

Are we waiting on anything else to continue this work?

No, we're just waiting for the ICU4X segmenter update to land.

Flags: needinfo?(andrebargull)
Assignee: nobody → andrebargull
Attachment #9349349 - Attachment description: WIP: Bug 1423593 - Part 1: Implement standard Intl constructor code for Intl.Segmenter. → Bug 1423593 - Part 1: Implement standard Intl constructor code for Intl.Segmenter. r=dminor!
Status: NEW → ASSIGNED
Attachment #9349350 - Attachment description: WIP: Bug 1423593 - Part 2: Implement Segmenter.prototype.segment and the SegmentsObject. → Bug 1423593 - Part 2: Implement Segmenter.prototype.segment and the SegmentsObject. r=dminor!
Attachment #9349351 - Attachment description: WIP: Bug 1423593 - Part 3: Implement Segment Iterator objects. → Bug 1423593 - Part 3: Implement Segment Iterator objects. r=dminor!
Attachment #9349352 - Attachment description: WIP: Bug 1423593 - Part 4: Add tests for Intl.Segmenter. → Bug 1423593 - Part 4: Add tests for Intl.Segmenter. r=dminor!
Attachment #9349353 - Attachment description: WIP: Bug 1423593 - Part 5: Enable test262 tests. → Bug 1423593 - Part 5: Enable test262 tests. r=dminor!
Attachment #9349354 - Attachment description: WIP: Bug 1423593 - Part 6: Reimport test262. → Bug 1423593 - Part 6: Reimport test262. r=dminor!

Recreating the ICU4X segmenters and break iterators is too slow when compared
to other engines, so we have to cache the ICU4X object for better performance.

Depends on D186430

@anba - do you have any rough estimates as to how much caching (part 7) saves? How does pre/post compares to other engines?

Flags: needinfo?(andrebargull)

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #49)

@anba - do you have any rough estimates as to how much caching (part 7) saves? How does pre/post compares to other engines?

Grapheme segmentation over the string "The quick brown fox jumps over the lazy dog.".

Engine Factor
SM (base) 1
SM (cache) ~5x faster
JSC ~2.5 faster
V8 ~1.5 slower (!)

Word segmentation over the string "The quick brown fox jumps over the lazy dog.".

Engine Factor
SM (base) 1
SM (cache) ~7x faster
JSC ~3.5 faster
V8 1 (!)

There's an even larger improvement for larger strings, because we're basically comparing a linear in string length against a quadratic in string length solution. (Quadratic because we always restart from index zero.)

I don't know why V8 is so slow when compared to JSC, maybe they're doing something wrong when calling into ICU?

var seg = new Intl.Segmenter("en", {granularity:"grapheme"})
var str = "The quick brown fox jumps over the lazy dog.";

var r = 0;
var t = performance.now();
for (var i = 0; i < 100_000; ++i) {
  r += [...seg.segment(str)].length;
}
console.log(performance.now() - t, r);
Flags: needinfo?(andrebargull)
Pushed by andre.bargull@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/5fb85ed28b12
Part 1: Implement standard Intl constructor code for Intl.Segmenter. r=dminor
https://hg.mozilla.org/integration/autoland/rev/ea86c2b391f7
Part 2: Implement Segmenter.prototype.segment and the SegmentsObject. r=dminor
https://hg.mozilla.org/integration/autoland/rev/ef3254d46ab7
Part 3: Implement Segment Iterator objects. r=dminor
https://hg.mozilla.org/integration/autoland/rev/e06dfac91b17
Part 4: Add tests for Intl.Segmenter. r=dminor
https://hg.mozilla.org/integration/autoland/rev/ec5883dc144d
Part 5: Enable test262 tests. r=dminor
https://hg.mozilla.org/integration/autoland/rev/3308537ba48a
Part 6: Reimport test262. r=dminor
https://hg.mozilla.org/integration/autoland/rev/4b401e7231ca
Part 7: Cache internal segmenters and break iterators. r=dminor
Regressions: 1869755

:anba could you consider nominating this for a release note? (Process info)

Flags: needinfo?(andrebargull)

Release Note Request (optional, but appreciated)
[Why is this notable]: A new web platform feature
[Affects Firefox for Android]: Yes
[Suggested wording]: Add support for Unicode text segmentation to JavaScript. Currently only enabled for Nightly builds.
[Links (documentation, blog post, etc)]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter

relnote-firefox: --- → ?
Flags: needinfo?(andrebargull)

Thanks, added to the Fx122 nightly release notes, please allow 30 minutes for the site to update.

FF122 docs work for this can be tracked in https://github.com/mdn/content/issues/31098. The API is already documented, so this was just an MDN release note and update to browser compatibility data.

Q: Based on the intl.icu4x.segmenter.enabled set to true in StaticPrefList.yaml, does it mean that Intl.Segmenter API will be shipped in 122 enabled by default?

(In reply to Vadim Makeev from comment #57)

Q: Based on the intl.icu4x.segmenter.enabled set to true in StaticPrefList.yaml, does it mean that Intl.Segmenter API will be shipped in 122 enabled by default?

No, Intl.Segmenter is still restricted to Nightly due to this ifdef.

(In reply to André Bargull [:anba] from comment #58)

(In reply to Vadim Makeev from comment #57)

Q: Based on the intl.icu4x.segmenter.enabled set to true in StaticPrefList.yaml, does it mean that Intl.Segmenter API will be shipped in 122 enabled by default?

No, Intl.Segmenter is still restricted to Nightly due to this ifdef.

What's blocking the removal of the ifdef and letting Intl.Segmenter ride the trains?

Flags: needinfo?(andrebargull)

I didn't want to let it ride the trains right away in case there are any major differences between ICU4C segmentation (which is used by JSC/V8) and ICU4X segmentation which can lead to web-compat issues. Minor differences which are unlikely to cause issues are for example:

We didn't yet receive any bug reports, so we could consider enabling Intl.Segmenter by default for either Firefox 124 or 125, I guess.

Flags: needinfo?(andrebargull)

Have we considered improving the implementation of the segmenter?

it will break on abbrevations when splitting, where most NLP libs don't.

Example with NTLK:

>>> import nltk
>>> nltk.sent_tokenize("""E. T. A. Hoffmann (1776–1822) wrote a short story in 1816 titled Der Sandmann, which showed how sinister such a character could be made. According to the protagonist's nurse, he threw sand in the eyes of children who would not sleep, with the result of those eyes falling out and being collected by the Sandman, who then takes the eyes to his iron nest on the Moon and uses them to feed his children. The protagonist of the story grows to associate this nightmarish creature with the genuinely sinister figure of his father's associate, Coppelius. In Romanian folklore, there is a similar character, Moș Ene (Ene the Elder). Hoffmann's version of the sandman is also similar to the French Canadian character known as the Bonhomme Sept Heures (Goodmaan Seven O’Clock), who, in some versions, throws sand in children's eyes to blind them so that he may capture them. Contrarily to the sandman, his bag is the place where he traps children who do not go to bed.""")

['E. T. A. Hoffmann (1776–1822) wrote a short story in 1816 titled Der Sandmann, which showed how sinister such a character could be made.', "According to the protagonist's nurse, he threw sand in the eyes of children who would not sleep, with the result of those eyes falling out and being collected by the Sandman, who then takes the eyes to his iron nest on the Moon and uses them to feed his children.", "The protagonist of the story grows to associate this nightmarish creature with the genuinely sinister figure of his father's associate, Coppelius.", 'In Romanian folklore, there is a similar character, Moș Ene (Ene the Elder).', "Hoffmann's version of the sandman is also similar to the French Canadian character known as the Bonhomme Sept Heures (Goodmaan Seven O’Clock), who, in some versions, throws sand in children's eyes to blind them so that he may capture them.", 'Contrarily to the sandman, his bag is the place where he traps children who do not go to bed.']

Intl will split at E. T. A.

Flags: needinfo?(andrebargull)

That's an ICU4X-level question. ICU4C suppresses sentence breaks for well-known abbreviations in the best-known languages, but only those. https://github.com/unicode-org/icu4x/issues/3284 is about whether ICU4X should do something like that.

At present ICU4C as used in Chrome treats "E.", "T.", and "A." as sentences, since they aren't well-known abbreviations even though English is one of the best-known languages.

Thanks Henri, will dig in that direction. It breaks for all points as far as I can tell. e.g. "Jr." in English.

In addition to the ICU4X issue linked in comment #62, the following ICU4X and ECMA-402 issues are also related:

Flags: needinfo?(andrebargull)

This has been enabled on Nightly for 3 cycles now. Per our policy, removing this from the Nightly release notes. Feel free to nominate again for train-specific release notes when this is ready to ride the trains to Release.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: