Add Intl.Segmenter API
Categories
(Core :: JavaScript: Internationalization API, enhancement, P3)
Tracking
()
People
(Reporter: anba, Assigned: anba)
References
(Blocks 4 open bugs, )
Details
(Keywords: dev-doc-complete)
Attachments
(7 files, 10 obsolete files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
The Intl.Segmenter proposal moved to stage 3 in September 2017 [2]. [1] https://github.com/tc39/proposal-intl-segmenter [2] https://github.com/tc39/tc39-notes/blob/master/es8/2017-09/sep-26.md#12ib-intlsegmenter-for-stage-3
Assignee | ||
Comment 1•7 years ago
|
||
Adds all break-iterator data to icudt for testing the WIP patch in part 2. Adding all break-iterator data increases icudt by 3.57MB which makes it kind of unlikely to be approved by release drivers. :-)
Assignee | ||
Comment 2•7 years ago
|
||
WIP patch for Intl.Segmenter.
Updated•7 years ago
|
Comment 3•7 years ago
|
||
Jonathan - the 3.5MB added is probably a no-go unless we can unify what we do internally with this API and either reuse the internal API here, or use ICU here and for Gecko and remove the internal API. WDYT?
Comment 4•7 years ago
|
||
I guess quite a lot of the data size here is to support proper line- and word-segmentation in languages like Thai, Khmer and Lao where text is written without word separators, yet breaks are not allowed just "anywhere" like in Chinese (to a first approximation) or derived from simple character-based rules; the only way to get it right is a dictionary-based algorithm. We currently don't have support for this built in to Gecko, as seen for example in bug 448650. On desktop platforms, we solve the line-breaking issue by calling host platform APIs (Uniscribe on Windows; CoreFoundation on macOS; pango on Linux) for text in "complex" scripts, which means that we're not responsible for packaging/delivering the bulky data that underlies it, but it also means our behavior is likely to vary somewhat between platforms, depending on the support provided by the libraries we're using. So in principle, we could switch from those separate platform-specific implementations to a single ICU-based one, and gain more consistent behavior across platforms (and fix the behavior on Android, where I assume we currently just fail, per bug 448650); but this would mean accepting the data size as part of Gecko instead of depending on the platform to provide it. Whether that's a worthwhile trade-off is a product-drivers call, I guess. We also have a bunch of custom line-breaking support in nsJISx4051LineBreaker for non-Thai/Lao/Khmer text, which could perhaps be replaced by ICU, though this would need careful examination as we don't necessarily follow the default Unicode algorithm in all cases. Our current breaker has a lot of customizations to better handle breaking around things like URLs, Japanese-style emoticon sequences ¯\_(ツ)_/¯ etc., and we'd need to be wary of regressing these cases. There's also word-boundary segmentation, used for things like double-click selection; here, our current implementation is quite rudimentary (it's nsSampleWordBreaker, which sounds suspiciously like it was an initial place-holder that never got any further love!), and I expect ICU's segment iterator is substantially better. If nothing else, it'll be up-to-date with the current Unicode character repertoire, whereas it doesn't look like our code has been updated since about Unicode 3.0 -- and it's completely missing any attempt to support languages like Thai properly. Overall, I see several benefits to switching intl/lwbrk to be backed by ICU segmentation: we'd get consistent behavior across desktop platforms, and fix the currently-broken behavior on mobile for Thai etc; we'd get improved word selection with up-to-date Unicode support and correct behavior (I presume) for the "difficult" SEAsian languages where we currently fail; and we'd get to drop maintenance of separate platform-specific line-breaking backends. (In particular, I suspect the Uniscribe-based implementation we use on Windows may become problematic as we continue to tighten sandboxing restrictions.) It's clear, though, that the amount of existing code and data that we'd get to drop in the process is pretty small, so I don't expect it would make a significant dent in overall size to offset the increase in ICU data. Can/should we live with that? (FWIW, I think the opposite approach, trying to implement Intl.Segmenter on top of our current line- and word-breaking code (AFAIK we don't have any implementation of "sentence segmentation"), isn't really worthwhile because our implementations aren't good/comprehensive enough, or sufficiently consistent across platforms, to be exposed to web authors who IMO should be able to rely on consistency from this API if they're going to write code that depends on it.)
Comment 5•7 years ago
|
||
See also bug 425915, which is about Thai word-segmentation. The current patches there are leveraging our existing platform-specific line-break implementations, so AFAICT would not fix the bug very well for Android, where we use the (much more limited) nsRuleBreaker rather than a dictionary-backed API. Adopting the ICU segmentation support would give a better result.
Updated•7 years ago
|
Updated•7 years ago
|
Comment 6•7 years ago
|
||
Thanks for the explanation Jonathan! I think it all makes sense, however unfortunately it puts us at odds with the Intl.Segmenter proposal. Is there any chance it would make sense for us to adopt the ICU algorithms and turn this problem into a table selection problem? In other words, direct us into a unified hypenation/segmentation/line-breaking backed by ICU algorithms both for Gecko and Intl API, but select only tables that we have right now? If that would be possible, my naive understanding is that we'd gain: - better algorithms to the ones we have to maintain - lower maintenance cost for us with ability to remove our own algos - unified API for the web and Gecko Costs: - One of our APIs is probably more fine-tuned for the Web than what ICU is doing (line-breaking) - Potentially, even the subset of tables we'd want to have to match our current data set would cost in size (?) Is that a good summary? Would it be something potentially realistic?
Comment 7•7 years ago
|
||
I'm not sure what "table selection problem" is. (In reply to Zibi Braniecki [:gandalf][:zibi] from comment #6) > Thanks for the explanation Jonathan! > > I think it all makes sense, however unfortunately it puts us at odds with > the Intl.Segmenter proposal. > > Is there any chance it would make sense for us to adopt the ICU algorithms > and turn this problem into a table selection problem? > > In other words, direct us into a unified > hypenation/segmentation/line-breaking backed by ICU algorithms both for > Gecko and Intl API, but select only tables that we have right now? > > If that would be possible, my naive understanding is that we'd gain: > > - better algorithms to the ones we have to maintain > - lower maintenance cost for us with ability to remove our own algos > - unified API for the web and Gecko > > Costs: > > - One of our APIs is probably more fine-tuned for the Web than what ICU is > doing (line-breaking) > - Potentially, even the subset of tables we'd want to have to match our > current data set would cost in size (?) > > Is that a good summary? Would it be something potentially realistic?
Comment 8•7 years ago
|
||
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #6) > Thanks for the explanation Jonathan! > > I think it all makes sense, however unfortunately it puts us at odds with > the Intl.Segmenter proposal. > > Is there any chance it would make sense for us to adopt the ICU algorithms > and turn this problem into a table selection problem? > > In other words, direct us into a unified > hypenation/segmentation/line-breaking backed by ICU algorithms both for > Gecko and Intl API, but select only tables that we have right now? (Don't include "hyphenation" in discussion here, AFAIK there's no hyphenation support in ICU. We implement that entirely separately, using libhyphen.) I don't know how much of the 3.5MB we could save by excluding some of the data; we could exclude the sentence-break rules, as we don't currently have any such functionality, and we could trim any localized versions of line- and word-breaking, as we don't currently implement language-sensitive behavior (that's bug 203016!). But my guess (without having actually tried) is that the savings would be relatively small. Most of the bulk probably comes from properties for the entire Unicode repertoire, which are needed even if we only support the root locale, plus the dictionaries that are needed to support Thai (etc) word-break recognition. Maybe André can try creating a version of the data patch with trimmed content and see how much different it turns out? But I suspect it won't make a dramatic difference, and if we want this functionality at all we're going to have to accept a several-MB size increase. (Remember that currently we're relying on data provided by the OS for substantial parts of this, which means we don't save any data size within Gecko by dropping the old implementations.)
Assignee | ||
Comment 9•7 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #8) > Maybe André can try creating a version of the data patch with trimmed > content and see how much different it turns out? But I suspect it won't make > a dramatic difference, and if we want this functionality at all we're going > to have to accept a several-MB size increase. The total increase in bytes is 3.745.072: 2.882.016 bytes for dictionaries (burmese, chinese-japanese, khmer lao, thai) and 863.056 bytes (non-dictionary data).
Updated•5 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 11•4 years ago
|
||
Includes all break iteration data, including line segmentation even though it's
unused, because ICU data tailoring doesn't support to exclude line segmentation
data.
Filtered break iteration and the accompanying "exceptions" data tree are
disabled, because the Intl.Segmenter proposal doesn't support sentence break
suppressions through the "ss" Unicode extension key.
Assignee | ||
Comment 12•4 years ago
|
||
Implement the standard Intl object code, including the constructor and the
resolvedOptions()
method. Intl.Segmenter.prototype.segment()
will be added
in the next patch.
Depends on D88432
Assignee | ||
Comment 13•4 years ago
|
||
Intl.Segmenter.prototype.segment()
returns a SegmentsObject tied to a specific
string. The SegmentsObject supports finding segmentation boundaries through its
containing()
method.
Each SegmentsObject
holds a separate UBreakIterator
instead of using a
shared UBreakIterator
in SegmenterObject
to allow ICU to precompute and cache
segmentation boundaries for a single string.
The "Segment Data" object is created in self-hosted code to ensure we get stable
shapes.
Depends on D88433
Assignee | ||
Comment 14•4 years ago
|
||
In addition to the containing()
method, Segments objects also support
iteration. Segment Iterator objects always start the segment iteration at the
start of the string and are independent of the current position of the Segments
object's UBreakIterator
. That means each Segment Iterator needs to have yet
another copy of the UBreakIterator
.
Depends on D88434
Assignee | ||
Comment 15•4 years ago
|
||
Test262 tests cover the various error conditions, so the local tests are mostly
restricted to functional tests and SpiderMonkey specifics like cross-compartment
wrappers.
Depends on D88435
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 16•4 years ago
|
||
Depends on D88436
Assignee | ||
Comment 17•4 years ago
|
||
The patches are ready, but I haven't yet assigned a reviewer, because I'm not sure who is currently available.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 18•4 years ago
|
||
Add a fast-path for Latin-1 strings to avoid the Latin-1 to Two-Byte string
conversion before calling into ICU. This gives a 50x speed-up for the cases
when AutoStableStringChars
needs to malloc memory to hold the converted
characters. It also provides a 25-40% speed-up compared to ICU (when comparing
without the malloc overhead).
Word break rules and properties for Latin-1 shouldn't observe any changes in
future UAX #29 resp. Unicode versions, so hard-coding the rules and properties
shouldn't lead to any issues.
All rules are annotated to allow comparing them with UAX #29. Rules which don't
apply to Latin-1 characters (e.g. Hebrew letters or Katagana) were omitted.
Depends on D88438
Assignee | ||
Comment 19•4 years ago
|
||
Similar to part 7, add a fast-path for Latin-1 strings. In addition to avoiding
the malloc overhead, the fast-path is 3-4x faster than ICU (when comparing
without the malloc overhead).
Depends on D89373
Comment 20•4 years ago
|
||
I'm available to review stuff. At least for now I plan on continuing to do stuff on an as-I-feel-like-it basis, and while I didn't start doing intl stuff by deliberate and intended choice, over time I've come to find it reasonably interesting in its own right. So intl stuff is part of the SpiderMonkey things I will keep doing. Therefore, throw 'em at me if you want. :-)
Can't make guarantees about turnaround time, if I decide to spend any particular tomorrow biking all day instead of doing reviewing that's just how it'll be. But then I couldn't well offer guarantees before, either, could I, when other projects were my primary goals. So it's not really too much different. (And with recent bike bustage, before I could get to repairing it, I seem to have lost a bunch of fitness, so I gotta budget in rest days anyway.)
(In reply to me from comment #10)
Any idea when
Intl.Segmenter
will be added?
Not terribly precisely, no. Patches need reviewing, we need to decide just how to deal with the size increases (if we haven't already chosen to accept it), and I'm not 100% certain what exactly we're doing about either having unified segmentation support used throughout Gecko, or having different implementations for Gecko for layout and for JS for Intl. If you need segmentation now, or soon, you should plan on shipping and using your own implementation.
Assignee | ||
Comment 21•4 years ago
|
||
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 22•4 years ago
|
||
I can be a secondary reviewer here.
Assignee | ||
Comment 23•4 years ago
|
||
That would be great. Thanks! :-)
Comment 24•4 years ago
•
|
||
I spoke to a few people regarding using ICU (Zibi and Ted mostly) for this. What the patch does now is valid and would be web compatible, but would result in an increased binary size and (perhaps more importantly) inconsistent behaviour between Gecko and Spidermonkey. The solution that is being discussed would be a shared segmenter library, and there is a plan for this that will probably kick of in early next year.
My initial thought was, if we could simply replace ICU with this library, then we could go forward with the work you did here behind a build flag, however it looks like the shared library will likely incorporate some of the functionality of the the Intl.Segmenter proposal. So, for now, we should wait.
I have set this to "requested changes" on the first patch for the time being, since there is no way to communicate "we are waiting on changes elsewhere that will impact this patch" on phabricator in case anyone else is looking at ongoing work or wants to know what the status of Intl.Segmenter is. We might land a modified version of this once things become clearer in how Gecko and SM will share Intl.Segmenter functionality.
Assignee | ||
Comment 25•4 years ago
•
|
||
When we talked in Berlin about this topic, I thought we concluded that Gecko isn't affected by Intl.Segmenter
, because Intl.Segmenter
doesn't include line segmentation? And we got the okay to increase the binary size across team (desktop and mobile).
Comment 26•4 years ago
•
|
||
Andre - A month ago I sent you a proposal for the project we're kicking off to rework how segmentation will work to allow us to fit into this spec without paying the payload cost.
I also shared a couple more proposals related to unification of internationalization APIs between Gecko and SpiderMonkey.
I used your bugmail email. If there's a better way to share those with you, please, let me know.
Assignee | ||
Comment 27•4 years ago
|
||
Thanks. I've read the proposals yesterday. It looks like the preparatory work for Intl.Segmenter
will take a while, so I'll unassign myself for now, because I guess we shouldn't give the impression that Intl.Segmenter
will land in the next few months.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 28•3 years ago
|
||
Is it possible to update this one to block https://bugzilla.mozilla.org/show_bug.cgi?id=1435795 now that Stage 4 has been reached (https://github.com/tc39/notes/blob/master/meetings/2021-10/oct-25.md#conclusionresolution-1)?
Updated•3 years ago
|
Updated•2 years ago
|
Comment 29•1 year ago
|
||
Hi. I would like to add a note about the potential increase of usage (and also, potential web-compat breakages) of this proposal, which may worth a reconsideration of the discussion priority:
An npm package, string-width
[1], started to use this Intl.Segmenter
API since its 6.0.0 version released 2 weeks ago [2]. This package currently has ~101M weekly downloads, and might be eligible as a widely-used package. Therefore, subsequent dependency upgrades of various sites might start to cause (a lot of) problems on browsers that don't have this implemented.
Comment 30•1 year ago
|
||
This is planned for later this year, but we're not able to work on this until the new ICU4x backed segmenter lands in Bug 1719535.
Comment 31•10 months ago
|
||
Dan, with Bug 1719535 landing, is this API effort back on the radar screen here in the near term?
Comment 32•10 months ago
|
||
We're planning this for Q4, once some of the current standards related work is completed. It's great to see Bug 1719535 land :)
Assignee | ||
Comment 33•10 months ago
|
||
Implement the standard Intl object code, including the constructor and the
resolvedOptions()
method. Intl.Segmenter.prototype.segment()
will be added
in the next patch.
Assignee | ||
Comment 34•10 months ago
|
||
Intl.Segmenter.prototype.segment()
returns a SegmentsObject tied to a specific
string. The SegmentsObject supports finding segmentation boundaries through its
containing()
method.
The "Segment Data" object is created in self-hosted code to ensure we get stable
shapes.
Depends on D186425
Assignee | ||
Comment 35•10 months ago
|
||
In addition to the containing()
method, Segments objects also support
iteration. Segment Iterator objects always start the segment iteration at the
start of the string.
Depends on D186426
Assignee | ||
Comment 36•10 months ago
|
||
Test262 tests cover the various error conditions, so the local tests are mostly
restricted to functional tests and SpiderMonkey specifics like cross-compartment
wrappers.
Depends on D186427
Assignee | ||
Comment 37•10 months ago
|
||
Depends on D186428
Assignee | ||
Comment 38•10 months ago
|
||
Depends on D186429
Assignee | ||
Comment 39•10 months ago
|
||
Running the tests uncovered some issues which look like bugs in the ICU4X implementation. (ICU4C computes the expected results). Should I file this at https://github.com/unicode-org/icu4x?
WordBreakIterator::is_word_like()
seems to return a wrong result when the input ends with single full stop.
For example "World"
isn't marked as word-like in this example:
js> [...new Intl.Segmenter("en", {granularity:"word"}).segment("Hello World.")]
[
{segment:"Hello", index:0, input:"Hello World.", isWordLike:true},
{segment:" ", index:5, input:"Hello World.", isWordLike:false},
{segment:"World", index:6, input:"Hello World.", isWordLike:false},
{segment:".", index:11, input:"Hello World.", isWordLike:false}
]
Replacing "."
with "!"
leads to computing the correct word-like result:
js> [...new Intl.Segmenter("en", {granularity:"word"}).segment("Hello World!")]
[
{segment:"Hello", index:0, input:"Hello World!", isWordLike:true},
{segment:" ", index:5, input:"Hello World!", isWordLike:false},
{segment:"World", index:6, input:"Hello World!", isWordLike:true},
{segment:"!", index:11, input:"Hello World!", isWordLike:false}
]
SentenceBreakIterator
seems to have issues with emojis.
The string "Hello World. "
is correctly classified as a single sentence.
js> [...new Intl.Segmenter(undefined, {granularity: "sentence"}).segment("Hello World. ")].map(s => s.segment)
["Hello World. "]
But when followed by 🤥 (U+1F925), the string is suddenly split into three segments. (The expected result is ["Hello World. ", "\uD83E\uDD25"]
.)
js> [...new Intl.Segmenter(undefined, {granularity: "sentence"}).segment("Hello World. " + "\u{1F925}")].map(s => s.segment)
["Hello World", ".", " \uD83E\uDD25"]
When replacing the full stop with an exclamation mark, only the trailing whitespace is incorrectly attributed to the next segment.
js> [...new Intl.Segmenter(undefined, {granularity: "sentence"}).segment("Hello World! " + "\u{1F925}")].map(s => s.segment)
["Hello World!", " \uD83E\uDD25"]
Comment 40•10 months ago
|
||
WordBreakIterator::is_word_like() seems to return a wrong result when the input ends with single full stop.
I will look it. I guess ICU4X bug. Where is is_word_like test? I would like to import/convert it for ICU4X. https://github.com/tc39/test262/blob/main/test/intl402/Segmenter/prototype/segment/containing/iswordlike.js?
SentenceBreakIterator seems to have issues with emojis.
Ah, although this is SB11 rules, this seems a bug of segmenter rule. I will fix it in ICU4X.
Assignee | ||
Comment 41•10 months ago
|
||
(In reply to Makoto Kato [:m_kato] from comment #40)
WordBreakIterator::is_word_like() seems to return a wrong result when the input ends with single full stop.
I will look it. I guess ICU4X bug. Where is is_word_like test? I would like to import/convert it for ICU4X. https://github.com/tc39/test262/blob/main/test/intl402/Segmenter/prototype/segment/containing/iswordlike.js?
The tests are in https://phabricator.services.mozilla.com/D186428. The word-segmenter test is "js/src/tests/non262/Intl/Segmenter/word.js", it has this commented-out assertion which succeeds in ICU4C, but fails in ICU4X.
// The non-word parts in the samples are either punctuators or space separators.
// assertEq(obj.isWordLike, !/^(\p{gc=P}|\p{gc=Zs})+$/u.test(obj.segment), obj.segment);
SentenceBreakIterator seems to have issues with emojis.
Ah, although this is SB11 rules, this seems a bug of segmenter rule. I will fix it in ICU4X.
Thanks!
Comment 42•10 months ago
|
||
comment #39 will be fixed by ICU4X 1.3
Comment 43•8 months ago
|
||
Do I read the patches correctly that supportedLocalesOf
responds based on what locales are explicitly known to other Intl
objects?
(I don't have a good suggestion of how supportedLocalesOf
for Intl.Segmenter
should behave when the back end ignores the locale and in that sense "supported locale" isn't a good conceptual match for the back end. Chrome seems to have particularly involved way of coming up with supported/not-supported claims for its Intl.Segmenter
, which while locale-sensitive, seems to be only minimally so. (Chrome seems to report sv-FI
as supported but fi-SE
as unsupported, which makes no sense considering what the ICU4C segmenter data contains, as it contains a common exception for fi
and sv
together ignoring the region for both, but instead seems to be based on knowledge that sv-FI
has more status as a well-known thing in non-segmentation manner than fi-SE
.) It seems a bit sad if for Web compat supportedLocalesOf
becomes or, considering what Chrome does, has already become a particularly complex thing that has little to do with how the back end behaves.)
Assignee | ||
Comment 44•8 months ago
|
||
Yes, the supportedLocalesOf
implementation uses the set of locales supported by ICU4C (as determined by uloc_getAvailable
). This matches the implementation in both V8 and JSC. This also explains why Intl.Segmenter.supportedLocalesOf(["sv-FI", "fi-SE"])
returns ["sv-FI", "fi-SE"]
in both engines:
sv-FI
is supported, because ICU4C contains (general) localisation support forsv-FI
(https://github.com/unicode-org/icu/blob/main/icu4c/source/data/locales/sv_FI.txt), and thereforenew Intl.Segmenter("sv-FI").resolvedOptions().locale === "sv-FI"
.fi-SE
is supported through the fallback to the parent localefi
(https://github.com/unicode-org/icu/blob/main/icu4c/source/data/locales/fi.txt), and thereforenew Intl.Segmenter("fi-SE").resolvedOptions().locale === "fi"
.
Comment 45•7 months ago
|
||
Hi Anba, I was wondering if we are still blocked on this. ICU4X 1.3 has been released (https://github.com/unicode-org/icu4x/tree/release/1.3) although we haven't updated our in-tree version. Are we waiting on anything else to continue this work?
I can take a look at doing the ICU4X 1.3 update if that would help. Or do the update and also rebase the patches here if you're busy with other things :)
Comment 46•7 months ago
|
||
Makoto is updating ICU4X segmenter to latest 1.4 in bug 1854031.
Assignee | ||
Comment 47•6 months ago
|
||
(In reply to Dan Minor [:dminor] from comment #45)
Are we waiting on anything else to continue this work?
No, we're just waiting for the ICU4X segmenter update to land.
Updated•6 months ago
|
Updated•6 months ago
|
Updated•6 months ago
|
Updated•6 months ago
|
Updated•6 months ago
|
Updated•6 months ago
|
Assignee | ||
Comment 48•6 months ago
|
||
Recreating the ICU4X segmenters and break iterators is too slow when compared
to other engines, so we have to cache the ICU4X object for better performance.
Depends on D186430
Comment 49•6 months ago
|
||
@anba - do you have any rough estimates as to how much caching (part 7) saves? How does pre/post compares to other engines?
Assignee | ||
Comment 50•6 months ago
|
||
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #49)
@anba - do you have any rough estimates as to how much caching (part 7) saves? How does pre/post compares to other engines?
Grapheme segmentation over the string "The quick brown fox jumps over the lazy dog."
.
Engine | Factor |
---|---|
SM (base) | 1 |
SM (cache) | ~5x faster |
JSC | ~2.5 faster |
V8 | ~1.5 slower (!) |
Word segmentation over the string "The quick brown fox jumps over the lazy dog."
.
Engine | Factor |
---|---|
SM (base) | 1 |
SM (cache) | ~7x faster |
JSC | ~3.5 faster |
V8 | 1 (!) |
There's an even larger improvement for larger strings, because we're basically comparing a linear in string length against a quadratic in string length solution. (Quadratic because we always restart from index zero.)
I don't know why V8 is so slow when compared to JSC, maybe they're doing something wrong when calling into ICU?
var seg = new Intl.Segmenter("en", {granularity:"grapheme"})
var str = "The quick brown fox jumps over the lazy dog.";
var r = 0;
var t = performance.now();
for (var i = 0; i < 100_000; ++i) {
r += [...seg.segment(str)].length;
}
console.log(performance.now() - t, r);
Comment 51•6 months ago
|
||
Pushed by andre.bargull@gmail.com: https://hg.mozilla.org/integration/autoland/rev/5fb85ed28b12 Part 1: Implement standard Intl constructor code for Intl.Segmenter. r=dminor https://hg.mozilla.org/integration/autoland/rev/ea86c2b391f7 Part 2: Implement Segmenter.prototype.segment and the SegmentsObject. r=dminor https://hg.mozilla.org/integration/autoland/rev/ef3254d46ab7 Part 3: Implement Segment Iterator objects. r=dminor https://hg.mozilla.org/integration/autoland/rev/e06dfac91b17 Part 4: Add tests for Intl.Segmenter. r=dminor https://hg.mozilla.org/integration/autoland/rev/ec5883dc144d Part 5: Enable test262 tests. r=dminor https://hg.mozilla.org/integration/autoland/rev/3308537ba48a Part 6: Reimport test262. r=dminor https://hg.mozilla.org/integration/autoland/rev/4b401e7231ca Part 7: Cache internal segmenters and break iterators. r=dminor
Comment 52•6 months ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/5fb85ed28b12
https://hg.mozilla.org/mozilla-central/rev/ea86c2b391f7
https://hg.mozilla.org/mozilla-central/rev/ef3254d46ab7
https://hg.mozilla.org/mozilla-central/rev/e06dfac91b17
https://hg.mozilla.org/mozilla-central/rev/ec5883dc144d
https://hg.mozilla.org/mozilla-central/rev/3308537ba48a
https://hg.mozilla.org/mozilla-central/rev/4b401e7231ca
Comment 53•6 months ago
|
||
:anba could you consider nominating this for a release note? (Process info)
Assignee | ||
Comment 54•6 months ago
|
||
Release Note Request (optional, but appreciated)
[Why is this notable]: A new web platform feature
[Affects Firefox for Android]: Yes
[Suggested wording]: Add support for Unicode text segmentation to JavaScript. Currently only enabled for Nightly builds.
[Links (documentation, blog post, etc)]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter
Comment 55•6 months ago
|
||
Thanks, added to the Fx122 nightly release notes, please allow 30 minutes for the site to update.
Comment 56•6 months ago
|
||
FF122 docs work for this can be tracked in https://github.com/mdn/content/issues/31098. The API is already documented, so this was just an MDN release note and update to browser compatibility data.
Comment 57•5 months ago
|
||
Q: Based on the intl.icu4x.segmenter.enabled
set to true
in StaticPrefList.yaml, does it mean that Intl.Segmenter API will be shipped in 122 enabled by default?
Assignee | ||
Comment 58•5 months ago
|
||
(In reply to Vadim Makeev from comment #57)
Q: Based on the
intl.icu4x.segmenter.enabled
set totrue
in StaticPrefList.yaml, does it mean that Intl.Segmenter API will be shipped in 122 enabled by default?
No, Intl.Segmenter
is still restricted to Nightly due to this ifdef.
Comment 59•5 months ago
|
||
(In reply to André Bargull [:anba] from comment #58)
(In reply to Vadim Makeev from comment #57)
Q: Based on the
intl.icu4x.segmenter.enabled
set totrue
in StaticPrefList.yaml, does it mean that Intl.Segmenter API will be shipped in 122 enabled by default?No,
Intl.Segmenter
is still restricted to Nightly due to this ifdef.
What's blocking the removal of the ifdef and letting Intl.Segmenter
ride the trains?
Assignee | ||
Comment 60•5 months ago
|
||
I didn't want to let it ride the trains right away in case there are any major differences between ICU4C segmentation (which is used by JSC/V8) and ICU4X segmentation which can lead to web-compat issues. Minor differences which are unlikely to cause issues are for example:
- https://github.com/unicode-org/icu4x/issues/4365
- https://github.com/unicode-org/icu4x/issues/4417
- https://github.com/unicode-org/icu4x/issues/4446
We didn't yet receive any bug reports, so we could consider enabling Intl.Segmenter
by default for either Firefox 124 or 125, I guess.
Comment 61•5 months ago
|
||
Have we considered improving the implementation of the segmenter?
it will break on abbrevations when splitting, where most NLP libs don't.
Example with NTLK:
>>> import nltk
>>> nltk.sent_tokenize("""E. T. A. Hoffmann (1776–1822) wrote a short story in 1816 titled Der Sandmann, which showed how sinister such a character could be made. According to the protagonist's nurse, he threw sand in the eyes of children who would not sleep, with the result of those eyes falling out and being collected by the Sandman, who then takes the eyes to his iron nest on the Moon and uses them to feed his children. The protagonist of the story grows to associate this nightmarish creature with the genuinely sinister figure of his father's associate, Coppelius. In Romanian folklore, there is a similar character, Moș Ene (Ene the Elder). Hoffmann's version of the sandman is also similar to the French Canadian character known as the Bonhomme Sept Heures (Goodmaan Seven O’Clock), who, in some versions, throws sand in children's eyes to blind them so that he may capture them. Contrarily to the sandman, his bag is the place where he traps children who do not go to bed.""")
['E. T. A. Hoffmann (1776–1822) wrote a short story in 1816 titled Der Sandmann, which showed how sinister such a character could be made.', "According to the protagonist's nurse, he threw sand in the eyes of children who would not sleep, with the result of those eyes falling out and being collected by the Sandman, who then takes the eyes to his iron nest on the Moon and uses them to feed his children.", "The protagonist of the story grows to associate this nightmarish creature with the genuinely sinister figure of his father's associate, Coppelius.", 'In Romanian folklore, there is a similar character, Moș Ene (Ene the Elder).', "Hoffmann's version of the sandman is also similar to the French Canadian character known as the Bonhomme Sept Heures (Goodmaan Seven O’Clock), who, in some versions, throws sand in children's eyes to blind them so that he may capture them.", 'Contrarily to the sandman, his bag is the place where he traps children who do not go to bed.']
Intl will split at E. T. A.
Comment 62•5 months ago
|
||
That's an ICU4X-level question. ICU4C suppresses sentence breaks for well-known abbreviations in the best-known languages, but only those. https://github.com/unicode-org/icu4x/issues/3284 is about whether ICU4X should do something like that.
At present ICU4C as used in Chrome treats "E.", "T.", and "A." as sentences, since they aren't well-known abbreviations even though English is one of the best-known languages.
Comment 63•5 months ago
|
||
Thanks Henri, will dig in that direction. It breaks for all points as far as I can tell. e.g. "Jr." in English.
Assignee | ||
Comment 64•5 months ago
|
||
In addition to the ICU4X issue linked in comment #62, the following ICU4X and ECMA-402 issues are also related:
- https://github.com/unicode-org/icu4x/issues/3927: Add support for sentence break suppressions.
- https://github.com/tc39/ecma402/pull/783: Allow sentence break suppressions for
Intl.Segmenter
. - https://github.com/tc39/ecma402/issues/656: How URLs, email addresses, and acronyms should be handled by
Intl.Segmenter
.
Comment 65•4 months ago
|
||
This has been enabled on Nightly for 3 cycles now. Per our policy, removing this from the Nightly release notes. Feel free to nominate again for train-specific release notes when this is ready to ride the trains to Release.
Description
•