Closed Bug 1423593 Opened 7 years ago Closed 6 months ago

Add Intl.Segmenter API

Tracking

()

Status:

RESOLVED FIXED

Milestone:

122 Branch

Tracking Flags:

Tracking

Status

firefox59

---

wontfix

firefox122

---

fixed

People

(Reporter: anba, Assigned: anba)

References

(Blocks 4 open bugs,
URL
)

Details

(Keywords: dev-doc-complete)

Attachments

(7 files, 10 obsolete files)

bug-1423593-part1-icu-data.zip 7 years ago André Bargull [:anba] 8.00 MB, application/x-zip-compressed		Details
bug-1423593-part2-segmenter.patch 7 years ago André Bargull [:anba] 80.73 KB, patch		Details \| Diff \| Splinter Review
Bug 1423593 - Part 1: Enable break iteration data in ICU. r=jwalden! 4 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 2: Implement standard Intl constructor code for Intl.Segmenter. r=jwalden! 4 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 3: Implement Segmenter.prototype.segment and the SegmentsObject. r=jwalden! 4 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 4: Implement Segment Iterator objects. r=jwalden! 4 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 5: Add tests for Intl.Segmenter. r=jwalden! 4 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 6: Enable test262 tests for Intl.Segmenter. r=jwalden! 4 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 7: Add fast-path for word boundary detection on Latin-1 strings. r=jwalden! 4 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 8: Add fast-path for sentence boundary detection on Latin-1 strings. r=jwalden! 4 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 1: Implement standard Intl constructor code for Intl.Segmenter. r=dminor! 10 months ago André Bargull [:anba] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 2: Implement Segmenter.prototype.segment and the SegmentsObject. r=dminor! 10 months ago André Bargull [:anba] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 3: Implement Segment Iterator objects. r=dminor! 10 months ago André Bargull [:anba] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 4: Add tests for Intl.Segmenter. r=dminor! 10 months ago André Bargull [:anba] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 5: Enable test262 tests. r=dminor! 10 months ago André Bargull [:anba] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 6: Reimport test262. r=dminor! 10 months ago André Bargull [:anba] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1423593 - Part 7: Cache internal segmenters and break iterators. r=dminor! 6 months ago André Bargull [:anba] 48 bytes, text/x-phabricator-request		Details \| Review

André Bargull [:anba]

Assignee

Description

•

7 years ago

The Intl.Segmenter proposal moved to stage 3 in September 2017 [2].


[1] https://github.com/tc39/proposal-intl-segmenter
[2] https://github.com/tc39/tc39-notes/blob/master/es8/2017-09/sep-26.md#12ib-intlsegmenter-for-stage-3

André Bargull [:anba]

Assignee

Comment 1

•

7 years ago

Attached file bug-1423593-part1-icu-data.zip (obsolete) — Details

Adds all break-iterator data to icudt for testing the WIP patch in part 2. Adding all break-iterator data increases icudt by 3.57MB which makes it kind of unlikely to be approved by release drivers. :-)

André Bargull [:anba]

Assignee

Comment 2

•

7 years ago

Attached patch bug-1423593-part2-segmenter.patch (obsolete) — Details — Splinter Review

WIP patch for Intl.Segmenter.

Florian Scholz (Open Web Docs)

Updated

•

7 years ago

Keywords: dev-doc-needed

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 3

•

7 years ago

Jonathan - the 3.5MB added is probably a no-go unless we can unify what we do internally with this API and either reuse the internal API here, or use ICU here and for Gecko and remove the internal API.

WDYT?

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Comment 4

•

7 years ago

I guess quite a lot of the data size here is to support proper line- and word-segmentation in languages like Thai, Khmer and Lao where text is written without word separators, yet breaks are not allowed just "anywhere" like in Chinese (to a first approximation) or derived from simple character-based rules; the only way to get it right is a dictionary-based algorithm.

We currently don't have support for this built in to Gecko, as seen for example in bug 448650. On desktop platforms, we solve the line-breaking issue by calling host platform APIs (Uniscribe on Windows; CoreFoundation on macOS; pango on Linux) for text in "complex" scripts, which means that we're not responsible for packaging/delivering the bulky data that underlies it, but it also means our behavior is likely to vary somewhat between platforms, depending on the support provided by the libraries we're using.

So in principle, we could switch from those separate platform-specific implementations to a single ICU-based one, and gain more consistent behavior across platforms (and fix the behavior on Android, where I assume we currently just fail, per bug 448650); but this would mean accepting the data size as part of Gecko instead of depending on the platform to provide it. Whether that's a worthwhile trade-off is a product-drivers call, I guess.

We also have a bunch of custom line-breaking support in nsJISx4051LineBreaker for non-Thai/Lao/Khmer text, which could perhaps be replaced by ICU, though this would need careful examination as we don't necessarily follow the default Unicode algorithm in all cases. Our current breaker has a lot of customizations to better handle breaking around things like URLs, Japanese-style emoticon sequences ¯\_(ツ)_/¯ etc., and we'd need to be wary of regressing these cases.

There's also word-boundary segmentation, used for things like double-click selection; here, our current implementation is quite rudimentary (it's nsSampleWordBreaker, which sounds suspiciously like it was an initial place-holder that never got any further love!), and I expect ICU's segment iterator is substantially better. If nothing else, it'll be up-to-date with the current Unicode character repertoire, whereas it doesn't look like our code has been updated since about Unicode 3.0 -- and it's completely missing any attempt to support languages like Thai properly.

Overall, I see several benefits to switching intl/lwbrk to be backed by ICU segmentation: we'd get consistent behavior across desktop platforms, and fix the currently-broken behavior on mobile for Thai etc; we'd get improved word selection with up-to-date Unicode support and correct behavior (I presume) for the "difficult" SEAsian languages where we currently fail; and we'd get to drop maintenance of separate platform-specific line-breaking backends. (In particular, I suspect the Uniscribe-based implementation we use on Windows may become problematic as we continue to tighten sandboxing restrictions.)

It's clear, though, that the amount of existing code and data that we'd get to drop in the process is pretty small, so I don't expect it would make a significant dent in overall size to offset the increase in ICU data. Can/should we live with that?

(FWIW, I think the opposite approach, trying to implement Intl.Segmenter on top of our current line- and word-breaking code (AFAIK we don't have any implementation of "sentence segmentation"), isn't really worthwhile because our implementations aren't good/comprehensive enough, or sufficiently consistent across platforms, to be exposed to web authors who IMO should be able to rely on consistency from this API if they're going to write code that depends on it.)

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Comment 5

•

7 years ago

See also bug 425915, which is about Thai word-segmentation. The current patches there are leveraging our existing platform-specific line-break implementations, so AFAICT would not fix the bug very well for Android, where we use the (much more limited) nsRuleBreaker rather than a dictionary-backed API. Adopting the ICU segmentation support would give a better result.

Jonathan Kew [:jfkthame]

Updated

•

7 years ago

Blocks: 448650, 425915

Jason Orendorff [:jorendorff]

Updated

•

7 years ago

Priority: -- → P3

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 6

•

7 years ago

Thanks for the explanation Jonathan!

I think it all makes sense, however unfortunately it puts us at odds with the Intl.Segmenter proposal.

Is there any chance it would make sense for us to adopt the ICU algorithms and turn this problem into a table selection problem?

In other words, direct us into a unified hypenation/segmentation/line-breaking backed by ICU algorithms both for Gecko and Intl API, but select only tables that we have right now?

If that would be possible, my naive understanding is that we'd gain:

 - better algorithms to the ones we have to maintain
 - lower maintenance cost for us with ability to remove our own algos
 - unified API for the web and Gecko

Costs:

 - One of our APIs is probably more fine-tuned for the Web than what ICU is doing (line-breaking)
 - Potentially, even the subset of tables we'd want to have to match our current data set would cost in size (?) 

Is that a good summary? Would it be something potentially realistic?

Flags: needinfo?(jfkthame)

Vee Satayamas

Comment 7

•

7 years ago

I'm not sure what "table selection problem" is.

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #6)
> Thanks for the explanation Jonathan!
> 
> I think it all makes sense, however unfortunately it puts us at odds with
> the Intl.Segmenter proposal.
> 
> Is there any chance it would make sense for us to adopt the ICU algorithms
> and turn this problem into a table selection problem?
> 
> In other words, direct us into a unified
> hypenation/segmentation/line-breaking backed by ICU algorithms both for
> Gecko and Intl API, but select only tables that we have right now?
> 
> If that would be possible, my naive understanding is that we'd gain:
> 
>  - better algorithms to the ones we have to maintain
>  - lower maintenance cost for us with ability to remove our own algos
>  - unified API for the web and Gecko
> 
> Costs:
> 
>  - One of our APIs is probably more fine-tuned for the Web than what ICU is
> doing (line-breaking)
>  - Potentially, even the subset of tables we'd want to have to match our
> current data set would cost in size (?) 
> 
> Is that a good summary? Would it be something potentially realistic?

Jonathan Kew [:jfkthame]

Comment 8

•

7 years ago

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #6)
> Thanks for the explanation Jonathan!
> 
> I think it all makes sense, however unfortunately it puts us at odds with
> the Intl.Segmenter proposal.
> 
> Is there any chance it would make sense for us to adopt the ICU algorithms
> and turn this problem into a table selection problem?
> 
> In other words, direct us into a unified
> hypenation/segmentation/line-breaking backed by ICU algorithms both for
> Gecko and Intl API, but select only tables that we have right now?

(Don't include "hyphenation" in discussion here, AFAIK there's no hyphenation support in ICU. We implement that entirely separately, using libhyphen.)

I don't know how much of the 3.5MB we could save by excluding some of the data; we could exclude the sentence-break rules, as we don't currently have any such functionality, and we could trim any localized versions of line- and word-breaking, as we don't currently implement language-sensitive behavior (that's bug 203016!). But my guess (without having actually tried) is that the savings would be relatively small. Most of the bulk probably comes from properties for the entire Unicode repertoire, which are needed even if we only support the root locale, plus the dictionaries that are needed to support Thai (etc) word-break recognition.

Maybe André can try creating a version of the data patch with trimmed content and see how much different it turns out? But I suspect it won't make a dramatic difference, and if we want this functionality at all we're going to have to accept a several-MB size increase.

(Remember that currently we're relying on data provided by the OS for substantial parts of this, which means we don't save any data size within Gecko by dropping the old implementations.)

Flags: needinfo?(jfkthame)

André Bargull [:anba]

Assignee

Comment 9

•

7 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #8)
> Maybe André can try creating a version of the data patch with trimmed
> content and see how much different it turns out? But I suspect it won't make
> a dramatic difference, and if we want this functionality at all we're going
> to have to accept a several-MB size increase.

The total increase in bytes is 3.745.072: 2.882.016 bytes for dictionaries (burmese, chinese-japanese, khmer lao, thai) and 863.056 bytes (non-dictionary data).

Jonathan Kew [:jfkthame]

Updated

•

7 years ago

Updated

•

5 years ago

Blocks: es-proposals-stage-2

Alexandre Folle de Menezes

Updated

•

5 years ago

Blocks: es-intl

Comment 10

•

5 years ago

Any idea when Intl.Segmenter will be added?

Flags: needinfo?(jwalden)

Yulia Startsev [:yulia]

Updated

•

4 years ago

Blocks: es-proposals-stage-3
No longer blocks: es-proposals-stage-2

André Bargull [:anba]

Assignee

Comment 11

•

4 years ago

Attached file Bug 1423593 - Part 1: Enable break iteration data in ICU. r=jwalden! (obsolete) — Details

Includes all break iteration data, including line segmentation even though it's
unused, because ICU data tailoring doesn't support to exclude line segmentation
data.

Filtered break iteration and the accompanying "exceptions" data tree are
disabled, because the Intl.Segmenter proposal doesn't support sentence break
suppressions through the "ss" Unicode extension key.

André Bargull [:anba]

Assignee

Comment 12

•

4 years ago

Attached file Bug 1423593 - Part 2: Implement standard Intl constructor code for Intl.Segmenter. r=jwalden! (obsolete) — Details

Implement the standard Intl object code, including the constructor and the
resolvedOptions() method. Intl.Segmenter.prototype.segment() will be added
in the next patch.

Depends on D88432

André Bargull [:anba]

Assignee

Comment 13

•

4 years ago

Attached file Bug 1423593 - Part 3: Implement Segmenter.prototype.segment and the SegmentsObject. r=jwalden! (obsolete) — Details

Intl.Segmenter.prototype.segment() returns a SegmentsObject tied to a specific
string. The SegmentsObject supports finding segmentation boundaries through its
containing() method.

Each SegmentsObject holds a separate UBreakIterator instead of using a
shared UBreakIterator in SegmenterObject to allow ICU to precompute and cache
segmentation boundaries for a single string.

The "Segment Data" object is created in self-hosted code to ensure we get stable
shapes.

Depends on D88433

André Bargull [:anba]

Assignee

Comment 14

•

4 years ago

Attached file Bug 1423593 - Part 4: Implement Segment Iterator objects. r=jwalden! (obsolete) — Details

In addition to the containing() method, Segments objects also support
iteration. Segment Iterator objects always start the segment iteration at the
start of the string and are independent of the current position of the Segments
object's UBreakIterator. That means each Segment Iterator needs to have yet
another copy of the UBreakIterator.

Depends on D88434

André Bargull [:anba]

Assignee

Comment 15

•

4 years ago

Attached file Bug 1423593 - Part 5: Add tests for Intl.Segmenter. r=jwalden! (obsolete) — Details

Test262 tests cover the various error conditions, so the local tests are mostly
restricted to functional tests and SpiderMonkey specifics like cross-compartment
wrappers.

Depends on D88435

André Bargull [:anba]

Assignee

Updated

•

4 years ago

Attachment #8934993 - Attachment is obsolete: true

André Bargull [:anba]

Assignee

Updated

•

4 years ago

Attachment #8934995 - Attachment is obsolete: true

André Bargull [:anba]

Assignee

Comment 16

•

4 years ago

Attached file Bug 1423593 - Part 6: Enable test262 tests for Intl.Segmenter. r=jwalden! (obsolete) — Details

Depends on D88436

André Bargull [:anba]

Assignee

Comment 17

•

4 years ago

The patches are ready, but I haven't yet assigned a reviewer, because I'm not sure who is currently available.

André Bargull [:anba]

Assignee

Updated

•

4 years ago

Assignee: nobody → andrebargull

Status: NEW → ASSIGNED

André Bargull [:anba]

Assignee

Comment 18

•

4 years ago

Attached file Bug 1423593 - Part 7: Add fast-path for word boundary detection on Latin-1 strings. r=jwalden! (obsolete) — Details

Add a fast-path for Latin-1 strings to avoid the Latin-1 to Two-Byte string
conversion before calling into ICU. This gives a 50x speed-up for the cases
when AutoStableStringChars needs to malloc memory to hold the converted
characters. It also provides a 25-40% speed-up compared to ICU (when comparing
without the malloc overhead).
Word break rules and properties for Latin-1 shouldn't observe any changes in
future UAX #29 resp. Unicode versions, so hard-coding the rules and properties
shouldn't lead to any issues.
All rules are annotated to allow comparing them with UAX #29. Rules which don't
apply to Latin-1 characters (e.g. Hebrew letters or Katagana) were omitted.

Depends on D88438

André Bargull [:anba]

Assignee

Comment 19

•

4 years ago

Attached file Bug 1423593 - Part 8: Add fast-path for sentence boundary detection on Latin-1 strings. r=jwalden! (obsolete) — Details

Similar to part 7, add a fast-path for Latin-1 strings. In addition to avoiding
the malloc overhead, the fast-path is 3-4x faster than ICU (when comparing
without the malloc overhead).

Depends on D89373

Jeff Walden [:Waldo]

Comment 20

•

4 years ago

I'm available to review stuff. At least for now I plan on continuing to do stuff on an as-I-feel-like-it basis, and while I didn't start doing intl stuff by deliberate and intended choice, over time I've come to find it reasonably interesting in its own right. So intl stuff is part of the SpiderMonkey things I will keep doing. Therefore, throw 'em at me if you want. :-)

Can't make guarantees about turnaround time, if I decide to spend any particular tomorrow biking all day instead of doing reviewing that's just how it'll be. But then I couldn't well offer guarantees before, either, could I, when other projects were my primary goals. So it's not really too much different. (And with recent bike bustage, before I could get to repairing it, I seem to have lost a bunch of fitness, so I gotta budget in rest days anyway.)

(In reply to me from comment #10)

Any idea when Intl.Segmenter will be added?

Not terribly precisely, no. Patches need reviewing, we need to decide just how to deal with the size increases (if we haven't already chosen to accept it), and I'm not 100% certain what exactly we're doing about either having unified segmentation support used throughout Gecko, or having different implementations for Gecko for layout and for JS for Intl. If you need segmentation now, or soon, you should plan on shipping and using your own implementation.

Flags: needinfo?(jwalden)

André Bargull [:anba]

Assignee

Comment 21

•

4 years ago

(In reply to Jeff Walden [:Waldo] from comment #20)

Therefore, throw 'em at me if you want. :-)

\o/

Phabricator Automation

Updated

•

4 years ago

Attachment #9172419 - Attachment description: Bug 1423593 - Part 1: Enable break iteration data in ICU. → Bug 1423593 - Part 1: Enable break iteration data in ICU. r=jwalden!

Phabricator Automation

Updated

•

4 years ago

Attachment #9172420 - Attachment description: Bug 1423593 - Part 2: Implement standard Intl constructor code for Intl.Segmenter. → Bug 1423593 - Part 2: Implement standard Intl constructor code for Intl.Segmenter. r=jwalden!

Phabricator Automation

Updated

•

4 years ago

Attachment #9172421 - Attachment description: Bug 1423593 - Part 3: Implement Segmenter.prototype.segment and the SegmentsObject. → Bug 1423593 - Part 3: Implement Segmenter.prototype.segment and the SegmentsObject. r=jwalden!

Phabricator Automation

Updated

•

4 years ago

Attachment #9172422 - Attachment description: Bug 1423593 - Part 4: Implement Segment Iterator objects. → Bug 1423593 - Part 4: Implement Segment Iterator objects. r=jwalden!

Phabricator Automation

Updated

•

4 years ago

Attachment #9172423 - Attachment description: Bug 1423593 - Part 5: Add tests for Intl.Segmenter. → Bug 1423593 - Part 5: Add tests for Intl.Segmenter. r=jwalden!

Phabricator Automation

Updated

•

4 years ago

Attachment #9172426 - Attachment description: Bug 1423593 - Part 6: Enable test262 tests for Intl.Segmenter. → Bug 1423593 - Part 6: Enable test262 tests for Intl.Segmenter. r=jwalden!

Phabricator Automation

Updated

•

4 years ago

Attachment #9174218 - Attachment description: Bug 1423593 - Part 7: Add fast-path for word boundary detection on Latin-1 strings. → Bug 1423593 - Part 7: Add fast-path for word boundary detection on Latin-1 strings. r=jwalden!

Phabricator Automation

Updated

•

4 years ago

Attachment #9174220 - Attachment description: Bug 1423593 - Part 8: Add fast-path for sentence boundary detection on Latin-1 strings. → Bug 1423593 - Part 8: Add fast-path for sentence boundary detection on Latin-1 strings. r=jwalden!

Yulia Startsev [:yulia]

Comment 22

•

4 years ago

I can be a secondary reviewer here.

André Bargull [:anba]

Assignee

Comment 23

•

4 years ago

That would be great. Thanks! :-)

Yulia Startsev [:yulia]

Comment 24

•

4 years ago

•

Edited

I spoke to a few people regarding using ICU (Zibi and Ted mostly) for this. What the patch does now is valid and would be web compatible, but would result in an increased binary size and (perhaps more importantly) inconsistent behaviour between Gecko and Spidermonkey. The solution that is being discussed would be a shared segmenter library, and there is a plan for this that will probably kick of in early next year.

My initial thought was, if we could simply replace ICU with this library, then we could go forward with the work you did here behind a build flag, however it looks like the shared library will likely incorporate some of the functionality of the the Intl.Segmenter proposal. So, for now, we should wait.

I have set this to "requested changes" on the first patch for the time being, since there is no way to communicate "we are waiting on changes elsewhere that will impact this patch" on phabricator in case anyone else is looking at ongoing work or wants to know what the status of Intl.Segmenter is. We might land a modified version of this once things become clearer in how Gecko and SM will share Intl.Segmenter functionality.

André Bargull [:anba]

Assignee

Comment 25

•

4 years ago

•

Edited

When we talked in Berlin about this topic, I thought we concluded that Gecko isn't affected by Intl.Segmenter, because Intl.Segmenter doesn't include line segmentation? And we got the okay to increase the binary size across team (desktop and mobile).

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 26

•

4 years ago

•

Edited

Andre - A month ago I sent you a proposal for the project we're kicking off to rework how segmentation will work to allow us to fit into this spec without paying the payload cost.
I also shared a couple more proposals related to unification of internationalization APIs between Gecko and SpiderMonkey.

I used your bugmail email. If there's a better way to share those with you, please, let me know.

Flags: needinfo?(andrebargull)

André Bargull [:anba]

Assignee

Comment 27

•

4 years ago

Thanks. I've read the proposals yesterday. It looks like the preparatory work for Intl.Segmenter will take a while, so I'll unassign myself for now, because I guess we shouldn't give the impression that Intl.Segmenter will land in the next few months.

Assignee: andrebargull → nobody

Status: ASSIGNED → NEW

Flags: needinfo?(andrebargull)

Phabricator Automation

Updated

•

4 years ago

Attachment #9172419 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 years ago

Attachment #9172420 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 years ago

Attachment #9172421 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 years ago

Attachment #9172422 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 years ago

Attachment #9172423 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 years ago

Attachment #9172426 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 years ago

Attachment #9174218 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 years ago

Attachment #9174220 - Attachment is obsolete: true

Zibi Braniecki [:zbraniecki][:gandalf]

Updated

•

4 years ago

Blocks: segmenter

Julien G

Comment 28

•

3 years ago

Is it possible to update this one to block https://bugzilla.mozilla.org/show_bug.cgi?id=1435795 now that Stage 4 has been reached (https://github.com/tc39/notes/blob/master/meetings/2021-10/oct-25.md#conclusionresolution-1)?

Yulia Startsev [:yulia]

Updated

•

3 years ago

Blocks: es-proposals-stage-4
No longer blocks: es-proposals-stage-3

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Greg Tatum [:gregtatum] (not responding to NI until Jun 17)

Updated

•

1 year ago

Blocks: 1820618

Julian Descottes [:jdescottes]

Updated

•

1 year ago

Blocks: 1826808

Outvi V

Comment 29

•

1 year ago

Hi. I would like to add a note about the potential increase of usage (and also, potential web-compat breakages) of this proposal, which may worth a reconsideration of the discussion priority:

An npm package, string-width [1], started to use this Intl.Segmenter API since its 6.0.0 version released 2 weeks ago [2]. This package currently has ~101M weekly downloads, and might be eligible as a widely-used package. Therefore, subsequent dependency upgrades of various sites might start to cause (a lot of) problems on browsers that don't have this implemented.

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

1 year ago

Depends on: 1719535

Dan Minor [:dminor]

Comment 30

•

1 year ago

This is planned for later this year, but we're not able to work on this until the new ICU4x backed segmenter lands in Bug 1719535.

Frank Griffith Jr

Comment 31

•

10 months ago

Dan, with Bug 1719535 landing, is this API effort back on the radar screen here in the near term?

Flags: needinfo?(dminor)

Dan Minor [:dminor]

Comment 32

•

10 months ago

We're planning this for Q4, once some of the current standards related work is completed. It's great to see Bug 1719535 land :)

Flags: needinfo?(dminor)

André Bargull [:anba]

Assignee

Comment 33

•

10 months ago

Attached file Bug 1423593 - Part 1: Implement standard Intl constructor code for Intl.Segmenter. r=dminor! — Details

Implement the standard Intl object code, including the constructor and the
resolvedOptions() method. Intl.Segmenter.prototype.segment() will be added
in the next patch.

André Bargull [:anba]

Assignee

Comment 34

•

10 months ago

Attached file Bug 1423593 - Part 2: Implement Segmenter.prototype.segment and the SegmentsObject. r=dminor! — Details

Intl.Segmenter.prototype.segment() returns a SegmentsObject tied to a specific
string. The SegmentsObject supports finding segmentation boundaries through its
containing() method.

The "Segment Data" object is created in self-hosted code to ensure we get stable
shapes.

Depends on D186425

André Bargull [:anba]

Assignee

Comment 35

•

10 months ago

Attached file Bug 1423593 - Part 3: Implement Segment Iterator objects. r=dminor! — Details

In addition to the containing() method, Segments objects also support
iteration. Segment Iterator objects always start the segment iteration at the
start of the string.

Depends on D186426

André Bargull [:anba]

Assignee

Comment 36

•

10 months ago

Attached file Bug 1423593 - Part 4: Add tests for Intl.Segmenter. r=dminor! — Details

Test262 tests cover the various error conditions, so the local tests are mostly
restricted to functional tests and SpiderMonkey specifics like cross-compartment
wrappers.

Depends on D186427

André Bargull [:anba]

Assignee

Comment 37

•

10 months ago

Attached file Bug 1423593 - Part 5: Enable test262 tests. r=dminor! — Details

Depends on D186428

André Bargull [:anba]

Assignee

Comment 38

•

10 months ago

Attached file Bug 1423593 - Part 6: Reimport test262. r=dminor! — Details

Depends on D186429

André Bargull [:anba]

Assignee

Comment 39

•

10 months ago

Running the tests uncovered some issues which look like bugs in the ICU4X implementation. (ICU4C computes the expected results). Should I file this at https://github.com/unicode-org/icu4x?

WordBreakIterator::is_word_like() seems to return a wrong result when the input ends with single full stop.

For example "World" isn't marked as word-like in this example:

js> [...new Intl.Segmenter("en", {granularity:"word"}).segment("Hello World.")]
[
  {segment:"Hello", index:0, input:"Hello World.", isWordLike:true},
  {segment:" ", index:5, input:"Hello World.", isWordLike:false},
  {segment:"World", index:6, input:"Hello World.", isWordLike:false},
  {segment:".", index:11, input:"Hello World.", isWordLike:false}
]

Replacing "." with "!" leads to computing the correct word-like result:

js> [...new Intl.Segmenter("en", {granularity:"word"}).segment("Hello World!")]
[
  {segment:"Hello", index:0, input:"Hello World!", isWordLike:true},
  {segment:" ", index:5, input:"Hello World!", isWordLike:false},
  {segment:"World", index:6, input:"Hello World!", isWordLike:true},
  {segment:"!", index:11, input:"Hello World!", isWordLike:false}
]

SentenceBreakIterator seems to have issues with emojis.

The string "Hello World. " is correctly classified as a single sentence.

js> [...new Intl.Segmenter(undefined, {granularity: "sentence"}).segment("Hello World. ")].map(s => s.segment)
["Hello World. "]

But when followed by 🤥 (U+1F925), the string is suddenly split into three segments. (The expected result is ["Hello World. ", "\uD83E\uDD25"].)

js> [...new Intl.Segmenter(undefined, {granularity: "sentence"}).segment("Hello World. " + "\u{1F925}")].map(s => s.segment)
["Hello World", ".", " \uD83E\uDD25"]

When replacing the full stop with an exclamation mark, only the trailing whitespace is incorrectly attributed to the next segment.

js> [...new Intl.Segmenter(undefined, {granularity: "sentence"}).segment("Hello World! " + "\u{1F925}")].map(s => s.segment)
["Hello World!", " \uD83E\uDD25"]

Flags: needinfo?(m_kato)

Makoto Kato [:m_kato]

Comment 40

•

10 months ago

WordBreakIterator::is_word_like() seems to return a wrong result when the input ends with single full stop.

I will look it. I guess ICU4X bug. Where is is_word_like test? I would like to import/convert it for ICU4X. https://github.com/tc39/test262/blob/main/test/intl402/Segmenter/prototype/segment/containing/iswordlike.js?

SentenceBreakIterator seems to have issues with emojis.

Ah, although this is SB11 rules, this seems a bug of segmenter rule. I will fix it in ICU4X.

Flags: needinfo?(m_kato)

André Bargull [:anba]

Assignee

Comment 41

•

10 months ago

(In reply to Makoto Kato [:m_kato] from comment #40)

WordBreakIterator::is_word_like() seems to return a wrong result when the input ends with single full stop.

I will look it. I guess ICU4X bug. Where is is_word_like test? I would like to import/convert it for ICU4X. https://github.com/tc39/test262/blob/main/test/intl402/Segmenter/prototype/segment/containing/iswordlike.js?

The tests are in https://phabricator.services.mozilla.com/D186428. The word-segmenter test is "js/src/tests/non262/Intl/Segmenter/word.js", it has this commented-out assertion which succeeds in ICU4C, but fails in ICU4X.

// The non-word parts in the samples are either punctuators or space separators.
// assertEq(obj.isWordLike, !/^(\p{gc=P}|\p{gc=Zs})+$/u.test(obj.segment), obj.segment);

SentenceBreakIterator seems to have issues with emojis.

Ah, although this is SB11 rules, this seems a bug of segmenter rule. I will fix it in ICU4X.

Thanks!

Makoto Kato [:m_kato]

Comment 42

•

10 months ago

comment #39 will be fixed by ICU4X 1.3

Henri Sivonen (:hsivonen) (on leave)

Comment 43

•

8 months ago

Do I read the patches correctly that supportedLocalesOf responds based on what locales are explicitly known to other Intl objects?

(I don't have a good suggestion of how supportedLocalesOf for Intl.Segmenter should behave when the back end ignores the locale and in that sense "supported locale" isn't a good conceptual match for the back end. Chrome seems to have particularly involved way of coming up with supported/not-supported claims for its Intl.Segmenter, which while locale-sensitive, seems to be only minimally so. (Chrome seems to report sv-FI as supported but fi-SE as unsupported, which makes no sense considering what the ICU4C segmenter data contains, as it contains a common exception for fi and sv together ignoring the region for both, but instead seems to be based on knowledge that sv-FI has more status as a well-known thing in non-segmentation manner than fi-SE.) It seems a bit sad if for Web compat supportedLocalesOf becomes or, considering what Chrome does, has already become a particularly complex thing that has little to do with how the back end behaves.)

Flags: needinfo?(andrebargull)

André Bargull [:anba]

Assignee

Comment 44

•

8 months ago

Yes, the supportedLocalesOf implementation uses the set of locales supported by ICU4C (as determined by uloc_getAvailable). This matches the implementation in both V8 and JSC. This also explains why Intl.Segmenter.supportedLocalesOf(["sv-FI", "fi-SE"]) returns ["sv-FI", "fi-SE"] in both engines:

sv-FI is supported, because ICU4C contains (general) localisation support for sv-FI (https://github.com/unicode-org/icu/blob/main/icu4c/source/data/locales/sv_FI.txt), and therefore new Intl.Segmenter("sv-FI").resolvedOptions().locale === "sv-FI".
fi-SE is supported through the fallback to the parent locale fi (https://github.com/unicode-org/icu/blob/main/icu4c/source/data/locales/fi.txt), and therefore new Intl.Segmenter("fi-SE").resolvedOptions().locale === "fi".

Flags: needinfo?(andrebargull)

Dan Minor [:dminor]

Comment 45

•

7 months ago

Hi Anba, I was wondering if we are still blocked on this. ICU4X 1.3 has been released (https://github.com/unicode-org/icu4x/tree/release/1.3) although we haven't updated our in-tree version. Are we waiting on anything else to continue this work?

I can take a look at doing the ICU4X 1.3 update if that would help. Or do the update and also rebase the patches here if you're busy with other things :)

Flags: needinfo?(andrebargull)

Dan Minor [:dminor]

Updated

•

7 months ago

Depends on: 1867586

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Comment 46

•

7 months ago

Makoto is updating ICU4X segmenter to latest 1.4 in bug 1854031.

Depends on: 1854031
No longer depends on: 1867586

André Bargull [:anba]

Assignee

Comment 47

•

6 months ago

(In reply to Dan Minor [:dminor] from comment #45)

Are we waiting on anything else to continue this work?

No, we're just waiting for the ICU4X segmenter update to land.

Flags: needinfo?(andrebargull)

Phabricator Automation

Updated

•

6 months ago

Assignee: nobody → andrebargull

Attachment #9349349 - Attachment description: WIP: Bug 1423593 - Part 1: Implement standard Intl constructor code for Intl.Segmenter. → Bug 1423593 - Part 1: Implement standard Intl constructor code for Intl.Segmenter. r=dminor!

Status: NEW → ASSIGNED

Phabricator Automation

Updated

•

6 months ago

Attachment #9349350 - Attachment description: WIP: Bug 1423593 - Part 2: Implement Segmenter.prototype.segment and the SegmentsObject. → Bug 1423593 - Part 2: Implement Segmenter.prototype.segment and the SegmentsObject. r=dminor!

Phabricator Automation

Updated

•

6 months ago

Attachment #9349351 - Attachment description: WIP: Bug 1423593 - Part 3: Implement Segment Iterator objects. → Bug 1423593 - Part 3: Implement Segment Iterator objects. r=dminor!

Phabricator Automation

Updated

•

6 months ago

Attachment #9349352 - Attachment description: WIP: Bug 1423593 - Part 4: Add tests for Intl.Segmenter. → Bug 1423593 - Part 4: Add tests for Intl.Segmenter. r=dminor!

Phabricator Automation

Updated

•

6 months ago

Attachment #9349353 - Attachment description: WIP: Bug 1423593 - Part 5: Enable test262 tests. → Bug 1423593 - Part 5: Enable test262 tests. r=dminor!

Phabricator Automation

Updated

•

6 months ago

Attachment #9349354 - Attachment description: WIP: Bug 1423593 - Part 6: Reimport test262. → Bug 1423593 - Part 6: Reimport test262. r=dminor!

André Bargull [:anba]

Assignee

Comment 48

•

6 months ago

Attached file Bug 1423593 - Part 7: Cache internal segmenters and break iterators. r=dminor! — Details

Recreating the ICU4X segmenters and break iterators is too slow when compared
to other engines, so we have to cache the ICU4X object for better performance.

Depends on D186430

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 49

•

6 months ago

@anba - do you have any rough estimates as to how much caching (part 7) saves? How does pre/post compares to other engines?

Flags: needinfo?(andrebargull)

André Bargull [:anba]

Assignee

Comment 50

•

6 months ago

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #49)

@anba - do you have any rough estimates as to how much caching (part 7) saves? How does pre/post compares to other engines?

Grapheme segmentation over the string "The quick brown fox jumps over the lazy dog.".

Engine	Factor
SM (base)	1
SM (cache)	~5x faster
JSC	~2.5 faster
V8	~1.5 slower (!)

Word segmentation over the string "The quick brown fox jumps over the lazy dog.".

Engine	Factor
SM (base)	1
SM (cache)	~7x faster
JSC	~3.5 faster
V8	1 (!)

There's an even larger improvement for larger strings, because we're basically comparing a linear in string length against a quadratic in string length solution. (Quadratic because we always restart from index zero.)

I don't know why V8 is so slow when compared to JSC, maybe they're doing something wrong when calling into ICU?

var seg = new Intl.Segmenter("en", {granularity:"grapheme"})
var str = "The quick brown fox jumps over the lazy dog.";

var r = 0;
var t = performance.now();
for (var i = 0; i < 100_000; ++i) {
  r += [...seg.segment(str)].length;
}
console.log(performance.now() - t, r);

Flags: needinfo?(andrebargull)

Pulsebot

Comment 51

•

6 months ago

Pushed by andre.bargull@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/5fb85ed28b12
Part 1: Implement standard Intl constructor code for Intl.Segmenter. r=dminor
https://hg.mozilla.org/integration/autoland/rev/ea86c2b391f7
Part 2: Implement Segmenter.prototype.segment and the SegmentsObject. r=dminor
https://hg.mozilla.org/integration/autoland/rev/ef3254d46ab7
Part 3: Implement Segment Iterator objects. r=dminor
https://hg.mozilla.org/integration/autoland/rev/e06dfac91b17
Part 4: Add tests for Intl.Segmenter. r=dminor
https://hg.mozilla.org/integration/autoland/rev/ec5883dc144d
Part 5: Enable test262 tests. r=dminor
https://hg.mozilla.org/integration/autoland/rev/3308537ba48a
Part 6: Reimport test262. r=dminor
https://hg.mozilla.org/integration/autoland/rev/4b401e7231ca
Part 7: Cache internal segmenters and break iterators. r=dminor

Cosmin Sabou [:CosminS]

Comment 52

•

6 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/5fb85ed28b12
https://hg.mozilla.org/mozilla-central/rev/ea86c2b391f7
https://hg.mozilla.org/mozilla-central/rev/ef3254d46ab7
https://hg.mozilla.org/mozilla-central/rev/e06dfac91b17
https://hg.mozilla.org/mozilla-central/rev/ec5883dc144d
https://hg.mozilla.org/mozilla-central/rev/3308537ba48a
https://hg.mozilla.org/mozilla-central/rev/4b401e7231ca

Status: ASSIGNED → RESOLVED

Closed: 6 months ago

status-firefox122: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 122 Branch

Serban Stanca [:SerbanS]

Updated

•

6 months ago

Regressions: 1869755

Donal Meehan [:dmeehan]

Comment 53

•

6 months ago

:anba could you consider nominating this for a release note? (Process info)

status-firefox59: affected → wontfix

Flags: needinfo?(andrebargull)

André Bargull [:anba]

Assignee

Comment 54

•

6 months ago

Release Note Request (optional, but appreciated)
[Why is this notable]: A new web platform feature
[Affects Firefox for Android]: Yes
[Suggested wording]: Add support for Unicode text segmentation to JavaScript. Currently only enabled for Nightly builds.
[Links (documentation, blog post, etc)]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter

relnote-firefox: --- → ?

Flags: needinfo?(andrebargull)

Donal Meehan [:dmeehan]

Comment 55

•

6 months ago

Thanks, added to the Fx122 nightly release notes, please allow 30 minutes for the site to update.

relnote-firefox: ? → nightly+

Hamish Willee

Comment 56

•

6 months ago

FF122 docs work for this can be tracked in https://github.com/mdn/content/issues/31098. The API is already documented, so this was just an MDN release note and update to browser compatibility data.

Keywords: dev-doc-needed → dev-doc-complete

Vadim Makeev

Comment 57

•

5 months ago

Q: Based on the intl.icu4x.segmenter.enabled set to true in StaticPrefList.yaml, does it mean that Intl.Segmenter API will be shipped in 122 enabled by default?

André Bargull [:anba]

Assignee

Comment 58

•

5 months ago

(In reply to Vadim Makeev from comment #57)

Q: Based on the intl.icu4x.segmenter.enabled set to true in StaticPrefList.yaml, does it mean that Intl.Segmenter API will be shipped in 122 enabled by default?

No, Intl.Segmenter is still restricted to Nightly due to this ifdef.

Henri Sivonen (:hsivonen) (on leave)

Comment 59

•

5 months ago

(In reply to André Bargull [:anba] from comment #58)

(In reply to Vadim Makeev from comment #57)

Q: Based on the intl.icu4x.segmenter.enabled set to true in StaticPrefList.yaml, does it mean that Intl.Segmenter API will be shipped in 122 enabled by default?

No, Intl.Segmenter is still restricted to Nightly due to this ifdef.

What's blocking the removal of the ifdef and letting Intl.Segmenter ride the trains?

Flags: needinfo?(andrebargull)

André Bargull [:anba]

Assignee

Comment 60

•

5 months ago

I didn't want to let it ride the trains right away in case there are any major differences between ICU4C segmentation (which is used by JSC/V8) and ICU4X segmentation which can lead to web-compat issues. Minor differences which are unlikely to cause issues are for example:

We didn't yet receive any bug reports, so we could consider enabling Intl.Segmenter by default for either Firefox 124 or 125, I guess.

Flags: needinfo?(andrebargull)

Tarek Ziadé (:tarek)

Comment 61

•

5 months ago

Have we considered improving the implementation of the segmenter?

it will break on abbrevations when splitting, where most NLP libs don't.

Example with NTLK:

>>> import nltk
>>> nltk.sent_tokenize("""E. T. A. Hoffmann (1776–1822) wrote a short story in 1816 titled Der Sandmann, which showed how sinister such a character could be made. According to the protagonist's nurse, he threw sand in the eyes of children who would not sleep, with the result of those eyes falling out and being collected by the Sandman, who then takes the eyes to his iron nest on the Moon and uses them to feed his children. The protagonist of the story grows to associate this nightmarish creature with the genuinely sinister figure of his father's associate, Coppelius. In Romanian folklore, there is a similar character, Moș Ene (Ene the Elder). Hoffmann's version of the sandman is also similar to the French Canadian character known as the Bonhomme Sept Heures (Goodmaan Seven O’Clock), who, in some versions, throws sand in children's eyes to blind them so that he may capture them. Contrarily to the sandman, his bag is the place where he traps children who do not go to bed.""")

['E. T. A. Hoffmann (1776–1822) wrote a short story in 1816 titled Der Sandmann, which showed how sinister such a character could be made.', "According to the protagonist's nurse, he threw sand in the eyes of children who would not sleep, with the result of those eyes falling out and being collected by the Sandman, who then takes the eyes to his iron nest on the Moon and uses them to feed his children.", "The protagonist of the story grows to associate this nightmarish creature with the genuinely sinister figure of his father's associate, Coppelius.", 'In Romanian folklore, there is a similar character, Moș Ene (Ene the Elder).', "Hoffmann's version of the sandman is also similar to the French Canadian character known as the Bonhomme Sept Heures (Goodmaan Seven O’Clock), who, in some versions, throws sand in children's eyes to blind them so that he may capture them.", 'Contrarily to the sandman, his bag is the place where he traps children who do not go to bed.']

Intl will split at E. T. A.

Flags: needinfo?(andrebargull)

Henri Sivonen (:hsivonen) (on leave)

Comment 62

•

5 months ago

That's an ICU4X-level question. ICU4C suppresses sentence breaks for well-known abbreviations in the best-known languages, but only those. https://github.com/unicode-org/icu4x/issues/3284 is about whether ICU4X should do something like that.

At present ICU4C as used in Chrome treats "E.", "T.", and "A." as sentences, since they aren't well-known abbreviations even though English is one of the best-known languages.

Tarek Ziadé (:tarek)

Comment 63

•

5 months ago

Thanks Henri, will dig in that direction. It breaks for all points as far as I can tell. e.g. "Jr." in English.

André Bargull [:anba]

Assignee

Comment 64

•

5 months ago

In addition to the ICU4X issue linked in comment #62, the following ICU4X and ECMA-402 issues are also related:

https://github.com/unicode-org/icu4x/issues/3927: Add support for sentence break suppressions.
https://github.com/tc39/ecma402/pull/783: Allow sentence break suppressions for Intl.Segmenter.
https://github.com/tc39/ecma402/issues/656: How URLs, email addresses, and acronyms should be handled by Intl.Segmenter.

Flags: needinfo?(andrebargull)

Ryan VanderMeulen [:RyanVM]

Comment 65

•

4 months ago

This has been enabled on Nightly for 3 cycles now. Per our policy, removing this from the Nightly release notes. Feel free to nominate again for train-specific release notes when this is ready to ride the trains to Release.

relnote-firefox: nightly+ → ---

You need to log in before you can comment on or make changes to this bug.