Closed Bug 1722484 Opened 1 year ago Closed 9 months ago

Unify lwbrk LineBreaker and WordBreaker

Categories

(Core :: Internationalization, task, P3)

task

Tracking

()

RESOLVED FIXED
96 Branch
Tracking Status
firefox96 --- fixed

People

(Reporter: dminor, Assigned: TYLin)

References

(Blocks 1 open bug)

Details

(Whiteboard: [i18n-unification])

Attachments

(2 files, 1 obsolete file)

To support experimentation with ICU4X backed segmentation, we'll need to move the LineBreaker and WordBreaker implementations into intl/components to make them available to standalone SpiderMonkey builds.

Note that there's a function NS_GetComplexLineBreaks [1] that is currently implemented differently for each platform. I don't think we want to unify that. I'm hoping we can use a forward declaration and let the linker find it at link time. We'd then define a no-op implementation for SpiderMonkey builds. The result of this would be that the messy platform specific bits still live in lwbrk, but we've unified the interfaces into intl/components.

We might also want to clean up the API a bit while doing this, e.g. renaming GetJISx4051Breaks to something more generic, and making sure the WordBreaker API is a good match for Intl.Segmenter, but this could also be done through follow ups.

For reference, here is Makoto's work integrating ICU4X as a Line Breaker: https://treeherder.mozilla.org/jobs?repo=try&revision=34ea1b318db5109891dac88f8adf479e144ec9dd

[1] https://searchfox.org/mozilla-central/search?q=symbol:_Z23NS_GetComplexLineBreaksPKDsjPh&redirect=false

Flags: needinfo?(aethanyc)

I take a look at lwbrk LineBreaker and WordBreaker's public APIs. Here my summary for their purpose, and my initial thought for them.

LineBreaker APIs:

  1. Next(): Take a string, and find the next line break opportunity from aPos.
  2. Prev(): Take a string, and find the previous line break opportunity from aPos.
  3. GetJISx4051Breaks: Take a string, and fill line break opportunities into aBreakBefore array where 1 is break and 0 is not break.

LineBreaker APIs match my expectation, and it should be straightforward to use ICU4X's line breaker to implement Prev and GetJISx4051Breaks, so we should keep them for now. One possible rename is probably changing GetJISx4051Breaks to GetLineBreaks.

WordBreak APIs:

  1. BreakInBetween: Take two strings, and see if there is a word break between two strings.
  2. FindWord: Take a string and a position, and search forwards and backwards to find the the word boundary. Return a pair of position in WordRange.
  3. NextWord: Take a string, and find the next word break opportunity from aPos.
  4. GetClass: Get WordBreakClass for a char.

We should makes some changes to WordBreaker to align the APIs with line breaker

  1. See if we can simplify the usage of BreakInBetween because the only usage is in nsFind to test the word break opportunity between two chars.
  2. We should rename NextWord to Next.
  3. We should add Prev to word breaker and use it to implement FindWord. Also, investigate if the callers can just call Next() and Prev() themselves.
  4. There are no calls to GetClass other than word breaker itself, so it is an implementation detail that doesn't need to be public.

After cleaning up the word breaker, we can create a unified segmenter class in intl/component that can delegate its API implementation to lwbrk for now.

Let me know if the above sounds like a good plan.

Whiteboard: [i18n-unification]

LineBreaker APIs:

Prev(): Take a string, and find the previous line break opportunity from aPos.

LineBreaker API uses text wrapping (text length is 80 character for mail).

Prev() is used by serializer (xml and text) and quote wrap. But I guess that we can avoid Prev() by rewriting serializer, Or we provide simple text character width wrap API.

Depends on: 1728708
Blocks: segmenter
Depends on: 1730084
Assignee: nobody → aethanyc
Status: NEW → ASSIGNED
Depends on: 1733009
Depends on: 1736938

I start working on this bug. Here are some thoughts of what I'd like to do in bug:

  1. Design a factory class Segmenter. Segmenter::Segment(text) should create an iterator of given granularity for JS to implement
    https://github.com/tc39/proposal-intl-segmenter

  2. The iterator base class should be similar to Rust iterator with only a Next() method. https://doc.rust-lang.org/std/iter/trait.Iterator.html

For layout and DOM usages, they can create iterators directly without going through Segmenter factory. Each iterator is backed by lwbrk breaker's methods, but we'll eventually switch to ICU4X break iterators.

The only two APIs lwbrk provided other than Next() are LineBreaker::GetJISx4051Breaks() and WordBreak::FindWord(). The new iterator should continue to support these two APIs to avoid rewriting the caller. It shouldn't be hard to rewrite these two API using the new ICU4X breaker's Next(). We can also rewrite the two APIs using lwbrk Next() to help the future ICU4X integration if we need to.

  1. Implement break iterators for UTF-16 and Latin1 text. I assume JS Intl::Segmenter needs both encodings per this old patch integrating ICU4C breaker.

Optional (can be done in this bug or follow-ups):

  1. Switch existing caller of LineBreak or WordBreak to new iterator interfaces.
Flags: needinfo?(aethanyc)

Comment on attachment 9247151 [details]
Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.

Before I dive too deep, I'd like your feedback on comment 5 and the WIP patch. Feel free to cancel the NI if they look reasonable.

Here is the question for the WIP where I seek feedback:

  1. Does the classes hierarchy and relationship of Segmenter and each iterator looks reasonable? (UTF16 and Latin1 iterator may look duplicate, but I avoid C++ template to hide the implementation in cpp)
  2. Did I miss any convention that are used by other Intl components? I've noticed TryCreate for construction a component. Anything else? (Dan, this question is for you.)
Attachment #9247151 - Flags: feedback?(m_kato)
Attachment #9247151 - Flags: feedback?(jfkthame)
Attachment #9247151 - Flags: feedback?(dminor)

Comment on attachment 9247151 [details]
Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.

This looks good so far to me!

Attachment #9247151 - Flags: feedback?(dminor) → feedback+

Comment on attachment 9247151 [details]
Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.

No objection for this change. I already comment another issue in bug 1736938.

Attachment #9247151 - Flags: feedback?(m_kato) → feedback+
Severity: -- → S3

Comment on attachment 9247151 [details]
Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.

Thank you for the feedback :)

Attachment #9247151 - Flags: feedback?(jfkthame)
Attachment #9247151 - Attachment description: WIP: Bug 1722484 - Introduce mozilla::intl::Segmenter and break iterators. → Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.

With the rewrite, we reduce the dependency of lwbrk LineBreaker::Next() and
WordBreaker::Next(), and eliminate some int32_t and uint32_t conversions.

Depends on D129193

Dan, I failed to find a way to add the new segmenter interface under intl/component as you envisioned in comment 0 because segmenter is still backed by lwbrk that uses xpcom APIs.

Here is the try showing the build error of segmenter added under intl/component https://treeherder.mozilla.org/jobs?repo=try&revision=ce8e1001315a59675f07f0ad7e816fe90a5372fd. Specifically, the our CI builds with ac_add_options --enable-js-shell, and js shell is statically linked to js component, which utilizes intlcomponent. I've tried to exclude segmenter source from standalone js build, but it's not helpful when doing a full Firefox build with ac_add_options --enable-js-shell. In this case we include segmenter source in intlcomponet, but lwbrk is not available when linking the js shell.

Pushed by aethanyc@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/e676fa1a0cb7
Part 1 - Introduce mozilla::intl::Segmenter and break iterators. r=m_kato,dminor
https://hg.mozilla.org/integration/autoland/rev/bef547b588ff
Part 2 - Replace LineBreaker::Next() and WordBreaker::Next() with the new iterators. r=m_kato

Backed out for causing multiple build bustages.

Push with failures

Failure log for android build bustages
Affected android builds: android-4-1-x86 opt, android-5-0-x86 opt, android-5-0-armv7 opt, android-4-1-x86 debug, android-5-0-armv7 debug, Android 4.1 opt, Android 4.1 debug.
Failure log for Linux x64 debug build bustage

Backout link

Flags: needinfo?(aethanyc)
Flags: needinfo?(aethanyc)

The latest revision should fix the build errors.

Pushed by aethanyc@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/ca73c9a05e98
Part 1 - Introduce mozilla::intl::Segmenter and break iterators. r=m_kato,dminor
https://hg.mozilla.org/integration/autoland/rev/cd8829982ddf
Part 2 - Replace LineBreaker::Next() and WordBreaker::Next() with the new iterators. r=m_kato
Status: ASSIGNED → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → 96 Branch
Blocks: 1750316
You need to log in before you can comment on or make changes to this bug.