Unify lwbrk LineBreaker and WordBreaker
Categories
(Core :: Internationalization, task, P3)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox96 | --- | fixed |
People
(Reporter: dminor, Assigned: TYLin)
References
(Blocks 1 open bug)
Details
(Whiteboard: [i18n-unification])
Attachments
(2 files, 1 obsolete file)
To support experimentation with ICU4X backed segmentation, we'll need to move the LineBreaker and WordBreaker implementations into intl/components to make them available to standalone SpiderMonkey builds.
Note that there's a function NS_GetComplexLineBreaks [1] that is currently implemented differently for each platform. I don't think we want to unify that. I'm hoping we can use a forward declaration and let the linker find it at link time. We'd then define a no-op implementation for SpiderMonkey builds. The result of this would be that the messy platform specific bits still live in lwbrk, but we've unified the interfaces into intl/components.
We might also want to clean up the API a bit while doing this, e.g. renaming GetJISx4051Breaks to something more generic, and making sure the WordBreaker API is a good match for Intl.Segmenter, but this could also be done through follow ups.
For reference, here is Makoto's work integrating ICU4X as a Line Breaker: https://treeherder.mozilla.org/jobs?repo=try&revision=34ea1b318db5109891dac88f8adf479e144ec9dd
| Assignee | ||
Updated•4 years ago
|
| Assignee | ||
Comment 1•4 years ago
|
||
I take a look at lwbrk LineBreaker and WordBreaker's public APIs. Here my summary for their purpose, and my initial thought for them.
Next(): Take a string, and find the next line break opportunity fromaPos.Prev(): Take a string, and find the previous line break opportunity fromaPos.GetJISx4051Breaks: Take a string, and fill line break opportunities intoaBreakBeforearray where1is break and0is not break.
LineBreaker APIs match my expectation, and it should be straightforward to use ICU4X's line breaker to implement Prev and GetJISx4051Breaks, so we should keep them for now. One possible rename is probably changing GetJISx4051Breaks to GetLineBreaks.
BreakInBetween: Take two strings, and see if there is a word break between two strings.FindWord: Take a string and a position, and search forwards and backwards to find the the word boundary. Return a pair of position inWordRange.NextWord: Take a string, and find the next word break opportunity fromaPos.GetClass: GetWordBreakClassfor a char.
We should makes some changes to WordBreaker to align the APIs with line breaker
- See if we can simplify the usage of
BreakInBetweenbecause the only usage is innsFindto test the word break opportunity between two chars. - We should rename
NextWordtoNext. - We should add
Prevto word breaker and use it to implementFindWord. Also, investigate if the callers can just callNext()andPrev()themselves. - There are no calls to
GetClassother than word breaker itself, so it is an implementation detail that doesn't need to be public.
After cleaning up the word breaker, we can create a unified segmenter class in intl/component that can delegate its API implementation to lwbrk for now.
Let me know if the above sounds like a good plan.
Updated•4 years ago
|
Comment 2•4 years ago
|
||
LineBreaker APIs:
Prev(): Take a string, and find the previous line break opportunity from aPos.
LineBreaker API uses text wrapping (text length is 80 character for mail).
Prev() is used by serializer (xml and text) and quote wrap. But I guess that we can avoid Prev() by rewriting serializer, Or we provide simple text character width wrap API.
| Comment hidden (obsolete) |
Updated•4 years ago
|
| Comment hidden (obsolete) |
| Assignee | ||
Comment 5•4 years ago
|
||
I start working on this bug. Here are some thoughts of what I'd like to do in bug:
-
Design a factory class
Segmenter.Segmenter::Segment(text)should create an iterator of given granularity for JS to implement
https://github.com/tc39/proposal-intl-segmenter -
The iterator base class should be similar to Rust iterator with only a
Next()method. https://doc.rust-lang.org/std/iter/trait.Iterator.html
For layout and DOM usages, they can create iterators directly without going through Segmenter factory. Each iterator is backed by lwbrk breaker's methods, but we'll eventually switch to ICU4X break iterators.
The only two APIs lwbrk provided other than Next() are LineBreaker::GetJISx4051Breaks() and WordBreak::FindWord(). The new iterator should continue to support these two APIs to avoid rewriting the caller. It shouldn't be hard to rewrite these two API using the new ICU4X breaker's Next(). We can also rewrite the two APIs using lwbrk Next() to help the future ICU4X integration if we need to.
- Implement break iterators for UTF-16 and Latin1 text. I assume JS Intl::Segmenter needs both encodings per this old patch integrating ICU4C breaker.
Optional (can be done in this bug or follow-ups):
- Switch existing caller of
LineBreakorWordBreakto new iterator interfaces.
| Assignee | ||
Comment 6•4 years ago
|
||
| Assignee | ||
Comment 7•4 years ago
|
||
Comment on attachment 9247151 [details]
Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.
Before I dive too deep, I'd like your feedback on comment 5 and the WIP patch. Feel free to cancel the NI if they look reasonable.
Here is the question for the WIP where I seek feedback:
- Does the classes hierarchy and relationship of
Segmenterand each iterator looks reasonable? (UTF16 and Latin1 iterator may look duplicate, but I avoid C++ template to hide the implementation in cpp) - Did I miss any convention that are used by other Intl components? I've noticed
TryCreatefor construction a component. Anything else? (Dan, this question is for you.)
| Reporter | ||
Comment 8•4 years ago
|
||
Comment on attachment 9247151 [details]
Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.
This looks good so far to me!
Comment 9•4 years ago
|
||
Comment on attachment 9247151 [details]
Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.
No objection for this change. I already comment another issue in bug 1736938.
Updated•4 years ago
|
| Assignee | ||
Comment 10•4 years ago
|
||
Comment on attachment 9247151 [details]
Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.
Thank you for the feedback :)
Updated•4 years ago
|
| Assignee | ||
Comment 11•4 years ago
|
||
With the rewrite, we reduce the dependency of lwbrk LineBreaker::Next() and
WordBreaker::Next(), and eliminate some int32_t and uint32_t conversions.
Depends on D129193
| Assignee | ||
Comment 12•4 years ago
|
||
Dan, I failed to find a way to add the new segmenter interface under intl/component as you envisioned in comment 0 because segmenter is still backed by lwbrk that uses xpcom APIs.
Here is the try showing the build error of segmenter added under intl/component https://treeherder.mozilla.org/jobs?repo=try&revision=ce8e1001315a59675f07f0ad7e816fe90a5372fd. Specifically, the our CI builds with ac_add_options --enable-js-shell, and js shell is statically linked to js component, which utilizes intlcomponent. I've tried to exclude segmenter source from standalone js build, but it's not helpful when doing a full Firefox build with ac_add_options --enable-js-shell. In this case we include segmenter source in intlcomponet, but lwbrk is not available when linking the js shell.
Comment 13•4 years ago
|
||
Comment 14•4 years ago
|
||
Backed out for causing multiple build bustages.
Failure log for android build bustages
Affected android builds: android-4-1-x86 opt, android-5-0-x86 opt, android-5-0-armv7 opt, android-4-1-x86 debug, android-5-0-armv7 debug, Android 4.1 opt, Android 4.1 debug.
Failure log for Linux x64 debug build bustage
| Assignee | ||
Updated•4 years ago
|
| Assignee | ||
Comment 15•4 years ago
|
||
The latest revision should fix the build errors.
Comment 16•4 years ago
|
||
Comment 17•4 years ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/ca73c9a05e98
https://hg.mozilla.org/mozilla-central/rev/cd8829982ddf
Description
•