Unify lwbrk LineBreaker and WordBreaker
Categories
(Core :: Internationalization, task, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox96 | --- | fixed |
People
(Reporter: dminor, Assigned: TYLin)
References
(Blocks 1 open bug)
Details
(Whiteboard: [i18n-unification])
Attachments
(2 files, 1 obsolete file)
To support experimentation with ICU4X backed segmentation, we'll need to move the LineBreaker and WordBreaker implementations into intl/components to make them available to standalone SpiderMonkey builds.
Note that there's a function NS_GetComplexLineBreaks [1] that is currently implemented differently for each platform. I don't think we want to unify that. I'm hoping we can use a forward declaration and let the linker find it at link time. We'd then define a no-op
implementation for SpiderMonkey builds. The result of this would be that the messy platform specific bits still live in lwbrk, but we've unified the interfaces into intl/components.
We might also want to clean up the API a bit while doing this, e.g. renaming GetJISx4051Breaks to something more generic, and making sure the WordBreaker API is a good match for Intl.Segmenter, but this could also be done through follow ups.
For reference, here is Makoto's work integrating ICU4X as a Line Breaker: https://treeherder.mozilla.org/jobs?repo=try&revision=34ea1b318db5109891dac88f8adf479e144ec9dd
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 1•3 years ago
|
||
I take a look at lwbrk LineBreaker and WordBreaker's public APIs. Here my summary for their purpose, and my initial thought for them.
Next()
: Take a string, and find the next line break opportunity fromaPos
.Prev()
: Take a string, and find the previous line break opportunity fromaPos
.GetJISx4051Breaks
: Take a string, and fill line break opportunities intoaBreakBefore
array where1
is break and0
is not break.
LineBreaker
APIs match my expectation, and it should be straightforward to use ICU4X's line breaker to implement Prev
and GetJISx4051Breaks
, so we should keep them for now. One possible rename is probably changing GetJISx4051Breaks
to GetLineBreaks
.
BreakInBetween
: Take two strings, and see if there is a word break between two strings.FindWord
: Take a string and a position, and search forwards and backwards to find the the word boundary. Return a pair of position inWordRange
.NextWord
: Take a string, and find the next word break opportunity fromaPos
.GetClass
: GetWordBreakClass
for a char.
We should makes some changes to WordBreaker
to align the APIs with line breaker
- See if we can simplify the usage of
BreakInBetween
because the only usage is innsFind
to test the word break opportunity between two chars. - We should rename
NextWord
toNext
. - We should add
Prev
to word breaker and use it to implementFindWord
. Also, investigate if the callers can just callNext()
andPrev()
themselves. - There are no calls to
GetClass
other than word breaker itself, so it is an implementation detail that doesn't need to be public.
After cleaning up the word breaker, we can create a unified segmenter
class in intl/component
that can delegate its API implementation to lwbrk for now.
Let me know if the above sounds like a good plan.
Updated•3 years ago
|
Comment 2•3 years ago
|
||
LineBreaker APIs:
Prev(): Take a string, and find the previous line break opportunity from aPos.
LineBreaker API uses text wrapping (text length is 80 character for mail).
Prev()
is used by serializer (xml and text) and quote wrap. But I guess that we can avoid Prev()
by rewriting serializer, Or we provide simple text character width wrap API.
Comment hidden (obsolete) |
Updated•3 years ago
|
Comment hidden (obsolete) |
Assignee | ||
Comment 5•3 years ago
|
||
I start working on this bug. Here are some thoughts of what I'd like to do in bug:
-
Design a factory class
Segmenter
.Segmenter::Segment(text)
should create an iterator of given granularity for JS to implement
https://github.com/tc39/proposal-intl-segmenter -
The iterator base class should be similar to Rust iterator with only a
Next()
method. https://doc.rust-lang.org/std/iter/trait.Iterator.html
For layout and DOM usages, they can create iterators directly without going through Segmenter
factory. Each iterator is backed by lwbrk breaker's methods, but we'll eventually switch to ICU4X break iterators.
The only two APIs lwbrk provided other than Next()
are LineBreaker::GetJISx4051Breaks()
and WordBreak::FindWord()
. The new iterator should continue to support these two APIs to avoid rewriting the caller. It shouldn't be hard to rewrite these two API using the new ICU4X breaker's Next()
. We can also rewrite the two APIs using lwbrk Next()
to help the future ICU4X integration if we need to.
- Implement break iterators for UTF-16 and Latin1 text. I assume JS Intl::Segmenter needs both encodings per this old patch integrating ICU4C breaker.
Optional (can be done in this bug or follow-ups):
- Switch existing caller of
LineBreak
orWordBreak
to new iterator interfaces.
Assignee | ||
Comment 6•3 years ago
|
||
Assignee | ||
Comment 7•3 years ago
|
||
Comment on attachment 9247151 [details]
Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.
Before I dive too deep, I'd like your feedback on comment 5 and the WIP patch. Feel free to cancel the NI if they look reasonable.
Here is the question for the WIP where I seek feedback:
- Does the classes hierarchy and relationship of
Segmenter
and each iterator looks reasonable? (UTF16 and Latin1 iterator may look duplicate, but I avoid C++ template to hide the implementation in cpp) - Did I miss any convention that are used by other Intl components? I've noticed
TryCreate
for construction a component. Anything else? (Dan, this question is for you.)
Reporter | ||
Comment 8•3 years ago
|
||
Comment on attachment 9247151 [details]
Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.
This looks good so far to me!
Comment 9•3 years ago
|
||
Comment on attachment 9247151 [details]
Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.
No objection for this change. I already comment another issue in bug 1736938.
Updated•3 years ago
|
Assignee | ||
Comment 10•3 years ago
|
||
Comment on attachment 9247151 [details]
Bug 1722484 Part 1 - Introduce mozilla::intl::Segmenter and break iterators.
Thank you for the feedback :)
Updated•3 years ago
|
Assignee | ||
Comment 11•3 years ago
|
||
With the rewrite, we reduce the dependency of lwbrk LineBreaker::Next() and
WordBreaker::Next(), and eliminate some int32_t and uint32_t conversions.
Depends on D129193
Assignee | ||
Comment 12•3 years ago
|
||
Dan, I failed to find a way to add the new segmenter interface under intl/component
as you envisioned in comment 0 because segmenter is still backed by lwbrk that uses xpcom APIs.
Here is the try showing the build error of segmenter added under intl/component
https://treeherder.mozilla.org/jobs?repo=try&revision=ce8e1001315a59675f07f0ad7e816fe90a5372fd. Specifically, the our CI builds with ac_add_options --enable-js-shell
, and js shell is statically linked to js component, which utilizes intlcomponent. I've tried to exclude segmenter source from standalone js build, but it's not helpful when doing a full Firefox build with ac_add_options --enable-js-shell
. In this case we include segmenter source in intlcomponet
, but lwbrk is not available when linking the js shell.
Comment 13•3 years ago
|
||
Comment 14•3 years ago
|
||
Backed out for causing multiple build bustages.
Failure log for android build bustages
Affected android builds: android-4-1-x86 opt, android-5-0-x86 opt, android-5-0-armv7 opt, android-4-1-x86 debug, android-5-0-armv7 debug, Android 4.1 opt, Android 4.1 debug.
Failure log for Linux x64 debug build bustage
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 15•3 years ago
|
||
The latest revision should fix the build errors.
Comment 16•3 years ago
|
||
Comment 17•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/ca73c9a05e98
https://hg.mozilla.org/mozilla-central/rev/cd8829982ddf
Description
•