Open Bug 1722484 Opened 3 months ago Updated 19 days ago

Unify lwbrk LineBreaker and WordBreaker

Categories

(Core :: Internationalization, task, P3)

task

Tracking

()

ASSIGNED

People

(Reporter: dminor, Assigned: TYLin, NeedInfo)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [i18n-unification])

Attachments

(1 obsolete file)

To support experimentation with ICU4X backed segmentation, we'll need to move the LineBreaker and WordBreaker implementations into intl/components to make them available to standalone SpiderMonkey builds.

Note that there's a function NS_GetComplexLineBreaks [1] that is currently implemented differently for each platform. I don't think we want to unify that. I'm hoping we can use a forward declaration and let the linker find it at link time. We'd then define a no-op implementation for SpiderMonkey builds. The result of this would be that the messy platform specific bits still live in lwbrk, but we've unified the interfaces into intl/components.

We might also want to clean up the API a bit while doing this, e.g. renaming GetJISx4051Breaks to something more generic, and making sure the WordBreaker API is a good match for Intl.Segmenter, but this could also be done through follow ups.

For reference, here is Makoto's work integrating ICU4X as a Line Breaker: https://treeherder.mozilla.org/jobs?repo=try&revision=34ea1b318db5109891dac88f8adf479e144ec9dd

[1] https://searchfox.org/mozilla-central/search?q=symbol:_Z23NS_GetComplexLineBreaksPKDsjPh&redirect=false

Flags: needinfo?(aethanyc)

I take a look at lwbrk LineBreaker and WordBreaker's public APIs. Here my summary for their purpose, and my initial thought for them.

LineBreaker APIs:

  1. Next(): Take a string, and find the next line break opportunity from aPos.
  2. Prev(): Take a string, and find the previous line break opportunity from aPos.
  3. GetJISx4051Breaks: Take a string, and fill line break opportunities into aBreakBefore array where 1 is break and 0 is not break.

LineBreaker APIs match my expectation, and it should be straightforward to use ICU4X's line breaker to implement Prev and GetJISx4051Breaks, so we should keep them for now. One possible rename is probably changing GetJISx4051Breaks to GetLineBreaks.

WordBreak APIs:

  1. BreakInBetween: Take two strings, and see if there is a word break between two strings.
  2. FindWord: Take a string and a position, and search forwards and backwards to find the the word boundary. Return a pair of position in WordRange.
  3. NextWord: Take a string, and find the next word break opportunity from aPos.
  4. GetClass: Get WordBreakClass for a char.

We should makes some changes to WordBreaker to align the APIs with line breaker

  1. See if we can simplify the usage of BreakInBetween because the only usage is in nsFind to test the word break opportunity between two chars.
  2. We should rename NextWord to Next.
  3. We should add Prev to word breaker and use it to implement FindWord. Also, investigate if the callers can just call Next() and Prev() themselves.
  4. There are no calls to GetClass other than word breaker itself, so it is an implementation detail that doesn't need to be public.

After cleaning up the word breaker, we can create a unified segmenter class in intl/component that can delegate its API implementation to lwbrk for now.

Let me know if the above sounds like a good plan.

Whiteboard: [i18n-unification]

LineBreaker APIs:

Prev(): Take a string, and find the previous line break opportunity from aPos.

LineBreaker API uses text wrapping (text length is 80 character for mail).

Prev() is used by serializer (xml and text) and quote wrap. But I guess that we can avoid Prev() by rewriting serializer, Or we provide simple text character width wrap API.

Depends on: 1728708
Blocks: 1684927
Depends on: 1730084
Assignee: nobody → aethanyc
Status: NEW → ASSIGNED
Depends on: 1733009
You need to log in before you can comment on or make changes to this bug.