1684927 - (segmenter) [meta] Unified Segmenter 2021

Reporter

Description

•

4 years ago

We are kicking off a new project to implement a unified segmentation model for Layout engine based on UAX#14/UAX#29 and offer it to SpiderMonkey to back ECMA-402 Intl.Segmenter.

The effort is going to be part of ICU4X project, and initially will live as a branch of icu4x in https://github.com/aethanyc/icu4x

Once we're ready to loop it into Gecko, we'll file specific integration bugs and mark them as blocking this one.

Until then, this meta bug will collect bugs we hope to address in the rewrite.

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Updated

•

4 years ago

Blocks: line-breaking

Depends on: 1423593, 56652, 345823, 1553725, 820261

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

Keywords: meta

Makoto Kato [:m_kato]

Updated

•

4 years ago

Severity: -- → S3

Priority: -- → P3

Jim Mathies [:jimm]

Updated

•

4 years ago

Blocks: 359179

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Updated

•

4 years ago

Blocks: 1267120

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Updated

•

4 years ago

Blocks: 1569566

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Updated

•

4 years ago

Blocks: 774965

Anne (:annevk)

Comment 1

•

4 years ago

This came up in https://github.com/w3c/editing/issues/278. Is there a standardization effort backing this? It's biting web developers that even for a single platform browsers will behave differently from each other. (It's expected that there are platform differences.)

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Comment 2

•

4 years ago

I don't think there is. We're going to follow Unicode UAX#14 and UAX#29 standards, which should help, but the biggest goal for us is to end up with a single logic and data powering both the layout segmentation and javascript segmentation.

Due to UAX#14/UAX#29, we're likely going to vastly close the gap to what engines powered by ICU4C are doing.

Bob Owen (:bobowen)

Updated

•

4 years ago

Updated

•

4 years ago

Blocks: win32k-lockdown

robert.rcampbell

Comment 3

•

3 years ago

I'm just curious, how will this impact complex text layout languages? Will it ship the data needed for Thai, Burmese, Khmer, and Lao?

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Comment 4

•

3 years ago

I'm just curious, how will this impact complex text layout languages? Will it ship the data needed for Thai, Burmese, Khmer, and Lao?

Yes, we are planning to support these languages. The data is either a trained model via machine learning or a dictionary like ICU.

robert.rcampbell

Comment 5

•

3 years ago

That's awesome!

It would be neat to learn more about the process of building machine learning models - as that could be a huge data size benefit for a number of these languages.

I'm helping maintain the Lao data in ICU and Hunspell Lao, and have personal contacts that have worked with the other regional languages (Thai, Khmer, and Burmese), so I'm sure it would be possible to train lots of decent training data to develop some really good basic models.

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

3 years ago

See Also: → https://github.com/unicode-org/icu4x/issues/109

Dan Minor [:dminor]

Updated

•

3 years ago

Depends on: 1719535

Dan Minor [:dminor]

Updated

•

3 years ago

Depends on: 1719537

Emilio Cobos Álvarez (:emilio)

Updated

•

3 years ago

Blocks: 1722848

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

3 years ago

Depends on: 1722484

Bob Owen (:bobowen)

Comment 6

•

3 years ago

No longer blocks bug 1381019, because of work-around landed in bug 1713973.

No longer blocks: win32k-lockdown

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

3 years ago

Blocks: 1125644

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

3 years ago

Alias: segmenter

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

3 years ago

Blocks: 1611578

Zibi Braniecki [:zbraniecki][:gandalf]

Reporter

Updated

•

2 years ago

Blocks: 1781989

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

1 year ago

Blocks: 1837706

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

1 year ago

Depends on: 1847807

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

1 year ago

Depends on: 1817386

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

1 year ago

Depends on: 1854031

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

1 year ago

Depends on: 1854032

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

11 months ago

Depends on: 1856267, 1858068, 1848282

Makoto Kato [:m_kato]

Updated

•

9 months ago

Depends on: 1871754

Makoto Kato [:m_kato]

Updated

•

9 months ago

No longer depends on: 1871754

Regressions: 1871754

Makoto Kato [:m_kato]

Updated

•

9 months ago

No longer regressions: 1871754

Makoto Kato [:m_kato]

Updated

•

9 months ago

Blocks: 1871754

Alice0775 White

Updated

•

8 months ago

Depends on: 1876874

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

8 months ago

Depends on: 1848049

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

7 months ago

Depends on: 1879221

Daniel Holbert [:dholbert]

Updated

•

7 months ago

Depends on: 1880362

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

7 months ago

Depends on: 1869732

Tom S [:evilpie]

Updated

•

6 months ago

Depends on: 1883914

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

4 months ago

Depends on: 1899411