Get a rid of legacy line/word segmenter
Categories
(Core :: Internationalization, task, P3)
Tracking
()
People
(Reporter: m_kato, Unassigned)
References
(Blocks 1 open bug)
Details
After ICU4X's segmenter as default on release channel, we can remove it with prefs.
Reporter | ||
Comment 1•2 years ago
|
||
Also, we need to adjust or remove old segmenter tests such as the following.
layout/reftests/line-breaking/reftest.list
2 pref(intl.icu4x.segmenter.enabled,false) == chemical-1.html chemical-1-ref.html
3 pref(intl.icu4x.segmenter.enabled,false) == conservative-range-1.html conservative-range-1-ref.html
4 pref(intl.icu4x.segmenter.enabled,false) == conservative-range-2.html conservative-range-2-ref.html
5 pref(intl.icu4x.segmenter.enabled,false) == currency-1.html currency-1-ref.html
7 pref(intl.icu4x.segmenter.enabled,false) == datetime-1.html datetime-1-ref.html
9 pref(gfx.font_rendering.fallback.async,false) pref(intl.icu4x.segmenter.enabled,false) == emoji-2.html emoji-2-ref.html
10 pref(intl.icu4x.segmenter.enabled,false) == hyphens-1.html hyphens-1-ref.html
11 pref(intl.icu4x.segmenter.enabled,false) == hyphens-2.html hyphens-2-ref.html
19 pref(intl.icu4x.segmenter.enabled,false) == leaders-1.html leaders-1-ref.html
20 pref(intl.icu4x.segmenter.enabled,false) == markup-src-1.html markup-src-1-ref.html
24 pref(intl.icu4x.segmenter.enabled,false) == parentheses-1.html parentheses-1-ref.html
29 pref(intl.icu4x.segmenter.enabled,false) == quotationmarks-1.html quotationmarks-1-ref.html
32 pref(intl.icu4x.segmenter.enabled,false) skip-if(gtkWidget) == quotationmarks-cjk-1.html quotationmarks-cjk-1-ref.html
33 pref(intl.icu4x.segmenter.enabled,false) == smileys-1.html smileys-1-ref.html
34 pref(intl.icu4x.segmenter.enabled,false) == smileys-2.html smileys-2-ref.html
38 pref(intl.icu4x.segmenter.enabled,false) == surrogates-2.html surrogates-2-ref.html
40 pref(intl.icu4x.segmenter.enabled,false) == surrogates-4.html surrogates-4-ref.html
41 pref(intl.icu4x.segmenter.enabled,false) == url-1.html url-1-ref.html
42 pref(intl.icu4x.segmenter.enabled,false) == url-2.html url-2-ref.html
43 pref(intl.icu4x.segmenter.enabled,false) == url-3.html url-3-ref.html
44 pref(intl.icu4x.segmenter.enabled,false) == winpath-1.html winpath-1-ref.html
layout/reftests/text/reftest.list
76 pref(intl.icu4x.segmenter.enabled,false) == wordbreak-1.html wordbreak-1-ref.html
151 pref(intl.icu4x.segmenter.enabled,false) == 1507661-spurious-hyphenation-after-explicit.html 1507661-spurious-hyphenation-after-explicit-ref.html
337 pref(intl.icu4x.segmenter.enabled,false) == ethiopic-wordspace.html ethiopic-wordspace-ref.html
Comment 2•2 years ago
|
||
Given that we are receiving complaining on Japanese word selection with ICU4X segmenter such as bug 1871754 and https://support.mozilla.org/en-US/forums/contributors/716759?last=86984#post-86974, we might want to keep the pref for a while so that people can switch to the old behavior.
Reporter | ||
Comment 3•2 years ago
|
||
(In reply to Ting-Yu Lin [:TYLin] (UTC-8) (Away Feb 15 - Mar 2) from comment #2)
Given that we are receiving complaining on Japanese word selection with ICU4X segmenter such as bug 1871754 and https://support.mozilla.org/en-US/forums/contributors/716759?last=86984#post-86974, we might want to keep the pref for a while so that people can switch to the old behavior.
I guess that legacy line segmenter may be able to removed since it doesn't depends on bug 1871754 issue. Also, word segmenter depends on legacy complex breaker for East Asian language. So we can replace it with ICU4X's LSTM segmenter. Legacy complex breaker runs on parent process only on Windows due to win32k lockdown.
Reporter | ||
Comment 4•2 years ago
|
||
Should I file a new bug to remove legacy line segmenter only?
Comment 5•2 years ago
|
||
Re comment 3:
Also, word segmenter depends on legacy complex breaker for East Asian language. So we can replace it with ICU4X's LSTM segmenter. Legacy complex breaker runs on parent process only on Windows due to win32k lockdown.
For word breaker, if some Japanese users want the legacy word breaking behavior, does it help if we mimic the legacy behavior with icu4x segmenter with no dictionaries?
Re comment 4:
Should I file a new bug to remove legacy line segmenter only?
It is OK to remove legacy line segmenter and word segmenter separately. However, we have some line breaking compat issues such as bug 1848049 and bug 1876874. It is not clear to me if people are already setting intl.icu4x.segmenter.enabled=false
to opt-in to the legacy behavior. If so, we might want to fix these bugs before removing the legacy line segmenter.
Description
•