Open Bug 1871754 Opened 11 months ago Updated 8 months ago

[ja] New Japanese word Segmentation is anoying for me

Categories

(Core :: Internationalization, defect, P5)

Firefox 122
defect

Tracking

()

Tracking Status
firefox-esr115 --- unaffected
firefox121 --- unaffected
firefox122 --- wontfix
firefox123 --- wontfix

People

(Reporter: alice0775, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: jp-critical, nightly-community, regression)

New Japanese word Segmentation (dblclick selection and web search) is anoying.

Ex.
文芸評論(ぶんげいひょうろん、英語: literary criticism)とは、文学を評論すること。
文芸批評、または文学研究とも言うが、
小説家や作品に限らず文学とその周辺全般が扱われ、学際的な性格を持つ。
研究対象の性格によっては
夏目漱石、山田太郎

NEW word Segmentation:
/文芸/評論/(/ぶん/げ/い/ひょう/ろ/ん/、/英語/: /literary /criticism/)/と/は/、/
/文芸/批評/、/または/文学/研究/とも/言う/が/、/
/小説/家/や/作品/に/限/ら/ず文/学/とそ/の/周辺/全般/が/扱/われ/、/学際/的/な/性格/を/持つ/。/
/研究/対象/の/性格/によって/は/
/夏目/漱石/、/山田/太郎/

OLD word Segmentation:
/文芸評論/(/ぶんげいひょうろん/、/英語/: /literary /criticism/)/とは/、
/文芸批評/、/または/文学研究/とも/言/うが/、/
/小説家/や/作品/に/限/らず/文学/とその/周辺全般/が/扱われ/、/学際的/な/性格/を/持/つ/。/
/研究対象/の/性格/によっては/
/夏目漱石/、/山田太郎/

Expected:
文芸評論, 文芸批評, 文学研究, 小説家, 文学, 研究対象, 夏目漱石, 山田太郎 should be one word.

:m_kato, since you are the author of the regressor, bug 1854032, could you take a look? Also, could you set the severity field?

For more information, please visit BugBot documentation.

Flags: needinfo?(m_kato)

We will use machine learning based segmenter for CJ in the long-term future. Actually, this depends on ICU's dictionary for CJ. So no way to fix it now

Severity: -- → S3
Type: enhancement → defect
Flags: needinfo?(m_kato)
Priority: -- → P5
Type: enhancement → defect
Severity: S3 → S4
Type: enhancement → defect
No longer blocks: 1869732
Blocks: segmenter
No longer blocks: segmenter
Regressed by: segmenter
No longer regressed by: 1854032

FWIW, the new Firefox word Segmentation behavior is the same as Google Chrome or Safari.

No longer regressed by: segmenter
Depends on: segmenter

I discussed this issue with Ting-Yu. We may be able to add an options to ignore dictionary in ICU4X or, WordBreakIteratorUtf16 has ignore option for Han script. I vote new option to WordBreakIteratorUtf16 since I don't want more dependencies in ICU4X segmenter.

Then we can change the behavior by pref in intl.properties. Localizer can choose it.

You need to log in before you can comment on or make changes to this bug.