Closed Bug 1630920 Opened 5 years ago Closed 2 years ago

Remove the gb2312han and big5han collations

Tracking

()

Status:

RESOLVED FIXED

Milestone:

108 Branch

Tracking Flags:

Tracking

Status

firefox108

---

fixed

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

(Keywords: parity-chrome)

Attachments

(1 file)

Bug 1630920 - Remove the gb2312han and big5han collations. 2 years ago Henri Sivonen (:hsivonen) (temporarily away from Bugzilla) 48 bytes, text/x-phabricator-request		Details \| Review

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Assignee

Description

•

5 years ago

CLDR provides non-default Chinese collations that are based on legacy encodings: gb2312han and big5han. https://searchfox.org/mozilla-central/rev/567b68b8ff4b6d607ba34a6f1926873d21a7b4d7/js/src/tests/test262/intl402/Collator/prototype/resolvedOptions/basic.js#19 suggests that we expose these to the Web.

It appears that these may exist for compatibility with legacy software that used the GB2312 and Big5 encodings, respectively, and sorted strings in those encodings lexicographically.

With the caveat that I don't actually read Chinese, these collation orders seem questionable in utility, because instead of resulting in a coherent sort order, such as first by radical and then by stroke count for GB2312 or first by stroke count and then by radical for Big5, the repertoire of these encodings is split into common and rare according to the definitions of the local education systems around 1980. While the common set has a coherent order and the rare set has a coherent order, e.g. big5han sorts character with few strokes from the rare set after a character with a large number strokes from the common set.

If these aren't actually used on the Web, we should not pay for them in binary size.

In the interest of figuring out if it's worthwhile to have these from the binary size perspective (and if it is worthwhile to have them if we should try to optimize binary size between ICU4X and encoding_rs), let's add use counters for all Chinese collations. (For all of them in order to be able to compare the level of usage of gb2312han and big5han with the others).

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Assignee

Updated

•

5 years ago

See Also: → https://github.com/unicode-org/omnicu/issues/34

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Assignee

Comment 1

•

5 years ago

With the caveat that I don't actually read Chinese

I emphasize this caveat, considering that the Japanese collation order in CLDR exhibits the characteristic of sorting rare after common (and having different ordering criteria for the common set and the rare set).

André Bargull [:anba]

Comment 2

•

5 years ago

In bug 1036383 we had a user who wanted to sort ASCII before native script, which is currently only possible through gb2312han or big5han (see also https://unicode-org.atlassian.net/browse/CLDR-9944).

Web compatibility-wise we could just remove gb2312han and big5han, because both encodings are already not available in Chrome (https://chromium.googlesource.com/chromium/deps/icu/+/refs/heads/master/filters/common.json).

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Assignee

Comment 3

•

5 years ago

(In reply to André Bargull [:anba] from comment #2)

Web compatibility-wise we could just remove gb2312han and big5han, because both encodings are already not available in Chrome

Whoa. In that case, it would indeed seem prudent to remove these.

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 4

•

5 years ago

This could help us narrow down the ICU binary size (bug 1612578).

Updated

•

4 years ago

Severity: -- → S4

Priority: -- → P3

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Assignee

Comment 5

•

4 years ago

Let's morph this into removal for consistency with Chrome.

Keywords: parity-chrome

Summary: Add use counters for Chinese collations → Remove the gb2312han and big5han collations

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Assignee

Comment 6

•

2 years ago

Issue to remove these from CLDR itself: https://unicode-org.atlassian.net/browse/CLDR-16062

See Also: https://github.com/unicode-org/omnicu/issues/34 → https://unicode-org.atlassian.net/browse/CLDR-16062

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Assignee

Comment 7

•

2 years ago

I tried to edit data_filter.json to exclude big5han and gb2312han, but the resulting build still showed them as supported on https://hsivonen.com/test/moz/zh-collations.html .

anba, any ideas why data_filter.json doesn't appear to take effect for these?

Flags: needinfo?(andrebargull)

André Bargull [:anba]

Comment 8

•

2 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #7)

anba, any ideas why data_filter.json doesn't appear to take effect for these?

The ICU data file needs to be manually rebuild after editing data_filter.json. If you only want to rebuild the data file and don't care about reapplying the current tzdata version, executing icu_sources_data.py should work:

cd intl/
PYTHONPATH=../python/mozbuild/ python ./icu_sources_data.py ../

The tzdata version can be reapplied by executing icupkg, but it's necessary to use icupkg from the current in-tree ICU version (i.e. ICU 71):

cd intl/
icupkg --add ./tzdata/files.txt --sourcedir ./tzdata/source/le/ ../config/external/icu/data/icudt71l.dat

Flags: needinfo?(andrebargull)

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Assignee

Comment 9

•

2 years ago

Attached file Bug 1630920 - Remove the gb2312han and big5han collations. — Details

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Assignee

Comment 10

•

2 years ago

Thanks. I was able to get https://hsivonen.com/test/moz/zh-collations.html to show the absence of these collations (after manually removing ICU from the obj dir before rebuilding). This makes the data file 12 KB larger, though. (I tested that running the steps without changes to data_filter.json does not make the data file larger, so this shouldn't be a matter of the scripts not matching the checked-in data file.)

Any ideas why excluding something makes the data file larger?

https://treeherder.mozilla.org/jobs?repo=try&revision=bc21f4df01c6d95a21ed5bec6e611ecc3ed047f4

Flags: needinfo?(andrebargull)

André Bargull [:anba]

Comment 11

•

2 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #10)

Any ideas why excluding something makes the data file larger?

Responded in Phabricator. It looks like collations needs to be used instead of * in the data filter.

Flags: needinfo?(andrebargull)

Phabricator Automation

Updated

•

2 years ago

Attachment #9300299 - Attachment description: WIP: Bug 1630920 - Remove the gb2312han and big5han collations. → Bug 1630920 - Remove the gb2312han and big5han collations.

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Pulsebot

Comment 12

•

2 years ago

Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2a987fb9383b Remove the gb2312han and big5han collations. r=anba

Sandor Molnar[:smolnar]

Comment 13

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/2a987fb9383b

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox108: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 108 Branch

You need to log in before you can comment on or make changes to this bug.