Closed Bug 1630920 Opened 5 years ago Closed 2 years ago

Remove the gb2312han and big5han collations

Categories

(Core :: JavaScript: Internationalization API, task, P3)

task

Tracking

()

RESOLVED FIXED
108 Branch
Tracking Status
firefox108 --- fixed

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

(Keywords: parity-chrome)

Attachments

(1 file)

CLDR provides non-default Chinese collations that are based on legacy encodings: gb2312han and big5han. https://searchfox.org/mozilla-central/rev/567b68b8ff4b6d607ba34a6f1926873d21a7b4d7/js/src/tests/test262/intl402/Collator/prototype/resolvedOptions/basic.js#19 suggests that we expose these to the Web.

It appears that these may exist for compatibility with legacy software that used the GB2312 and Big5 encodings, respectively, and sorted strings in those encodings lexicographically.

With the caveat that I don't actually read Chinese, these collation orders seem questionable in utility, because instead of resulting in a coherent sort order, such as first by radical and then by stroke count for GB2312 or first by stroke count and then by radical for Big5, the repertoire of these encodings is split into common and rare according to the definitions of the local education systems around 1980. While the common set has a coherent order and the rare set has a coherent order, e.g. big5han sorts character with few strokes from the rare set after a character with a large number strokes from the common set.

If these aren't actually used on the Web, we should not pay for them in binary size.

In the interest of figuring out if it's worthwhile to have these from the binary size perspective (and if it is worthwhile to have them if we should try to optimize binary size between ICU4X and encoding_rs), let's add use counters for all Chinese collations. (For all of them in order to be able to compare the level of usage of gb2312han and big5han with the others).

With the caveat that I don't actually read Chinese

I emphasize this caveat, considering that the Japanese collation order in CLDR exhibits the characteristic of sorting rare after common (and having different ordering criteria for the common set and the rare set).

In bug 1036383 we had a user who wanted to sort ASCII before native script, which is currently only possible through gb2312han or big5han (see also https://unicode-org.atlassian.net/browse/CLDR-9944).

Web compatibility-wise we could just remove gb2312han and big5han, because both encodings are already not available in Chrome (https://chromium.googlesource.com/chromium/deps/icu/+/refs/heads/master/filters/common.json).

(In reply to André Bargull [:anba] from comment #2)

Web compatibility-wise we could just remove gb2312han and big5han, because both encodings are already not available in Chrome

Whoa. In that case, it would indeed seem prudent to remove these.

This could help us narrow down the ICU binary size (bug 1612578).

See Also: → 1612578
Severity: -- → S4
Priority: -- → P3

Let's morph this into removal for consistency with Chrome.

Keywords: parity-chrome
Summary: Add use counters for Chinese collations → Remove the gb2312han and big5han collations

I tried to edit data_filter.json to exclude big5han and gb2312han, but the resulting build still showed them as supported on https://hsivonen.com/test/moz/zh-collations.html .

anba, any ideas why data_filter.json doesn't appear to take effect for these?

Flags: needinfo?(andrebargull)

(In reply to Henri Sivonen (:hsivonen) from comment #7)

anba, any ideas why data_filter.json doesn't appear to take effect for these?

The ICU data file needs to be manually rebuild after editing data_filter.json. If you only want to rebuild the data file and don't care about reapplying the current tzdata version, executing icu_sources_data.py should work:

cd intl/
PYTHONPATH=../python/mozbuild/ python ./icu_sources_data.py ../

The tzdata version can be reapplied by executing icupkg, but it's necessary to use icupkg from the current in-tree ICU version (i.e. ICU 71):

cd intl/
icupkg --add ./tzdata/files.txt --sourcedir ./tzdata/source/le/ ../config/external/icu/data/icudt71l.dat
Flags: needinfo?(andrebargull)

Thanks. I was able to get https://hsivonen.com/test/moz/zh-collations.html to show the absence of these collations (after manually removing ICU from the obj dir before rebuilding). This makes the data file 12 KB larger, though. (I tested that running the steps without changes to data_filter.json does not make the data file larger, so this shouldn't be a matter of the scripts not matching the checked-in data file.)

Any ideas why excluding something makes the data file larger?

https://treeherder.mozilla.org/jobs?repo=try&revision=bc21f4df01c6d95a21ed5bec6e611ecc3ed047f4

Flags: needinfo?(andrebargull)

(In reply to Henri Sivonen (:hsivonen) from comment #10)

Any ideas why excluding something makes the data file larger?

Responded in Phabricator. It looks like collations needs to be used instead of * in the data filter.

Flags: needinfo?(andrebargull)
Attachment #9300299 - Attachment description: WIP: Bug 1630920 - Remove the gb2312han and big5han collations. → Bug 1630920 - Remove the gb2312han and big5han collations.
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2a987fb9383b Remove the gb2312han and big5han collations. r=anba
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 108 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: