Remove the gb2312han and big5han collations
Categories
(Core :: JavaScript: Internationalization API, task, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox108 | --- | fixed |
People
(Reporter: hsivonen, Assigned: hsivonen)
References
Details
(Keywords: parity-chrome)
Attachments
(1 file)
CLDR provides non-default Chinese collations that are based on legacy encodings: gb2312han
and big5han
. https://searchfox.org/mozilla-central/rev/567b68b8ff4b6d607ba34a6f1926873d21a7b4d7/js/src/tests/test262/intl402/Collator/prototype/resolvedOptions/basic.js#19 suggests that we expose these to the Web.
It appears that these may exist for compatibility with legacy software that used the GB2312 and Big5 encodings, respectively, and sorted strings in those encodings lexicographically.
With the caveat that I don't actually read Chinese, these collation orders seem questionable in utility, because instead of resulting in a coherent sort order, such as first by radical and then by stroke count for GB2312 or first by stroke count and then by radical for Big5, the repertoire of these encodings is split into common and rare according to the definitions of the local education systems around 1980. While the common set has a coherent order and the rare set has a coherent order, e.g. big5han sorts character with few strokes from the rare set after a character with a large number strokes from the common set.
If these aren't actually used on the Web, we should not pay for them in binary size.
In the interest of figuring out if it's worthwhile to have these from the binary size perspective (and if it is worthwhile to have them if we should try to optimize binary size between ICU4X and encoding_rs), let's add use counters for all Chinese collations. (For all of them in order to be able to compare the level of usage of gb2312han
and big5han
with the others).
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
With the caveat that I don't actually read Chinese
I emphasize this caveat, considering that the Japanese collation order in CLDR exhibits the characteristic of sorting rare after common (and having different ordering criteria for the common set and the rare set).
Comment 2•5 years ago
|
||
In bug 1036383 we had a user who wanted to sort ASCII before native script, which is currently only possible through gb2312han
or big5han
(see also https://unicode-org.atlassian.net/browse/CLDR-9944).
Web compatibility-wise we could just remove gb2312han
and big5han
, because both encodings are already not available in Chrome (https://chromium.googlesource.com/chromium/deps/icu/+/refs/heads/master/filters/common.json).
Assignee | ||
Comment 3•5 years ago
|
||
(In reply to André Bargull [:anba] from comment #2)
Web compatibility-wise we could just remove
gb2312han
andbig5han
, because both encodings are already not available in Chrome
Whoa. In that case, it would indeed seem prudent to remove these.
Comment 4•5 years ago
|
||
This could help us narrow down the ICU binary size (bug 1612578).
Updated•4 years ago
|
Assignee | ||
Comment 5•4 years ago
|
||
Let's morph this into removal for consistency with Chrome.
Assignee | ||
Comment 6•2 years ago
|
||
Issue to remove these from CLDR itself: https://unicode-org.atlassian.net/browse/CLDR-16062
Assignee | ||
Comment 7•2 years ago
|
||
I tried to edit data_filter.json
to exclude big5han
and gb2312han
, but the resulting build still showed them as supported on https://hsivonen.com/test/moz/zh-collations.html .
anba, any ideas why data_filter.json
doesn't appear to take effect for these?
Comment 8•2 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #7)
anba, any ideas why
data_filter.json
doesn't appear to take effect for these?
The ICU data file needs to be manually rebuild after editing data_filter.json
. If you only want to rebuild the data file and don't care about reapplying the current tzdata version, executing icu_sources_data.py should work:
cd intl/
PYTHONPATH=../python/mozbuild/ python ./icu_sources_data.py ../
The tzdata version can be reapplied by executing icupkg, but it's necessary to use icupkg
from the current in-tree ICU version (i.e. ICU 71):
cd intl/
icupkg --add ./tzdata/files.txt --sourcedir ./tzdata/source/le/ ../config/external/icu/data/icudt71l.dat
Assignee | ||
Comment 9•2 years ago
|
||
Assignee | ||
Comment 10•2 years ago
|
||
Thanks. I was able to get https://hsivonen.com/test/moz/zh-collations.html to show the absence of these collations (after manually removing ICU from the obj dir before rebuilding). This makes the data file 12 KB larger, though. (I tested that running the steps without changes to data_filter.json
does not make the data file larger, so this shouldn't be a matter of the scripts not matching the checked-in data file.)
Any ideas why excluding something makes the data file larger?
https://treeherder.mozilla.org/jobs?repo=try&revision=bc21f4df01c6d95a21ed5bec6e611ecc3ed047f4
Comment 11•2 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #10)
Any ideas why excluding something makes the data file larger?
Responded in Phabricator. It looks like collations
needs to be used instead of *
in the data filter.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 12•2 years ago
|
||
Comment 13•2 years ago
|
||
bugherder |
Description
•