Tolerate unmapped Big5 byte sequences in chardetng
Categories
(Core :: Internationalization, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox89 | --- | fixed |
People
(Reporter: hsivonen, Assigned: hsivonen)
Details
Attachments
(1 file)
Telemetry suggests that users in Taiwan and Hong Kong are experiencing misdetection more often than the global average, which suggests that chardetng is rejecting Big5-ish input.
Assignee | ||
Comment 1•4 years ago
|
||
The said telemetry results are shown at: https://hsivonen.fi/encoding-telemetry/
Updated•4 years ago
|
Assignee | ||
Comment 2•4 years ago
|
||
Assignee | ||
Comment 3•4 years ago
|
||
This patch tries to address the issue that legacy CJK extensions have various
extended variants where the core of the encoding is compatible but the edges
are incompatible. Without this patch, we reject e.g. Big5 if it has a single
character from the UAO extension or a single Windows end-user-defined character.
Likewise for the other legacy CJK encodings.
This patch tolerates:
- All Big5 extensions (the motivating part of this patch).
- Windows EUDC for EUC-KR.
- Classic Mac OS extensions to Shift_JIS, EUC-KR, GBK, and Big5 to the
extent practical considering conflicting definitions of what constitutes
a lead byte in the Encoding Standard but a single-byte extension in
Classic Mac OS. - JIS X 0213 / 2004 extensions to Shift_JIS and EUC-JP. (It's unclear if
these have actual deployment.)
Tolerating means that the occurrence of an extension character doesn't
disqualify a candidate but only applies a penalty to the pending score.
If there is enough other convincing content, it should be able to overcome
the penalty.
Updated•4 years ago
|
Comment 5•4 years ago
|
||
bugherder |
Description
•