Closed Bug 1702246 Opened 4 years ago Closed 4 years ago

Tolerate unmapped Big5 byte sequences in chardetng

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

RESOLVED FIXED
89 Branch
Tracking Status
firefox89 --- fixed

People

(Reporter: hsivonen, Assigned: hsivonen)

Details

Attachments

(1 file)

Telemetry suggests that users in Taiwan and Hong Kong are experiencing misdetection more often than the global average, which suggests that chardetng is rejecting Big5-ish input.

Severity: -- → S3
Priority: -- → P3

This patch tries to address the issue that legacy CJK extensions have various
extended variants where the core of the encoding is compatible but the edges
are incompatible. Without this patch, we reject e.g. Big5 if it has a single
character from the UAO extension or a single Windows end-user-defined character.

Likewise for the other legacy CJK encodings.

This patch tolerates:

  • All Big5 extensions (the motivating part of this patch).
  • Windows EUDC for EUC-KR.
  • Classic Mac OS extensions to Shift_JIS, EUC-KR, GBK, and Big5 to the
    extent practical considering conflicting definitions of what constitutes
    a lead byte in the Encoding Standard but a single-byte extension in
    Classic Mac OS.
  • JIS X 0213 / 2004 extensions to Shift_JIS and EUC-JP. (It's unclear if
    these have actual deployment.)

Tolerating means that the occurrence of an extension character doesn't
disqualify a candidate but only applies a penalty to the pending score.
If there is enough other convincing content, it should be able to overcome
the penalty.

Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8c8448138787 Make the encoding detector tolerate extensions to legacy CJK encodings. r=emk
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 89 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: