Open Bug 1702246 Opened 11 days ago Updated 2 days ago

Tolerate unmapped Big5 byte sequences in chardetng

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

ASSIGNED

People

(Reporter: hsivonen, Assigned: hsivonen)

Details

Attachments

(1 file)

Telemetry suggests that users in Taiwan and Hong Kong are experiencing misdetection more often than the global average, which suggests that chardetng is rejecting Big5-ish input.

The said telemetry results are shown at: https://hsivonen.fi/encoding-telemetry/

Severity: -- → S3
Priority: -- → P3

This patch tries to address the issue that legacy CJK extensions have various
extended variants where the core of the encoding is compatible but the edges
are incompatible. Without this patch, we reject e.g. Big5 if it has a single
character from the UAO extension or a single Windows end-user-defined character.

Likewise for the other legacy CJK encodings.

This patch tolerates:

  • All Big5 extensions (the motivating part of this patch).
  • Windows EUDC for EUC-KR.
  • Classic Mac OS extensions to Shift_JIS, EUC-KR, GBK, and Big5 to the
    extent practical considering conflicting definitions of what constitutes
    a lead byte in the Encoding Standard but a single-byte extension in
    Classic Mac OS.
  • JIS X 0213 / 2004 extensions to Shift_JIS and EUC-JP. (It's unclear if
    these have actual deployment.)

Tolerating means that the occurrence of an extension character doesn't
disqualify a candidate but only applies a penalty to the pending score.
If there is enough other convincing content, it should be able to overcome
the penalty.

Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
You need to log in before you can comment on or make changes to this bug.