Closed Bug 1613861 Opened 6 years ago Closed 6 years ago

Upgrade chardetng to 0.1.4

Categories

(Core :: Internationalization, enhancement, P3)

enhancement

Tracking

()

RESOLVED DUPLICATE of bug 1615836

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

chardetng 0.1.4 moves Estonian from the Baltic language model to the Western language model. This improves title-length Estonian detection accuracy by about 15 percentage points from around 75% to around 90% (considering only titles in the Estonian Wikipedia that have at least one non-ASCII character).

All legacy encodings used for Estonian have the Estonian non-ASCII vowels in the same byte positions. There are two non-ASCII consonants that are officially part of the orthography but are only used in recent (as in from the last 100 years or so) loan words and transliterations and whose byte positions differ between legacy encodings: š and ž. This change trades off breaking the relatively rare (on the order of either no occurrences per page or one occurrence per page) instances of those consonants in non-windows-1252 cases in order to gain better overall accuracy for Estonian text generally. (Apart from those two consonants, Estonian has no non-ASCII commonality with Lithuanian and Latvian. Estonian non-ASCII vowels do have commonality with German, Finnish, Swedish, and Portuguese.)

Priority: -- → P3
Blocks: 1615836

Duplicating against a bug that explains symptoms.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.