Upgrade chardetng to 0.1.4
Categories
(Core :: Internationalization, enhancement, P3)
Tracking
()
People
(Reporter: hsivonen, Assigned: hsivonen)
References
Details
chardetng 0.1.4 moves Estonian from the Baltic language model to the Western language model. This improves title-length Estonian detection accuracy by about 15 percentage points from around 75% to around 90% (considering only titles in the Estonian Wikipedia that have at least one non-ASCII character).
All legacy encodings used for Estonian have the Estonian non-ASCII vowels in the same byte positions. There are two non-ASCII consonants that are officially part of the orthography but are only used in recent (as in from the last 100 years or so) loan words and transliterations and whose byte positions differ between legacy encodings: š and ž. This change trades off breaking the relatively rare (on the order of either no occurrences per page or one occurrence per page) instances of those consonants in non-windows-1252 cases in order to gain better overall accuracy for Estonian text generally. (Apart from those two consonants, Estonian has no non-ASCII commonality with Lithuanian and Latvian. Estonian non-ASCII vowels do have commonality with German, Finnish, Swedish, and Portuguese.)
Updated•6 years ago
|
| Assignee | ||
Comment 1•6 years ago
|
||
Duplicating against a bug that explains symptoms.
Description
•