Closed Bug 715967 Opened 13 years ago Closed 9 years ago

Universal charset detection wrong on several single byte encodings

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: jf, Assigned: smontagu)

Details

Attachments

(1 file)

Single byte encodings like iso-8859-2 (ie: hungarian, czech) or iso-8859-9 (turkish) are detected as windows-1252/iso-8859-1. For example, see attached hungarian file. The lack of development of single byte encoding identification is actually stated in the "Future work" section of the original article describing the identification method: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html The reasons for which the current code is not working well are of two kinds: * A number of small bugs in the code (especially an off by one error while processing the 2-char distribution tables). * An insufficient number of language/encoding models, probably due to lack of time when the module was initially developped, but also to the lack of good identification results due to the above bug. For exemple the Hungarian model is present in the code but commented out. If it is uncommented it tends to "capture" most western european language/encodings. I can propose a few very small patches to fix the bugs, which are all almost of the "typo" kind, and some bigger patches that add language models to enhance the precision of the results. I've set up a (hopefuly temporary) bitbucket repository to hold the patches. The main and small fix is here: https://bitbucket.org/medoc/uchardet-enhanced/changeset/531748aa5a49 If someone is interested in integrating / testing this please contact me for a commented list of the commits relevant to the mozilla code (and also the surrounding test/table generation code if needed).
Are these changes needed to improve Web compatibility? If not, we're better off not taking them, since if our detection gets better, fewer pages will have their encoding labeled, and then we (and other browsers) will need the more complicated behavior more often.
Firefox currently includes automatic character set detection code (prominently reachable from menu entries). This code is buggy (IMHO), and does not do what it claims to do. My general view is that the code should be either fixed or removed. I am no judge of the Mozilla situation though, and I can marginally understand the argument for keeping remedial code defective in the hope that it will force people to publish better pages. There is a lot of old data on the web though. As an "amusing" example: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html This page is encoded in windows-1252 (correctly described by a "content-type" inside the html), but for some reason, the mozilla server sends it as utf-8, meaning it is displayed improperly. I think that the autodetection doesn't even get a chance to run in this case, so it's an improper example, but you get the general idea: the web is not going to become clean anyday soon.
> Are these changes needed to improve Web compatibility? For our default settings, probably not, since only zh-TW turns on the Universal detector by default. (Most likely zh-TW shouldn't be doing that. It should probably default to Traditional Chinese or we should introduce a detector that handles both Simplified and Traditional Chinese and make zh-TW default to that.) Maybe as a first step, we should rename the UI option from "Universal" to "All of the Above", since "Universal" should be chose by people who read e.g. both Korean and Japanese sites but shouldn't be chosen by users who read Hungarian or Czech sites. > I've set up a (hopefuly temporary) bitbucket repository to hold the patches. The > main and small fix is here: > https://bitbucket.org/medoc/uchardet-enhanced/changeset/531748aa5a49 I get an access denied error on the bitbucket URL.
Sorry about the bitbucket problem, the repository was private. The following link should work now: https://bitbucket.org/medoc/uchardet-enhanced/overview
I realized that the change log was difficult to use because most commits deal with the test data/scripts and training data. Here follows a list of all the changes that affect mozilla code: https://bitbucket.org/medoc/uchardet-enhanced/changeset/531748aa5a49 This fixes the main issue for single-byte encodings: off by one error while processing the statistics tables (which did not result in an array overflow because of high index checking). https://bitbucket.org/medoc/uchardet-enhanced/changeset/c9cdedf5d1c2 This adds statistics tables for more language/charset pairs, modifies the tables for some of the existing, and adds a language name field to all tables for diagnostic purposes. https://bitbucket.org/medoc/uchardet-enhanced/changeset/697a65a826e3 This integrates the "language name" change to the tables and the new tables with the rest of the code and implements an option to filter out (or not) ascii characters from the input data. The option was present in the original code but not fully implemented. https://bitbucket.org/medoc/uchardet-enhanced/changeset/b912dacb3716 This changes the utf-8 detector to declare an error (not be a candidate any more) on the first decoding error. Previously, it declared success after decoding several multibyte sequences without taking any error into account, and it was quite possible to obtain false positives. By the way, the same change was independantly implemented by the author of the "chardet" Python version of the detector. https://bitbucket.org/medoc/uchardet-enhanced/changeset/913ea06b2583 This fixes a typo/bug in the top driver, lacking parentheses between an & and an && expression.
Resolved by removing the "universal" detector.
Status: UNCONFIRMED → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: