Closed
Bug 715967
Opened 13 years ago
Closed 9 years ago
Universal charset detection wrong on several single byte encodings
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: jf, Assigned: smontagu)
Details
Attachments
(1 file)
15.45 KB,
text/plain
|
Details |
Single byte encodings like iso-8859-2 (ie: hungarian, czech) or iso-8859-9 (turkish) are detected as windows-1252/iso-8859-1. For example, see attached hungarian file.
The lack of development of single byte encoding identification is actually stated in the "Future work" section of the original article describing the identification method:
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
The reasons for which the current code is not working well are of two kinds:
* A number of small bugs in the code (especially an off by one error while processing the 2-char distribution tables).
* An insufficient number of language/encoding models, probably due to lack of time when the module was initially developped, but also to the lack of good identification results due to the above bug. For exemple the Hungarian model is present in the code but commented out. If it is uncommented it tends to "capture" most western european language/encodings.
I can propose a few very small patches to fix the bugs, which are all almost of the "typo" kind, and some bigger patches that add language models to enhance the precision of the results.
I've set up a (hopefuly temporary) bitbucket repository to hold the patches. The main and small fix is here:
https://bitbucket.org/medoc/uchardet-enhanced/changeset/531748aa5a49
If someone is interested in integrating / testing this please contact me for a commented list of the commits relevant to the mozilla code (and also the surrounding test/table generation code if needed).
Are these changes needed to improve Web compatibility? If not, we're better off not taking them, since if our detection gets better, fewer pages will have their encoding labeled, and then we (and other browsers) will need the more complicated behavior more often.
Reporter | ||
Comment 2•13 years ago
|
||
Firefox currently includes automatic character set detection code (prominently reachable from menu entries). This code is buggy (IMHO), and does not do what it claims to do. My general view is that the code should be either fixed or removed.
I am no judge of the Mozilla situation though, and I can marginally understand the argument for keeping remedial code defective in the hope that it will force people to publish better pages.
There is a lot of old data on the web though. As an "amusing" example:
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
This page is encoded in windows-1252 (correctly described by a "content-type" inside the html), but for some reason, the mozilla server sends it as utf-8, meaning it is displayed improperly. I think that the autodetection doesn't even get a chance to run in this case, so it's an improper example, but you get the general idea: the web is not going to become clean anyday soon.
Comment 3•13 years ago
|
||
> Are these changes needed to improve Web compatibility?
For our default settings, probably not, since only zh-TW turns on the Universal detector by default. (Most likely zh-TW shouldn't be doing that. It should probably default to Traditional Chinese or we should introduce a detector that handles both Simplified and Traditional Chinese and make zh-TW default to that.)
Maybe as a first step, we should rename the UI option from "Universal" to "All of the Above", since "Universal" should be chose by people who read e.g. both Korean and Japanese sites but shouldn't be chosen by users who read Hungarian or Czech sites.
> I've set up a (hopefuly temporary) bitbucket repository to hold the patches. The > main and small fix is here:
> https://bitbucket.org/medoc/uchardet-enhanced/changeset/531748aa5a49
I get an access denied error on the bitbucket URL.
Reporter | ||
Comment 4•13 years ago
|
||
Sorry about the bitbucket problem, the repository was private. The following link should work now:
https://bitbucket.org/medoc/uchardet-enhanced/overview
Reporter | ||
Comment 5•13 years ago
|
||
I realized that the change log was difficult to use because most commits deal
with the test data/scripts and training data. Here follows a list of all
the changes that affect mozilla code:
https://bitbucket.org/medoc/uchardet-enhanced/changeset/531748aa5a49
This fixes the main issue for single-byte encodings: off by one error while
processing the statistics tables (which did not result in an array
overflow because of high index checking).
https://bitbucket.org/medoc/uchardet-enhanced/changeset/c9cdedf5d1c2
This adds statistics tables for more language/charset pairs, modifies the
tables for some of the existing, and adds a language name field to all
tables for diagnostic purposes.
https://bitbucket.org/medoc/uchardet-enhanced/changeset/697a65a826e3
This integrates the "language name" change to the tables and the new
tables with the rest of the code and implements an option to filter out
(or not) ascii characters from the input data. The option was present in
the original code but not fully implemented.
https://bitbucket.org/medoc/uchardet-enhanced/changeset/b912dacb3716
This changes the utf-8 detector to declare an error (not be a candidate any
more) on the first decoding error. Previously, it declared success after
decoding several multibyte sequences without taking any error into account,
and it was quite possible to obtain false positives.
By the way, the same change was independantly implemented by the author of
the "chardet" Python version of the detector.
https://bitbucket.org/medoc/uchardet-enhanced/changeset/913ea06b2583
This fixes a typo/bug in the top driver, lacking parentheses between an &
and an && expression.
Comment 6•9 years ago
|
||
Resolved by removing the "universal" detector.
Status: UNCONFIRMED → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•