Closed Bug 844115 Opened 7 years ago Closed 6 years ago

Remove the "universal" encoding detector (chardet)

Categories

(Core :: Internationalization, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla35

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

Attachments

(1 file, 3 obsolete files)

The "universal" encoding detector isn't really universal. See bug 265030 and bug 715967. Therefore, it is mis-advertised. Moreover, if the Traditional Chinese locale switches away from the "universal" detector, it will no longer be enabled anywhere by default. We still have the Japanese, Russian etc. detectors that address common use cases.

I think it's not worth the engineering effort to actually make the "universal" detector universal and the fact that we have long-standing unfixed bugs indicate that others don't see the work having high enough reward to do the work. Moreover, I think we shouldn't put more engineering effort into the "universal" detector, because we should encourage authoring practices that involve using labeled UTF-8 instead of reliance on heuristics. While it's unlikely that we would achieve interoperability between heuristic detectors across browsers, if we do put some engineering effort in this area, we should instead put it into figuring out what exactly WebKit detects, which I believe is less than what Gecko detects, and make Gecko detects no more.

Since it doesn't make sense to make the "universal" detector actually universal and as it currently exists it probably has more potential for confusion than for solving problems, I think we should remove the "universal" detector.
Depends on: 848842
Depends on: 849113
Attached patch Part 1: Remove constructors (obsolete) — Splinter Review
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
It seems to me that the only thing removed here that *might* be of value is the visual vs. logical Hebrew detector. If there is a true need to keep that, we should have a Hebrew detector labeled as a Hebrew detector instead of a Hebrew detector oversold as "universal".

(We have a separate Russian detector anyway, and bug reports suggest that the Hungarian detector does not actually work. There isn't much point in Thai detection, since Thai has only one legacy encoding.)
smontagu, do you have data or anecdotes about the need to keep the Hebrew prober around?
Bug 939311 suggests that Visual Hebrew isn't that common anymore.
With these short UTF-8 text files in Russian and in Spanish, the universal detector is the only helpful one (unless you choose to detect Unicode yourself).


A .txt file created with a UTF-8 locale containing *either*
Russian text from https://mozilla-russia.org/:
----
Mozilla Firefox — браузер нового поколения от Mozilla Foundation. Простой и лаконичный интерфейс позволяет освоить программу за несколько минут. Безопасность, высокая скорость работы, гибкость и расширяемость — основные качества, присущие Mozilla Firefox.
----

or Spanish text from http://www.mozilla-hispano.org/ (http://www.mozilla-hispano.org/aplicaciones-para-la-copa-del-mundo-seleccion-de-los-editores/):
----
El mundial de fútbol empezó esta semana y acá podrás encontrar aplicaciones que te ayudarán a disfrutar más de esta apasionante competencia deportiva.
----


2013-01-01-03-08-38-mozilla-central-firefox-20.0a1.en-US.linux-x86_64
by default, chooses Western (ISO-8859-1)
Auto-Detect Russian — Cyrillic (ISO-8859-5)
Auto-Detect Universal — UTF-8.

2014-06-26-03-02-01-mozilla-central-firefox-33.0a1.ru.linux-x86_64
by default (Auto-Detect Russian), Cyrillic (ISO)
Auto-Detect Off — Cyrillic (Windows)
Auto-Detect Japanese — UTF-8.

2014-06-26-03-02-01-mozilla-central-firefox-33.0a1.en-US.linux-x86_64
by default (Auto-Detect Off), Western
the rest is same as with the Russian build.

Firefox 30.0 behaves same or similar to the nightly.
(In reply to [:Aleksej] from comment #6)
> in Russian and in Spanish, the universal detector is the only helpful
> one (unless you choose to detect Unicode yourself).

(or choose Japanese, which is neither Russian, nor Spanish)
The Russian and Ukrainian detectors only try to detect legacy single-byte encodings. It's quite possible that they do more harm than good. Evaluating that is bug 845791.
https://tbpl.mozilla.org/?tree=Try&rev=ae1abb5fe7af

This patch removes the detectors that are no longer offered in the UI.

I haven't seen any complaints about the removal of Chinese, Korean and CJK-combo detectors from the UI. I have seen one complaint about the UI removal of Universal in the sense of general opposition to removal of features and I've seen two complaints that really boiled down to requesting UTF-8 detection for file: URLs.

Thunderbird has also removed UI for the detectors that no longer have UI in Firefox.

At this point, I think we should remove the detectors that are no longer offered in the UI and, therefore, are dead code and debate whether we should add UTF-8 detection for file: URLs (using a different mechanism) in another bug.

This patch basically turns the extensions/universalchardet/ into a Japanese detector only without renaming the directory. If renaming the directory is considered important, I think it makes sense to do that in another patch in order to avoid mixing file changes and moves in one patch.
Attachment #722802 - Attachment is obsolete: true
Attachment #722803 - Attachment is obsolete: true
Now with adjusted tests:
https://tbpl.mozilla.org/?tree=Try&rev=3c2107a75498
Attachment #8493640 - Attachment is obsolete: true
Attachment #8493696 - Flags: review?(VYV03354)
Attachment #8493696 - Flags: review?(VYV03354) → review+
https://hg.mozilla.org/mozilla-central/rev/2f0c68d4d7ca
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
(In reply to Henri Sivonen (:hsivonen) from comment #9)
> At this point, I think we should remove the detectors that are no longer
> offered in the UI and, therefore, are dead code and debate whether we should
> add UTF-8 detection for file: URLs (using a different mechanism) in another
> bug.

The detector you deleted was used in Thunderbird in a pretty major place: <http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.cpp#2311>.

Now, I'll buy that this is probably not the right thing to be using, but you did break at least 2 tests in comm-central as a result, which did rely on us sniffing Shift-JIS versus not-Shift-JIS via the detector.
(In reply to Joshua Cranmer [:jcranmer] from comment #13)
> (In reply to Henri Sivonen (:hsivonen) from comment #9)
> > At this point, I think we should remove the detectors that are no longer
> > offered in the UI and, therefore, are dead code and debate whether we should
> > add UTF-8 detection for file: URLs (using a different mechanism) in another
> > bug.
> 
> The detector you deleted was used in Thunderbird in a pretty major place:
> <http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.
> cpp#2311>.
> 
> Now, I'll buy that this is probably not the right thing to be using, but you
> did break at least 2 tests in comm-central as a result, which did rely on us
> sniffing Shift-JIS versus not-Shift-JIS via the detector.

I concede that I only looked at the TB detector UI having changed to match Firefox's and failed to search c-c for other uses. Sorry about that. Still, I believe the removal was in compliance with the m-c tree rules and should stick, even though the tone of the above comment suggests that it was not OK to break c-c tests.

Is the use case coming up with a charset parameter for attached text/plain files? If so, using the "universal" detector like that is a very bad idea. Since the "universal" detector wasn't actually universal, it would only work for some locales but not others. This is not cool if you are user whose locale isn't covered by "universal". That developers go "oh, let's use the universal detector" without realizing that "universal" wasn't actually universal is a major reason why I wanted to remove it. On the m-c side, this happened with the File API (in violation of the File API spec). 

Even if one considers guessing a good idea in general (I don't), generating explicit metadata from such a guess that's known not to cover the full guessing spectrum is harmful, because then the recipient is denied the opportunity of making a better guess.

So I think using the detector like this was a bad idea to begin with, but c-c is, of course, free to import it into c-c. (Potentially duplicating the Japanese part, I suppose.)

(However, in the specific case of UTF-8, it may be a good idea to check if the file is valid UTF-8 as a whole and then label it charset=utf-8 if it is.)
Blocks: 1074585
No longer blocks: 1074585
Depends on: 1074585
Depends on: 1081693
Duplicate of this bug: 608609
You need to log in before you can comment on or make changes to this bug.