Closed Bug 844115 Opened 12 years ago Closed 10 years ago

Remove the "universal" encoding detector (chardet)

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla35

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

Attachments

(1 file, 3 obsolete files)

Part 1: Remove constructors 12 years ago Henri Sivonen (:hsivonen) 7.27 KB, patch		Details \| Diff \| Splinter Review
Part 2: Remove the non-CJK probers 12 years ago Henri Sivonen (:hsivonen) 105.30 KB, patch		Details \| Diff \| Splinter Review
Remove detectors that are no longer offered in the UI 10 years ago Henri Sivonen (:hsivonen) 366.83 KB, patch		Details \| Diff \| Splinter Review
Remove detectors that are no longer offered in the UI, v2 10 years ago Henri Sivonen (:hsivonen) 390.10 KB, patch	emk : review+	Details \| Diff \| Splinter Review

Henri Sivonen (:hsivonen)

Assignee

Description

•

12 years ago

The "universal" encoding detector isn't really universal. See bug 265030 and bug 715967. Therefore, it is mis-advertised. Moreover, if the Traditional Chinese locale switches away from the "universal" detector, it will no longer be enabled anywhere by default. We still have the Japanese, Russian etc. detectors that address common use cases. I think it's not worth the engineering effort to actually make the "universal" detector universal and the fact that we have long-standing unfixed bugs indicate that others don't see the work having high enough reward to do the work. Moreover, I think we shouldn't put more engineering effort into the "universal" detector, because we should encourage authoring practices that involve using labeled UTF-8 instead of reliance on heuristics. While it's unlikely that we would achieve interoperability between heuristic detectors across browsers, if we do put some engineering effort in this area, we should instead put it into figuring out what exactly WebKit detects, which I believe is less than what Gecko detects, and make Gecko detects no more. Since it doesn't make sense to make the "universal" detector actually universal and as it currently exists it probably has more potential for confusion than for solving problems, I think we should remove the "universal" detector.

Henri Sivonen (:hsivonen)

Assignee

Updated

•

12 years ago

Depends on: 848842

Henri Sivonen (:hsivonen)

Assignee

Updated

•

12 years ago

Depends on: 849113

Henri Sivonen (:hsivonen)

Assignee

Comment 1

•

12 years ago

Attached patch Part 1: Remove constructors (obsolete) — Details — Splinter Review

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Henri Sivonen (:hsivonen)

Assignee

Comment 2

•

12 years ago

Attached patch Part 2: Remove the non-CJK probers (obsolete) — Details — Splinter Review

Henri Sivonen (:hsivonen)

Assignee

Comment 3

•

12 years ago

It seems to me that the only thing removed here that *might* be of value is the visual vs. logical Hebrew detector. If there is a true need to keep that, we should have a Hebrew detector labeled as a Hebrew detector instead of a Hebrew detector oversold as "universal". (We have a separate Russian detector anyway, and bug reports suggest that the Hungarian detector does not actually work. There isn't much point in Thai detection, since Thai has only one legacy encoding.)

Henri Sivonen (:hsivonen)

Assignee

Comment 4

•

12 years ago

smontagu, do you have data or anecdotes about the need to keep the Hebrew prober around?

Henri Sivonen (:hsivonen)

Assignee

Comment 5

•

11 years ago

Bug 939311 suggests that Visual Hebrew isn't that common anymore.

[:Aleksej]

Comment 6

•

10 years ago

With these short UTF-8 text files in Russian and in Spanish, the universal detector is the only helpful one (unless you choose to detect Unicode yourself). A .txt file created with a UTF-8 locale containing *either* Russian text from https://mozilla-russia.org/: ---- Mozilla Firefox — браузер нового поколения от Mozilla Foundation. Простой и лаконичный интерфейс позволяет освоить программу за несколько минут. Безопасность, высокая скорость работы, гибкость и расширяемость — основные качества, присущие Mozilla Firefox. ---- or Spanish text from http://www.mozilla-hispano.org/ (http://www.mozilla-hispano.org/aplicaciones-para-la-copa-del-mundo-seleccion-de-los-editores/): ---- El mundial de fútbol empezó esta semana y acá podrás encontrar aplicaciones que te ayudarán a disfrutar más de esta apasionante competencia deportiva. ---- 2013-01-01-03-08-38-mozilla-central-firefox-20.0a1.en-US.linux-x86_64 by default, chooses Western (ISO-8859-1) Auto-Detect Russian — Cyrillic (ISO-8859-5) Auto-Detect Universal — UTF-8. 2014-06-26-03-02-01-mozilla-central-firefox-33.0a1.ru.linux-x86_64 by default (Auto-Detect Russian), Cyrillic (ISO) Auto-Detect Off — Cyrillic (Windows) Auto-Detect Japanese — UTF-8. 2014-06-26-03-02-01-mozilla-central-firefox-33.0a1.en-US.linux-x86_64 by default (Auto-Detect Off), Western the rest is same as with the Russian build. Firefox 30.0 behaves same or similar to the nightly.

[:Aleksej]

Comment 7

•

10 years ago

(In reply to [:Aleksej] from comment #6) > in Russian and in Spanish, the universal detector is the only helpful > one (unless you choose to detect Unicode yourself). (or choose Japanese, which is neither Russian, nor Spanish)

Henri Sivonen (:hsivonen)

Assignee

Comment 8

•

10 years ago

The Russian and Ukrainian detectors only try to detect legacy single-byte encodings. It's quite possible that they do more harm than good. Evaluating that is bug 845791.

Henri Sivonen (:hsivonen)

Assignee

Comment 9

•

10 years ago

Attached patch Remove detectors that are no longer offered in the UI (obsolete) — Details — Splinter Review

https://tbpl.mozilla.org/?tree=Try&rev=ae1abb5fe7af This patch removes the detectors that are no longer offered in the UI. I haven't seen any complaints about the removal of Chinese, Korean and CJK-combo detectors from the UI. I have seen one complaint about the UI removal of Universal in the sense of general opposition to removal of features and I've seen two complaints that really boiled down to requesting UTF-8 detection for file: URLs. Thunderbird has also removed UI for the detectors that no longer have UI in Firefox. At this point, I think we should remove the detectors that are no longer offered in the UI and, therefore, are dead code and debate whether we should add UTF-8 detection for file: URLs (using a different mechanism) in another bug. This patch basically turns the extensions/universalchardet/ into a Japanese detector only without renaming the directory. If renaming the directory is considered important, I think it makes sense to do that in another patch in order to avoid mixing file changes and moves in one patch.

Attachment #722802 - Attachment is obsolete: true

Attachment #722803 - Attachment is obsolete: true

Henri Sivonen (:hsivonen)

Assignee

Comment 10

•

10 years ago

Attached patch Remove detectors that are no longer offered in the UI, v2 — Details — Splinter Review

Now with adjusted tests: https://tbpl.mozilla.org/?tree=Try&rev=3c2107a75498

Attachment #8493640 - Attachment is obsolete: true

Attachment #8493696 - Flags: review?(VYV03354)

Masatoshi Kimura [:emk]

Updated

•

10 years ago

Attachment #8493696 - Flags: review?(VYV03354) → review+

Henri Sivonen (:hsivonen)

Assignee

Comment 11

•

10 years ago

Thanks. Landed: https://hg.mozilla.org/integration/mozilla-inbound/rev/2f0c68d4d7ca

Ryan VanderMeulen [:RyanVM]

Comment 12

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/2f0c68d4d7ca

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla35

Joshua Cranmer [:jcranmer]

Comment 13

•

10 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #9) > At this point, I think we should remove the detectors that are no longer > offered in the UI and, therefore, are dead code and debate whether we should > add UTF-8 detection for file: URLs (using a different mechanism) in another > bug. The detector you deleted was used in Thunderbird in a pretty major place: <http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.cpp#2311>. Now, I'll buy that this is probably not the right thing to be using, but you did break at least 2 tests in comm-central as a result, which did rely on us sniffing Shift-JIS versus not-Shift-JIS via the detector.

Henri Sivonen (:hsivonen)

Assignee

Comment 14

•

10 years ago

(In reply to Joshua Cranmer [:jcranmer] from comment #13) > (In reply to Henri Sivonen (:hsivonen) from comment #9) > > At this point, I think we should remove the detectors that are no longer > > offered in the UI and, therefore, are dead code and debate whether we should > > add UTF-8 detection for file: URLs (using a different mechanism) in another > > bug. > > The detector you deleted was used in Thunderbird in a pretty major place: > <http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils. > cpp#2311>. > > Now, I'll buy that this is probably not the right thing to be using, but you > did break at least 2 tests in comm-central as a result, which did rely on us > sniffing Shift-JIS versus not-Shift-JIS via the detector. I concede that I only looked at the TB detector UI having changed to match Firefox's and failed to search c-c for other uses. Sorry about that. Still, I believe the removal was in compliance with the m-c tree rules and should stick, even though the tone of the above comment suggests that it was not OK to break c-c tests. Is the use case coming up with a charset parameter for attached text/plain files? If so, using the "universal" detector like that is a very bad idea. Since the "universal" detector wasn't actually universal, it would only work for some locales but not others. This is not cool if you are user whose locale isn't covered by "universal". That developers go "oh, let's use the universal detector" without realizing that "universal" wasn't actually universal is a major reason why I wanted to remove it. On the m-c side, this happened with the File API (in violation of the File API spec). Even if one considers guessing a good idea in general (I don't), generating explicit metadata from such a guess that's known not to cover the full guessing spectrum is harmful, because then the recipient is denied the opportunity of making a better guess. So I think using the detector like this was a bad idea to begin with, but c-c is, of course, free to import it into c-c. (Potentially duplicating the Japanese part, I suppose.) (However, in the specific case of UTF-8, it may be a good idea to check if the file is valid UTF-8 as a whole and then label it charset=utf-8 if it is.)

Henri Sivonen (:hsivonen)

Assignee

Updated

•

10 years ago

Blocks: 1074585

Magnus Melin [:mkmelin]

Updated

•

10 years ago

No longer blocks: 1074585

Depends on: 1074585

Hiroyuki Ikezoe (:hiro)

Updated

•

10 years ago

Depends on: 1081693

You need to log in before you can comment on or make changes to this bug.