845791 - Gather telemetry about the necessity of the Russian and Ukrainian encoding detectors

Henri Sivonen (:hsivonen) (away from Bugzilla until 2025-10-27)

Reporter

Description

•

12 years ago

While there is still anecdotal evidence about heuristic encoding detection being necessary for the Japanese locale, the stream of similar anecdotes for the Russian and Ukrainian cases has dried up over the years. We should add telemetry to see how often the Russian and Ukrainian detectors detect a Russian or Ukrainian encoding other than the fallback encoding to see if we could make HTML parsing less magical for the Russian and Ukrainian locales.

Bablu Kumar

Comment 1

•

12 years ago

I want to work on this bug.I am well acquainted with the knowledge of c/c++.I think it this first bug would be a great opportunity for me.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2025-10-27)

Reporter

Comment 2

•

12 years ago

Chardet is initialized at https://mxr.mozilla.org/mozilla-central/source/parser/html/nsHtml5StreamParser.cpp#184 if it is initialized. Over there, you should store an enum on nsHtml5StreamParser to mark if the situation is interesting. The situation is interesting if 1) The detector is set to the Russian detector *and* the fallback encoding (intl.charset.default) is set to the default fallback encoding for the Russian locale *and* we're not getting a HTTP charset or parent frame charset passed in. OR 2) The detector is set to the Ukranian detector *and* the fallback encoding (intl.charset.default) is set to the default fallback encoding for the Ukranian locale *and* we're not getting a HTTP charset or parent frame charset passed in. Otherwise, mark the case as not interesting. Then at the end of the parse, record telemetry if the case was marked as being interesting and the charset source is kCharsetFromAutoDetection or lower, record telemetry whether we are seeing: a) Interest case 1 (Russian) and charset source is kCharsetFromAutoDetection and the charset is not the Russian default. b) Interest case 1 otherwise. c) Interest case 2 (Ukranian) and charset source is kCharsetFromAutoDetection and the charset is not the Ukranian default. d) Interest case 2 otherwise.

Rahul Gandhi

Comment 3

•

12 years ago

Sir I want to handle this bug,it will be great opportunity for me to work on this bug.i am well equipped with the knowledge of c/c++, java, javascript, html/css. I am new to development so can you please explain it little more

Henri Sivonen (:hsivonen) (away from Bugzilla until 2025-10-27)

Reporter

Comment 4

•

12 years ago

Do you have a specific question?

Bruno Schneider [:Bruss]

Comment 5

•

12 years ago

Hello guys, i'm new at C++ but i'm going to work on this bug

David Teller [:Yoric] - still alive but not very active

Updated

•

12 years ago

Assignee: nobody → brunoschneider17

Henri Sivonen (:hsivonen) (away from Bugzilla until 2025-10-27)

Reporter

Updated

•

11 years ago

Summary: Gather telemetry about the necessity of the Russian and Ukranian encoding detectors → Gather telemetry about the necessity of the Russian and Ukrainian encoding detectors

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Mentor: hsivonen

Whiteboard: [mentor=hsivonen][lang=C++] → [lang=C++]

[:Aleksej]

Comment 6

•

11 years ago

I tried opening file:// UTF-8 text files in Spanish and in Russian with en-US and ru builds, and got wrong encodings with Nightly 2014-06-26 except when choosing Unicode or Auto-Detect Japanese. Nightly 2013-01-01 had another working option: Auto-Detect Universal. Details in bug 844115 comment 6.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2025-10-27)

Reporter

Comment 7

•

11 years ago

Accommodating file: URLs is not a goal. The detectors are for dealing with legacy Web sites. See bug 1034960 comment 1 about UTF-8 in the file: URL case.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2025-10-27)

Reporter

Comment 8

•

6 years ago

No word from the assignee for 5 years. Returning to the "OK to work on it" pool.

Assignee: brunoschneider17 → nobody

Henri Sivonen (:hsivonen) (away from Bugzilla until 2025-10-27)

Reporter

Comment 9

•

6 years ago

https://searchfox.org/mozilla-central/rev/d33d470140ce3f9426af523eaa8ecfa83476c806/intl/chardet/nsCyrillicDetector.cpp#36 indicates that the Cyrillic detectors have been giving up after the first call to feed them for 20 years!

Henri Sivonen (:hsivonen) (away from Bugzilla until 2025-10-27)

Reporter

Comment 10

•

6 years ago

Considering that we've shipped with these detectors enabled by default for the Russian and Ukrainian locales for years, that Chromium, including the new Edge, has an always-on detector, and that IE has an opt-in detector, I'm getting cold feet about unshipping these even though Safari gets away with not having these. (The WebKit codebase has a detector, but AFAIK, Safari has no UI for enabling it.)

After bug 1543077, it would be pretty easy to give similar treatment to Cyrillic encodings as Japanese encodings are getting in that bug.

(Chromium and IE detect IBM866, ISO-8859-5, KOI8 (didn't investigate -R vs. -U), and windows-1251, but don't detect x-mac-cyrillic. x-mac-cyrillic gets detected as windows-1251.

SIMD-accelerated detection between the Cyrillic encodings should be possible by looking at the most-significant half of each byte instead of looking at character-specific frequencies.)

Henri Sivonen (:hsivonen) (away from Bugzilla until 2025-10-27)

Reporter

Comment 11

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #10)

SIMD-accelerated detection between the Cyrillic encodings should be possible by looking at the most-significant half of each byte instead of looking at character-specific frequencies.)

I experimented with this approach (that is, looking at the most significant half of each byte; I didn't implement SIMD acceleration for mere experimentation), but unfortunately it doesn't work well, because ISO-8859-5 is off by one row compared to windows-1251, which leads to short windows-1251 inputs getting mis-detected as ISO-8859-5 sometimes. (I synthetized short test inputs from Wikipedia article titles.)

The current state of our Cyrillic detection is clearly bogus, but I don't know whether we should just let it be bogus, put effort into pursuing a Chromium-like approach, or put effort into pursuing a more WebKit-like approach.

Nour Saffour

Comment 12

•

5 years ago

Hi Henri,
How do you record telemetry information?
I would like to work on this bug.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2025-10-27)

Reporter

Comment 13

•

5 years ago

Sorry about missing your question earlier. This bug has become irrelevant due to bug 1551276.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → WONTFIX

Bugzilla

Gather telemetry about the necessity of the Russian and Ukrainian encoding detectors

Categories

(Core :: DOM: HTML Parser, defect)

Tracking

()

People

(Reporter: hsivonen, Unassigned, Mentored)

References

Details

(Whiteboard: [lang=C++])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Updated

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13