Closed Bug 845791 Opened 11 years ago Closed 4 years ago

Gather telemetry about the necessity of the Russian and Ukrainian encoding detectors

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: hsivonen, Unassigned, Mentored)

Details

(Whiteboard: [lang=C++])

While there is still anecdotal evidence about heuristic encoding detection being necessary for the Japanese locale, the stream of similar anecdotes for the Russian and Ukrainian cases has dried up over the years. We should add telemetry to see how often the Russian and Ukrainian detectors detect a Russian or Ukrainian encoding other than the fallback encoding to see if we could make HTML parsing less magical for the Russian and Ukrainian locales.
I want to work on this bug.I am well acquainted with the knowledge of c/c++.I think it this first bug would be a great opportunity for me.
Chardet is initialized at https://mxr.mozilla.org/mozilla-central/source/parser/html/nsHtml5StreamParser.cpp#184 if it is initialized. Over there, you should store an enum on nsHtml5StreamParser to mark if the situation is interesting. The situation is interesting if
 1) The detector is set to the Russian detector *and* the fallback encoding (intl.charset.default) is set to the default fallback encoding for the Russian locale *and* we're not getting a HTTP charset or parent frame charset passed in.
OR
 2) The detector is set to the Ukranian detector *and* the fallback encoding (intl.charset.default) is set to the default fallback encoding for the Ukranian locale *and* we're not getting a HTTP charset or parent frame charset passed in.

Otherwise, mark the case as not interesting.

Then at the end of the parse, record telemetry if the case was marked as being interesting and the charset source is kCharsetFromAutoDetection or lower, record telemetry whether we are seeing:
 a) Interest case 1 (Russian) and charset source is kCharsetFromAutoDetection and the charset is not the Russian default.
 b) Interest case 1 otherwise.
 c) Interest case 2 (Ukranian) and charset source is kCharsetFromAutoDetection and the charset is not the Ukranian default.
 d) Interest case 2 otherwise.
Sir
I want to handle this bug,it will be great opportunity for me to work on this bug.i am well equipped with the knowledge of c/c++, java, javascript, html/css.
I am new to development so can you please explain it little more
Do you have a specific question?
Hello guys, i'm new at C++ but i'm going to work on this bug
Assignee: nobody → brunoschneider17
Summary: Gather telemetry about the necessity of the Russian and Ukranian encoding detectors → Gather telemetry about the necessity of the Russian and Ukrainian encoding detectors
Mentor: hsivonen
Whiteboard: [mentor=hsivonen][lang=C++] → [lang=C++]
I tried opening file:// UTF-8 text files in Spanish and in Russian with en-US and ru builds,
and got wrong encodings with Nightly 2014-06-26 except when choosing Unicode or Auto-Detect Japanese.  Nightly 2013-01-01 had another working option: Auto-Detect Universal.  Details in bug 844115 comment 6.
Accommodating file: URLs is not a goal. The detectors are for dealing with legacy Web sites.

See bug 1034960 comment 1 about UTF-8 in the file: URL case.
No word from the assignee for 5 years. Returning to the "OK to work on it" pool.
Assignee: brunoschneider17 → nobody

https://searchfox.org/mozilla-central/rev/d33d470140ce3f9426af523eaa8ecfa83476c806/intl/chardet/nsCyrillicDetector.cpp#36 indicates that the Cyrillic detectors have been giving up after the first call to feed them for 20 years!

Considering that we've shipped with these detectors enabled by default for the Russian and Ukrainian locales for years, that Chromium, including the new Edge, has an always-on detector, and that IE has an opt-in detector, I'm getting cold feet about unshipping these even though Safari gets away with not having these. (The WebKit codebase has a detector, but AFAIK, Safari has no UI for enabling it.)

After bug 1543077, it would be pretty easy to give similar treatment to Cyrillic encodings as Japanese encodings are getting in that bug.

(Chromium and IE detect IBM866, ISO-8859-5, KOI8 (didn't investigate -R vs. -U), and windows-1251, but don't detect x-mac-cyrillic. x-mac-cyrillic gets detected as windows-1251.

SIMD-accelerated detection between the Cyrillic encodings should be possible by looking at the most-significant half of each byte instead of looking at character-specific frequencies.)

(In reply to Henri Sivonen (:hsivonen) from comment #10)

SIMD-accelerated detection between the Cyrillic encodings should be possible by looking at the most-significant half of each byte instead of looking at character-specific frequencies.)

I experimented with this approach (that is, looking at the most significant half of each byte; I didn't implement SIMD acceleration for mere experimentation), but unfortunately it doesn't work well, because ISO-8859-5 is off by one row compared to windows-1251, which leads to short windows-1251 inputs getting mis-detected as ISO-8859-5 sometimes. (I synthetized short test inputs from Wikipedia article titles.)

The current state of our Cyrillic detection is clearly bogus, but I don't know whether we should just let it be bogus, put effort into pursuing a Chromium-like approach, or put effort into pursuing a more WebKit-like approach.

Hi Henri,
How do you record telemetry information?
I would like to work on this bug.

Sorry about missing your question earlier. This bug has become irrelevant due to bug 1551276.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.