Gather telemetry about the necessity of the Russian and Ukrainian encoding detectors
Categories
(Core :: DOM: HTML Parser, defect)
Tracking
()
People
(Reporter: hsivonen, Unassigned, Mentored)
Details
(Whiteboard: [lang=C++])
While there is still anecdotal evidence about heuristic encoding detection being necessary for the Japanese locale, the stream of similar anecdotes for the Russian and Ukrainian cases has dried up over the years. We should add telemetry to see how often the Russian and Ukrainian detectors detect a Russian or Ukrainian encoding other than the fallback encoding to see if we could make HTML parsing less magical for the Russian and Ukrainian locales.
Comment 1•11 years ago
|
||
I want to work on this bug.I am well acquainted with the knowledge of c/c++.I think it this first bug would be a great opportunity for me.
Reporter | ||
Comment 2•11 years ago
|
||
Chardet is initialized at https://mxr.mozilla.org/mozilla-central/source/parser/html/nsHtml5StreamParser.cpp#184 if it is initialized. Over there, you should store an enum on nsHtml5StreamParser to mark if the situation is interesting. The situation is interesting if 1) The detector is set to the Russian detector *and* the fallback encoding (intl.charset.default) is set to the default fallback encoding for the Russian locale *and* we're not getting a HTTP charset or parent frame charset passed in. OR 2) The detector is set to the Ukranian detector *and* the fallback encoding (intl.charset.default) is set to the default fallback encoding for the Ukranian locale *and* we're not getting a HTTP charset or parent frame charset passed in. Otherwise, mark the case as not interesting. Then at the end of the parse, record telemetry if the case was marked as being interesting and the charset source is kCharsetFromAutoDetection or lower, record telemetry whether we are seeing: a) Interest case 1 (Russian) and charset source is kCharsetFromAutoDetection and the charset is not the Russian default. b) Interest case 1 otherwise. c) Interest case 2 (Ukranian) and charset source is kCharsetFromAutoDetection and the charset is not the Ukranian default. d) Interest case 2 otherwise.
Comment 3•11 years ago
|
||
Sir I want to handle this bug,it will be great opportunity for me to work on this bug.i am well equipped with the knowledge of c/c++, java, javascript, html/css. I am new to development so can you please explain it little more
Reporter | ||
Comment 4•11 years ago
|
||
Do you have a specific question?
Comment 5•11 years ago
|
||
Hello guys, i'm new at C++ but i'm going to work on this bug
Updated•11 years ago
|
Reporter | ||
Updated•11 years ago
|
Assignee | ||
Updated•10 years ago
|
Comment 6•10 years ago
|
||
I tried opening file:// UTF-8 text files in Spanish and in Russian with en-US and ru builds, and got wrong encodings with Nightly 2014-06-26 except when choosing Unicode or Auto-Detect Japanese. Nightly 2013-01-01 had another working option: Auto-Detect Universal. Details in bug 844115 comment 6.
Reporter | ||
Comment 7•10 years ago
|
||
Accommodating file: URLs is not a goal. The detectors are for dealing with legacy Web sites. See bug 1034960 comment 1 about UTF-8 in the file: URL case.
Reporter | ||
Comment 8•6 years ago
|
||
No word from the assignee for 5 years. Returning to the "OK to work on it" pool.
Reporter | ||
Comment 9•5 years ago
|
||
https://searchfox.org/mozilla-central/rev/d33d470140ce3f9426af523eaa8ecfa83476c806/intl/chardet/nsCyrillicDetector.cpp#36 indicates that the Cyrillic detectors have been giving up after the first call to feed them for 20 years!
Reporter | ||
Comment 10•5 years ago
|
||
Considering that we've shipped with these detectors enabled by default for the Russian and Ukrainian locales for years, that Chromium, including the new Edge, has an always-on detector, and that IE has an opt-in detector, I'm getting cold feet about unshipping these even though Safari gets away with not having these. (The WebKit codebase has a detector, but AFAIK, Safari has no UI for enabling it.)
After bug 1543077, it would be pretty easy to give similar treatment to Cyrillic encodings as Japanese encodings are getting in that bug.
(Chromium and IE detect IBM866, ISO-8859-5, KOI8 (didn't investigate -R vs. -U), and windows-1251, but don't detect x-mac-cyrillic. x-mac-cyrillic gets detected as windows-1251.
SIMD-accelerated detection between the Cyrillic encodings should be possible by looking at the most-significant half of each byte instead of looking at character-specific frequencies.)
Reporter | ||
Comment 11•5 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #10)
SIMD-accelerated detection between the Cyrillic encodings should be possible by looking at the most-significant half of each byte instead of looking at character-specific frequencies.)
I experimented with this approach (that is, looking at the most significant half of each byte; I didn't implement SIMD acceleration for mere experimentation), but unfortunately it doesn't work well, because ISO-8859-5 is off by one row compared to windows-1251, which leads to short windows-1251 inputs getting mis-detected as ISO-8859-5 sometimes. (I synthetized short test inputs from Wikipedia article titles.)
The current state of our Cyrillic detection is clearly bogus, but I don't know whether we should just let it be bogus, put effort into pursuing a Chromium-like approach, or put effort into pursuing a more WebKit-like approach.
Comment 12•5 years ago
|
||
Hi Henri,
How do you record telemetry information?
I would like to work on this bug.
Reporter | ||
Comment 13•5 years ago
|
||
Sorry about missing your question earlier. This bug has become irrelevant due to bug 1551276.
Description
•