Closed Bug 845791 Opened 7 years ago Closed 1 month ago
Gather telemetry about the necessity of the Russian and Ukrainian encoding detectors
While there is still anecdotal evidence about heuristic encoding detection being necessary for the Japanese locale, the stream of similar anecdotes for the Russian and Ukrainian cases has dried up over the years. We should add telemetry to see how often the Russian and Ukrainian detectors detect a Russian or Ukrainian encoding other than the fallback encoding to see if we could make HTML parsing less magical for the Russian and Ukrainian locales.
I want to work on this bug.I am well acquainted with the knowledge of c/c++.I think it this first bug would be a great opportunity for me.
Chardet is initialized at https://mxr.mozilla.org/mozilla-central/source/parser/html/nsHtml5StreamParser.cpp#184 if it is initialized. Over there, you should store an enum on nsHtml5StreamParser to mark if the situation is interesting. The situation is interesting if 1) The detector is set to the Russian detector *and* the fallback encoding (intl.charset.default) is set to the default fallback encoding for the Russian locale *and* we're not getting a HTTP charset or parent frame charset passed in. OR 2) The detector is set to the Ukranian detector *and* the fallback encoding (intl.charset.default) is set to the default fallback encoding for the Ukranian locale *and* we're not getting a HTTP charset or parent frame charset passed in. Otherwise, mark the case as not interesting. Then at the end of the parse, record telemetry if the case was marked as being interesting and the charset source is kCharsetFromAutoDetection or lower, record telemetry whether we are seeing: a) Interest case 1 (Russian) and charset source is kCharsetFromAutoDetection and the charset is not the Russian default. b) Interest case 1 otherwise. c) Interest case 2 (Ukranian) and charset source is kCharsetFromAutoDetection and the charset is not the Ukranian default. d) Interest case 2 otherwise.
Do you have a specific question?
Hello guys, i'm new at C++ but i'm going to work on this bug
Assignee: nobody → brunoschneider17
Summary: Gather telemetry about the necessity of the Russian and Ukranian encoding detectors → Gather telemetry about the necessity of the Russian and Ukrainian encoding detectors
Whiteboard: [mentor=hsivonen][lang=C++] → [lang=C++]
I tried opening file:// UTF-8 text files in Spanish and in Russian with en-US and ru builds, and got wrong encodings with Nightly 2014-06-26 except when choosing Unicode or Auto-Detect Japanese. Nightly 2013-01-01 had another working option: Auto-Detect Universal. Details in bug 844115 comment 6.
Accommodating file: URLs is not a goal. The detectors are for dealing with legacy Web sites. See bug 1034960 comment 1 about UTF-8 in the file: URL case.
No word from the assignee for 5 years. Returning to the "OK to work on it" pool.
Assignee: brunoschneider17 → nobody
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.