Gather telemetry about the necessity of the Russian and Ukrainian encoding detectors
Categories
(Core :: DOM: HTML Parser, defect)
Tracking
()
People
(Reporter: hsivonen, Unassigned, Mentored)
Details
(Whiteboard: [lang=C++])
![]() |
||
Comment 1•12 years ago
|
||
Reporter | ||
Comment 2•12 years ago
|
||
Comment 3•12 years ago
|
||
Reporter | ||
Comment 4•12 years ago
|
||
![]() |
||
Comment 5•12 years ago
|
||
Updated•12 years ago
|
Reporter | ||
Updated•11 years ago
|
Assignee | ||
Updated•11 years ago
|
![]() |
||
Comment 6•11 years ago
|
||
Reporter | ||
Comment 7•11 years ago
|
||
Reporter | ||
Comment 8•6 years ago
|
||
Reporter | ||
Comment 9•6 years ago
|
||
https://searchfox.org/mozilla-central/rev/d33d470140ce3f9426af523eaa8ecfa83476c806/intl/chardet/nsCyrillicDetector.cpp#36 indicates that the Cyrillic detectors have been giving up after the first call to feed them for 20 years!
Reporter | ||
Comment 10•6 years ago
|
||
Considering that we've shipped with these detectors enabled by default for the Russian and Ukrainian locales for years, that Chromium, including the new Edge, has an always-on detector, and that IE has an opt-in detector, I'm getting cold feet about unshipping these even though Safari gets away with not having these. (The WebKit codebase has a detector, but AFAIK, Safari has no UI for enabling it.)
After bug 1543077, it would be pretty easy to give similar treatment to Cyrillic encodings as Japanese encodings are getting in that bug.
(Chromium and IE detect IBM866, ISO-8859-5, KOI8 (didn't investigate -R vs. -U), and windows-1251, but don't detect x-mac-cyrillic. x-mac-cyrillic gets detected as windows-1251.
SIMD-accelerated detection between the Cyrillic encodings should be possible by looking at the most-significant half of each byte instead of looking at character-specific frequencies.)
Reporter | ||
Comment 11•6 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #10)
SIMD-accelerated detection between the Cyrillic encodings should be possible by looking at the most-significant half of each byte instead of looking at character-specific frequencies.)
I experimented with this approach (that is, looking at the most significant half of each byte; I didn't implement SIMD acceleration for mere experimentation), but unfortunately it doesn't work well, because ISO-8859-5 is off by one row compared to windows-1251, which leads to short windows-1251 inputs getting mis-detected as ISO-8859-5 sometimes. (I synthetized short test inputs from Wikipedia article titles.)
The current state of our Cyrillic detection is clearly bogus, but I don't know whether we should just let it be bogus, put effort into pursuing a Chromium-like approach, or put effort into pursuing a more WebKit-like approach.
Comment 12•5 years ago
|
||
Hi Henri,
How do you record telemetry information?
I would like to work on this bug.
Reporter | ||
Comment 13•5 years ago
|
||
Sorry about missing your question earlier. This bug has become irrelevant due to bug 1551276.
Description
•