Closed Bug 248304 Opened 21 years ago Closed 20 years ago

Poor text-encoding guessing for sites with no specified encoding (e.g. pound signs on UK sites)

Categories

(Camino Graveyard :: General, defect, P2)

PowerPC
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED
Camino0.9

People

(Reporter: stuart.morgan+bugzilla, Assigned: sfraser_bugs)

References

()

Details

Attachments

(2 files)

There have been several forum complaints about sites with garbled text, especially numbers, on sites that don't have an encoding set (see the "More Buying Choices" box on the right of the test URL, before and after setting text encoding to ISO Latin 1). Changing the encoding manually works, but is a pain if it's a site they use often or navigate around very much. According to the reports, the other Moz family browsers don't have problems with the test URL or many other sites, so we should look into doing what they are doing to guess text encoding.
*** Bug 249196 has been marked as a duplicate of this bug. ***
*** Bug 257383 has been marked as a duplicate of this bug. ***
*** Bug 263704 has been marked as a duplicate of this bug. ***
Add note about pound signs to summary as this is one of the most noticeable effects of this bug.
Summary: Poor text-encoding guessing for sites with no specified encoding → Poor text-encoding guessing for sites with no specified encoding (e.g. pound signs on UK sites)
*** Bug 280172 has been marked as a duplicate of this bug. ***
This is a common complaint from the feedback.
Priority: -- → P2
Target Milestone: --- → Camino1.0
Depends on: 168526
They are many duplicates directly to this bug, but it would be better to check if the page if wrongly detected as SJIS, in which case it's a duplicate of bug 168526, or wrongly detected as GB18030, in which case it's a duplicate of bug 181344. And if it's neither of those, it shows they are more problems than those two to solve to make the universal detector more reliable, which is interesting info that must not be lost because the duplicate marking was too quick. From a very superficial check, all the already reported cases seems to be duplicate of bug 181344 (the pound sign wrongly leads the detector to interpret the page as GB18030). If so, I think it would be better to tag them as such, so that it's easier to check if a fix for bug 181344 fixes them all.
Depends on: 181344
> According to the reports, the other Moz family browsers don't have problems > with the test URL or many other sites This is Camino specific and not one of those bugs.
This probably only happens in Camino because it uses the Universal Charset Detector.
this needs to be fixed for 09, we get a lot of feedback about this from non-US users.
Status: NEW → ASSIGNED
Target Milestone: Camino1.0 → Camino0.9
*** Bug 290317 has been marked as a duplicate of this bug. ***
*** Bug 292320 has been marked as a duplicate of this bug. ***
This is apparently happening on MacFixIt now/today, guessing a Shift JIS encoding.
Attached file Testcase
Testcase. Interestingly, a single pounds sign displays OK, but with > 1, the encoding is guess incorrectly.
The testcase is incorrectly detected as GB18030, so that's bug 181344.
OK, here's the deal. We hardcode the "intl.charset.detector" pref to "universal_charset_detector" in [PreferencesManager syncMozillaPrefs], which means that every camino user will have this in their prefs.js file. If I take that out (and nuke the pref by hand), the testcase works. So do we want to expose toggling "Auto Detect" on and off via the Text Encoding menu?
Taking.
Assignee: pinkerton → sfraser_bugs
Status: ASSIGNED → NEW
Attached patch PatchSplinter Review
This patch does several things: 1. Flips the "universal_charset_detector" off for people running a build with this change for the first time (using a new pref version key), and removes the hardcoding of this pref. 2. Adds a "Automatically Detect Page Encoding" item to the bottom of the text encodings menu, which toggles the "universal_charset_detector" on and off (reloading the page when toggled) 3. Makes the Text encodings menu not auto-update; we update it on display (this removes code that assumed that any meny item with a tag > 10 was an encoding item)
Attachment #187903 - Flags: review?(pinkerton)
Status: NEW → ASSIGNED
wait wait wait. we had to put that pref in otherwise a number of sites wouldn't render correctly (or at all) because the encoding would be wrong. I think it was many japanese/chinese/korean/russian sites, I don't remember the details, it's all hazy. Maybe look back into bugzilla to see what turning this stuff on was fixing. i know i turned that stuff on for a reason.
(In reply to comment #19) > wait wait wait. we had to put that pref in otherwise a number of sites wouldn't > render correctly (or at all) because the encoding would be wrong. I think it was > many japanese/chinese/korean/russian sites, I don't remember the details, it's Well, not quite true unless you're truly multilingual and have to/can read many different languages (C,J,K, and R). For most people, just using a lang-specific detector which works better for (as opposed to the universal detector) should suffice. Actually, Japanese and Russians need JA/Ru detectors but Chinese(both SC and TC) and Koreans don't need detector most of time because for the latter group, there is a single dominant encoding which can be set to the default.
I thought we put the pref in early on because we didn't want Camino to have a Text Encodings menu. But now we have one.
i recall putting it in close to 0.7 shipping, after we had an encoding menu. again, that's just a recollection.
That's probably because we used to have it in all-camino.js, but they they changed some of the i18n prefs to use this funky locale thing, so we have to move it into code. I don't see why we need to be any different than Firefox, and this patch makes it us similar.
All those urls look fine with the charset detector off.
The crash logs pasted into bug 281679 seem to have fooled the charset detector, too; I don't know what it's guessing, because there's *no* check in the text encoding menu.
It's probably guessing gb18030, which is a superset of gb2312 I believe.
Comment on attachment 187903 [details] [diff] [review] Patch r=pink. good idea on the prefs version as well. we'll need that going forward.
Attachment #187903 - Flags: review?(pinkerton) → review+
Checked in. Note that the first time you run a build with this change (which will be 20050705), the universal charset detector will be turned off. (If you go and run an older build, it will get turned back on.) You can toggle the detector on and off via the bottom item of the Text Encodings submenu.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Pageload went from ~870ms to ~800ms with this change.
(In reply to comment #30) > Pageload went from ~870ms to ~800ms with this change. Smaller is better? Does that mean bug 234683 got fixed (found it by accident; perhaps what Mike was remembering in comment 19 or 22)?
*** Bug 305775 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: