Closed Bug 248304 Opened 20 years ago Closed 19 years ago

Poor text-encoding guessing for sites with no specified encoding (e.g. pound signs on UK sites)

Categories

(Camino Graveyard :: General, defect, P2)

PowerPC
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED
Camino0.9

People

(Reporter: stuart.morgan+bugzilla, Assigned: sfraser_bugs)

References

()

Details

Attachments

(2 files)

There have been several forum complaints about sites with garbled text,
especially numbers, on sites that don't have an encoding set (see the "More
Buying Choices" box on the right of the test URL, before and after setting text
encoding to ISO Latin 1). Changing the encoding manually works, but is a pain if
it's a site they use often or navigate around very much.

According to the reports, the other Moz family browsers don't have problems with
the test URL or many other sites, so we should look into doing what they are
doing to guess text encoding.
*** Bug 249196 has been marked as a duplicate of this bug. ***
*** Bug 257383 has been marked as a duplicate of this bug. ***
*** Bug 263704 has been marked as a duplicate of this bug. ***
Add note about pound signs to summary as this is one of the most noticeable
effects of this bug.
Summary: Poor text-encoding guessing for sites with no specified encoding → Poor text-encoding guessing for sites with no specified encoding (e.g. pound signs on UK sites)
*** Bug 280172 has been marked as a duplicate of this bug. ***
This is a common complaint from the feedback.
Priority: -- → P2
Target Milestone: --- → Camino1.0
Depends on: 168526
They are many duplicates directly to this bug, but it would be better to check
if the page if wrongly detected as SJIS, in which case it's a duplicate of bug
168526, or wrongly detected as GB18030, in which case it's a duplicate of bug
181344.

And if it's neither of those, it shows they are more problems than those two to
solve to make the universal detector more reliable, which is interesting info
that must not be lost because the duplicate marking was too quick.

From a very superficial check, all the already reported cases seems to be
duplicate of bug 181344 (the pound sign wrongly leads the detector to interpret
the page as GB18030). If so, I think it would be better to tag them as such, so
that it's easier to check if a fix for bug 181344 fixes them all.
Depends on: 181344
> According to the reports, the other Moz family browsers don't have problems
> with the test URL or many other sites

This is Camino specific and not one of those bugs.
This probably only happens in Camino because it uses the Universal Charset Detector.
this needs to be fixed for 09, we get a lot of feedback about this from non-US
users.
Status: NEW → ASSIGNED
Target Milestone: Camino1.0 → Camino0.9
*** Bug 290317 has been marked as a duplicate of this bug. ***
*** Bug 292320 has been marked as a duplicate of this bug. ***
This is apparently happening on MacFixIt now/today, guessing a Shift JIS encoding.
Attached file Testcase
Testcase. Interestingly, a single pounds sign displays OK, but with > 1, the
encoding is guess incorrectly.
The testcase is incorrectly detected as GB18030, so that's bug 181344.
OK, here's the deal.

We hardcode the "intl.charset.detector" pref to "universal_charset_detector" in
[PreferencesManager syncMozillaPrefs], which means that every camino user will
have this in their prefs.js file.

If I take that out (and nuke the pref by hand), the testcase works.

So do we want to expose toggling "Auto Detect" on and off via the Text Encoding
menu?
Taking.
Assignee: pinkerton → sfraser_bugs
Status: ASSIGNED → NEW
Attached patch PatchSplinter Review
This patch does several things:
1. Flips the "universal_charset_detector" off for people running a build with
   this change for the first time (using a new pref version key), and removes
   the hardcoding of this pref.

2. Adds a "Automatically Detect Page Encoding" item to the bottom of the text
   encodings menu, which toggles the "universal_charset_detector" on and off
   (reloading the page when toggled)

3. Makes the Text encodings menu not auto-update; we update it on display
   (this removes code that assumed that any meny item with a tag > 10 was an
   encoding item)
Attachment #187903 - Flags: review?(pinkerton)
Status: NEW → ASSIGNED
wait wait wait. we had to put that pref in otherwise a number of sites wouldn't
render correctly (or at all) because the encoding would be wrong. I think it was
many japanese/chinese/korean/russian sites, I don't remember the details, it's
all hazy. 

Maybe look back into bugzilla to see what turning this stuff on was fixing. i
know i turned that stuff on for a reason.
(In reply to comment #19)
> wait wait wait. we had to put that pref in otherwise a number of sites wouldn't
> render correctly (or at all) because the encoding would be wrong. I think it was
> many japanese/chinese/korean/russian sites, I don't remember the details, it's

Well, not quite true unless you're truly multilingual and have to/can read many
different languages (C,J,K, and R). For most people, just using a lang-specific
detector which works better for (as opposed to the universal detector) should
suffice. Actually, Japanese and Russians need JA/Ru detectors but Chinese(both
SC and TC) and Koreans don't need detector most of time because for the latter
group, there is a single dominant encoding which can be set to the default.
I thought we put the pref in early on because we didn't want Camino to have a
Text Encodings menu. But now we have one.
i recall putting it in close to 0.7 shipping, after we had an encoding menu.
again, that's just a recollection.
That's probably because we used to have it in all-camino.js, but they they
changed some of the i18n prefs to use this funky locale thing, so we have to
move it into code.

I don't see why we need to be any different than Firefox, and this patch makes
it us similar.
All those urls look fine with the charset detector off.
The crash logs pasted into bug 281679 seem to have fooled the charset detector,
too; I don't know what it's guessing, because there's *no* check in the text
encoding menu.
It's probably guessing gb18030, which is a superset of gb2312 I believe.
Comment on attachment 187903 [details] [diff] [review]
Patch

r=pink. good idea on the prefs version as well. we'll need that going forward.
Attachment #187903 - Flags: review?(pinkerton) → review+
Checked in. Note that the first time you run a build with this change (which
will be 20050705), the universal charset detector will be turned off. (If you go
and run an older build, it will get turned back on.) You can toggle the detector
on and off via the bottom item of the Text Encodings submenu.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Pageload went from ~870ms to ~800ms with this change.
(In reply to comment #30)
> Pageload went from ~870ms to ~800ms with this change.

Smaller is better?  Does that mean bug 234683 got fixed (found it by accident;
perhaps what Mike was remembering in comment 19 or 22)?
*** Bug 305775 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: