Poor text-encoding guessing for sites with no specified encoding (e.g. pound signs on UK sites)

RESOLVED FIXED in Camino0.9

Status

P2
normal
RESOLVED FIXED
15 years ago
14 years ago

People

(Reporter: stuart.morgan+bugzilla, Assigned: sfraser_bugs)

Tracking

unspecified
Camino0.9
PowerPC
macOS
Dependency tree / graph

Details

(URL)

Attachments

(2 attachments)

(Reporter)

Description

15 years ago
There have been several forum complaints about sites with garbled text,
especially numbers, on sites that don't have an encoding set (see the "More
Buying Choices" box on the right of the test URL, before and after setting text
encoding to ISO Latin 1). Changing the encoding manually works, but is a pain if
it's a site they use often or navigate around very much.

According to the reports, the other Moz family browsers don't have problems with
the test URL or many other sites, so we should look into doing what they are
doing to guess text encoding.
(Reporter)

Comment 1

15 years ago
*** Bug 249196 has been marked as a duplicate of this bug. ***

Comment 2

15 years ago
*** Bug 257383 has been marked as a duplicate of this bug. ***

Comment 3

15 years ago
*** Bug 263704 has been marked as a duplicate of this bug. ***

Comment 4

15 years ago
Add note about pound signs to summary as this is one of the most noticeable
effects of this bug.
Summary: Poor text-encoding guessing for sites with no specified encoding → Poor text-encoding guessing for sites with no specified encoding (e.g. pound signs on UK sites)

Comment 5

14 years ago
*** Bug 280172 has been marked as a duplicate of this bug. ***
(Assignee)

Comment 6

14 years ago
This is a common complaint from the feedback.
Priority: -- → P2
Target Milestone: --- → Camino1.0
(Assignee)

Updated

14 years ago
Depends on: 168526
They are many duplicates directly to this bug, but it would be better to check
if the page if wrongly detected as SJIS, in which case it's a duplicate of bug
168526, or wrongly detected as GB18030, in which case it's a duplicate of bug
181344.

And if it's neither of those, it shows they are more problems than those two to
solve to make the universal detector more reliable, which is interesting info
that must not be lost because the duplicate marking was too quick.

From a very superficial check, all the already reported cases seems to be
duplicate of bug 181344 (the pound sign wrongly leads the detector to interpret
the page as GB18030). If so, I think it would be better to tag them as such, so
that it's easier to check if a fix for bug 181344 fixes them all.
Depends on: 181344

Comment 8

14 years ago
> According to the reports, the other Moz family browsers don't have problems
> with the test URL or many other sites

This is Camino specific and not one of those bugs.
(Assignee)

Comment 9

14 years ago
This probably only happens in Camino because it uses the Universal Charset Detector.
this needs to be fixed for 09, we get a lot of feedback about this from non-US
users.
Status: NEW → ASSIGNED
Target Milestone: Camino1.0 → Camino0.9
*** Bug 290317 has been marked as a duplicate of this bug. ***
*** Bug 292320 has been marked as a duplicate of this bug. ***
This is apparently happening on MacFixIt now/today, guessing a Shift JIS encoding.
(Assignee)

Comment 14

14 years ago
Created attachment 187749 [details]
Testcase

Testcase. Interestingly, a single pounds sign displays OK, but with > 1, the
encoding is guess incorrectly.
(Assignee)

Comment 15

14 years ago
The testcase is incorrectly detected as GB18030, so that's bug 181344.
(Assignee)

Comment 16

14 years ago
OK, here's the deal.

We hardcode the "intl.charset.detector" pref to "universal_charset_detector" in
[PreferencesManager syncMozillaPrefs], which means that every camino user will
have this in their prefs.js file.

If I take that out (and nuke the pref by hand), the testcase works.

So do we want to expose toggling "Auto Detect" on and off via the Text Encoding
menu?
(Assignee)

Comment 17

14 years ago
Taking.
Assignee: pinkerton → sfraser_bugs
Status: ASSIGNED → NEW
(Assignee)

Comment 18

14 years ago
Created attachment 187903 [details] [diff] [review]
Patch

This patch does several things:
1. Flips the "universal_charset_detector" off for people running a build with
   this change for the first time (using a new pref version key), and removes
   the hardcoding of this pref.

2. Adds a "Automatically Detect Page Encoding" item to the bottom of the text
   encodings menu, which toggles the "universal_charset_detector" on and off
   (reloading the page when toggled)

3. Makes the Text encodings menu not auto-update; we update it on display
   (this removes code that assumed that any meny item with a tag > 10 was an
   encoding item)
Attachment #187903 - Flags: review?(pinkerton)
(Assignee)

Updated

14 years ago
Status: NEW → ASSIGNED
wait wait wait. we had to put that pref in otherwise a number of sites wouldn't
render correctly (or at all) because the encoding would be wrong. I think it was
many japanese/chinese/korean/russian sites, I don't remember the details, it's
all hazy. 

Maybe look back into bugzilla to see what turning this stuff on was fixing. i
know i turned that stuff on for a reason.

Comment 20

14 years ago
(In reply to comment #19)
> wait wait wait. we had to put that pref in otherwise a number of sites wouldn't
> render correctly (or at all) because the encoding would be wrong. I think it was
> many japanese/chinese/korean/russian sites, I don't remember the details, it's

Well, not quite true unless you're truly multilingual and have to/can read many
different languages (C,J,K, and R). For most people, just using a lang-specific
detector which works better for (as opposed to the universal detector) should
suffice. Actually, Japanese and Russians need JA/Ru detectors but Chinese(both
SC and TC) and Koreans don't need detector most of time because for the latter
group, there is a single dominant encoding which can be set to the default.
(Assignee)

Comment 21

14 years ago
I thought we put the pref in early on because we didn't want Camino to have a
Text Encodings menu. But now we have one.
i recall putting it in close to 0.7 shipping, after we had an encoding menu.
again, that's just a recollection.
(Assignee)

Comment 23

14 years ago
That's probably because we used to have it in all-camino.js, but they they
changed some of the i18n prefs to use this funky locale thing, so we have to
move it into code.

I don't see why we need to be any different than Firefox, and this patch makes
it us similar.
(Assignee)

Comment 25

14 years ago
All those urls look fine with the charset detector off.
The crash logs pasted into bug 281679 seem to have fooled the charset detector,
too; I don't know what it's guessing, because there's *no* check in the text
encoding menu.
(Assignee)

Comment 27

14 years ago
It's probably guessing gb18030, which is a superset of gb2312 I believe.
Comment on attachment 187903 [details] [diff] [review]
Patch

r=pink. good idea on the prefs version as well. we'll need that going forward.
Attachment #187903 - Flags: review?(pinkerton) → review+
(Assignee)

Comment 29

14 years ago
Checked in. Note that the first time you run a build with this change (which
will be 20050705), the universal charset detector will be turned off. (If you go
and run an older build, it will get turned back on.) You can toggle the detector
on and off via the bottom item of the Text Encodings submenu.
Status: ASSIGNED → RESOLVED
Last Resolved: 14 years ago
Resolution: --- → FIXED
(Assignee)

Comment 30

14 years ago
Pageload went from ~870ms to ~800ms with this change.
(In reply to comment #30)
> Pageload went from ~870ms to ~800ms with this change.

Smaller is better?  Does that mean bug 234683 got fixed (found it by accident;
perhaps what Mike was remembering in comment 19 or 22)?

Comment 32

14 years ago
*** Bug 305775 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.