248304 - Poor text-encoding guessing for sites with no specified encoding (e.g. pound signs on UK sites)

Reporter

Description

•

20 years ago

There have been several forum complaints about sites with garbled text,
especially numbers, on sites that don't have an encoding set (see the "More
Buying Choices" box on the right of the test URL, before and after setting text
encoding to ISO Latin 1). Changing the encoding manually works, but is a pain if
it's a site they use often or navigate around very much.

According to the reports, the other Moz family browsers don't have problems with
the test URL or many other sites, so we should look into doing what they are
doing to guess text encoding.

Stuart Morgan

Reporter

Comment 1

•

20 years ago

*** Bug 249196 has been marked as a duplicate of this bug. ***

Bruce Davidson

Comment 2

•

20 years ago

*** Bug 257383 has been marked as a duplicate of this bug. ***

Bruce Davidson

Comment 3

•

20 years ago

*** Bug 263704 has been marked as a duplicate of this bug. ***

Bruce Davidson

Comment 4

•

20 years ago

Add note about pound signs to summary as this is one of the most noticeable
effects of this bug.

Summary: Poor text-encoding guessing for sites with no specified encoding → Poor text-encoding guessing for sites with no specified encoding (e.g. pound signs on UK sites)

Jasper

Comment 5

•

20 years ago

*** Bug 280172 has been marked as a duplicate of this bug. ***

Simon Fraser [no longer active]

Assignee

Comment 6

•

19 years ago

This is a common complaint from the feedback.

Priority: -- → P2

Target Milestone: --- → Camino1.0

Simon Fraser [no longer active]

Assignee

Updated

•

19 years ago

Depends on: 168526

Jean-Marc Desperrier

Comment 7

•

19 years ago

They are many duplicates directly to this bug, but it would be better to check
if the page if wrongly detected as SJIS, in which case it's a duplicate of bug
168526, or wrongly detected as GB18030, in which case it's a duplicate of bug
181344.

And if it's neither of those, it shows they are more problems than those two to
solve to make the universal detector more reliable, which is interesting info
that must not be lost because the duplicate marking was too quick.

From a very superficial check, all the already reported cases seems to be
duplicate of bug 181344 (the pound sign wrongly leads the detector to interpret
the page as GB18030). If so, I think it would be better to tag them as such, so
that it's easier to check if a fix for bug 181344 fixes them all.

Depends on: 181344

L. H.

Comment 8

•

19 years ago

> According to the reports, the other Moz family browsers don't have problems
> with the test URL or many other sites

This is Camino specific and not one of those bugs.

Simon Fraser [no longer active]

Assignee

Comment 9

•

19 years ago

This probably only happens in Camino because it uses the Universal Charset Detector.

Mike Pinkerton (not reading bugmail)

Comment 10

•

19 years ago

this needs to be fixed for 09, we get a lot of feedback about this from non-US
users.

Status: NEW → ASSIGNED

Target Milestone: Camino1.0 → Camino0.9

Smokey Ardisson (offline for a while; not following bugs - do not email)

Comment 11

•

19 years ago

*** Bug 290317 has been marked as a duplicate of this bug. ***

Smokey Ardisson (offline for a while; not following bugs - do not email)

Comment 12

•

19 years ago

*** Bug 292320 has been marked as a duplicate of this bug. ***

Smokey Ardisson (offline for a while; not following bugs - do not email)

Comment 13

•

19 years ago

This is apparently happening on MacFixIt now/today, guessing a Shift JIS encoding.

Simon Fraser [no longer active]

Assignee

Comment 14

•

19 years ago

Attached file Testcase — Details

Testcase. Interestingly, a single pounds sign displays OK, but with > 1, the
encoding is guess incorrectly.

Simon Fraser [no longer active]

Assignee

Comment 15

•

19 years ago

The testcase is incorrectly detected as GB18030, so that's bug 181344.

Simon Fraser [no longer active]

Assignee

Comment 16

•

19 years ago

OK, here's the deal.

We hardcode the "intl.charset.detector" pref to "universal_charset_detector" in
[PreferencesManager syncMozillaPrefs], which means that every camino user will
have this in their prefs.js file.

If I take that out (and nuke the pref by hand), the testcase works.

So do we want to expose toggling "Auto Detect" on and off via the Text Encoding
menu?

Simon Fraser [no longer active]

Assignee

Comment 17

•

19 years ago

Taking.

Assignee: pinkerton → sfraser_bugs

Status: ASSIGNED → NEW

Simon Fraser [no longer active]

Assignee

Comment 18

•

19 years ago

Attached patch Patch — Details — Splinter Review

This patch does several things:
1. Flips the "universal_charset_detector" off for people running a build with
   this change for the first time (using a new pref version key), and removes
   the hardcoding of this pref.

2. Adds a "Automatically Detect Page Encoding" item to the bottom of the text
   encodings menu, which toggles the "universal_charset_detector" on and off
   (reloading the page when toggled)

3. Makes the Text encodings menu not auto-update; we update it on display
   (this removes code that assumed that any meny item with a tag > 10 was an
   encoding item)

Attachment #187903 - Flags: review?(pinkerton)

Simon Fraser [no longer active]

Assignee

Updated

•

19 years ago

Status: NEW → ASSIGNED

Mike Pinkerton (not reading bugmail)

Comment 19

•

19 years ago

wait wait wait. we had to put that pref in otherwise a number of sites wouldn't
render correctly (or at all) because the encoding would be wrong. I think it was
many japanese/chinese/korean/russian sites, I don't remember the details, it's
all hazy. 

Maybe look back into bugzilla to see what turning this stuff on was fixing. i
know i turned that stuff on for a reason.

Jungshik Shin

Comment 20

•

19 years ago

(In reply to comment #19)
> wait wait wait. we had to put that pref in otherwise a number of sites wouldn't
> render correctly (or at all) because the encoding would be wrong. I think it was
> many japanese/chinese/korean/russian sites, I don't remember the details, it's

Well, not quite true unless you're truly multilingual and have to/can read many
different languages (C,J,K, and R). For most people, just using a lang-specific
detector which works better for (as opposed to the universal detector) should
suffice. Actually, Japanese and Russians need JA/Ru detectors but Chinese(both
SC and TC) and Koreans don't need detector most of time because for the latter
group, there is a single dominant encoding which can be set to the default.

Simon Fraser [no longer active]

Assignee

Comment 21

•

19 years ago

I thought we put the pref in early on because we didn't want Camino to have a
Text Encodings menu. But now we have one.

Mike Pinkerton (not reading bugmail)

Comment 22

•

19 years ago

i recall putting it in close to 0.7 shipping, after we had an encoding menu.
again, that's just a recollection.

Simon Fraser [no longer active]

Assignee

Comment 23

•

19 years ago

That's probably because we used to have it in all-camino.js, but they they
changed some of the i18n prefs to use this funky locale thing, so we have to
move it into code.

I don't see why we need to be any different than Firefox, and this patch makes
it us similar.

Simon Fraser [no longer active]

Assignee

Comment 24

•

19 years ago

Bugs to look at when testing for regressions:
bug 180703 (duped to bug 153150)

Test pages:
http://www.rest.co.il/yoezer/ (hebrew)
http://aoshimak.tripod.co.jp/
http://www.geocities.co.jp/
http://forums.maccentral.com/wwwthreads/showthreaded.php?Cat=&Board=Lounge&Number=270015&Search=true&Forum=Lounge&Words=BiggerFoot&Match=Username&Searchpage=0&Limit=25&Old=1week&Main=270015
http://www.tvland.com/shows/
http://slashdot.org/article.pl?sid=02/10/20/156247&mode=thread&tid=141
https://bugzilla.mozilla.org/show_bug.cgi?id=168526 (yes, the whole bug)

Simon Fraser [no longer active]

Assignee

Comment 25

•

19 years ago

All those urls look fine with the charset detector off.

Smokey Ardisson (offline for a while; not following bugs - do not email)

Comment 26

•

19 years ago

The crash logs pasted into bug 281679 seem to have fooled the charset detector,
too; I don't know what it's guessing, because there's *no* check in the text
encoding menu.

Simon Fraser [no longer active]

Assignee

Comment 27

•

19 years ago

It's probably guessing gb18030, which is a superset of gb2312 I believe.

Mike Pinkerton (not reading bugmail)

Comment 28

•

19 years ago

Comment on attachment 187903 [details] [diff] [review]
Patch

r=pink. good idea on the prefs version as well. we'll need that going forward.

Attachment #187903 - Flags: review?(pinkerton) → review+

Simon Fraser [no longer active]

Assignee

Comment 29

•

19 years ago

Checked in. Note that the first time you run a build with this change (which
will be 20050705), the universal charset detector will be turned off. (If you go
and run an older build, it will get turned back on.) You can toggle the detector
on and off via the bottom item of the Text Encodings submenu.

Status: ASSIGNED → RESOLVED

Closed: 19 years ago

Resolution: --- → FIXED

Simon Fraser [no longer active]

Assignee

Comment 30

•

19 years ago

Pageload went from ~870ms to ~800ms with this change.

Smokey Ardisson (offline for a while; not following bugs - do not email)

Comment 31

•

19 years ago

(In reply to comment #30)
> Pageload went from ~870ms to ~800ms with this change.

Smaller is better?  Does that mean bug 234683 got fixed (found it by accident;
perhaps what Mike was remembering in comment 19 or 22)?

Torben

Comment 32

•

19 years ago

*** Bug 305775 has been marked as a duplicate of this bug. ***

Testcase 19 years ago Simon Fraser [no longer active] 257 bytes, text/html		Details
Patch 19 years ago Simon Fraser [no longer active] 10.62 KB, patch	mikepinkerton : review+	Details \| Diff \| Splinter Review