168526 - Windows-1252 detected as Shift_JIS by auto-detect universal

Reporter

Description

•

22 years ago

User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.0.1) Gecko/20020909
Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.0.1) Gecko/20020909

When browsing the liniked forums I have noticed some strange characters being
added to the layout. On two occasions, a japanese character has been rendered in
place of a contraction "I'm" and "We'll"

 The other issue involves some odd characters being generated to the left of
icon images that increase in width as the thread becomes further justified to
the right.

 Two images will be attached for reference.

Reproducible: Sometimes

Steps to Reproduce:
1.Visit this link:
http://forums.maccentral.com/wwwthreads/showthreaded.php?Cat=&Board=Lounge&Number=270015&Search=true&Forum=Lounge&Words=BiggerFoot&Match=Username&Searchpage=0&Limit=25&Old=1week&Main=270015

Actual Results:  
 Strange behavior is evident when visiting link.

Expected Results:  
 Text should render without Japanese substitute and odd "icon companions"
shouldn't be present.

The following settings have been added vai user.js file:

user_pref("image.animation_mode", "none");
user_pref("network.cookie.lifetime.enabled", true);
user_pref("network.http.pipelining", true);
user_pref("font.minimum-size.x-western", 10);

Mike S.

Reporter

Comment 1

•

22 years ago

Attached image Image of japanese character substitute — Details

Mike S.

Reporter

Comment 2

•

22 years ago

Attached image Image of japanese substitution and odd icons — Details

Greg K.

Comment 3

•

22 years ago

Whoever typed the "I知 back!" text used a curly apostrophe. The content of the
page is script-generated, so is this an encoding issue?

It does work in Mozilla. I wonder if this will start working when bug 160317 is
resolved; that is, bug 111728's fix is migrated to Chimera.

URL: http://forums.maccentral.com → http://forums.maccentral.com/wwwthrea...

Severity: trivial → normal

Keywords: intl

Peter of the Norse

Comment 4

•

22 years ago

I was able to get the same effect in Mozilla by changing the encoding to
Japanese. Perhaps this will be cleared up with Bug 153150.

kaldari

Comment 5

•

22 years ago

*** Bug 175195 has been marked as a duplicate of this bug. ***

kaldari

Comment 6

•

22 years ago

Confirming in 2002111304 on 10.2.2. Updating summary to be more specific.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Summary: Strange characters being generated in web pages → Kanji characters being substituted for apostrophes

kaldari

Comment 7

•

22 years ago

*** Bug 175754 has been marked as a duplicate of this bug. ***

Niklas Dougherty

Comment 8

•

22 years ago

This is an issue with Universal Auto-Detect. Turn off Auto-Detect, go to the
given URL and see the page render with your default encoding (assuming
ISO-8859-1). Turn on Universal Auto-Detect, and it will re-render, now with
characters not in ISO-8859-1 (miscellaneous interpunctuation symbols such as
right quote high up in Unicode) substituted for the equivalents in Shift-JIS.

The auto detection algorithm seems to assume that whenever ASCII is mixed with
characters around Ux2000 it must be Japanese, even though CJK, Hiragana,
Bopomofo and what have you is nowhere near this block.

Jonathan Louie

Comment 9

•

22 years ago

unless i'm missing something, there's no option to change character encoding
handling in chimera (0.6).  also, mozilla (1.2b) both with and without
autodetect seems to render just fine.

Niklas Dougherty

Comment 10

•

22 years ago

There may be no GUI to change it, but it is there, and it is on by default.

The given URI does render in Shift_JIS with Universal Auto-Detect on in Mozilla
1.2b 2002103110. So does _this_ page.

Jonathan Louie

Comment 11

•

22 years ago

aha, you're right, autodetect universal is broken in 1.2b.  didn't notice that
option before.

doh, bugzilla is very angry with me.  someone "more empowered" should change
product to be browser, component to be internationalization, and summary to
include autodetect universal.

uhh, and the chimera doesn't allow selection of character encodings problem is
being tracked as bug 153150.

kaldari

Comment 12

•

22 years ago

This seems to be fixed in Chimera 2002111604. If someone else can confirm this,
I'll mark it fixed. Anybody know what was changed?

kaldari

Comment 13

•

22 years ago

I guess they just turned off auto-detection as the default setting in Chimera.
I've confirmed that the Kanji for apostrophes behavior does occur in Mozilla
1.3a when encoding is set to auto-detect universal. Changing product and summary
to reflect this.

Component: Page Layout → Internationalization

Product: Chimera → Browser

Summary: Kanji characters being substituted for apostrophes → Kanji being substituted for apostrophes under auto-detect universal

Version: unspecified → Trunk

Simon Fraser [no longer active]

Comment 14

•

21 years ago

Why does bryner have this bug?

Brian Ryner (not reading)

Comment 15

•

21 years ago

-> component default owner

Assignee: bryner → smontagu

QA Contact: winnie → ylong

Simon Montagu :smontagu

Comment 16

•

21 years ago

Bob, who should take autodetection bugs?

Summary: Kanji being substituted for apostrophes under auto-detect universal → Windows-1252 detected as Shift_JIS by auto-detect universal

S Woodside

Updated

•

21 years ago

Blocks: 180461

S Woodside

Comment 17

•

21 years ago

Who's the owner for this bug?

Mike Cowperthwaite

Comment 18

•

20 years ago

attachment 107923 [details] (from bug 182976) exhibits this symptom
  Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8a3) Gecko/20040726

Updating platform/os.

OS: MacOS X → All

Hardware: Macintosh → All

Mike Cowperthwaite

Comment 19

•

20 years ago

*** Bug 182976 has been marked as a duplicate of this bug. ***

Michael Newton

Comment 20

•

20 years ago

This bug is still an issue, as evidenced by the fact that this bug shows up as
EUC-JP, thanks to comment 3.  Instead of just checking for the presence of
certain characters, perhaps an algorithm could be worked out that looks at
relative numbers of characters?  This page, for example, has only one Japanese
character (if it even is Japanese) out of thousands.

Michael Newton

Comment 21

•

20 years ago

This bug is still an issue, as evidenced by the fact that this bug shows up as
EUC-JP, thanks to comment 3.  Instead of just checking for the presence of
certain characters, perhaps an algorithm could be worked out that looks at
relative numbers of characters?  This page, for example, has only one Japanese
character (if it even is Japanese) out of thousands.

Jean-Marc Desperrier

Assignee

Updated

•

20 years ago

Blocks: 264871

Jean-Marc Desperrier

Assignee

Comment 22

•

19 years ago

I'm currently trying to see what can be done to solve this bug.

I intend to work on both the SJIS and the gb18030 problem. I will open a thread
on netscape.public.mozilla.i18n to discuss my finding about the working of the
detector, and share ideas about exactly how it should be enhanced.

This case is more delicate that it might sound. The detectors filters out all
pure ASCII characters that are not immediately following a high bit character.
The sample pages shown here in most cases only use the curly apostrophe, and
this character forms valid SJIS code points with many ASCII characters following it.
So after filtering what we get is a string only of valid SJIS sequence, to which
the detector adequately gives a high probablity of being SJIS.

We should find a way to pounder adequately the fact the curly apostrophe is also
often found in windows-1252, and probably try to use the fact those pages have a
much higher proportion of ASCII char. But I don't want this to break for example
an english page teaching japanese using SJIS (using a lot of english, and only a
few japanese words).

According to shanjian, the latin1 detector is not very precise, so he had to
divide by two the confidence score it gives WRT asian detectors. 
But in that situation, the SJIS detector will get fairly confident (and has to !
We can not lower too much the score of valid SJIS data), and the latin1 can not
compete once it's confidence is divided by two. But highering globally the
confidence of latin1 will break some pages that needed the divide by two.

I know that french page almost never have this problem, because they use many
other high but characters outside the curly apostrophe, so very fast it doesn't
look like SJIS at all, and the latin1 detectors wins.

Therefore I am strongly interested in seeing pages that get wrongly identified
and use other high bit characters than the curly apostrophe, to see if a tailor
made solution for the curly apostrophe is enough, or if other characters must be
considered too.

Status: NEW → ASSIGNED

Simon Montagu :smontagu

Comment 23

•

19 years ago

Please see bug 220555 comment 14 for an idea of mine based on a similar
analysis. Before proposing a patch for review I would have wanted to find a pile
of autodetection edgecase testcases to see what effect it had.

Simon Fraser [no longer active]

Comment 24

•

19 years ago

I wonder if this is related to the problem that Camino has when displaying pages
with the British pounds (�’) symbol?

Simon Fraser [no longer active]

Updated

•

19 years ago

Blocks: 248304

Jean-Marc Desperrier

Assignee

Comment 25

•

19 years ago

Simon, I saw bug 220555#c14, but in my testing the SJIS detector sometimes rises
to 0.99 confidence because of the curly apostrophe, so the change you suggested
will not in any case be enough alone. And that change would affect many other
cases than this one, so I'm a bit wary of that solution except if full testing
proves it's OK. 

In bug 171813#c23 (that bug has very interesting comments by shanjian about the
working of the universal detector), Franck Tang said there was a set of tests
available to test autodetection. Maybe we should ask him where that was, and if
the fundation should be able to access it.

Assignee: smontagu → jmdesp

Status: ASSIGNED → NEW

Uri Bernstein

Comment 26

•

19 years ago

(In reply to comment #22)
> The detectors filters out all pure ASCII characters that are not immediately
following a high bit character.

To me (and I admit I know nothing about how the detectors work), this seems like
a problem - maybe even the core problem here.
A page really encoded in, say, Shift_JIS, is unlikely to have 99% pure ASCII
characters. Therefore, instead of ignoring these characters, they should serve
to increase the probability of Latin-1 (actually, all the Latin-x encodings),
Windows-1252, and UTF-8, and decrease the probability of non-Latin encodings.

Does this make any sense?

timeless

Comment 27

•

19 years ago

absolutely

Jean-Marc Desperrier

Assignee

Comment 28

•

19 years ago

Well, I had prepared a nice explanation, that I was sure I had committed, but ...

They are some legitimate SJIS page that have very little SJIS characters, one
example : http://users.skynet.be/mangaguide/au1948.html
Those are quite special cases, but the latin page that meet this problem are
also. Most french or german pages have enough accentuated characters that they
never meet this problem, this problem and variations mostly affect US/UK users
where the pages are more likely to have only a few non US-ASCII characters.

In the reference case, SJIS rises to 99%, so we should rise ISO-8859-1 to 100%
with this mechanism for it to win. Not very subtle, and I wouldn't feel very
happy with a solution based on such an a priori about the pages. 
Also I wouldn't like this solution to add more calculation and slow down the
apparently already slow auto-detection.

Part of the problem is that the latin1 detector doesn't seem to be smart at all.
It apparently rises to 100% as soon as they are non US-ASCII character,
therefore  requiring the divide by two, that makes it a baseline converter used
when none of the other one gets over 50% confidence. We could try to make it
smarter and suppress this divide by two, but the impact would be large and
require much testing. And it wouldn't be easy anyway as ISO-8859-1 is used by so
many different languages.

The SJIS detector itself doesn't seem so smart as the curly brace in combination
with what follows doesn't give that common characters. So another option could
be to lower the SJIS confidence on characters similar to that one, so that
ISO-8859-1 can win more easily.

Anyway taking all of that into account finding the best solution requires more
experimenting and testing that I have time for at the moment :-(

Jean-Marc Desperrier

Assignee

Comment 29

•

19 years ago

Even if I don't have time to actually work on it, I still think about this bug.

I now think maybe it would be worth trying to both heighten the Latin1
probability and lower the multi-byte converter probability when the ratio of
8-byte character is too low. We'd probabably misidentify then pages similar to
the mangaguide ones, but they're rare enough ?

But I'm still annoyed that short SJIS pages whose visible content is all SJIS
might have so much non-displayed pure ASCII content (headers, js, css) that it
will lower the ratio to the point where they will be misidentified too. So it
doesn't look ideal. 

Except if we could filter out all content that will not be displayed and do
identification only on "visible" content.

Maybe the best way out is to check how much pure ASCII there is /around/ the 8
bits characters, and not take into account the whole page.

John G. Myers

Comment 30

•

19 years ago

The original test case link appears to no longer trigger this bug.  Could someone attach a test case?

The Latin1 prober appears to be as dumb as it gets.  No matter what you feed it, it appears to always return a confidence of 0.5.

Mike Cowperthwaite

Comment 31

•

19 years ago

(In reply to comment #30)
> The original test case link appears to no longer trigger this bug.  Could
> someone attach a test case?

Yeah, it appears that URL is now being served with UTF-8.  For a test case, see comment 18 -- I have that email message in my folders, and just checked: symptom still there.

John G. Myers

Comment 32

•

19 years ago

The message in comment 18 has a single non-ascii character (in either charset).  There's really not enough for the detector to get its hands on.  I doubt that case is fixable.

Mike Cowperthwaite

Comment 33

•

19 years ago

Well, if one char is not enough to assume a particular state, it doesn't make sense to default to the foreign charset; better to fall back on the default.

Simon Montagu :smontagu

Comment 34

•

18 years ago

Fixed for me by the checkin for bug 306272

Status: NEW → RESOLVED

Closed: 18 years ago

Resolution: --- → FIXED

Mike Cowperthwaite

Comment 35

•

18 years ago

It sure seems to be; using the test case cited in comment 18, TB 3a1-0806 exhibits the problem and -0808 does not.  Thanks, Simon!

Status: RESOLVED → VERIFIED

Image of japanese character substitute 22 years ago Mike S. 118.90 KB, image/jpeg		Details
Image of japanese substitution and odd icons 22 years ago Mike S. 35.67 KB, image/png		Details