Closed Bug 168526 Opened 22 years ago Closed 18 years ago

Windows-1252 detected as Shift_JIS by auto-detect universal

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED

People

(Reporter: mikeslvr, Assigned: jmdesp)

References

()

Details

(Keywords: intl)

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.0.1) Gecko/20020909
Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.0.1) Gecko/20020909

When browsing the liniked forums I have noticed some strange characters being
added to the layout. On two occasions, a japanese character has been rendered in
place of a contraction "I'm" and "We'll"

 The other issue involves some odd characters being generated to the left of
icon images that increase in width as the thread becomes further justified to
the right.

 Two images will be attached for reference.

Reproducible: Sometimes

Steps to Reproduce:
1.Visit this link:
http://forums.maccentral.com/wwwthreads/showthreaded.php?Cat=&Board=Lounge&Number=270015&Search=true&Forum=Lounge&Words=BiggerFoot&Match=Username&Searchpage=0&Limit=25&Old=1week&Main=270015

Actual Results:  
 Strange behavior is evident when visiting link.

Expected Results:  
 Text should render without Japanese substitute and odd "icon companions"
shouldn't be present.

The following settings have been added vai user.js file:

user_pref("image.animation_mode", "none");
user_pref("network.cookie.lifetime.enabled", true);
user_pref("network.http.pipelining", true);
user_pref("font.minimum-size.x-western", 10);
Whoever typed the "I知 back!" text used a curly apostrophe. The content of the
page is script-generated, so is this an encoding issue?

It does work in Mozilla. I wonder if this will start working when bug 160317 is
resolved; that is, bug 111728's fix is migrated to Chimera.
Severity: trivial → normal
Keywords: intl
I was able to get the same effect in Mozilla by changing the encoding to
Japanese. Perhaps this will be cleared up with Bug 153150.
*** Bug 175195 has been marked as a duplicate of this bug. ***
Confirming in 2002111304 on 10.2.2. Updating summary to be more specific.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Strange characters being generated in web pages → Kanji characters being substituted for apostrophes
*** Bug 175754 has been marked as a duplicate of this bug. ***
This is an issue with Universal Auto-Detect. Turn off Auto-Detect, go to the
given URL and see the page render with your default encoding (assuming
ISO-8859-1). Turn on Universal Auto-Detect, and it will re-render, now with
characters not in ISO-8859-1 (miscellaneous interpunctuation symbols such as
right quote high up in Unicode) substituted for the equivalents in Shift-JIS.

The auto detection algorithm seems to assume that whenever ASCII is mixed with
characters around Ux2000 it must be Japanese, even though CJK, Hiragana,
Bopomofo and what have you is nowhere near this block.
unless i'm missing something, there's no option to change character encoding
handling in chimera (0.6).  also, mozilla (1.2b) both with and without
autodetect seems to render just fine.
There may be no GUI to change it, but it is there, and it is on by default.

The given URI does render in Shift_JIS with Universal Auto-Detect on in Mozilla
1.2b 2002103110. So does _this_ page.
aha, you're right, autodetect universal is broken in 1.2b.  didn't notice that
option before.

doh, bugzilla is very angry with me.  someone "more empowered" should change
product to be browser, component to be internationalization, and summary to
include autodetect universal.

uhh, and the chimera doesn't allow selection of character encodings problem is
being tracked as bug 153150.
This seems to be fixed in Chimera 2002111604. If someone else can confirm this,
I'll mark it fixed. Anybody know what was changed?
I guess they just turned off auto-detection as the default setting in Chimera.
I've confirmed that the Kanji for apostrophes behavior does occur in Mozilla
1.3a when encoding is set to auto-detect universal. Changing product and summary
to reflect this.
Component: Page Layout → Internationalization
Product: Chimera → Browser
Summary: Kanji characters being substituted for apostrophes → Kanji being substituted for apostrophes under auto-detect universal
Version: unspecified → Trunk
Why does bryner have this bug?
-> component default owner
Assignee: bryner → smontagu
QA Contact: winnie → ylong
Bob, who should take autodetection bugs?
Summary: Kanji being substituted for apostrophes under auto-detect universal → Windows-1252 detected as Shift_JIS by auto-detect universal
Blocks: 180461
Who's the owner for this bug?
attachment 107923 [details] (from bug 182976) exhibits this symptom
  Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8a3) Gecko/20040726

Updating platform/os.
OS: MacOS X → All
Hardware: Macintosh → All
*** Bug 182976 has been marked as a duplicate of this bug. ***
This bug is still an issue, as evidenced by the fact that this bug shows up as
EUC-JP, thanks to comment 3.  Instead of just checking for the presence of
certain characters, perhaps an algorithm could be worked out that looks at
relative numbers of characters?  This page, for example, has only one Japanese
character (if it even is Japanese) out of thousands.
This bug is still an issue, as evidenced by the fact that this bug shows up as
EUC-JP, thanks to comment 3.  Instead of just checking for the presence of
certain characters, perhaps an algorithm could be worked out that looks at
relative numbers of characters?  This page, for example, has only one Japanese
character (if it even is Japanese) out of thousands.
Blocks: 264871
I'm currently trying to see what can be done to solve this bug.

I intend to work on both the SJIS and the gb18030 problem. I will open a thread
on netscape.public.mozilla.i18n to discuss my finding about the working of the
detector, and share ideas about exactly how it should be enhanced.

This case is more delicate that it might sound. The detectors filters out all
pure ASCII characters that are not immediately following a high bit character.
The sample pages shown here in most cases only use the curly apostrophe, and
this character forms valid SJIS code points with many ASCII characters following it.
So after filtering what we get is a string only of valid SJIS sequence, to which
the detector adequately gives a high probablity of being SJIS.

We should find a way to pounder adequately the fact the curly apostrophe is also
often found in windows-1252, and probably try to use the fact those pages have a
much higher proportion of ASCII char. But I don't want this to break for example
an english page teaching japanese using SJIS (using a lot of english, and only a
few japanese words).

According to shanjian, the latin1 detector is not very precise, so he had to
divide by two the confidence score it gives WRT asian detectors. 
But in that situation, the SJIS detector will get fairly confident (and has to !
We can not lower too much the score of valid SJIS data), and the latin1 can not
compete once it's confidence is divided by two. But highering globally the
confidence of latin1 will break some pages that needed the divide by two.

I know that french page almost never have this problem, because they use many
other high but characters outside the curly apostrophe, so very fast it doesn't
look like SJIS at all, and the latin1 detectors wins.

Therefore I am strongly interested in seeing pages that get wrongly identified
and use other high bit characters than the curly apostrophe, to see if a tailor
made solution for the curly apostrophe is enough, or if other characters must be
considered too.
Status: NEW → ASSIGNED
Please see bug 220555 comment 14 for an idea of mine based on a similar
analysis. Before proposing a patch for review I would have wanted to find a pile
of autodetection edgecase testcases to see what effect it had.
I wonder if this is related to the problem that Camino has when displaying pages
with the British pounds (�’) symbol?
Simon, I saw bug 220555#c14, but in my testing the SJIS detector sometimes rises
to 0.99 confidence because of the curly apostrophe, so the change you suggested
will not in any case be enough alone. And that change would affect many other
cases than this one, so I'm a bit wary of that solution except if full testing
proves it's OK. 

In bug 171813#c23 (that bug has very interesting comments by shanjian about the
working of the universal detector), Franck Tang said there was a set of tests
available to test autodetection. Maybe we should ask him where that was, and if
the fundation should be able to access it.
Assignee: smontagu → jmdesp
Status: ASSIGNED → NEW
(In reply to comment #22)
> The detectors filters out all pure ASCII characters that are not immediately
following a high bit character.

To me (and I admit I know nothing about how the detectors work), this seems like
a problem - maybe even the core problem here.
A page really encoded in, say, Shift_JIS, is unlikely to have 99% pure ASCII
characters. Therefore, instead of ignoring these characters, they should serve
to increase the probability of Latin-1 (actually, all the Latin-x encodings),
Windows-1252, and UTF-8, and decrease the probability of non-Latin encodings.

Does this make any sense?
absolutely
Well, I had prepared a nice explanation, that I was sure I had committed, but ...

They are some legitimate SJIS page that have very little SJIS characters, one
example : http://users.skynet.be/mangaguide/au1948.html
Those are quite special cases, but the latin page that meet this problem are
also. Most french or german pages have enough accentuated characters that they
never meet this problem, this problem and variations mostly affect US/UK users
where the pages are more likely to have only a few non US-ASCII characters.

In the reference case, SJIS rises to 99%, so we should rise ISO-8859-1 to 100%
with this mechanism for it to win. Not very subtle, and I wouldn't feel very
happy with a solution based on such an a priori about the pages. 
Also I wouldn't like this solution to add more calculation and slow down the
apparently already slow auto-detection.

Part of the problem is that the latin1 detector doesn't seem to be smart at all.
It apparently rises to 100% as soon as they are non US-ASCII character,
therefore  requiring the divide by two, that makes it a baseline converter used
when none of the other one gets over 50% confidence. We could try to make it
smarter and suppress this divide by two, but the impact would be large and
require much testing. And it wouldn't be easy anyway as ISO-8859-1 is used by so
many different languages.

The SJIS detector itself doesn't seem so smart as the curly brace in combination
with what follows doesn't give that common characters. So another option could
be to lower the SJIS confidence on characters similar to that one, so that
ISO-8859-1 can win more easily.

Anyway taking all of that into account finding the best solution requires more
experimenting and testing that I have time for at the moment :-(
Even if I don't have time to actually work on it, I still think about this bug.

I now think maybe it would be worth trying to both heighten the Latin1
probability and lower the multi-byte converter probability when the ratio of
8-byte character is too low. We'd probabably misidentify then pages similar to
the mangaguide ones, but they're rare enough ?

But I'm still annoyed that short SJIS pages whose visible content is all SJIS
might have so much non-displayed pure ASCII content (headers, js, css) that it
will lower the ratio to the point where they will be misidentified too. So it
doesn't look ideal. 

Except if we could filter out all content that will not be displayed and do
identification only on "visible" content.

Maybe the best way out is to check how much pure ASCII there is /around/ the 8
bits characters, and not take into account the whole page.
The original test case link appears to no longer trigger this bug.  Could someone attach a test case?

The Latin1 prober appears to be as dumb as it gets.  No matter what you feed it, it appears to always return a confidence of 0.5.
(In reply to comment #30)
> The original test case link appears to no longer trigger this bug.  Could
> someone attach a test case?

Yeah, it appears that URL is now being served with UTF-8.  For a test case, see comment 18 -- I have that email message in my folders, and just checked: symptom still there.
The message in comment 18 has a single non-ascii character (in either charset).  There's really not enough for the detector to get its hands on.  I doubt that case is fixable.
Well, if one char is not enough to assume a particular state, it doesn't make sense to default to the foreign charset; better to fall back on the default.
Fixed for me by the checkin for bug 306272
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
It sure seems to be; using the test case cited in comment 18, TB 3a1-0806 exhibits the problem and -0808 does not.  Thanks, Simon!
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: