Closed
Bug 168526
Opened 22 years ago
Closed 18 years ago
Windows-1252 detected as Shift_JIS by auto-detect universal
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
VERIFIED
FIXED
People
(Reporter: mikeslvr, Assigned: jmdesp)
References
()
Details
(Keywords: intl)
Attachments
(2 files)
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.0.1) Gecko/20020909 Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.0.1) Gecko/20020909 When browsing the liniked forums I have noticed some strange characters being added to the layout. On two occasions, a japanese character has been rendered in place of a contraction "I'm" and "We'll" The other issue involves some odd characters being generated to the left of icon images that increase in width as the thread becomes further justified to the right. Two images will be attached for reference. Reproducible: Sometimes Steps to Reproduce: 1.Visit this link: http://forums.maccentral.com/wwwthreads/showthreaded.php?Cat=&Board=Lounge&Number=270015&Search=true&Forum=Lounge&Words=BiggerFoot&Match=Username&Searchpage=0&Limit=25&Old=1week&Main=270015 Actual Results: Strange behavior is evident when visiting link. Expected Results: Text should render without Japanese substitute and odd "icon companions" shouldn't be present. The following settings have been added vai user.js file: user_pref("image.animation_mode", "none"); user_pref("network.cookie.lifetime.enabled", true); user_pref("network.http.pipelining", true); user_pref("font.minimum-size.x-western", 10);
Whoever typed the "I知 back!" text used a curly apostrophe. The content of the page is script-generated, so is this an encoding issue? It does work in Mozilla. I wonder if this will start working when bug 160317 is resolved; that is, bug 111728's fix is migrated to Chimera.
Severity: trivial → normal
Keywords: intl
Comment 4•22 years ago
|
||
I was able to get the same effect in Mozilla by changing the encoding to Japanese. Perhaps this will be cleared up with Bug 153150.
*** Bug 175195 has been marked as a duplicate of this bug. ***
Confirming in 2002111304 on 10.2.2. Updating summary to be more specific.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Strange characters being generated in web pages → Kanji characters being substituted for apostrophes
*** Bug 175754 has been marked as a duplicate of this bug. ***
Comment 8•22 years ago
|
||
This is an issue with Universal Auto-Detect. Turn off Auto-Detect, go to the given URL and see the page render with your default encoding (assuming ISO-8859-1). Turn on Universal Auto-Detect, and it will re-render, now with characters not in ISO-8859-1 (miscellaneous interpunctuation symbols such as right quote high up in Unicode) substituted for the equivalents in Shift-JIS. The auto detection algorithm seems to assume that whenever ASCII is mixed with characters around Ux2000 it must be Japanese, even though CJK, Hiragana, Bopomofo and what have you is nowhere near this block.
Comment 9•22 years ago
|
||
unless i'm missing something, there's no option to change character encoding handling in chimera (0.6). also, mozilla (1.2b) both with and without autodetect seems to render just fine.
Comment 10•22 years ago
|
||
There may be no GUI to change it, but it is there, and it is on by default. The given URI does render in Shift_JIS with Universal Auto-Detect on in Mozilla 1.2b 2002103110. So does _this_ page.
Comment 11•22 years ago
|
||
aha, you're right, autodetect universal is broken in 1.2b. didn't notice that option before. doh, bugzilla is very angry with me. someone "more empowered" should change product to be browser, component to be internationalization, and summary to include autodetect universal. uhh, and the chimera doesn't allow selection of character encodings problem is being tracked as bug 153150.
Comment 12•22 years ago
|
||
This seems to be fixed in Chimera 2002111604. If someone else can confirm this, I'll mark it fixed. Anybody know what was changed?
Comment 13•22 years ago
|
||
I guess they just turned off auto-detection as the default setting in Chimera. I've confirmed that the Kanji for apostrophes behavior does occur in Mozilla 1.3a when encoding is set to auto-detect universal. Changing product and summary to reflect this.
Component: Page Layout → Internationalization
Product: Chimera → Browser
Summary: Kanji characters being substituted for apostrophes → Kanji being substituted for apostrophes under auto-detect universal
Version: unspecified → Trunk
Comment 14•21 years ago
|
||
Why does bryner have this bug?
Comment 15•21 years ago
|
||
-> component default owner
Assignee: bryner → smontagu
QA Contact: winnie → ylong
Comment 16•21 years ago
|
||
Bob, who should take autodetection bugs?
Summary: Kanji being substituted for apostrophes under auto-detect universal → Windows-1252 detected as Shift_JIS by auto-detect universal
Comment 17•21 years ago
|
||
Who's the owner for this bug?
Comment 18•20 years ago
|
||
attachment 107923 [details] (from bug 182976) exhibits this symptom Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8a3) Gecko/20040726 Updating platform/os.
OS: MacOS X → All
Hardware: Macintosh → All
Comment 19•20 years ago
|
||
*** Bug 182976 has been marked as a duplicate of this bug. ***
Comment 20•20 years ago
|
||
This bug is still an issue, as evidenced by the fact that this bug shows up as EUC-JP, thanks to comment 3. Instead of just checking for the presence of certain characters, perhaps an algorithm could be worked out that looks at relative numbers of characters? This page, for example, has only one Japanese character (if it even is Japanese) out of thousands.
Comment 21•20 years ago
|
||
This bug is still an issue, as evidenced by the fact that this bug shows up as EUC-JP, thanks to comment 3. Instead of just checking for the presence of certain characters, perhaps an algorithm could be worked out that looks at relative numbers of characters? This page, for example, has only one Japanese character (if it even is Japanese) out of thousands.
Assignee | ||
Comment 22•19 years ago
|
||
I'm currently trying to see what can be done to solve this bug. I intend to work on both the SJIS and the gb18030 problem. I will open a thread on netscape.public.mozilla.i18n to discuss my finding about the working of the detector, and share ideas about exactly how it should be enhanced. This case is more delicate that it might sound. The detectors filters out all pure ASCII characters that are not immediately following a high bit character. The sample pages shown here in most cases only use the curly apostrophe, and this character forms valid SJIS code points with many ASCII characters following it. So after filtering what we get is a string only of valid SJIS sequence, to which the detector adequately gives a high probablity of being SJIS. We should find a way to pounder adequately the fact the curly apostrophe is also often found in windows-1252, and probably try to use the fact those pages have a much higher proportion of ASCII char. But I don't want this to break for example an english page teaching japanese using SJIS (using a lot of english, and only a few japanese words). According to shanjian, the latin1 detector is not very precise, so he had to divide by two the confidence score it gives WRT asian detectors. But in that situation, the SJIS detector will get fairly confident (and has to ! We can not lower too much the score of valid SJIS data), and the latin1 can not compete once it's confidence is divided by two. But highering globally the confidence of latin1 will break some pages that needed the divide by two. I know that french page almost never have this problem, because they use many other high but characters outside the curly apostrophe, so very fast it doesn't look like SJIS at all, and the latin1 detectors wins. Therefore I am strongly interested in seeing pages that get wrongly identified and use other high bit characters than the curly apostrophe, to see if a tailor made solution for the curly apostrophe is enough, or if other characters must be considered too.
Status: NEW → ASSIGNED
Comment 23•19 years ago
|
||
Please see bug 220555 comment 14 for an idea of mine based on a similar analysis. Before proposing a patch for review I would have wanted to find a pile of autodetection edgecase testcases to see what effect it had.
Comment 24•19 years ago
|
||
I wonder if this is related to the problem that Camino has when displaying pages with the British pounds (�’) symbol?
Assignee | ||
Comment 25•19 years ago
|
||
Simon, I saw bug 220555#c14, but in my testing the SJIS detector sometimes rises to 0.99 confidence because of the curly apostrophe, so the change you suggested will not in any case be enough alone. And that change would affect many other cases than this one, so I'm a bit wary of that solution except if full testing proves it's OK. In bug 171813#c23 (that bug has very interesting comments by shanjian about the working of the universal detector), Franck Tang said there was a set of tests available to test autodetection. Maybe we should ask him where that was, and if the fundation should be able to access it.
Assignee: smontagu → jmdesp
Status: ASSIGNED → NEW
Comment 26•19 years ago
|
||
(In reply to comment #22) > The detectors filters out all pure ASCII characters that are not immediately following a high bit character. To me (and I admit I know nothing about how the detectors work), this seems like a problem - maybe even the core problem here. A page really encoded in, say, Shift_JIS, is unlikely to have 99% pure ASCII characters. Therefore, instead of ignoring these characters, they should serve to increase the probability of Latin-1 (actually, all the Latin-x encodings), Windows-1252, and UTF-8, and decrease the probability of non-Latin encodings. Does this make any sense?
Comment 27•19 years ago
|
||
absolutely
Assignee | ||
Comment 28•19 years ago
|
||
Well, I had prepared a nice explanation, that I was sure I had committed, but ... They are some legitimate SJIS page that have very little SJIS characters, one example : http://users.skynet.be/mangaguide/au1948.html Those are quite special cases, but the latin page that meet this problem are also. Most french or german pages have enough accentuated characters that they never meet this problem, this problem and variations mostly affect US/UK users where the pages are more likely to have only a few non US-ASCII characters. In the reference case, SJIS rises to 99%, so we should rise ISO-8859-1 to 100% with this mechanism for it to win. Not very subtle, and I wouldn't feel very happy with a solution based on such an a priori about the pages. Also I wouldn't like this solution to add more calculation and slow down the apparently already slow auto-detection. Part of the problem is that the latin1 detector doesn't seem to be smart at all. It apparently rises to 100% as soon as they are non US-ASCII character, therefore requiring the divide by two, that makes it a baseline converter used when none of the other one gets over 50% confidence. We could try to make it smarter and suppress this divide by two, but the impact would be large and require much testing. And it wouldn't be easy anyway as ISO-8859-1 is used by so many different languages. The SJIS detector itself doesn't seem so smart as the curly brace in combination with what follows doesn't give that common characters. So another option could be to lower the SJIS confidence on characters similar to that one, so that ISO-8859-1 can win more easily. Anyway taking all of that into account finding the best solution requires more experimenting and testing that I have time for at the moment :-(
Assignee | ||
Comment 29•19 years ago
|
||
Even if I don't have time to actually work on it, I still think about this bug. I now think maybe it would be worth trying to both heighten the Latin1 probability and lower the multi-byte converter probability when the ratio of 8-byte character is too low. We'd probabably misidentify then pages similar to the mangaguide ones, but they're rare enough ? But I'm still annoyed that short SJIS pages whose visible content is all SJIS might have so much non-displayed pure ASCII content (headers, js, css) that it will lower the ratio to the point where they will be misidentified too. So it doesn't look ideal. Except if we could filter out all content that will not be displayed and do identification only on "visible" content. Maybe the best way out is to check how much pure ASCII there is /around/ the 8 bits characters, and not take into account the whole page.
Comment 30•19 years ago
|
||
The original test case link appears to no longer trigger this bug. Could someone attach a test case? The Latin1 prober appears to be as dumb as it gets. No matter what you feed it, it appears to always return a confidence of 0.5.
Comment 31•19 years ago
|
||
(In reply to comment #30) > The original test case link appears to no longer trigger this bug. Could > someone attach a test case? Yeah, it appears that URL is now being served with UTF-8. For a test case, see comment 18 -- I have that email message in my folders, and just checked: symptom still there.
Comment 32•19 years ago
|
||
The message in comment 18 has a single non-ascii character (in either charset). There's really not enough for the detector to get its hands on. I doubt that case is fixable.
Comment 33•19 years ago
|
||
Well, if one char is not enough to assume a particular state, it doesn't make sense to default to the foreign charset; better to fall back on the default.
Comment 34•18 years ago
|
||
Fixed for me by the checkin for bug 306272
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Comment 35•18 years ago
|
||
It sure seems to be; using the test case cited in comment 18, TB 3a1-0806 exhibits the problem and -0808 does not. Thanks, Simon!
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•