Closed
Bug 306224
Opened 19 years ago
Closed 17 years ago
Windows-1252 pages are mistaken to be in Windows-1255 by univeral charset detector
Categories
(Core :: Internationalization, defect)
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: ria.klaassen, Assigned: shooshx)
References
()
Details
(Keywords: intl)
Attachments
(2 files)
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20050827 Firefox/1.6a1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20050827 Firefox/1.6a1
This regressed very recently, both in trunk and branch. Between these two builds:
1.9a1_2005082611 and 1.9a1_2005082612
The character encoding resets itself to hebrew and the displayed font-family and
font-size are not the ones in the preferences.
It happens on some sites.
Reproducible: Sometimes
Steps to Reproduce:
Sometimes I could reproduce it only with this prefs but not consistently:
user_pref("browser.display.use_document_fonts", 0);
user_pref("browser.search.selectedEngine", "Google");
user_pref("browser.shell.checkDefaultBrowser", false);
user_pref("browser.startup.homepage",
"http://www.haarshop.nl/default.asp?part=product_details&product_id=31&page=1");
user_pref("browser.startup.homepage_override.mstone", "rv:1.9a1");
user_pref("extensions.lastAppVersion", "1.6a1");
user_pref("font.default", "sans-serif");
user_pref("font.default.x-western", "sans-serif");
user_pref("font.minimum-size.x-western", 24);
user_pref("font.name.sans-serif.x-western", "Verdana");
user_pref("font.name.serif.x-western", "Verdana");
user_pref("font.size.fixed.x-western", 14);
user_pref("intl.accept_languages", "nl");
user_pref("intl.charset.detector", "universal_charset_detector");
user_pref("intl.charsetmenu.browser.cache", "ISO-8859-1, windows-1255, us-ascii,
windows-1252, ISO-8859-2");
user_pref("network.cookie.prefsMigrated", true);
user_pref("pref.browser.language.disable_button.remove", true);
Actual Results:
When I changed the intl.accept_languages to default, it didn't happen.| Reporter | ||
Comment 1•19 years ago
|
||
Under some conditions Firefox prefers the (wrong) character encoding AutoDetect settings, not the personal settings.
| Reporter | ||
Comment 2•19 years ago
|
||
I can reproduce it on another XP computer on an empty profile with only the same prefs. The only way to get normal letters is setting intl.accept_languages to en-us OR not choosing my own fonts. Both solutions have huge disadvantages.
| Reporter | ||
Comment 3•19 years ago
|
||
I found a way to get around these problems. In case anyone runs into the same problem (worked previously well so it is cleary a regression). You need to accept more languages, not only your own. If this message is sent to the server: Accept-Language: nl,en-us;q=0.7,en;q=0.3 there will be no problem, even if I choose my own font.
| Reporter | ||
Updated•19 years ago
|
Component: General → Internationalization
Product: Firefox → Core
Version: unspecified → 1.8 Branch
| Reporter | ||
Comment 4•19 years ago
|
||
OK, I can reproduce this now 100% consistently. Forget the previous post: what polluted this test was that Auto-Detect sometimes changed from on to off. 1. Create a new profile, set Character Encoding > Auto-Detect to Universal and go to http://www.haarshop.nl/default.asp?part=product_details&product_id=31&page=1 2. Then change Advanced > General > Edit Languages to Dutch as the first language. Restart and verify if Character Encoding > Auto-Detect is still set to Universal. Go back to the URL and observe that Character Encoding points wrongly to Hebrew (Windows-1255). This happens only in builds since 2005-08-26 12 probably due to some bugfix. 3. Then set Options > Content > Default Font to verdana and UNcheck "Allow pages to choose their own fonts". What then happens is that the page changes into a very small font-size. This happens on more pages, sometimes it changes to fontsize 7 or 8, but I remember only this one.
| Reporter | ||
Updated•19 years ago
|
Flags: blocking1.8b4?
Comment 6•19 years ago
|
||
(In reply to comment #5) > Ria, can you determine when this broke? from comment 0: This regressed very recently, both in trunk and branch. Between these two builds: 1.9a1_2005082611 and 1.9a1_2005082612 http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=SeaMonkeyAll&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2005-08-26+10%3A00&maxdate=2005-08-26+13%3A00&cvsroot=%2Fcvsroot 2005-08-26 12:09 mozilla.mano%sent.com mozilla/ extensions/ universalchardet/ src/ LangHebrewModel.cpp 1.2 17/8 Bug 304951 - error in chardet's Hebrew language model. patch from Shy Shalom <shoosh20012001@hotmail.com>, r=smontagu, sr=roc.
Comment 7•19 years ago
|
||
shoosh, can you look into this regression. If it can't be easily fixed, we should consider backing out that other change.
Assignee: nobody → shoosh20012001
Status: UNCONFIRMED → NEW
Ever confirmed: true
| Assignee | ||
Comment 8•19 years ago
|
||
ok, I've spent some time investigating this and got some initial findings first of all as of this moment I am unable to reproduce this bug. which is strange because an hour ago I was able to reproduce it fully. I really can't point to anything I did locally to change anything. however I am able to see that something is still wrong by enabling the debug output of the universal prober and thus prodece some reduced testcases. if anyone can still reproduce it as its described in comment #4 please notify me. the strange behaviour that seem to depend of the received languages is probably due to the fact that for some languages and not others, the page is sent with the 'euro' char (0x80 in cp1252) incorporated in it. the bug it itself is the misrecognition as cp1255 which is caused by a sequence in the comment by "Melissa" of two 0xe9 (Letter Yod in cp1255 or an accented 'e' in cp1252) since Yod is is the most common letter in hebrew this gives a high score to the hebrew detector. the patch to 304951 seem to have caused this regression since it makes the hebrew language model more percise. backing up that change would fix this page specifically (and unfix the 304951 page) but would not fix the essence of this bug. the essence of this bug being that the SingleByte detector is allowed to be very confident ( > 0.5) on a very very short input, in the range of 2-4 bytes. the reduced test cases I've produced demostrate that this is not a Hebrew-prober only behaviour but a general Single-Byte prober problem since it is reproduced for cp1251 (russian) as well. The second and totally unrealated problem described in comment #4 has to do with fonts, again, since currently I can't reproduce the bug. I have no means to see what this looks like but in any case it is of no relation to the universal charset detector and if it at all exist, it deserves a different bug. This bug and any other 86999-related bugs end when the charset of the page is decided upon.
| Assignee | ||
Comment 9•19 years ago
|
||
this demostrates how the hebrew-prober recognizes a page with mosty english and two letters of hebrew as windows-1255
| Assignee | ||
Comment 10•19 years ago
|
||
this demostrates how the windows-1251 recognizes a page with mosty english and two letters of high-ascii (0xee) as windows-1251
Comment 11•19 years ago
|
||
smontague and jshin, can you comment here? Which is worse, the bug that caused this regression or this regression? any suggestions for a low risk fix?
Comment 12•19 years ago
|
||
(In reply to comment #4) > 3. Then set Options > Content > Default Font to verdana and UNcheck "Allow pages > to choose their own fonts". What then happens is that the page changes into a > very small font-size. This happens on more pages, sometimes it changes to > fontsize 7 or 8, but I remember only this one. What you observed is a by-product of our universal autodetector mistaking a Windows-1252-encoded page for a Windows-1255-encoded one. A new font preference dialog of a trunk build adds to your confusion.(I pointed out this in another bug whose bug number I can't recall at the moment) The 'default font' is NOT used as the default font for all the pages (when 'allow web pages to choose their own fonts' is NOT checked). It's only the default for the last language group for which you configured your font setting (by pressing 'Advanced' button in the font pref. UI.) In your case, it's probably Western (European). However, the page was detected as Windows-1255 so that the default font for Hebrew (lang. group) rather than Veranda (the default font for 'Western' lang. group) will be used to render the page.
Keywords: intl
Summary: Firefox doesn't respect personal font settings anymore on some sites → Windows-1252 pages are mistaken to be in Windows-1255 by univeral charset detector
Comment 13•19 years ago
|
||
Simon or Jshin, can you please renominate this if you think the current situation is worse than the bug we fixed.
Flags: blocking1.8b5? → blocking1.8b5-
| Assignee | ||
Comment 14•19 years ago
|
||
A possible solution to this and related bugs is to introduce a static limit of say 10 characters, under which the SingleByte probers always return confidence 0. another possiblity is to introduce a minimum ratio of (checked characters)/(total page size, including all english and tags) which should be above something like 0.2 yet another possiblity is that the SingleByteGroupProber will do both with-english and without-english filtering on the original text and then calculate the ratio on those numbers which more accuratly summerise the content of the text. this change however has a performance hit of another filtering going over the page.
Comment 15•19 years ago
|
||
Gentlemen. Here's another page for you - http://www.cubio.fi/99551.php Recognized as 1251 even althought meta says 1252. You can see cyrillic chars at the bottom of the screen, which should be displaying 'ä' instead. Better yet, Auto-detect is set to OFF and it doesn't matter. FF stubbornly chooses the WRONG codepage. I have this problem with some bulletin boards as well. Really really tiresome browsing pages because FF selects wrong codepage every time I change a thread or page of a thread. Can't find an example right now, but..
Comment 16•19 years ago
|
||
(In reply to comment #15) > Gentlemen. Here's another page for you - http://www.cubio.fi/99551.php > Recognized as 1251 even althought meta says 1252. You can see cyrillic chars at > the bottom of the screen, which should be displaying 'ä' instead. That's quite a different issue: the server is sending Content-Type: text/html; charset=windows-1251 in headers, which takes priority over the meta.
Comment 17•19 years ago
|
||
OK, but turning off auto detection should force FF to use codepage I've selected, no?
| Assignee | ||
Comment 18•19 years ago
|
||
No. autodetection is: trying to choose a charset relying on the bare text iteslf for clues on the charset. Charset names which are produced from the HTTP (channel) or from the HTML are always used without any reguard to the charset probers. there are plentiful examples of this situation you're describing in the test cases page I submitted in bug 86999.
Comment 19•19 years ago
|
||
That's a closed bug. And not really concerned with FORCING a codepage on a misbehaving site. AFAIK, there isn't a way to do so, codepage will revert even on a reload. I looked for an extension to do so and came up short. In this case, their web server is mis-configured and the html author is doing the right thing?
Comment 20•19 years ago
|
||
(In reply to comment #19) > That's a closed bug. And not really concerned with FORCING a codepage on a > misbehaving site. AFAIK, there isn't a way to do so, codepage will revert even > on a reload. I looked for an extension to do so and came up short. In this > case, their web server is mis-configured and the html author is doing the > right thing? It might be. But the HTTP/1.1 specification (RFC 2616) clearly disallows any guessing in the section 3.4.1 if the server specifies the charset: HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. Note the "MUST" - this item is not negotiable. It should be also noted that per spec, a document without a charset label SHOULD be assumed to have iso-8859-1 encoding. Contact the server administrator and tell them that they are breaking the specification. If they don't know the real charset, then they should be sending nothing instead of some incorrect charset.
Updated•19 years ago
|
QA Contact: general → amyy
| Reporter | ||
Comment 21•17 years ago
|
||
I can't reproduce this anymore on the latest trunk and branch builds.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•