Closed Bug 306224 Opened 19 years ago Closed 17 years ago

Windows-1252 pages are mistaken to be in Windows-1255 by univeral charset detector

Categories

(Core :: Internationalization, defect)

1.8 Branch
x86
Windows XP
defect
Not set
major

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: ria.klaassen, Assigned: shooshx)

References

()

Details

(Keywords: intl)

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20050827 Firefox/1.6a1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20050827 Firefox/1.6a1

This regressed very recently, both in trunk and branch. Between these two builds:
1.9a1_2005082611 and 1.9a1_2005082612
The character encoding resets itself to hebrew and the displayed font-family and
font-size are not the ones in the preferences.
It happens on some sites.




Reproducible: Sometimes

Steps to Reproduce:
Sometimes I could reproduce it only with this prefs but not consistently:

user_pref("browser.display.use_document_fonts", 0);
user_pref("browser.search.selectedEngine", "Google");
user_pref("browser.shell.checkDefaultBrowser", false);
user_pref("browser.startup.homepage",
"http://www.haarshop.nl/default.asp?part=product_details&product_id=31&page=1");
user_pref("browser.startup.homepage_override.mstone", "rv:1.9a1");
user_pref("extensions.lastAppVersion", "1.6a1");
user_pref("font.default", "sans-serif");
user_pref("font.default.x-western", "sans-serif");
user_pref("font.minimum-size.x-western", 24);
user_pref("font.name.sans-serif.x-western", "Verdana");
user_pref("font.name.serif.x-western", "Verdana");
user_pref("font.size.fixed.x-western", 14);
user_pref("intl.accept_languages", "nl");
user_pref("intl.charset.detector", "universal_charset_detector");
user_pref("intl.charsetmenu.browser.cache", "ISO-8859-1, windows-1255, us-ascii,
windows-1252, ISO-8859-2");
user_pref("network.cookie.prefsMigrated", true);
user_pref("pref.browser.language.disable_button.remove", true);
Actual Results:  
When I changed the intl.accept_languages to default, it didn't happen.
Under some conditions Firefox prefers the (wrong) character encoding AutoDetect
settings, not the personal settings.
I can reproduce it on another XP computer on an empty profile with only the same
prefs.
The only way to get normal letters is setting intl.accept_languages to en-us
OR not choosing my own fonts. Both solutions have huge disadvantages.
I found a way to get around these problems. In case anyone runs into the same
problem (worked previously well so it is cleary a regression).
You need to accept more languages, not only your own.
If this message is sent to the server: Accept-Language: nl,en-us;q=0.7,en;q=0.3
there will be no problem, even if I choose my own font.
Component: General → Internationalization
Product: Firefox → Core
Version: unspecified → 1.8 Branch
OK, I can reproduce this now 100% consistently. Forget the previous post: what
polluted this test was that Auto-Detect sometimes changed from on to off.

1. Create a new profile, set Character Encoding > Auto-Detect to Universal and
go to http://www.haarshop.nl/default.asp?part=product_details&product_id=31&page=1
2. Then change Advanced > General > Edit Languages to Dutch as the first
language. Restart and verify if Character Encoding > Auto-Detect is still set to
Universal. Go back to the URL and observe that Character Encoding points wrongly
to Hebrew (Windows-1255). This happens only in builds since 2005-08-26 12
probably due to some bugfix.
3. Then set Options > Content > Default Font to verdana and UNcheck "Allow pages
to choose their own fonts". What then happens is that the page changes into a
very small font-size. This happens on more pages, sometimes it changes to
fontsize 7 or 8, but I remember only this one. 
Flags: blocking1.8b4?
Ria, can you determine when this broke? 
(In reply to comment #5)
> Ria, can you determine when this broke? 

from comment 0:
This regressed very recently, both in trunk and branch. Between these two builds:
1.9a1_2005082611 and 1.9a1_2005082612

http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=SeaMonkeyAll&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2005-08-26+10%3A00&maxdate=2005-08-26+13%3A00&cvsroot=%2Fcvsroot

2005-08-26 12:09	mozilla.mano%sent.com 	mozilla/ extensions/ universalchardet/
src/ LangHebrewModel.cpp 	1.2 	17/8  	
Bug 304951 - error in chardet's Hebrew language model. 
patch from Shy Shalom <shoosh20012001@hotmail.com>, r=smontagu, sr=roc.
shoosh, can you look into this regression. If it can't be easily fixed, we
should consider backing out that other change.
Assignee: nobody → shoosh20012001
Status: UNCONFIRMED → NEW
Ever confirmed: true
ok, I've spent some time investigating this and got some initial findings

first of all as of this moment I am unable to reproduce this bug. which is
strange because an hour ago I was able to reproduce it fully. I really can't
point to anything I did locally to change anything. however I am able to see
that something is still wrong by enabling the debug output of the universal
prober and thus prodece some reduced testcases. if anyone can still reproduce it
as its described in comment #4 please notify me.

the strange behaviour that seem to depend of the received languages is probably
due to the fact that for some languages and not others, the page is sent with
the 'euro' char (0x80 in cp1252) incorporated in it.

the bug it itself is the misrecognition as cp1255 which is caused by a sequence
in the comment by "Melissa" of two 0xe9 (Letter Yod in cp1255 or an accented 'e'
in cp1252) since Yod is is the most common letter in hebrew this gives a high
score to the hebrew detector. 

the patch to 304951 seem to have caused this regression since it makes the
hebrew language model more percise. backing up that change would fix this page
specifically (and unfix the 304951 page) but would not fix the essence of this bug.
the essence of this bug being that the SingleByte detector is allowed to be very
confident ( > 0.5) on a very very short input, in the range of 2-4 bytes.
the reduced test cases I've produced demostrate that this is not a Hebrew-prober
only behaviour but a general Single-Byte prober problem since it is reproduced
for cp1251 (russian) as well.

The second and totally unrealated problem described in comment #4 has to do with
fonts, again, since currently I can't reproduce the bug. I have no means to see
what this looks like but in any case it is of no relation to the universal
charset detector and if it at all exist, it deserves a different bug. This bug
and any other 86999-related bugs end when the charset of the page is decided upon.
this demostrates how the hebrew-prober recognizes a page with mosty english and
two letters of hebrew as windows-1255
this demostrates how the windows-1251 recognizes a page with mosty english and
two letters of high-ascii (0xee) as windows-1251
smontague and jshin, can you comment here? Which is worse, the bug that caused
this regression or this regression? any suggestions for a low risk fix?
(In reply to comment #4)

> 3. Then set Options > Content > Default Font to verdana and UNcheck "Allow pages
> to choose their own fonts". What then happens is that the page changes into a
> very small font-size. This happens on more pages, sometimes it changes to
> fontsize 7 or 8, but I remember only this one. 

What you observed is a by-product of our universal autodetector mistaking a
Windows-1252-encoded page for a Windows-1255-encoded one. A new font preference
dialog of a trunk build adds to your confusion.(I pointed out this in another
bug whose bug number I can't recall at the moment) The 'default font' is NOT
used as  the default font for all the pages (when 'allow web pages to choose
their own fonts' is NOT checked). It's only the default for the last language
group for which you configured your font setting (by pressing 'Advanced' button
in the font pref. UI.) In your case, it's probably Western (European). However,
the page was detected as Windows-1255 so that the default font for Hebrew (lang.
group) rather than Veranda (the default font for 'Western' lang. group) will be
used to render the page. 

 
Keywords: intl
Summary: Firefox doesn't respect personal font settings anymore on some sites → Windows-1252 pages are mistaken to be in Windows-1255 by univeral charset detector
Simon or Jshin, can you please renominate this if you think the current
situation is worse than the bug we fixed.
Flags: blocking1.8b5? → blocking1.8b5-
A possible solution to this and related bugs is to introduce a static limit of
say 10 characters, under which the SingleByte probers always return confidence 0.
another possiblity is to introduce a minimum ratio of (checked
characters)/(total page size, including all english and tags) which should be
above something like 0.2
yet another possiblity is that the SingleByteGroupProber will do both
with-english and without-english filtering on the original text and then
calculate the ratio on those numbers which more accuratly summerise the content
of the text. this change however has a performance hit of another filtering
going over the page.
Gentlemen. Here's another page for you - http://www.cubio.fi/99551.php
Recognized as 1251 even althought meta says 1252. You can see cyrillic chars at the bottom of the screen, which should be displaying 'ä' instead. 

Better yet, Auto-detect is set to OFF and it doesn't matter. FF stubbornly chooses the WRONG codepage. I have this problem with some bulletin boards as well. Really really tiresome browsing pages because FF selects wrong codepage every time I change a thread or page of a thread. Can't find an example right now, but.. 
(In reply to comment #15)
> Gentlemen. Here's another page for you - http://www.cubio.fi/99551.php
> Recognized as 1251 even althought meta says 1252. You can see cyrillic chars at
> the bottom of the screen, which should be displaying 'ä' instead. 

That's quite a different issue: the server is sending
 Content-Type: text/html; charset=windows-1251
in headers, which takes priority over the meta.
OK, but turning off auto detection should force FF to use codepage I've selected, no? 
No. autodetection is: trying to choose a charset relying on the bare text iteslf for clues on the charset. Charset names which are produced from the HTTP (channel) or from the HTML are always used without any reguard to the charset probers.
there are plentiful examples of this situation you're describing in the test cases page I submitted in bug 86999.
That's a closed bug. And not really concerned with FORCING a codepage on a misbehaving site. AFAIK, there isn't a way to do so, codepage will revert even on a reload. I looked for an extension to do so and came up short. In this case, their web server is mis-configured and the html author is doing the right thing? 
Blocks: 264871
(In reply to comment #19)
> That's a closed bug. And not really concerned with FORCING a codepage on a
> misbehaving site. AFAIK, there isn't a way to do so, codepage will revert even
> on a reload. I looked for an extension to do so and came up short. In this
> case, their web server is mis-configured and the html author is doing the 
> right thing? 

It might be. But the HTTP/1.1 specification (RFC 2616) clearly disallows any guessing in the section 3.4.1 if the server specifies the charset:

   HTTP/1.1 recipients MUST respect the charset label provided
   by the sender; and those user agents that have a provision
   to "guess" a charset MUST use the charset from the
   content-type field if they support that charset, rather
   than the recipient's preference, when initially displaying
   a document.

Note the "MUST" - this item is not negotiable. It should be also noted that per spec, a document without a charset label SHOULD be assumed to have iso-8859-1 encoding.

Contact the server administrator and tell them that they are breaking the specification. If they don't know the real charset, then they should be sending nothing instead of some incorrect charset.
QA Contact: general → amyy
I can't reproduce this anymore on the latest trunk and branch builds.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: