Last Comment Bug 672081 - Heuristic encoding detector (chardet) interferes with late <meta> charset (after 1024 bytes) handling
: Heuristic encoding detector (chardet) interferes with late <meta> charset (af...
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: HTML: Parser (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla8
Assigned To: Henri Sivonen (:hsivonen)
:
Mentors:
http://students.mimuw.edu.pl/~as29253...
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-07-16 15:09 PDT by bluefish6@wp.pl
Modified: 2011-07-28 08:25 PDT (History)
5 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Fix (2.70 KB, patch)
2011-07-26 04:25 PDT, Henri Sivonen (:hsivonen)
bzbarsky: review+
Details | Diff | Review
Test that doesn't fail without the fix (10.30 KB, patch)
2011-07-26 04:28 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review

Description bluefish6@wp.pl 2011-07-16 15:09:32 PDT
Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0

Firefox don't detect correctly encoding of pages, even though the encoding is specified in META. The problem occurs when both conditions are met:
1) automatic encoding detection is turned on (universal)
2) there are many (at least 900 I think) spaces or html comments before the actual <HTML> tag.
The problem disappears, when some of the spaces (or comments) are deleted. In the attached example page, the encoding forced in html is iso-8859-2. Firefox sets Windows-1252 instead. But, what's interesting, it happens not always, but *every other* time the page is loaded (or refreshed).
Maybe this bug is memory-size related, so I created another test page with about 6kb spaces in the beggining for those that cannot reproduce the bug with the first url:
http://students.mimuw.edu.pl/~as292532/bug2.html
Comment 1 Boris Zbarsky [:bz] (Out June 25-July 6) 2011-07-16 20:51:44 PDT
Per HTML5 spec, <meta> charset declarations are only looked for in the first 512 bytes of the file....
Comment 2 bluefish6@wp.pl 2011-07-16 21:56:36 PDT
Boris, thank you for a quich anwser.

The facts are:
1) The HTML5 specification [1] does say "The element containing the character encoding declaration must be serialized completely within the first 1024 bytes of the document.".
2) IE 8, Opera 11 and Chrome 12 does not have such strict understanding of this point. (Why FF have to follow these still draft rules so strictly, while it supports invalid tags like <marquee>, that are said to be "bad and evil" by the standard freaks?)
3) If we delete this meta charset declaration [2], FF will not detect the desired encoding correctly. Instead of ISO-8859-2, it will set Windows-1252.

If I understand correctly, "auto-detection" should:
a) try to find a charset declaration in the document or in the header sent from server if it is sent,
b) try to guess the correct charset if charset declaration is absent.

Point "b" is obviously not working. (In fact, the default charset on my Windows isn't even Win-1252, but Win-1250. So where does Win-1252 come from?)

But point "a" is much more interesting. Why is it so, that with the meta charset declaration after 1024 bytes, the character encoding is detected once correctly, and once wrong (try to refresh the page a few times)? Why *invalid* (further than 1024b from the beginning) charset encoding is not ignored then? Or why is it used to detect the charset in 50% cases? You have to decide, if it is invalid or valid! You can't just toss a coin to decide, whether to use this declaration or not! And that's exactly what "auto-detection" is currently doing...


[1] http://dev.w3.org/html5/spec/Overview.html#character-encoding-declaration
[2] http://students.mimuw.edu.pl/~as292532/bug3.html
Comment 3 Boris Zbarsky [:bz] (Out June 25-July 6) 2011-07-16 22:19:07 PDT
Yes, the "once correctly once not" behavior is why the bug is still open... ;)
Comment 4 Henri Sivonen (:hsivonen) 2011-07-17 06:29:06 PDT
If there's a bug here, it's in the HTML parser--not in the i18n libs. I'll analyze this later.
Comment 5 Rodze 2011-07-17 07:29:33 PDT
http://hk.image.search.yahoo.com/images

Is this why that page has the wrong encoding?
Comment 6 Henri Sivonen (:hsivonen) 2011-07-25 07:41:08 PDT
(In reply to comment #3)
> Yes, the "once correctly once not" behavior is why the bug is still open...
> ;)

Indeed.

(In reply to comment #2)
> (In fact, the default charset on my
> Windows isn't even Win-1252, but Win-1250. So where does Win-1252 come from?)

The default encoding of Windows doesn't matter. The default encoding chosen on the Firefox level matters in some cases. However, preliminary investigation suggests that the "Universal" heuristic detector detects your test case as Windows-1252.

FWIW, it seems that of ISO-8859-2 languages, the "Universal" heuristic detector has been trained with Hungarian but I see no indication of it having been trained with Polish (or Czech).

(In reply to comment #5)
> http://hk.image.search.yahoo.com/images
> 
> Is this why that page has the wrong encoding?

Very unlikely.
Comment 7 Henri Sivonen (:hsivonen) 2011-07-26 03:27:30 PDT
Here's what happens:

First, the page starts loading with kCharsetFromUserDefault using the encoding that's the current user default (the default differs by localization of Firefox--the underlying flavor of Windows doesn't matter).

Non-prescan late <meta> handling sees the <meta> and requests a reload with ISO-8859-2. The heuristic detector continues to examine the stream and and requests change to Windows-1252. Apparently, the second change request gets ignored by the docshell.

The reparse with ISO-8859-2 happens with kCharsetFromMetaTag.

When then pressing the reload button in the UI, the following happens:
The page starts loading as ISO-8859-2 with kCharsetFromCache. The late <meta> handling sees ISO-8859-2, marks the encoding as confident but fails to stop feeding the heuristic detector. The heuristic detector requests a switch to Windows-1252.

Since there isn't already another pending reparse request, the request goes through and the page is reloaded as Windows-1252 with kCharsetFromAutoDetection.

When then pressing the reload button in the UI again, the following happens:
The page starts loading as Windows-1252 with kCharsetFromCache. Non-prescan late <meta> handling sees the <meta> and requests a reload with ISO-8859-2. The heuristic detector keeps examining the data, detects Windows-1252 and becomes confident without requesting a reload, since the page is already being loaded as Windows-1252.

The reparse with ISO-8859-2 happens with kCharsetFromMetaTag.
Comment 8 Henri Sivonen (:hsivonen) 2011-07-26 04:25:39 PDT
Created attachment 548411 [details] [diff] [review]
Fix
Comment 9 Henri Sivonen (:hsivonen) 2011-07-26 04:28:33 PDT
Created attachment 548412 [details] [diff] [review]
Test that doesn't fail without the fix

I was unable to write a test case for this. Here's an attempt using the data from the reporter (to make sure my own simpler data wasn't at fault; ideally, the test data should be so obviously windows-1252 that the test case remains valid even if ISO-8859-2 detection improves), but it passes even without the fix, so it's no good as a test. I'm very tempted to land the fix without a test...
Comment 10 Boris Zbarsky [:bz] (Out June 25-July 6) 2011-07-26 06:16:31 PDT
Comment on attachment 548411 [details] [diff] [review]
Fix

r=me

I'm ok with a followup for the test.
Comment 11 Henri Sivonen (:hsivonen) 2011-07-28 03:50:15 PDT
(In reply to comment #10)
> Comment on attachment 548411 [details] [diff] [review] [review]
> Fix
> 
> r=me

Thanks. Pushed to inbound:
http://hg.mozilla.org/integration/mozilla-inbound/rev/979276ce6056

> I'm ok with a followup for the test.

smontagu, any ideas how to tweak attachment 548412 [details] [diff] [review] to make it fail without the fix? AFAICT, the problem is that the test would need to test charset source behavior in the top-level browsing context but the test harness uses an iframe. Should be deemed just too hard to test?
Comment 12 :Ehsan Akhgari (out sick) 2011-07-28 08:25:32 PDT
http://hg.mozilla.org/mozilla-central/rev/979276ce6056

Note You need to log in before you can comment on or make changes to this bug.