Closed Bug 672081 Opened 14 years ago Closed 14 years ago

Heuristic encoding detector (chardet) interferes with late <meta> charset (after 1024 bytes) handling

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla8

People

(Reporter: bluefish6, Assigned: hsivonen)

References

(
URL
)

Details

Attachments

(2 files)

Fix 14 years ago Henri Sivonen (:hsivonen) 2.70 KB, patch	bzbarsky : review+	Details \| Diff \| Splinter Review
Test that doesn't fail without the fix 14 years ago Henri Sivonen (:hsivonen) 10.30 KB, patch		Details \| Diff \| Splinter Review

bluefish6@wp.pl

Reporter

Description

•

14 years ago

Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox don't detect correctly encoding of pages, even though the encoding is specified in META. The problem occurs when both conditions are met: 1) automatic encoding detection is turned on (universal) 2) there are many (at least 900 I think) spaces or html comments before the actual <HTML> tag. The problem disappears, when some of the spaces (or comments) are deleted. In the attached example page, the encoding forced in html is iso-8859-2. Firefox sets Windows-1252 instead. But, what's interesting, it happens not always, but *every other* time the page is loaded (or refreshed). Maybe this bug is memory-size related, so I created another test page with about 6kb spaces in the beggining for those that cannot reproduce the bug with the first url: http://students.mimuw.edu.pl/~as292532/bug2.html

Boris Zbarsky [:bzbarsky]

Comment 1

•

14 years ago

Per HTML5 spec, <meta> charset declarations are only looked for in the first 512 bytes of the file....

bluefish6@wp.pl

Reporter

Comment 2

•

14 years ago

Boris, thank you for a quich anwser. The facts are: 1) The HTML5 specification [1] does say "The element containing the character encoding declaration must be serialized completely within the first 1024 bytes of the document.". 2) IE 8, Opera 11 and Chrome 12 does not have such strict understanding of this point. (Why FF have to follow these still draft rules so strictly, while it supports invalid tags like <marquee>, that are said to be "bad and evil" by the standard freaks?) 3) If we delete this meta charset declaration [2], FF will not detect the desired encoding correctly. Instead of ISO-8859-2, it will set Windows-1252. If I understand correctly, "auto-detection" should: a) try to find a charset declaration in the document or in the header sent from server if it is sent, b) try to guess the correct charset if charset declaration is absent. Point "b" is obviously not working. (In fact, the default charset on my Windows isn't even Win-1252, but Win-1250. So where does Win-1252 come from?) But point "a" is much more interesting. Why is it so, that with the meta charset declaration after 1024 bytes, the character encoding is detected once correctly, and once wrong (try to refresh the page a few times)? Why *invalid* (further than 1024b from the beginning) charset encoding is not ignored then? Or why is it used to detect the charset in 50% cases? You have to decide, if it is invalid or valid! You can't just toss a coin to decide, whether to use this declaration or not! And that's exactly what "auto-detection" is currently doing... [1] http://dev.w3.org/html5/spec/Overview.html#character-encoding-declaration [2] http://students.mimuw.edu.pl/~as292532/bug3.html

Boris Zbarsky [:bzbarsky]

Comment 3

•

14 years ago

Yes, the "once correctly once not" behavior is why the bug is still open... ;)

Henri Sivonen (:hsivonen)

Assignee

Comment 4

•

14 years ago

If there's a bug here, it's in the HTML parser--not in the i18n libs. I'll analyze this later.

Assignee: smontagu → nobody

Component: Internationalization → HTML: Parser

QA Contact: i18n → parser

rdz_bug

Comment 5

•

14 years ago

http://hk.image.search.yahoo.com/images Is this why that page has the wrong encoding?

Henri Sivonen (:hsivonen)

Assignee

Comment 6

•

14 years ago

(In reply to comment #3) > Yes, the "once correctly once not" behavior is why the bug is still open... > ;) Indeed. (In reply to comment #2) > (In fact, the default charset on my > Windows isn't even Win-1252, but Win-1250. So where does Win-1252 come from?) The default encoding of Windows doesn't matter. The default encoding chosen on the Firefox level matters in some cases. However, preliminary investigation suggests that the "Universal" heuristic detector detects your test case as Windows-1252. FWIW, it seems that of ISO-8859-2 languages, the "Universal" heuristic detector has been trained with Hungarian but I see no indication of it having been trained with Polish (or Czech). (In reply to comment #5) > http://hk.image.search.yahoo.com/images > > Is this why that page has the wrong encoding? Very unlikely.

Assignee: nobody → hsivonen

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Henri Sivonen (:hsivonen)

Assignee

Comment 7

•

14 years ago

Here's what happens: First, the page starts loading with kCharsetFromUserDefault using the encoding that's the current user default (the default differs by localization of Firefox--the underlying flavor of Windows doesn't matter). Non-prescan late <meta> handling sees the <meta> and requests a reload with ISO-8859-2. The heuristic detector continues to examine the stream and and requests change to Windows-1252. Apparently, the second change request gets ignored by the docshell. The reparse with ISO-8859-2 happens with kCharsetFromMetaTag. When then pressing the reload button in the UI, the following happens: The page starts loading as ISO-8859-2 with kCharsetFromCache. The late <meta> handling sees ISO-8859-2, marks the encoding as confident but fails to stop feeding the heuristic detector. The heuristic detector requests a switch to Windows-1252. Since there isn't already another pending reparse request, the request goes through and the page is reloaded as Windows-1252 with kCharsetFromAutoDetection. When then pressing the reload button in the UI again, the following happens: The page starts loading as Windows-1252 with kCharsetFromCache. Non-prescan late <meta> handling sees the <meta> and requests a reload with ISO-8859-2. The heuristic detector keeps examining the data, detects Windows-1252 and becomes confident without requesting a reload, since the page is already being loaded as Windows-1252. The reparse with ISO-8859-2 happens with kCharsetFromMetaTag.

Henri Sivonen (:hsivonen)

Assignee

Updated

•

14 years ago

Summary: Wrong encoding auto-detection on pages with spaces before <HTML> → Heuristic encoding detector (chardet) interferes with late <meta> charset (after 1024 bytes) handling

Henri Sivonen (:hsivonen)

Assignee

Comment 8

•

14 years ago

Attached patch Fix — Details — Splinter Review

Attachment #548411 - Flags: review?(bzbarsky)

Henri Sivonen (:hsivonen)

Assignee

Comment 9

•

14 years ago

Attached patch Test that doesn't fail without the fix — Details — Splinter Review

I was unable to write a test case for this. Here's an attempt using the data from the reporter (to make sure my own simpler data wasn't at fault; ideally, the test data should be so obviously windows-1252 that the test case remains valid even if ISO-8859-2 detection improves), but it passes even without the fix, so it's no good as a test. I'm very tempted to land the fix without a test...

Boris Zbarsky [:bzbarsky]

Comment 10

•

14 years ago

Comment on attachment 548411 [details] [diff] [review] Fix r=me I'm ok with a followup for the test.

Attachment #548411 - Flags: review?(bzbarsky) → review+

Henri Sivonen (:hsivonen)

Assignee

Comment 11

•

14 years ago

(In reply to comment #10) > Comment on attachment 548411 [details] [diff] [review] [review] > Fix > > r=me Thanks. Pushed to inbound: http://hg.mozilla.org/integration/mozilla-inbound/rev/979276ce6056 > I'm ok with a followup for the test. smontagu, any ideas how to tweak attachment 548412 [details] [diff] [review] to make it fail without the fix? AFAICT, the problem is that the test would need to test charset source behavior in the top-level browsing context but the test harness uses an iframe. Should be deemed just too hard to test?

OS: Windows XP → All

Hardware: x86 → All

Whiteboard: [inbound]

Target Milestone: --- → mozilla8

Version: 5 Branch → Trunk

(no longer active)

Comment 12

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/979276ce6056

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Whiteboard: [inbound]

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Heuristic encoding detector (chardet) interferes with late <meta> charset (after 1024 bytes) handling

Categories

(Core :: DOM: HTML Parser, defect)

Tracking

()

People

(Reporter: bluefish6, Assigned: hsivonen)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Attachment

General

Description

File Name

Content Type