Page starting with lots of NUL chars is incorrectly sniffed as UTF-16BE (with HTML5 parser enabled)

RESOLVED FIXED

Status

()

defect
P1
major
RESOLVED FIXED
8 years ago
8 years ago

People

(Reporter: streetwolf, Assigned: hsivonen)

Tracking

({regression})

Trunk
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(blocking2.0 .x+)

Details

(Whiteboard: [has patch][needs approval], URL)

Attachments

(3 attachments, 2 obsolete attachments)

(Reporter)

Description

8 years ago
User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b13pre) Gecko/20110302 Firefox/4.0b13pre
Build Identifier: 20110302123843

With the HTML5 Parser enabled the page renders in what looks like Asian characters.  With the Parser disabled it renders correctly.

Reproducible: Always

Steps to Reproduce:
1.Go to http://www.rinconrojo.net/podium/index.php?showforum=4
2.
3.
Actual Results:  
All Asian characters appear.

Expected Results:  
A normal looking page.

Here's another page with the same problem: http://www.creators.com/featurepages/11_comics_speed-bump.html?name=bmp
(Reporter)

Updated

8 years ago
Version: unspecified → Trunk
(Reporter)

Updated

8 years ago
blocking2.0: --- → ?
Something is causing the parser to trip the wrong encoding. Using the encoding menu, setting the option back to iso-8859-1 fixes the issue.

Comment 2

8 years ago
Thank you Dexter for filing the bug. It's been happening since quite some time.
All I thought was a problem with the page. Until today I received email feeds
from the reported forum and I opened it with IE but still couldn't render well
in FF4.
(Reporter)

Comment 3

8 years ago
(In reply to comment #1)
> Something is causing the parser to trip the wrong encoding. Using the encoding
> menu, setting the option back to iso-8859-1 fixes the issue.

It is set on iso-8859-1 with the Parser enabled.  Maybe I am not understanding your comment.
The page starts with a bunch of null bytes (12280 of them, in fact) before any content.  So the <meta> tag is ignored, since it's too late in the document.

So yes, there is very much a problem with the page!
Status: UNCONFIRMED → NEW
Ever confirmed: true
(Reporter)

Comment 5

8 years ago
Without the Parser enabled it look's like the encoding used is UTF-16BE.

With the Parser enabled it's using iso-8859-1.
Henri, SniffBOMlessUTF16BasicLatin looks like it'll treat all null bytes as leaving both elements of byteNonZero as false.  But then the code at the bottom of the function assumes that one of them must be true....
Blocks: 631751
> With the Parser enabled it's using iso-8859-1.

No.  Without the parser enabled (or with the bug fixed), this page will use the user's default encoding, which is locale-dependent.  So it'll still be broken for many users; it really needs to be fixed in its own right.
Changing platform to x86; since WOW64 in user agent indicates 32bit build of Firefox (platform refers to browser build, not OS).
Hardware: x86_64 → x86
(Reporter)

Comment 9

8 years ago
(In reply to comment #8)
> Changing platform to x86; since WOW64 in user agent indicates 32bit build of
> Firefox (platform refers to browser build, not OS).

I missed this.  Thanks for correcting.
Posted patch Fix (obsolete) — Splinter Review
I'm not quote sure how to write a test for this, since the detected charset will be locale-dependent...  Can I just assume that it'll be ISO-8859-1 anywhere where we run tests?
Attachment #516491 - Flags: review?(hsivonen)
Assignee: nobody → bzbarsky
OS: Windows 7 → All
Priority: -- → P1
Hardware: x86 → All
Whiteboard: [need review]

Updated

8 years ago
Summary: Page renders incorrectly with HTML5 Parser enabled. → Page starting with lots of NUL chars is incorrectly sniffed as UTF-16BE (with HTML5 parser enabled)

Comment 11

8 years ago
when will this land so it doesnt become a hardblocker? (hopefully it doesnt become one)
It needs review first.
(Assignee)

Comment 13

8 years ago
Comment on attachment 516491 [details] [diff] [review]
Fix

r=me.
Attachment #516491 - Flags: review?(hsivonen) → review+
(Assignee)

Updated

8 years ago
Whiteboard: [need review] → [has patch][needs approval]
(Assignee)

Updated

8 years ago
Attachment #516491 - Flags: approval2.0?
(Assignee)

Comment 14

8 years ago
Hmm. Actually, won't this patch still sniff as UTF-16 if there's, say 500 zero bytes? A safer follow-up patch coming up.
(Assignee)

Comment 15

8 years ago
Comment on attachment 516491 [details] [diff] [review]
Fix

Clearing review. Will post another patch after testing it.
Attachment #516491 - Flags: review-
Attachment #516491 - Flags: review+
Attachment #516491 - Flags: approval2.0?
(Assignee)

Updated

8 years ago
Whiteboard: [has patch][needs approval]
(Assignee)

Comment 16

8 years ago
Posted patch Safer fixSplinter Review
Mochitests coming up in another patch separately.
Assignee: bzbarsky → hsivonen
Attachment #516491 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #516517 - Flags: review?(bzbarsky)
(Assignee)

Updated

8 years ago
Keywords: regression
Whiteboard: [has patch][needs review]
(Assignee)

Comment 17

8 years ago
(In reply to comment #14)
> Hmm. Actually, won't this patch still sniff as UTF-16 if there's, say 500 zero
> bytes? A safer follow-up patch coming up.

Rather, the problem I'm trying to fix by making this "safer" is where there are runs of zeros and non-zeros are all even or odd. It seems safer to trigger the UTF-16 stuff only if we the an alternation of zero and non-zero all the way.
(Assignee)

Comment 18

8 years ago
Posted patch MochitestSplinter Review
The file in the binary patch part has a long run of zeros at the start but one non-zero byte among them within the first 1024 bytes. Then after the 1024-byte limit, there's ASCII text.
Attachment #516520 - Flags: review?(bzbarsky)
(Assignee)

Comment 19

8 years ago
About blockingness: I think this bug is worse from the Web compat point of view than bug 631751 (whose fix caused this regression). If this fix doesn't make it to Firefox 4.0, I think we should take the fix for 4.0.1 or, failing that, back out bug 631751 for 4.0.1.
(Assignee)

Comment 20

8 years ago
Here's a patch that reverses the fix for bug 631751 in case we want to do that instead.
Comment on attachment 516517 [details] [diff] [review]
Safer fix

OK, yeah.  This would be the other way to deal.  r=me
Attachment #516517 - Flags: review?(bzbarsky) → review+
Comment on attachment 516520 [details] [diff] [review]
Mochitest

r=me.

Please request approval for this?

For drivers: I think this is very safe, in the sense that it turns off some sniffing that we added recently in cases when there are lots of null bytes at the beginning of the file, because the sniffing is not reliable in that situation.
Attachment #516520 - Flags: review?(bzbarsky) → review+
(Assignee)

Comment 23

8 years ago
Comment on attachment 516517 [details] [diff] [review]
Safer fix

Setting the approval request per bz's comment.
Attachment #516517 - Flags: approval2.0?
(Assignee)

Updated

8 years ago
Whiteboard: [has patch][needs review] → [has patch][needs approval]
Comment on attachment 516517 [details] [diff] [review]
Safer fix

a=beltzner
Attachment #516517 - Flags: approval2.0? → approval2.0+
Can someone land this nowrightnow?
blocking2.0: ? → .x+
(Assignee)

Comment 26

8 years ago
I'll land.
Landed:

http://hg.mozilla.org/mozilla-central/rev/ed9c526f93b5
Status: ASSIGNED → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
(Assignee)

Comment 28

8 years ago
Test landed:
http://hg.mozilla.org/mozilla-central/rev/770e2709cff5
Flags: in-testsuite+
I think the first fix was better. Or maybe the second fix needs to distinguish between 0x0000 characters before and after having found a non-zero byte.

The attached test case renders differently in current trunk than in Fx 4.0 b12 and IE 9.
(Assignee)

Comment 31

8 years ago
(In reply to comment #29)
> Created attachment 516644 [details]
> Test case, UTF 16 BE with leading zeros

Has this case been seen in the wild?
Not that I know of. New bug, target Fx 5?
(Assignee)

Comment 33

8 years ago
(In reply to comment #32)
> Not that I know of. New bug, target Fx 5?

Let's not go further down the slippery slope here.

Supporting imaginable bogosity is not a goal. The point is supporting the real-world bogosity seen in bug 631751 to the extent it works in both IE9 and Firefox 3.6--and only to that extent. Note that Chrome and Safari don't support that bogosity anyway.
> The attached test case renders differently

That was rather the point of Henri's approach, yes.

Updated

8 years ago
Attachment #516644 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.