Closed Bug 638318 Opened 9 years ago Closed 9 years ago

Page starting with lots of NUL chars is incorrectly sniffed as UTF-16BE (with HTML5 parser enabled)

Categories

(Core :: DOM: HTML Parser, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- .x+

People

(Reporter: streetwolf52, Assigned: hsivonen)

References

()

Details

(Keywords: regression, Whiteboard: [has patch][needs approval])

Attachments

(3 files, 2 obsolete files)

User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b13pre) Gecko/20110302 Firefox/4.0b13pre
Build Identifier: 20110302123843

With the HTML5 Parser enabled the page renders in what looks like Asian characters.  With the Parser disabled it renders correctly.

Reproducible: Always

Steps to Reproduce:
1.Go to http://www.rinconrojo.net/podium/index.php?showforum=4
2.
3.
Actual Results:  
All Asian characters appear.

Expected Results:  
A normal looking page.

Here's another page with the same problem: http://www.creators.com/featurepages/11_comics_speed-bump.html?name=bmp
Version: unspecified → Trunk
blocking2.0: --- → ?
Something is causing the parser to trip the wrong encoding. Using the encoding menu, setting the option back to iso-8859-1 fixes the issue.
Thank you Dexter for filing the bug. It's been happening since quite some time.
All I thought was a problem with the page. Until today I received email feeds
from the reported forum and I opened it with IE but still couldn't render well
in FF4.
(In reply to comment #1)
> Something is causing the parser to trip the wrong encoding. Using the encoding
> menu, setting the option back to iso-8859-1 fixes the issue.

It is set on iso-8859-1 with the Parser enabled.  Maybe I am not understanding your comment.
The page starts with a bunch of null bytes (12280 of them, in fact) before any content.  So the <meta> tag is ignored, since it's too late in the document.

So yes, there is very much a problem with the page!
Status: UNCONFIRMED → NEW
Ever confirmed: true
Without the Parser enabled it look's like the encoding used is UTF-16BE.

With the Parser enabled it's using iso-8859-1.
Henri, SniffBOMlessUTF16BasicLatin looks like it'll treat all null bytes as leaving both elements of byteNonZero as false.  But then the code at the bottom of the function assumes that one of them must be true....
Blocks: 631751
> With the Parser enabled it's using iso-8859-1.

No.  Without the parser enabled (or with the bug fixed), this page will use the user's default encoding, which is locale-dependent.  So it'll still be broken for many users; it really needs to be fixed in its own right.
Changing platform to x86; since WOW64 in user agent indicates 32bit build of Firefox (platform refers to browser build, not OS).
Hardware: x86_64 → x86
(In reply to comment #8)
> Changing platform to x86; since WOW64 in user agent indicates 32bit build of
> Firefox (platform refers to browser build, not OS).

I missed this.  Thanks for correcting.
Attached patch Fix (obsolete) — Splinter Review
I'm not quote sure how to write a test for this, since the detected charset will be locale-dependent...  Can I just assume that it'll be ISO-8859-1 anywhere where we run tests?
Attachment #516491 - Flags: review?(hsivonen)
Assignee: nobody → bzbarsky
OS: Windows 7 → All
Priority: -- → P1
Hardware: x86 → All
Whiteboard: [need review]
Summary: Page renders incorrectly with HTML5 Parser enabled. → Page starting with lots of NUL chars is incorrectly sniffed as UTF-16BE (with HTML5 parser enabled)
when will this land so it doesnt become a hardblocker? (hopefully it doesnt become one)
It needs review first.
Comment on attachment 516491 [details] [diff] [review]
Fix

r=me.
Attachment #516491 - Flags: review?(hsivonen) → review+
Whiteboard: [need review] → [has patch][needs approval]
Hmm. Actually, won't this patch still sniff as UTF-16 if there's, say 500 zero bytes? A safer follow-up patch coming up.
Comment on attachment 516491 [details] [diff] [review]
Fix

Clearing review. Will post another patch after testing it.
Attachment #516491 - Flags: review-
Attachment #516491 - Flags: review+
Attachment #516491 - Flags: approval2.0?
Whiteboard: [has patch][needs approval]
Attached patch Safer fixSplinter Review
Mochitests coming up in another patch separately.
Assignee: bzbarsky → hsivonen
Attachment #516491 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #516517 - Flags: review?(bzbarsky)
Keywords: regression
Whiteboard: [has patch][needs review]
(In reply to comment #14)
> Hmm. Actually, won't this patch still sniff as UTF-16 if there's, say 500 zero
> bytes? A safer follow-up patch coming up.

Rather, the problem I'm trying to fix by making this "safer" is where there are runs of zeros and non-zeros are all even or odd. It seems safer to trigger the UTF-16 stuff only if we the an alternation of zero and non-zero all the way.
Attached patch MochitestSplinter Review
The file in the binary patch part has a long run of zeros at the start but one non-zero byte among them within the first 1024 bytes. Then after the 1024-byte limit, there's ASCII text.
Attachment #516520 - Flags: review?(bzbarsky)
About blockingness: I think this bug is worse from the Web compat point of view than bug 631751 (whose fix caused this regression). If this fix doesn't make it to Firefox 4.0, I think we should take the fix for 4.0.1 or, failing that, back out bug 631751 for 4.0.1.
Here's a patch that reverses the fix for bug 631751 in case we want to do that instead.
Comment on attachment 516517 [details] [diff] [review]
Safer fix

OK, yeah.  This would be the other way to deal.  r=me
Attachment #516517 - Flags: review?(bzbarsky) → review+
Comment on attachment 516520 [details] [diff] [review]
Mochitest

r=me.

Please request approval for this?

For drivers: I think this is very safe, in the sense that it turns off some sniffing that we added recently in cases when there are lots of null bytes at the beginning of the file, because the sniffing is not reliable in that situation.
Attachment #516520 - Flags: review?(bzbarsky) → review+
Comment on attachment 516517 [details] [diff] [review]
Safer fix

Setting the approval request per bz's comment.
Attachment #516517 - Flags: approval2.0?
Whiteboard: [has patch][needs review] → [has patch][needs approval]
Comment on attachment 516517 [details] [diff] [review]
Safer fix

a=beltzner
Attachment #516517 - Flags: approval2.0? → approval2.0+
Can someone land this nowrightnow?
blocking2.0: ? → .x+
Landed:

http://hg.mozilla.org/mozilla-central/rev/ed9c526f93b5
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
I think the first fix was better. Or maybe the second fix needs to distinguish between 0x0000 characters before and after having found a non-zero byte.

The attached test case renders differently in current trunk than in Fx 4.0 b12 and IE 9.
(In reply to comment #29)
> Created attachment 516644 [details]
> Test case, UTF 16 BE with leading zeros

Has this case been seen in the wild?
Not that I know of. New bug, target Fx 5?
(In reply to comment #32)
> Not that I know of. New bug, target Fx 5?

Let's not go further down the slippery slope here.

Supporting imaginable bogosity is not a goal. The point is supporting the real-world bogosity seen in bug 631751 to the extent it works in both IE9 and Firefox 3.6--and only to that extent. Note that Chrome and Safari don't support that bogosity anyway.
> The attached test case renders differently

That was rather the point of Henri's approach, yes.
Attachment #516644 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.