A saved html page is rendered improperly

RESOLVED FIXED

Status

()

RESOLVED FIXED
8 years ago
8 years ago

People

(Reporter: bobbyalexphilip, Assigned: hsivonen)

Tracking

({regression})

unspecified
regression
Points:
---

Firefox Tracking Flags

(blocking2.0 final+)

Details

(Whiteboard: [softblocker], URL)

Attachments

(4 attachments, 2 obsolete attachments)

(Reporter)

Description

8 years ago
User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b10) Gecko/20100101 Firefox/4.0b10
Build Identifier: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b10) Gecko/20100101 Firefox/4.0b10

I get a statement from my brokers as an html page. In firefox 4 its rendered improperly. The alignment is broken totally. Checked the same page in 3.6 as well as IE8, it is shown properly

Reproducible: Always

Steps to Reproduce:
Just use the attached HTML page
Actual Results:  
See the screenshot1

Expected Results:  
See the screenshot2
(Reporter)

Comment 1

8 years ago
Created attachment 509978 [details]
Always reproducible file
(Reporter)

Comment 2

8 years ago
Created attachment 509979 [details]
improper rendering
(Reporter)

Comment 3

8 years ago
Created attachment 509980 [details]
Desired rendering
(Assignee)

Comment 4

8 years ago
This is very odd. I'll look into this.
Assignee: nobody → hsivonen
Status: UNCONFIRMED → NEW
Ever confirmed: true
(Assignee)

Updated

8 years ago
OS: Windows 7 → All
Hardware: x86 → All
The file is UTF-16 encoded, with no BOM.  It does have a <meta> saying so, but of course the <meta> prescan won't find that.

Henri, it looks like the HTML5 parser treats the page as UTF-8 whereas the old parser treated it as UTF-16LE...
blocking2.0: --- → ?
Keywords: regression
Looks like the HTML5 spec's charset detection at http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding is just broken for BOM-less UTF-16 without a declared charset.  We should use the weasel-room from step 7 to do something sane here.

The old parser had a bunch of code dealing with null bytes at file start, including looking for UCS-4 BOMs, looking for some combination of '<' and '\0' in the first 4 bytes, looking for UTF-16-encoded "<!" and "<?" and "<H" and "<h" in the first 4 bytes, etc.

I think we need something like that here too; certainly to handle this page correctly.
Shouldn't be too hard to craft wording for the prescan algorithm that looks for sequences like 00 3C 00 21 00 2D 00 2D, or the same in opposite endianness... http://tools.ietf.org/html/draft-abarth-mime-sniff would need updating too.
(Assignee)

Comment 8

8 years ago
(In reply to comment #5)
> Henri, it looks like the HTML5 parser treats the page as UTF-8 whereas the old
> parser treated it as UTF-16LE...

IE9 PP7 treats this as UTF-8, too, but it seems to drop zero bytes before tokenizing, which isn't exactly safe, so let's not go there.

It's interesting to test what happens with BOMless UTF-16 that contains characters from outside the Basic Latin range. That is, the expectation is for the content to "break" if the test case here "works" only because zeros are dropped.

It turns out that neither IE6 nor IE9 PP7 (didn't test 7 and 8) sniffs UTF-16 in these cases (spot-checked loading from local disk in IE6, too):
http://hsivonen.iki.fi/test/moz/utf-16-russian-meta-quirks.htm
http://hsivonen.iki.fi/test/moz/utf-16-russian-meta.htm
http://hsivonen.iki.fi/test/moz/utf-16-russian-no-meta-quirks.htm
http://hsivonen.iki.fi/test/moz/utf-16-russian-no-meta.htm

Given that these pages don't "work" in IE, Chrome or Safari, I think this bug should be WONTFIXed. bz, is that OK given this evidence?
I think that gives a really sucky user experience, though.  And the pages do work with Opera and Firefox 3.x.  They also "work" in IE for western users....

I really think that UAs should make more of an effort to have this work.  There are some basic heuristics we could use (e.g. the one you proposed on irc) that should have a very low false-positive rate and work well on at least the corpus of documents that IE works on.
(Assignee)

Comment 10

8 years ago
Per IRC discussion with bz, I'll fix this for the case that seems to "work" in IE by checking if every second byte is zero in the first 1024 bytes.
Status: NEW → ASSIGNED
(Assignee)

Comment 11

8 years ago
Created attachment 511429 [details] [diff] [review]
Patch with broken test

The test doesn't work. Framing probably interfering or something.
Almost certainly, yes.  :(
Blocking, let's get this test fixed and the patch landed.
blocking2.0: ? → final+
Whiteboard: [softblocker]
Fwiw, a chrome mochitest might work for the test...
(Assignee)

Comment 15

8 years ago
Created attachment 512139 [details] [diff] [review]
Patch using the chardet harness
Attachment #511429 - Attachment is obsolete: true
Attachment #512139 - Flags: review?(bzbarsky)
Comment on attachment 512139 [details] [diff] [review]
Patch using the chardet harness

>+        if (i & 1) {
>+          if (evenByteNonZero) {
>+            return;
>+          }
>+          oddByteNonZero = PR_TRUE;
>+        } else {
>+          if (oddByteNonZero) {
>+            return;
>+          }
>+          evenByteNonZero = PR_TRUE;
>+        }

This (duplicated) code might be simpler if you declared the booleans as:

  // even-numbered bytes tracked at 0, odd-numbered bytes tracked at 1
  PRBool byteNonZero[2] = { PR_FALSE, PR_FALSE };

and then you can do

  if (byteNonZero[1 - (i % 2)])
    return;
  byteNonZero[i % 2] = PR_TRUE;

in the loop, and similar for the i+j loop.

I don't see why you need the last check for |oddByteNonZero| before deciding you're LE.  How can that be nonzero?

Also, wouldn't it make sense to just do the mCharset set conditionally at the end there, and hare the other three lines of code?

> +     * Sniff for if every other byte in the sniffing buffer is zero.

"Check whether every other byte...."

r=me with those changes.
Attachment #512139 - Flags: review?(bzbarsky) → review+
(Assignee)

Comment 17

8 years ago
Oops. Can't land this. Bad behavior with files with length of one byte.
(Assignee)

Comment 18

8 years ago
Created attachment 512532 [details] [diff] [review]
Address review comments, add magic limit for minimum bytes

The patch requires at least 30 bytes of data (the length of "<title></title>" as UTF-16) in order to sniff as UTF-16. This avoids false positives with very short files.
Attachment #512139 - Attachment is obsolete: true
Attachment #512532 - Flags: review?(bzbarsky)
Comment on attachment 512532 [details] [diff] [review]
Address review comments, add magic limit for minimum bytes

r=me
Attachment #512532 - Flags: review?(bzbarsky) → review+
(Assignee)

Comment 20

8 years ago
Comment on attachment 512532 [details] [diff] [review]
Address review comments, add magic limit for minimum bytes

Sounds like softblockers need explicit approval now.
Attachment #512532 - Flags: approval2.0?
Comment on attachment 512532 [details] [diff] [review]
Address review comments, add magic limit for minimum bytes

a=me
Attachment #512532 - Flags: approval2.0? → approval2.0+
(Assignee)

Comment 22

8 years ago
Thanks.
http://hg.mozilla.org/mozilla-central/rev/8eb1b3531dd9
Status: ASSIGNED → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.