Closed Bug 566280 Opened 14 years ago Closed 14 years ago

[HTML5] Plain text prefixed by U+0000 displays only U+FFFD

Categories

(Core :: DOM: HTML Parser, defect, P2)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- final+

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

(Keywords: regression)

Attachments

(2 files, 3 obsolete files)

Attached file Test case
Steps to reproduce:
 1) Load the attachment.

Expected results:
�hello world

Actual results:
�
blocking2.0: --- → ?
Keywords: regression
Blocking.
blocking2.0: ? → final+
Assignee: nobody → hsivonen
Attached patch Fix bad copypasta (obsolete) — Splinter Review
zwol, HTML5 invalidates http://mxr-test.konigsberg.mozilla.org/mozilla-central/source/layout/reftests/bugs/228856-2.html?force=1 since the HTML5 parsering algorithm turns U+0000 into U+FFFD before it reaches the CSS parser. The test has accidentally passed due to this bug.

What should be done to 228856-2.html when landing this fix?
Is there a specification that explicitly calls for U+0000 to be replaced by U+FFFD? That seems odd to me; if anything, I'd have expected to see a hexbox rather than a Unicode REPLACEMENT CHARACTER. U+FFFD would normally indicate an encoding error (e.g. an invalid UTF-8 sequence or unpaired UTF-16 surrogate, or an invalid code in a legacy codepage that cannot be transcoded to Unicode), not merely a correctly-encoded character that we can't display.
The test is really about what U+0000 does to the CSS parser, so you should definitely pull the contents of the <style> tag out to a separate sheet.

I'm not sure what to do with the divs, though.  Does <div something="..&#0;.."> still generate an attribute with a literal NUL in its value?  If so, we could probably just delete the subtests with literal NULs in the input, and rely on the &#0;s.  If not, we need to convert this to a mochitest that uses JS to examine the parsed style sheet, which is a thing I can do if you don't know how.
(In reply to comment #6)
> The test is really about what U+0000 does to the CSS parser, so you should
> definitely pull the contents of the <style> tag out to a separate sheet.

OK.

> I'm not sure what to do with the divs, though.  Does <div something="..&#0;..">
> still generate an attribute with a literal NUL in its value?

&#0; generates U+FFFD per HTML5.

> If so, we could
> probably just delete the subtests with literal NULs in the input, and rely on
> the &#0;s.  If not, we need to convert this to a mochitest that uses JS to
> examine the parsed style sheet, which is a thing I can do if you don't know
> how.

I don't, so it would be nice if you'd do it to make sure the test still test what you intended.
(In reply to comment #5)
> Is there a specification that explicitly calls for U+0000 to be replaced by
> U+FFFD?

Yes, the HTML5 spec.

The zero byte:
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream

The numeric reference:
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references
(In reply to comment #7)
> > we need to convert this to a mochitest that uses JS to
> > examine the parsed style sheet, which is a thing I can do if you don't know
> > how.
> 
> I don't, so it would be nice if you'd do it to make sure the test still test
> what you intended.

I will try to find time for this next week.  Note that Monday is a holiday in the USA.

(In reply to comment #8)
> (In reply to comment #5)
> > Is there a specification that explicitly calls for U+0000 to be replaced by
> > U+FFFD?
> 
> Yes, the HTML5 spec.

CSS presently doesn't define the behavior of U+0000 either as a literal character or as a \-escape.  It is tempting to propose that CSS change to match HTML5 - it's not like there's any cost to doing so, and we'd gain predictability.  dbaron, fantasai, what do you think?
Splitting out the first <style> into a <link rel=stylesheet> was enough to make 228856-2.html not fail.

The binary patch that adds a reftest for this bug is like the reference for the test except there's a zero byte where the reference has &#xFFFD;.
Attachment #447960 - Attachment is obsolete: true
Attachment #448355 - Flags: review?(jonas)
Blocks: 568228
Forgot to update a copyright year.
Attachment #448355 - Attachment is obsolete: true
Attachment #448371 - Flags: review?(jonas)
Attachment #448355 - Flags: review?(jonas)
Comment on attachment 448355 [details] [diff] [review]
Fix bad copypasta, make the reftest reference work on the tinderbox, make an older reftest not fail

(In reply to comment #11)
> Forgot to update a copyright year.

Sorry. Wrong bug.
Attachment #448355 - Attachment is obsolete: false
Attachment #448355 - Flags: review?(jonas)
Attachment #448371 - Attachment is obsolete: true
Attachment #448371 - Flags: review?(jonas)
Henri, when exactly were these rules for U+0000 and &#0; added to HTML5?  If there was public discussion of this change, a pointer to that would also be useful.
(In reply to comment #13)
> Henri, when exactly were these rules for U+0000 and &#0; added to HTML5?

http://html5.org/tools/web-apps-tracker?from=13&to=14

> If
> there was public discussion of this change, a pointer to that would also be
> useful.

I can't find a public discussion of this change. I can find some emails where I whined about U+0000 getting dropped without a parse error, but I don't see email from me or Hixie about mapping it to U+FFFD.
http://hg.mozilla.org/mozilla-central/rev/14bb99ed59c8
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: