Last Comment Bug 566280 - [HTML5] Plain text prefixed by U+0000 displays only U+FFFD
: [HTML5] Plain text prefixed by U+0000 displays only U+FFFD
Status: RESOLVED FIXED
: regression
Product: Core
Classification: Components
Component: HTML: Parser (show other bugs)
: Trunk
: All All
: P2 normal (vote)
: ---
Assigned To: Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
:
Mentors:
: 552137 568228 (view as bug list)
Depends on:
Blocks: 568228
  Show dependency treegraph
 
Reported: 2010-05-17 00:58 PDT by Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
Modified: 2010-06-09 01:47 PDT (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
final+


Attachments
Test case (13 bytes, text/html)
2010-05-17 00:58 PDT, Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
no flags Details
Fix bad copypasta (2.39 KB, patch)
2010-05-27 01:12 PDT, Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
no flags Details | Diff | Splinter Review
Fix bad copypasta, make the reftest reference work on the tinderbox (2.39 KB, patch)
2010-05-28 03:50 PDT, Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
no flags Details | Diff | Splinter Review
Fix bad copypasta, make the reftest reference work on the tinderbox, make an older reftest not fail (5.15 KB, patch)
2010-05-31 03:45 PDT, Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
jonas: review+
Details | Diff | Splinter Review
Fix bad copypasta, make the reftest reference work on the tinderbox, make an older reftest not fail, fix WHATWG copyright year (5.53 KB, patch)
2010-05-31 07:20 PDT, Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
no flags Details | Diff | Splinter Review

Description Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-05-17 00:58:14 PDT
Created attachment 445655 [details]
Test case

Steps to reproduce:
 1) Load the attachment.

Expected results:
�hello world

Actual results:
�
Comment 1 Johnny Stenback (:jst, jst@mozilla.com) 2010-05-26 16:21:59 PDT
Blocking.
Comment 2 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-05-27 01:12:47 PDT
Created attachment 447722 [details] [diff] [review]
Fix bad copypasta
Comment 3 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-05-28 03:50:52 PDT
Created attachment 447960 [details] [diff] [review]
Fix bad copypasta, make the reftest reference work on the tinderbox
Comment 4 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-05-28 03:55:40 PDT
zwol, HTML5 invalidates http://mxr-test.konigsberg.mozilla.org/mozilla-central/source/layout/reftests/bugs/228856-2.html?force=1 since the HTML5 parsering algorithm turns U+0000 into U+FFFD before it reaches the CSS parser. The test has accidentally passed due to this bug.

What should be done to 228856-2.html when landing this fix?
Comment 5 Jonathan Kew (:jfkthame) 2010-05-28 09:00:59 PDT
Is there a specification that explicitly calls for U+0000 to be replaced by U+FFFD? That seems odd to me; if anything, I'd have expected to see a hexbox rather than a Unicode REPLACEMENT CHARACTER. U+FFFD would normally indicate an encoding error (e.g. an invalid UTF-8 sequence or unpaired UTF-16 surrogate, or an invalid code in a legacy codepage that cannot be transcoded to Unicode), not merely a correctly-encoded character that we can't display.
Comment 6 Zack Weinberg (:zwol) 2010-05-28 09:08:11 PDT
The test is really about what U+0000 does to the CSS parser, so you should definitely pull the contents of the <style> tag out to a separate sheet.

I'm not sure what to do with the divs, though.  Does <div something="..&#0;.."> still generate an attribute with a literal NUL in its value?  If so, we could probably just delete the subtests with literal NULs in the input, and rely on the &#0;s.  If not, we need to convert this to a mochitest that uses JS to examine the parsed style sheet, which is a thing I can do if you don't know how.
Comment 7 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-05-29 08:52:10 PDT
(In reply to comment #6)
> The test is really about what U+0000 does to the CSS parser, so you should
> definitely pull the contents of the <style> tag out to a separate sheet.

OK.

> I'm not sure what to do with the divs, though.  Does <div something="..&#0;..">
> still generate an attribute with a literal NUL in its value?

&#0; generates U+FFFD per HTML5.

> If so, we could
> probably just delete the subtests with literal NULs in the input, and rely on
> the &#0;s.  If not, we need to convert this to a mochitest that uses JS to
> examine the parsed style sheet, which is a thing I can do if you don't know
> how.

I don't, so it would be nice if you'd do it to make sure the test still test what you intended.
Comment 8 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-05-29 08:56:36 PDT
(In reply to comment #5)
> Is there a specification that explicitly calls for U+0000 to be replaced by
> U+FFFD?

Yes, the HTML5 spec.

The zero byte:
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream

The numeric reference:
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references
Comment 9 Zack Weinberg (:zwol) 2010-05-29 09:26:12 PDT
(In reply to comment #7)
> > we need to convert this to a mochitest that uses JS to
> > examine the parsed style sheet, which is a thing I can do if you don't know
> > how.
> 
> I don't, so it would be nice if you'd do it to make sure the test still test
> what you intended.

I will try to find time for this next week.  Note that Monday is a holiday in the USA.

(In reply to comment #8)
> (In reply to comment #5)
> > Is there a specification that explicitly calls for U+0000 to be replaced by
> > U+FFFD?
> 
> Yes, the HTML5 spec.

CSS presently doesn't define the behavior of U+0000 either as a literal character or as a \-escape.  It is tempting to propose that CSS change to match HTML5 - it's not like there's any cost to doing so, and we'd gain predictability.  dbaron, fantasai, what do you think?
Comment 10 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-05-31 03:45:54 PDT
Created attachment 448355 [details] [diff] [review]
Fix bad copypasta, make the reftest reference work on the tinderbox, make an older reftest not fail

Splitting out the first <style> into a <link rel=stylesheet> was enough to make 228856-2.html not fail.

The binary patch that adds a reftest for this bug is like the reference for the test except there's a zero byte where the reference has &#xFFFD;.
Comment 11 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-05-31 07:20:01 PDT
Created attachment 448371 [details] [diff] [review]
Fix bad copypasta, make the reftest reference work on the tinderbox, make an older reftest not fail, fix WHATWG copyright year

Forgot to update a copyright year.
Comment 12 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-05-31 23:31:30 PDT
Comment on attachment 448355 [details] [diff] [review]
Fix bad copypasta, make the reftest reference work on the tinderbox, make an older reftest not fail

(In reply to comment #11)
> Forgot to update a copyright year.

Sorry. Wrong bug.
Comment 13 Zack Weinberg (:zwol) 2010-06-01 08:54:38 PDT
Henri, when exactly were these rules for U+0000 and &#0; added to HTML5?  If there was public discussion of this change, a pointer to that would also be useful.
Comment 14 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-06-03 06:51:27 PDT
(In reply to comment #13)
> Henri, when exactly were these rules for U+0000 and &#0; added to HTML5?

http://html5.org/tools/web-apps-tracker?from=13&to=14

> If
> there was public discussion of this change, a pointer to that would also be
> useful.

I can't find a public discussion of this change. I can find some emails where I whined about U+0000 getting dropped without a parse error, but I don't see email from me or Hixie about mapping it to U+FFFD.
Comment 15 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-06-04 01:54:24 PDT
*** Bug 568228 has been marked as a duplicate of this bug. ***
Comment 16 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-06-04 01:55:49 PDT
*** Bug 552137 has been marked as a duplicate of this bug. ***
Comment 17 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2010-06-09 01:47:59 PDT
http://hg.mozilla.org/mozilla-central/rev/14bb99ed59c8

Note You need to log in before you can comment on or make changes to this bug.