Closed Bug 572215 Opened 15 years ago Closed 15 years ago

[HTML5] ASCII unprintable characters (0x00-0x1F) rendered as questionmark-in-diamond instead of hexbox

Categories

(Core :: DOM: HTML Parser, defect)

x86
Windows 7
defect
Not set
major

Tracking

()

RESOLVED INVALID

People

(Reporter: netrolller.3d, Unassigned)

References

()

Details

(Keywords: regression)

With the HTML5 parser enabled, ASCII unprintable characters no longer render as hexboxes, but rather as questionmark-in-diamond glyphs (the Missing Glyph symbol used before hexboxes were implemented). With HTML5 disabled, hexboxes are correctly rendered. Compare the following URL: data:text/html,This should be a hexbox: � (which is correct with the old parser but not with HTML5) with this: data:text/html,This should be a hexbox: 𐀀 (correct with both parsers). (However, the following URL: data:text/html,This should be a hexbox:  is also misrendered with HTML5.)
The HTML5 spec requires this behavior, at least for U+0000.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → INVALID
(In reply to comment #1) > The HTML5 spec requires this behavior, at least for U+0000. Where does the spec say "When the user agent comes across an unprintable ASCII character, it must not reveal any information about exactly what character it is"? Does HTML5 really specify the exact glyph to be used for unprintable characters? Doesn't make much sense to me...
Also, how do you explain the discrepancy between: data:text/html,This should be a hexbox:  and data:text/html,This should be a hexbox: 𐀀 ?
> Where does the spec say "When the user agent comes across an unprintable > ASCII character, it must not reveal any information about exactly what > character it is"? Several different places, but the one relevant for � is http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#consume-a-character-reference and specifically the part that says: If that number is one of the numbers in the first column of the following table, then this is a parse error. Find the row with that number in the first column, and return a character token for the Unicode character given in the second column of that row. 0x00 is in the first row of the table, and the corresponding character is U+FFFD. Your other example () falls into the list of things that are considered a parse error, but should make it through to the DOM intact, and it does for me with both the HTML5 parser and the old one (neither shows the hexbox for me). similarly, both parsers show a hexbox for .
> Also, how do you explain the discrepancy between: I don't see such a discrepancy here...
And just so we're clear... which exact build are you using?
It is the same in the latest trunk and in 3.6.3.  displays a hexbox here with HTML4 but an U+FFFD with HTML5. Same for  (and ). OTOH for 𐀀 (𐀀), both parsers display a hexbox. BTW, the relevant part of the spec seems to only deal with character references. However, the difference I reported in comment 0 also exists for literal ASCII NULs in the HTML source, which doesn't seem to be covered.
>  displays a hexbox here with HTML4 but an U+FFFD with HTML5. It does? Does the character in the DOM end up as U+FFFD? Because it sure doesn't here, on latest trunk on Mac (and I see a hexbox). I'll spin up a Windows build. > However, the difference I reported in comment 0 also exists for > literal ASCII NULs in the HTML source Sure. See http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream
>  displays a hexbox here with HTML4 but an U+FFFD with HTML5. This shows up as a hexbox for me with the HTML5 parser on Mac OS 10.5, Linux (F12), and Windows 7.
You need to log in before you can comment on or make changes to this bug.