Closed Bug 562590 Opened 10 years ago Closed 2 years ago

Bad byte right before EOF is not turned into the REPLACEMENT CHARACTER

Categories

(Core :: DOM: HTML Parser, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox56 --- fixed

People

(Reporter: hsivonen, Assigned: hsivonen)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [good first bug])

Attachments

(1 file)

Steps to reproduce (both the old parser and the HTML5 parser have this bug):
 1) Load data:text/html;charset=utf-8,x%c2
 2) Compare with data:text/html;charset=utf-8,x%c2x

Actual results:
A bad byte is turned into the REPLACEMENT CHARACTER only if there's another byte after it.

Expected results:
Expected a bad byte to turn into the REPLACEMENT CHARACTER even when followed immediately by the EOF.
Blocks: encoding
Henri, Anne - any updated thoughts here?

Over on Chromium we're trying to decide how to tackle this. bug: http://crbug.com277035 patch: http://crrev.com/23532016

Chromium flushes the decoder at the end of fetches (page load, XHR) *but* the UTF-16 decoder is buggy and the flush ends up as a no-op. So... for UTF-16, Chromium and Gecko happen to be bug-compatible here. :( 

If there's no intent to fix this issue, Chromium could disable flushing at the end of fetches to become compatible with Gecko; that would merit an addition to the Encoding and/or Fetch specs. I'd prefer that we correctly flush in all cases, though, but that's a bigger change for Gecko.

(There's no impact on the Encoding script API here, so far as I can tell; the behavior there can be independent of what's done for fetching.)
Sorry, typo in the bug URL: http://crbug.com/277035
Joshua, I think we want to do the right thing here. Emitting U+FFFD if there's an incomplete sequence when you hit EOF. I believe this is what the combination of Encoding, Fetch, and HTML already defines.

I take it that by flushing you mean that Chromium already does the correct thing?
(In reply to Anne (:annevk) from comment #3)
> I take it that by flushing you mean that Chromium already does the correct
> thing?

Yes (mostly). By flushing I meant ensuring the EOF propagates from "fetch" through "encoding", emitting U+FFFD. From discussion (email?) I was led to believe that Gecko is missing the plumbing there. Chromium does propagate the EOF, it just has a bug in the UTF-16 decoder that causes the it to be ignored, leading to "bug compatible' behavior with Gecko.
I recommend fixing the bug in Chromium and shaming Gecko (or better, provide us with a patch :-)).
The HTML parser should remember if the last call to nsIUnicodeDecoder::Convert returned NS_OK_UDEC_MOREINPUT or NS_OK. At end of input from network, if the remembered result is NS_OK_UDEC_MOREINPUT, append one REPLACEMENT CHARACTER to output.

Bug 912470 will update the API doc to actually say this.
Whiteboard: [good first bug]
I intend to actually fix this now that bug 1261841 put in place nicer infra.
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Attachment #8880350 - Flags: review?(VYV03354)
Comment on attachment 8880350 [details]
Bug 562590 - Make incomplete byte sequences near HTML EOF emit a REPLACEMENT CHARACTER.

https://reviewboard.mozilla.org/r/151716/#review159298
Attachment #8880350 - Flags: review?(VYV03354) → review+
We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again.

hg error in cmd: hg rebase -s 678da39b4708 -d 1bc50cd63349: rebasing 405139:678da39b4708 "Bug 562590 - Make incomplete byte sequences near HTML EOF emit a REPLACEMENT CHARACTER. r=emk" (tip)
merging testing/web-platform/meta/MANIFEST.json
warning: conflicts while merging testing/web-platform/meta/MANIFEST.json! (edit, then use 'hg resolve --mark')
unresolved conflicts (see hg resolve, then hg rebase --continue)
Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f0a1809a7ee1
Make incomplete byte sequences near HTML EOF emit a REPLACEMENT CHARACTER. r=emk
https://hg.mozilla.org/mozilla-central/rev/f0a1809a7ee1
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
You need to log in before you can comment on or make changes to this bug.