Closed Bug 562590 Opened 11 years ago Closed 4 years ago
Bad byte right before EOF is not turned into the REPLACEMENT CHARACTER
59 bytes, text/x-review-board-request
Steps to reproduce (both the old parser and the HTML5 parser have this bug): 1) Load data:text/html;charset=utf-8,x%c2 2) Compare with data:text/html;charset=utf-8,x%c2x Actual results: A bad byte is turned into the REPLACEMENT CHARACTER only if there's another byte after it. Expected results: Expected a bad byte to turn into the REPLACEMENT CHARACTER even when followed immediately by the EOF.
Henri, Anne - any updated thoughts here? Over on Chromium we're trying to decide how to tackle this. bug: http://crbug.com277035 patch: http://crrev.com/23532016 Chromium flushes the decoder at the end of fetches (page load, XHR) *but* the UTF-16 decoder is buggy and the flush ends up as a no-op. So... for UTF-16, Chromium and Gecko happen to be bug-compatible here. :( If there's no intent to fix this issue, Chromium could disable flushing at the end of fetches to become compatible with Gecko; that would merit an addition to the Encoding and/or Fetch specs. I'd prefer that we correctly flush in all cases, though, but that's a bigger change for Gecko. (There's no impact on the Encoding script API here, so far as I can tell; the behavior there can be independent of what's done for fetching.)
Sorry, typo in the bug URL: http://crbug.com/277035
Joshua, I think we want to do the right thing here. Emitting U+FFFD if there's an incomplete sequence when you hit EOF. I believe this is what the combination of Encoding, Fetch, and HTML already defines. I take it that by flushing you mean that Chromium already does the correct thing?
(In reply to Anne (:annevk) from comment #3) > I take it that by flushing you mean that Chromium already does the correct > thing? Yes (mostly). By flushing I meant ensuring the EOF propagates from "fetch" through "encoding", emitting U+FFFD. From discussion (email?) I was led to believe that Gecko is missing the plumbing there. Chromium does propagate the EOF, it just has a bug in the UTF-16 decoder that causes the it to be ignored, leading to "bug compatible' behavior with Gecko.
I recommend fixing the bug in Chromium and shaming Gecko (or better, provide us with a patch :-)).
The HTML parser should remember if the last call to nsIUnicodeDecoder::Convert returned NS_OK_UDEC_MOREINPUT or NS_OK. At end of input from network, if the remembered result is NS_OK_UDEC_MOREINPUT, append one REPLACEMENT CHARACTER to output. Bug 912470 will update the API doc to actually say this.
I intend to actually fix this now that bug 1261841 put in place nicer infra.
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Attachment #8880350 - Flags: review?(VYV03354)
Comment on attachment 8880350 [details] Bug 562590 - Make incomplete byte sequences near HTML EOF emit a REPLACEMENT CHARACTER. https://reviewboard.mozilla.org/r/151716/#review159298
Attachment #8880350 - Flags: review?(VYV03354) → review+
We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again. hg error in cmd: hg rebase -s 678da39b4708 -d 1bc50cd63349: rebasing 405139:678da39b4708 "Bug 562590 - Make incomplete byte sequences near HTML EOF emit a REPLACEMENT CHARACTER. r=emk" (tip) merging testing/web-platform/meta/MANIFEST.json warning: conflicts while merging testing/web-platform/meta/MANIFEST.json! (edit, then use 'hg resolve --mark') unresolved conflicts (see hg resolve, then hg rebase --continue)
Pushed by email@example.com: https://hg.mozilla.org/integration/autoland/rev/f0a1809a7ee1 Make incomplete byte sequences near HTML EOF emit a REPLACEMENT CHARACTER. r=emk
You need to log in before you can comment on or make changes to this bug.