848842 - Don't do heuristic encoding detection in the File API

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Description

•

12 years ago

It seems that currently, the File API implementations do character encoding guessing using the "universal" encoding detector. This bad, because the "universal" detector isn't really universal and because heuristic detection isn't standardized, and wrong, because the spec says to default to UTF-8: http://www.w3.org/TR/FileAPI/#enctype We should default to UTF-8 instead of guessing.

Arun Ranganathan

Comment 1

•

12 years ago

Henri: what I intend to do is to switch all the encoding stuff to rely on the encoding spec. and NOT the legacy "default to UTF-8 stuff" which is honestly a holdover from XHR2. The encoding spec. is: http://encoding.spec.whatwg.org/ If I do this in the spec., will this satisfy your nit about the heuristic detection mechanism that is currently spec'd? I agree that it is wrong.

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 2

•

12 years ago

(In reply to Arun K. Ranganathan from comment #1) > Henri: what I intend to do is to switch all the encoding stuff to rely on > the encoding spec. and NOT the legacy "default to UTF-8 stuff" which is > honestly a holdover from XHR2. I think "default to UTF-8" is fine in the spec and generally the Right Thing. My problem is that Gecko doesn't obey the spec and tries to guess the encoding from file content and the guessing method Gecko uses has been fundamentally broken for years without any sign of it ever becoming not fundamentally broken. (The "universal" detector is not actually universal. The range of its detection capabilities is rather arbitrary. It has a Hungarian frequency model but it doesn't actually work right. It doesn't detect Arabic, Czech or Turkish. Etc., etc.) > The encoding spec. is: http://encoding.spec.whatwg.org/ I think it would be great to replace references to IANA labels with references to the Encoding Standard. > If I do this in the spec., will this satisfy your nit about the heuristic > detection mechanism that is currently spec'd? I agree that it is wrong. But the spec doesn't do heuristic detection right now, AFAICT. (As in: Looking at the byte frequencies and patterns [other than the BOM] and making a guess.) I think the spec is fine except for the reference to IANA instead of the Encoding Standard.

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 3

•

12 years ago

Oh yeah. The spec also needs to make the BOM take precedence over everything else to be consistent with the rest of the platform. Quite sad to support UTF-16 in an API as new as this one, but oh well.

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 4

•

12 years ago

Actually, the spec has a couple of more problems: If the encoding argument was not an encoding label, should we throw or fall back to UTF-8? It's undefined how exactly the charset parameter is to be extracted from the MIME type.

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 5

•

12 years ago

Attached patch Stop asking the magic 8-ball when the spec doesn't require it (obsolete) — Details — Splinter Review

This patch in work in progress and implements what Anne said in http://lists.w3.org/Archives/Public/public-webapps/2013JanMar/0707.html

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Masatoshi Kimura [:emk]

Comment 6

•

12 years ago

Comment on attachment 722782 [details] [diff] [review] Stop asking the magic 8-ball when the spec doesn't require it >diff --git a/content/base/src/nsDOMFileReader.cpp b/content/base/src/nsDOMFileReader.cpp >--- a/content/base/src/nsDOMFileReader.cpp >+++ b/content/base/src/nsDOMFileReader.cpp >+ if (!nsContentUtils::CheckForBOM( Anne's proposal says nothing about BOM-sniffing. >- rv = ConvertStream(aFileData, aDataLen, charset.get(), aResult); >+ ConvertStream(aFileData, aDataLen, charset.get(), aResult); > > return NS_OK; > } Either respect the rv or make ConvertStream's return type void. >diff --git a/dom/workers/FileReaderSync.cpp b/dom/workers/FileReaderSync.cpp >--- a/dom/workers/FileReaderSync.cpp >+++ b/dom/workers/FileReaderSync.cpp >@@ -158,50 +150,82 @@ FileReaderSync::ReadAsText(JSObject* aBl >+ aRv.Throw(NS_ERROR_DOM_ENCODING_NOT_SUPPORTED_ERR); Use aRv.ThrowTypeError(). NS_ERROR_DOM_ENCODING_NOT_SUPPORTED_ERR is left due to pre-WebIDL-binding objects. (I wish the WebIDL spec had [CaseInsensitive] and [Trim] extended attributes for enum types...) >+ // Seek to 0 because to undo the BOM sniffing advance. UTF-8 and UTF-16 >+ // decoders will swallow the BOM. UTF-16 decoder will re-sniff for endianness >+ // first. The sniffing UTF-16 decoder will never be used anymore. >+ rv = seekable->Seek(nsISeekableStream::NS_SEEK_SET, 0); >+ if (NS_FAILED(rv)) { >+ aRv.Throw(rv); The stream may not be seekable. I think you should pass corresponding BOM bytes to the decoder instead of trying to rewind.

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 7

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #6) > Anne's proposal says nothing about BOM-sniffing. It's hidden in the "decode" step that references the Encoding Standard. > >- rv = ConvertStream(aFileData, aDataLen, charset.get(), aResult); > >+ ConvertStream(aFileData, aDataLen, charset.get(), aResult); > > > > return NS_OK; > > } > > Either respect the rv or make ConvertStream's return type void. This came from the existing code, but okay. > >diff --git a/dom/workers/FileReaderSync.cpp b/dom/workers/FileReaderSync.cpp > >--- a/dom/workers/FileReaderSync.cpp > >+++ b/dom/workers/FileReaderSync.cpp > >@@ -158,50 +150,82 @@ FileReaderSync::ReadAsText(JSObject* aBl > >+ aRv.Throw(NS_ERROR_DOM_ENCODING_NOT_SUPPORTED_ERR); > > Use aRv.ThrowTypeError(). NS_ERROR_DOM_ENCODING_NOT_SUPPORTED_ERR is left > due to pre-WebIDL-binding objects. OK. > The sniffing UTF-16 decoder will never be used anymore. Why? Starting when? nsScriptLoader and mozilla::css::Loader both rely on the decoder is instantiated for the label "UTF-16" performing BOM sniffing for endianness. > >+ rv = seekable->Seek(nsISeekableStream::NS_SEEK_SET, 0); > >+ if (NS_FAILED(rv)) { > >+ aRv.Throw(rv); > > The stream may not be seekable. I think you should pass corresponding BOM > bytes to the decoder instead of trying to rewind. Seeking seemed iffy to me, but the existing code used seeking.

Masatoshi Kimura [:emk]

Comment 8

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #7) > > The sniffing UTF-16 decoder will never be used anymore. > > Why? Starting when? nsScriptLoader and mozilla::css::Loader both rely on the > decoder is instantiated for the label "UTF-16" performing BOM sniffing for > endianness. EncodingUtils::FindEncodingForLabel will map "UTF-16" to "utf-16le" as the Encoding Standard is saying. The script loader and the css loader do not use EncodingUtils::FindEncodingForLabel yet.

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 9

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #8) > EncodingUtils::FindEncodingForLabel will map "UTF-16" to "utf-16le" as the > Encoding Standard is saying. The script loader and the css loader do not use > EncodingUtils::FindEncodingForLabel yet. They do, but not for the output of CheckForBOM. This patch does not pass the output of CheckForBOM to FindEncodingForLabel, either.

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Updated

•

11 years ago

Blocks: 943272

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 10

•

11 years ago

Attached patch Fix bitrot, address comments (obsolete) — Details — Splinter Review

(In reply to Masatoshi Kimura [:emk] from comment #6) > >- rv = ConvertStream(aFileData, aDataLen, charset.get(), aResult); > >+ ConvertStream(aFileData, aDataLen, charset.get(), aResult); > > > > return NS_OK; > > } > > Either respect the rv or make ConvertStream's return type void. Propagated rv. > >diff --git a/dom/workers/FileReaderSync.cpp b/dom/workers/FileReaderSync.cpp > >--- a/dom/workers/FileReaderSync.cpp > >+++ b/dom/workers/FileReaderSync.cpp > >@@ -158,50 +150,82 @@ FileReaderSync::ReadAsText(JSObject* aBl > >+ aRv.Throw(NS_ERROR_DOM_ENCODING_NOT_SUPPORTED_ERR); > > Use aRv.ThrowTypeError(). Done. (Didn't change the old code, though.) > >+ // Seek to 0 because to undo the BOM sniffing advance. UTF-8 and UTF-16 > >+ // decoders will swallow the BOM. UTF-16 decoder will re-sniff for endianness > >+ // first. > > The sniffing UTF-16 decoder will never be used anymore. Adjusted comment for the current nsContentUtils::CheckForBOM() reality. > >+ rv = seekable->Seek(nsISeekableStream::NS_SEEK_SET, 0); > >+ if (NS_FAILED(rv)) { > >+ aRv.Throw(rv); > > The stream may not be seekable. I think you should pass corresponding BOM > bytes to the decoder instead of trying to rewind. The old code used seeking, too, and the subsequent code doesn't make it easy to pass the BOM bytes to the decoder without rewinding.

Attachment #722782 - Attachment is obsolete: true

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 11

•

11 years ago

https://tbpl.mozilla.org/?tree=Try&rev=2f9a67e0f0ba

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 12

•

11 years ago

Attached patch Patch that actually compiles (obsolete) — Details — Splinter Review

https://tbpl.mozilla.org/?tree=Try&rev=b97c23d8b250

Attachment #8341684 - Attachment is obsolete: true

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Updated

•

11 years ago

Attachment #8342994 - Flags: review?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 13

•

11 years ago

Comment on attachment 8342994 [details] [diff] [review] Patch that actually compiles So this doesn't seem to implement what the spec says for FileReader. What the spec says is: 1) If label argument is present, call "getting an encoding" and if that doesn't fail use that. 2) If no label, or if "getting an encoding" failed, use the type attr's parameter (this probably just needs a followup). 3) If all that failed, use "utf-8". whereas the patch seems to bail out if "getting an encoding" fails. Also, there is no mention of BOM in the spec that I can see; is that a spec bug you've raised? Furthermore, it's not clear to me whether the behavior for "replacement" here would match the spec; it seems like it would throw immediately, but the replacement encoding decoder would ... not sure what. >+ nsAutoCString type; >+ CopyUTF16toUTF8(type16, type); NS_ConvertUTF16toUTF8 type(type16); >+ specifiedCharset = Substring(specifiedCharset, 1, >+ specifiedCharset.Length() - 2); Weird indent. I stopped at the end of the FileReader but, since I suspect the answers to the above might affect the rest of the patch....

Attachment #8342994 - Flags: review?(bzbarsky) → review-

Arun Ranganathan

Comment 14

•

11 years ago

> Also, there is no mention of BOM in the spec that I can see; is that a spec > bug you've raised? Boris: note the File API defers to the encoding spec at WHATWG, which in turn does the BOM heavy lifting for us. In particular, Step 7 (http://dev.w3.org/2006/webapi/FileAPI/#enctype) refers to "Decode" which is: http://encoding.spec.whatwg.org/#decode-and-encode Please correct me if you continue to believe that insufficient attention is payed to BOM.

Arun Ranganathan

Comment 15

•

11 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #4) > Actually, the spec has a couple of more problems: > > If the encoding argument was not an encoding label, should we throw or fall > back to UTF-8? I hope this clarifies: http://dev.w3.org/2006/webapi/FileAPI/#enctype We don't throw, but set it to null, try Charset Parameter, if that returns failure, set it to UTF-8, then decode (which does BOM) per encoding spec. > > It's undefined how exactly the charset parameter is to be extracted from the > MIME type. This should be clarified in the spec. We defer to MIMESNIFF (http://dev.w3.org/2006/webapi/FileAPI/#model refers to http://mimesniff.spec.whatwg.org/#parsing-a-mime-type). I think we can toggle on ";" with your approval.

Boris Zbarsky [:bzbarsky]

Comment 16

•

11 years ago

> note the File API defers to the encoding spec at WHATWG, which in turn does the BOM heavy > lifting for us. Ah, because you're using "encoding" as a fallback encoding only, in cases when the BOM is not present? Sounds good. That's not what the attached patch implements, however.

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 17

•

11 years ago

(In reply to Boris Zbarsky [:bz] from comment #13) > So this doesn't seem to implement what the spec says for FileReader. What > the spec says is: > > 1) If label argument is present, call "getting an encoding" and if that > doesn't > fail use that. Oh right. I should have re-read the spec. Silently ignoring an argument like that is a bit odd API design, but OK. > Also, there is no mention of BOM in the spec that I can see; is that a spec > bug you've raised? File API references "decode" from the Encoding Standard and BOM-sniffing is baked into "decode". That is, the BOM overrides everything else. So the sequence should be: 1) Encoding from BOM if the is a BOM for UTF-8, UTF-16LE or UTF-16BE 2) Encoding from the label argument if getting an encoding doesn't fail 3) Encoding from the blob's type attribute if getting an encoding doesn't fail 4) UTF-8 > Furthermore, it's not clear to me whether the behavior for "replacement" > here would match the spec; I wrote the patch thinking (dunno why; I should have re-read the spec) that File API referenced TextDecoder behavior, but since it references "get an encoding" plus "decode", the behavior in the patch doesn't match the spec. I'll adjust the patch. Sorry about asking for review without re-reading the spec.

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 18

•

11 years ago

Attached patch Patch that agrees with the spec (obsolete) — Details — Splinter Review

https://tbpl.mozilla.org/?tree=Try&rev=4cf8f6e4b3a1 Code in an around nsContentUtils::ConvertStringFromCharset was confused about whether it's dealing with labels or encodings and about whether UTF-8 errors should be handled or not. I changed the name of the method to clarify that it's dealing with encodings and made it always handle UTF-8 errors, since callers seem to expect that. (Not going to fix bug 919924 today. It would prevent this kind of confusion, but the fix would be massive. Probably never worth the effort.) I checked blame on how the MIME violation of accepting single quotes for the charset parameter got introduced to XHR and removed the cargo cult copy of that violation from the patch. The incidental change to ArchiveReader is according to what baku said on IRC; there's no spec.

Attachment #8342994 - Attachment is obsolete: true

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 19

•

11 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #18) > Code in an around nsContentUtils::ConvertStringFromCharset was confused > about whether it's dealing with labels or encodings For clarity: After the introduction of the replacement encoding, the strings we use to represent *encodings* are not guaranteed to be valid *labels*, because "replacement" as a label doesn't map to the "replacement" as an encoding.

Arun Ranganathan

Comment 20

•

11 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #17) > (In reply to Boris Zbarsky [:bz] from comment #13) > > So this doesn't seem to implement what the spec says for FileReader. What > > the spec says is: > > > > 1) If label argument is present, call "getting an encoding" and if that > > doesn't > > fail use that. > > Oh right. I should have re-read the spec. Silently ignoring an argument like > that is a bit odd API design, but OK. We traded off throwing for this, believing that throwing was more pain than gain for web developers.

Anne (:annevk)

Comment 21

•

11 years ago

Can we still change that? The label list is essentially fixed. If you supply one that is incorrect you basically have not read the specification.

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 22

•

11 years ago

Attached patch Hopefully with fewer test failures — Details — Splinter Review

https://tbpl.mozilla.org/?tree=Try&rev=bde99d398f1a

Attachment #8345829 - Attachment is obsolete: true

Arun Ranganathan

Comment 23

•

11 years ago

(In reply to Anne (:annevk) from comment #21) > Can we still change that? The label list is essentially fixed. If you supply > one that is incorrect you basically have not read the specification. This is an old discussion, mostly back and forth on IRC, and you should open a fresh "spec bug" if you feel strongly, but: 1. It seems that what the API does now is to fallback on the most common encoding on the Web. For instance, this is the top link that emerges when you try a naive survey of encodings: http://w3techs.com/technologies/overview/character_encoding/all (a separate debate on these guys' methodology can be had, but it does confirm my suspicion/bias in favor of UTF-8). 2. We do BOM checks. Between 1. and 2., I think we get it mostly right in the 80-20 sense. The worst that can happen is some data loss with replacement chars, right?

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Updated

•

11 years ago

Attachment #8345900 - Flags: review?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 24

•

11 years ago

I'm sorry this is taking so long; I'll get to this on Monday.

Boris Zbarsky [:bzbarsky]

Comment 25

•

11 years ago

Comment on attachment 8345900 [details] [diff] [review] Hopefully with fewer test failures >+++ b/content/base/public/nsContentUtils.h >+ * @param aEncoding the Gecko-canocial name of the encoding or the empty "canonical" >+++ b/content/base/src/nsDOMFileReader.cpp >+ if (!EncodingUtils::FindEncodingForLabel(aCharset, >+ encoding)) { The indent is weird: please line up the arguments. >+ nsDependentCString data(aFileData, aDataLen); I think if this were an nsDependentCSubstring you wouldn't need the null-termination bit. >+++ b/content/base/src/nsReferencedElement.cpp >+ // XXX Eww. If fallible malloc failed, using a conversion method that >+ // assumes UTF-8 and doesn't handle UTF-8 errors. If fallible malloc failed, don't we want to just bail out? Could you please file a followup bug on fixing whatever the weirdness here is, with a clear description of the weirdness? >+++ b/dom/encoding/TextDecoder.h >+ * If no encoding is provided then mEncoding is default initialised to Does that mean if aLable is empty, or something else? Would be good to clarify that, since this API looks like aLabel always has to be provided. >+++ b/dom/workers/FileReaderSync.cpp >+ !EncodingUtils::FindEncodingForLabel(aEncoding.Value(), >+ encoding)) { Again, weird indent. >+ // API argument failed. Try the type property of the blob. I wish we had some way to avoid copy/pasting this code. :( r=me with the nits fixed.

Attachment #8345900 - Flags: review?(bzbarsky) → review+

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 26

•

11 years ago

(In reply to Boris Zbarsky [:bz] from comment #25) > "canonical" Fixed. > The indent is weird: please line up the arguments. Fixed. > >+ nsDependentCString data(aFileData, aDataLen); > > I think if this were an nsDependentCSubstring you wouldn't need the > null-termination bit. Undid the null-termination and used nsDependentCSubstring. (I guess I still have a lot to learn about our string classes.) > >+++ b/content/base/src/nsReferencedElement.cpp > >+ // XXX Eww. If fallible malloc failed, using a conversion method that > >+ // assumes UTF-8 and doesn't handle UTF-8 errors. > > If fallible malloc failed, don't we want to just bail out? > > Could you please file a followup bug on fixing whatever the weirdness here > is, with a clear description of the weirdness? Filed bug 951082. > >+++ b/dom/encoding/TextDecoder.h > >+ * If no encoding is provided then mEncoding is default initialised to > > Does that mean if aLable is empty, or something else? Would be good to > clarify that, since this API looks like aLabel always has to be provided. The documentation indeed didn't match what the API did before or after this patch. I adjusted the documentation. emk, what's the story behind the documentation talking about TextDecoder defaulting to UTF-8 but in fact throwing for bogus labels, including the empty string? > Again, weird indent. Fixed. > r=me with the nits fixed. Thanks. Landed: https://hg.mozilla.org/integration/mozilla-inbound/rev/5b57ad1e78fe

Flags: needinfo?(VYV03354)

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Comment 27

•

11 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #26) > Undid the null-termination and used nsDependentCSubstring. (I guess I still > have a lot to learn about our string classes.) Filed bug 951110.

Masatoshi Kimura [:emk]

Comment 28

•

11 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #26) > The documentation indeed didn't match what the API did before or after this > patch. I adjusted the documentation. emk, what's the story behind the > documentation talking about TextDecoder defaulting to UTF-8 but in fact > throwing for bogus labels, including the empty string? That comment had the DOM API in mind. In earlier iteration of the patch of bug 764234, default-to-UTF-8 code was in the C++ implementation. Later the code was removed in favor of the WebIDL optional argument, but I forgot to fix the comment.

Flags: needinfo?(VYV03354)

Ryan VanderMeulen [:RyanVM]

Comment 29

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/5b57ad1e78fe

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Flags: in-testsuite+

Resolution: --- → FIXED

Target Milestone: --- → mozilla29

Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06)

Assignee

Updated

•

11 years ago

Blocks: 951555

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: DOM → DOM: Core & HTML

Stop asking the magic 8-ball when the spec doesn't require it 12 years ago Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06) 20.06 KB, patch		Details \| Diff \| Splinter Review
Fix bitrot, address comments 11 years ago Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06) 20.62 KB, patch		Details \| Diff \| Splinter Review
Patch that actually compiles 11 years ago Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06) 20.72 KB, patch	bzbarsky : review-	Details \| Diff \| Splinter Review
Patch that agrees with the spec 11 years ago Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06) 36.22 KB, patch		Details \| Diff \| Splinter Review
Hopefully with fewer test failures 11 years ago Henri Sivonen (:hsivonen) (not reading bugmail until 2025-02-06) 38.98 KB, patch	bzbarsky : review+	Details \| Diff \| Splinter Review