Closed Bug 848842 Opened 7 years ago Closed 6 years ago

Don't do heuristic encoding detection in the File API

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla29

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

Attachments

(1 file, 4 obsolete files)

It seems that currently, the File API implementations do character encoding guessing using the "universal" encoding detector. This bad, because the "universal" detector isn't really universal and because heuristic detection isn't standardized, and wrong, because the spec says to default to UTF-8: http://www.w3.org/TR/FileAPI/#enctype

We should default to UTF-8 instead of guessing.
Henri: what I intend to do is to switch all the encoding stuff to rely on the encoding spec. and NOT the legacy "default to UTF-8 stuff" which is honestly a holdover from XHR2.

The encoding spec. is: http://encoding.spec.whatwg.org/

If I do this in the spec., will this satisfy your nit about the heuristic detection mechanism that is currently spec'd?  I agree that it is wrong.
(In reply to Arun K. Ranganathan from comment #1)
> Henri: what I intend to do is to switch all the encoding stuff to rely on
> the encoding spec. and NOT the legacy "default to UTF-8 stuff" which is
> honestly a holdover from XHR2.

I think "default to UTF-8" is fine in the spec and generally the Right Thing. My problem is that Gecko doesn't obey the spec and tries to guess the encoding from file content and the guessing method Gecko uses has been fundamentally broken for years without any sign of it ever becoming not fundamentally broken. (The "universal" detector is not actually universal. The range of its detection capabilities is rather arbitrary. It has a Hungarian frequency model but it doesn't actually work right. It doesn't detect Arabic, Czech or Turkish. Etc., etc.)

> The encoding spec. is: http://encoding.spec.whatwg.org/

I think it would be great to replace references to IANA labels with references to the Encoding Standard.

> If I do this in the spec., will this satisfy your nit about the heuristic
> detection mechanism that is currently spec'd?  I agree that it is wrong.

But the spec doesn't do heuristic detection right now, AFAICT. (As in: Looking at the byte frequencies and patterns [other than the BOM] and making a guess.) I think the spec is fine except for the reference to IANA instead of the Encoding Standard.
Oh yeah. The spec also needs to make the BOM take precedence over everything else to be consistent with the rest of the platform.

Quite sad to support UTF-16 in an API as new as this one, but oh well.
Actually, the spec has a couple of more problems:

If the encoding argument was not an encoding label, should we throw or fall back to UTF-8?

It's undefined how exactly the charset parameter is to be extracted from the MIME type.
This patch in work in progress and implements what Anne said in http://lists.w3.org/Archives/Public/public-webapps/2013JanMar/0707.html
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Comment on attachment 722782 [details] [diff] [review]
Stop asking the magic 8-ball when the spec doesn't require it

>diff --git a/content/base/src/nsDOMFileReader.cpp b/content/base/src/nsDOMFileReader.cpp
>--- a/content/base/src/nsDOMFileReader.cpp
>+++ b/content/base/src/nsDOMFileReader.cpp
>+  if (!nsContentUtils::CheckForBOM(

Anne's proposal says nothing about BOM-sniffing.

>-  rv = ConvertStream(aFileData, aDataLen, charset.get(), aResult);
>+  ConvertStream(aFileData, aDataLen, charset.get(), aResult);
> 
>   return NS_OK;
> }

Either respect the rv or make ConvertStream's return type void.

>diff --git a/dom/workers/FileReaderSync.cpp b/dom/workers/FileReaderSync.cpp
>--- a/dom/workers/FileReaderSync.cpp
>+++ b/dom/workers/FileReaderSync.cpp
>@@ -158,50 +150,82 @@ FileReaderSync::ReadAsText(JSObject* aBl
>+      aRv.Throw(NS_ERROR_DOM_ENCODING_NOT_SUPPORTED_ERR);

Use aRv.ThrowTypeError(). NS_ERROR_DOM_ENCODING_NOT_SUPPORTED_ERR is left due to pre-WebIDL-binding objects.
(I wish the WebIDL spec had [CaseInsensitive] and [Trim] extended attributes for enum types...)

>+  // Seek to 0 because to undo the BOM sniffing advance. UTF-8 and UTF-16
>+  // decoders will swallow the BOM. UTF-16 decoder will re-sniff for endianness
>+  // first.

The sniffing UTF-16 decoder will never be used anymore.

>+  rv = seekable->Seek(nsISeekableStream::NS_SEEK_SET, 0);
>+  if (NS_FAILED(rv)) {
>+    aRv.Throw(rv);

The stream may not be seekable. I think you should pass corresponding BOM bytes to the decoder instead of trying to rewind.
(In reply to Masatoshi Kimura [:emk] from comment #6)
> Anne's proposal says nothing about BOM-sniffing.

It's hidden in the "decode" step that references the Encoding Standard.

> >-  rv = ConvertStream(aFileData, aDataLen, charset.get(), aResult);
> >+  ConvertStream(aFileData, aDataLen, charset.get(), aResult);
> > 
> >   return NS_OK;
> > }
> 
> Either respect the rv or make ConvertStream's return type void.

This came from the existing code, but okay.

> >diff --git a/dom/workers/FileReaderSync.cpp b/dom/workers/FileReaderSync.cpp
> >--- a/dom/workers/FileReaderSync.cpp
> >+++ b/dom/workers/FileReaderSync.cpp
> >@@ -158,50 +150,82 @@ FileReaderSync::ReadAsText(JSObject* aBl
> >+      aRv.Throw(NS_ERROR_DOM_ENCODING_NOT_SUPPORTED_ERR);
> 
> Use aRv.ThrowTypeError(). NS_ERROR_DOM_ENCODING_NOT_SUPPORTED_ERR is left
> due to pre-WebIDL-binding objects.

OK.

> The sniffing UTF-16 decoder will never be used anymore.

Why? Starting when? nsScriptLoader and mozilla::css::Loader both rely on the decoder is instantiated for the label "UTF-16" performing BOM sniffing for endianness.

> >+  rv = seekable->Seek(nsISeekableStream::NS_SEEK_SET, 0);
> >+  if (NS_FAILED(rv)) {
> >+    aRv.Throw(rv);
> 
> The stream may not be seekable. I think you should pass corresponding BOM
> bytes to the decoder instead of trying to rewind.

Seeking seemed iffy to me, but the existing code used seeking.
(In reply to Henri Sivonen (:hsivonen) from comment #7)
> > The sniffing UTF-16 decoder will never be used anymore.
> 
> Why? Starting when? nsScriptLoader and mozilla::css::Loader both rely on the
> decoder is instantiated for the label "UTF-16" performing BOM sniffing for
> endianness.

EncodingUtils::FindEncodingForLabel will map "UTF-16" to "utf-16le" as the Encoding Standard is saying. The script loader and the css loader do not use EncodingUtils::FindEncodingForLabel yet.
(In reply to Masatoshi Kimura [:emk] from comment #8)
> EncodingUtils::FindEncodingForLabel will map "UTF-16" to "utf-16le" as the
> Encoding Standard is saying. The script loader and the css loader do not use
> EncodingUtils::FindEncodingForLabel yet.

They do, but not for the output of CheckForBOM.

This patch does not pass the output of CheckForBOM to FindEncodingForLabel, either.
Blocks: 943272
Attached patch Fix bitrot, address comments (obsolete) — Splinter Review
(In reply to Masatoshi Kimura [:emk] from comment #6)
> >-  rv = ConvertStream(aFileData, aDataLen, charset.get(), aResult);
> >+  ConvertStream(aFileData, aDataLen, charset.get(), aResult);
> > 
> >   return NS_OK;
> > }
> 
> Either respect the rv or make ConvertStream's return type void.

Propagated rv.

> >diff --git a/dom/workers/FileReaderSync.cpp b/dom/workers/FileReaderSync.cpp
> >--- a/dom/workers/FileReaderSync.cpp
> >+++ b/dom/workers/FileReaderSync.cpp
> >@@ -158,50 +150,82 @@ FileReaderSync::ReadAsText(JSObject* aBl
> >+      aRv.Throw(NS_ERROR_DOM_ENCODING_NOT_SUPPORTED_ERR);
> 
> Use aRv.ThrowTypeError().

Done. (Didn't change the old code, though.)

> >+  // Seek to 0 because to undo the BOM sniffing advance. UTF-8 and UTF-16
> >+  // decoders will swallow the BOM. UTF-16 decoder will re-sniff for endianness
> >+  // first.
> 
> The sniffing UTF-16 decoder will never be used anymore.

Adjusted comment for the current nsContentUtils::CheckForBOM() reality.

> >+  rv = seekable->Seek(nsISeekableStream::NS_SEEK_SET, 0);
> >+  if (NS_FAILED(rv)) {
> >+    aRv.Throw(rv);
> 
> The stream may not be seekable. I think you should pass corresponding BOM
> bytes to the decoder instead of trying to rewind.

The old code used seeking, too, and the subsequent code doesn't make it easy to pass the BOM bytes to the decoder without rewinding.
Attachment #722782 - Attachment is obsolete: true
Attachment #8342994 - Flags: review?(bzbarsky)
Comment on attachment 8342994 [details] [diff] [review]
Patch that actually compiles

So this doesn't seem to implement what the spec says for FileReader.  What the spec says is:

1)  If label argument is present, call "getting an encoding" and if that doesn't
    fail use that.
2)  If no label, or if "getting an encoding" failed, use the type attr's
    parameter (this probably just needs a followup).
3)  If all that failed, use "utf-8".

whereas the patch seems to bail out if "getting an encoding" fails.

Also, there is no mention of BOM in the spec that I can see; is that a spec bug you've raised?

Furthermore, it's not clear to me whether the behavior for "replacement" here would match the spec; it seems like it would throw immediately, but the replacement encoding decoder would ... not sure what.

>+      nsAutoCString type;
>+      CopyUTF16toUTF8(type16, type);

NS_ConvertUTF16toUTF8 type(type16);

>+          specifiedCharset = Substring(specifiedCharset, 1,
>+          specifiedCharset.Length() - 2);

Weird indent.

I stopped at the end of the FileReader but, since I suspect the answers to the above might affect the rest of the patch....
Attachment #8342994 - Flags: review?(bzbarsky) → review-
 
> Also, there is no mention of BOM in the spec that I can see; is that a spec
> bug you've raised?


Boris: note the File API defers to the encoding spec at WHATWG, which in turn does the BOM heavy lifting for us.

In particular, Step 7 (http://dev.w3.org/2006/webapi/FileAPI/#enctype) refers to "Decode" which is: http://encoding.spec.whatwg.org/#decode-and-encode

Please correct me if you continue to believe that insufficient attention is payed to BOM.
(In reply to Henri Sivonen (:hsivonen) from comment #4)
> Actually, the spec has a couple of more problems:
> 
> If the encoding argument was not an encoding label, should we throw or fall
> back to UTF-8?

I hope this clarifies: http://dev.w3.org/2006/webapi/FileAPI/#enctype

We don't throw, but set it to null, try Charset Parameter, if that returns failure, set it to UTF-8, then decode (which does BOM) per encoding spec.

> 
> It's undefined how exactly the charset parameter is to be extracted from the
> MIME type.

This should be clarified in the spec.  We defer to MIMESNIFF (http://dev.w3.org/2006/webapi/FileAPI/#model refers to http://mimesniff.spec.whatwg.org/#parsing-a-mime-type).  I think we can toggle on ";" with your approval.
> note the File API defers to the encoding spec at WHATWG, which in turn does the BOM heavy
> lifting for us.

Ah, because you're using "encoding" as a fallback encoding only, in cases when the BOM is not present?  Sounds good.

That's not what the attached patch implements, however.
(In reply to Boris Zbarsky [:bz] from comment #13)
> So this doesn't seem to implement what the spec says for FileReader.  What
> the spec says is:
> 
> 1)  If label argument is present, call "getting an encoding" and if that
> doesn't
>     fail use that.

Oh right. I should have re-read the spec. Silently ignoring an argument like that is a bit odd API design, but OK.

> Also, there is no mention of BOM in the spec that I can see; is that a spec
> bug you've raised?

File API references "decode" from the Encoding Standard and BOM-sniffing is baked into "decode". That is, the BOM overrides everything else. So the sequence should be:

 1) Encoding from BOM if the is a BOM for UTF-8, UTF-16LE or UTF-16BE
 2) Encoding from the label argument if getting an encoding doesn't fail
 3) Encoding from the blob's type attribute if getting an encoding doesn't fail
 4) UTF-8

> Furthermore, it's not clear to me whether the behavior for "replacement"
> here would match the spec;

I wrote the patch thinking (dunno why; I should have re-read the spec) that File API referenced TextDecoder behavior, but since it references "get an encoding" plus "decode", the behavior in the patch doesn't match the spec.

I'll adjust the patch. Sorry about asking for review without re-reading the spec.
Attached patch Patch that agrees with the spec (obsolete) — Splinter Review
https://tbpl.mozilla.org/?tree=Try&rev=4cf8f6e4b3a1

Code in an around nsContentUtils::ConvertStringFromCharset was confused about whether it's dealing with labels or encodings and about whether UTF-8 errors should be handled or not. I changed the name of the method to clarify that it's dealing with encodings and made it always handle UTF-8 errors, since callers seem to expect that. (Not going to fix bug 919924 today. It would prevent this kind of confusion, but the fix would be massive. Probably never worth the effort.)

I checked blame on how the MIME violation of accepting single quotes for the charset parameter got introduced to XHR and removed the cargo cult copy of that violation from the patch.

The incidental change to ArchiveReader is according to what baku said on IRC; there's no spec.
Attachment #8342994 - Attachment is obsolete: true
(In reply to Henri Sivonen (:hsivonen) from comment #18)
> Code in an around nsContentUtils::ConvertStringFromCharset was confused
> about whether it's dealing with labels or encodings

For clarity: After the introduction of the replacement encoding, the strings we use to represent *encodings* are not guaranteed to be valid *labels*, because "replacement" as a label doesn't map to the "replacement" as an encoding.
(In reply to Henri Sivonen (:hsivonen) from comment #17)
> (In reply to Boris Zbarsky [:bz] from comment #13)
> > So this doesn't seem to implement what the spec says for FileReader.  What
> > the spec says is:
> > 
> > 1)  If label argument is present, call "getting an encoding" and if that
> > doesn't
> >     fail use that.
> 
> Oh right. I should have re-read the spec. Silently ignoring an argument like
> that is a bit odd API design, but OK.


We traded off throwing for this, believing that throwing was more pain than gain for web developers.
Can we still change that? The label list is essentially fixed. If you supply one that is incorrect you basically have not read the specification.
(In reply to Anne (:annevk) from comment #21)
> Can we still change that? The label list is essentially fixed. If you supply
> one that is incorrect you basically have not read the specification.

This is an old discussion, mostly back and forth on IRC, and you should open a fresh "spec bug" if you feel strongly, but:

1. It seems that what the API does now is to fallback on the most common encoding on the Web.  For instance, this is the top link that emerges when you try a naive survey of encodings: http://w3techs.com/technologies/overview/character_encoding/all (a separate debate on these guys' methodology can be had, but it does confirm my suspicion/bias in favor of UTF-8).

2. We do BOM checks.

Between 1. and 2., I think we get it mostly right in the 80-20 sense.  The worst that can happen is some data loss with replacement chars, right?
Attachment #8345900 - Flags: review?(bzbarsky)
I'm sorry this is taking so long; I'll get to this on Monday.
Comment on attachment 8345900 [details] [diff] [review]
Hopefully with fewer test failures

>+++ b/content/base/public/nsContentUtils.h
>+   * @param aEncoding the Gecko-canocial name of the encoding or the empty

"canonical"

>+++ b/content/base/src/nsDOMFileReader.cpp
>+    if (!EncodingUtils::FindEncodingForLabel(aCharset,
>+        encoding)) {

The indent is weird: please line up the arguments.

>+  nsDependentCString data(aFileData, aDataLen);

I think if this were an nsDependentCSubstring you wouldn't need the null-termination bit.

>+++ b/content/base/src/nsReferencedElement.cpp
>+    // XXX Eww. If fallible malloc failed, using a conversion method that
>+    // assumes UTF-8 and doesn't handle UTF-8 errors.

If fallible malloc failed, don't we want to just bail out?

Could you please file a followup bug on fixing whatever the weirdness here is, with a clear description of the weirdness?

>+++ b/dom/encoding/TextDecoder.h
>+   * If no encoding is provided then mEncoding is default initialised to

Does that mean if aLable is empty, or something else?  Would be good to clarify that, since this API looks like aLabel always has to be provided.

>+++ b/dom/workers/FileReaderSync.cpp
>+        !EncodingUtils::FindEncodingForLabel(aEncoding.Value(),
>+        encoding)) {

Again, weird indent.

>+      // API argument failed. Try the type property of the blob.

I wish we had some way to avoid copy/pasting this code.  :(

r=me with the nits fixed.
Attachment #8345900 - Flags: review?(bzbarsky) → review+
(In reply to Boris Zbarsky [:bz] from comment #25)
> "canonical"

Fixed.

> The indent is weird: please line up the arguments.

Fixed.

> >+  nsDependentCString data(aFileData, aDataLen);
> 
> I think if this were an nsDependentCSubstring you wouldn't need the
> null-termination bit.

Undid the null-termination and used nsDependentCSubstring. (I guess I still have a lot to learn about our string classes.)

> >+++ b/content/base/src/nsReferencedElement.cpp
> >+    // XXX Eww. If fallible malloc failed, using a conversion method that
> >+    // assumes UTF-8 and doesn't handle UTF-8 errors.
> 
> If fallible malloc failed, don't we want to just bail out?
> 
> Could you please file a followup bug on fixing whatever the weirdness here
> is, with a clear description of the weirdness?

Filed bug 951082.

> >+++ b/dom/encoding/TextDecoder.h
> >+   * If no encoding is provided then mEncoding is default initialised to
> 
> Does that mean if aLable is empty, or something else?  Would be good to
> clarify that, since this API looks like aLabel always has to be provided.

The  documentation indeed didn't match what the API did before or after this patch. I adjusted the documentation. emk, what's the story behind  the documentation talking about TextDecoder defaulting to UTF-8 but in fact throwing for bogus labels, including the empty string?

> Again, weird indent.

Fixed.

> r=me with the nits fixed.

Thanks. Landed:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5b57ad1e78fe
Flags: needinfo?(VYV03354)
(In reply to Henri Sivonen (:hsivonen) from comment #26)
> Undid the null-termination and used nsDependentCSubstring. (I guess I still
> have a lot to learn about our string classes.)

Filed bug 951110.
(In reply to Henri Sivonen (:hsivonen) from comment #26)
> The  documentation indeed didn't match what the API did before or after this
> patch. I adjusted the documentation. emk, what's the story behind  the
> documentation talking about TextDecoder defaulting to UTF-8 but in fact
> throwing for bogus labels, including the empty string?

That comment had the DOM API in mind. In earlier iteration of the patch of bug 764234, default-to-UTF-8 code was in the C++ implementation. Later the code was removed in favor of the WebIDL optional argument, but I forgot to fix the comment.
Flags: needinfo?(VYV03354)
https://hg.mozilla.org/mozilla-central/rev/5b57ad1e78fe
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Blocks: 951555
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.