Last Comment Bug 772191 - RESTRequest should default to UTF-8 if no encoding is specified
: RESTRequest should default to UTF-8 if no encoding is specified
Status: RESOLVED FIXED
[blocking-aitc+], [qa+]
:
Product: Cloud Services
Classification: Client Software
Component: Firefox: Common (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla16
Assigned To: Gregory Szorc [:gps]
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-09 13:05 PDT by Anant Narayanan [:anant]
Modified: 2012-08-02 15:49 PDT (History)
7 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Make RESTRequest take a charset hint (5.25 KB, patch)
2012-07-09 16:38 PDT, Anant Narayanan [:anant]
no flags Details | Diff | Review
Make RESTRequest take a charset hint (8.47 KB, patch)
2012-07-12 10:52 PDT, Anant Narayanan [:anant]
no flags Details | Diff | Review
Make RESTRequest take a charset hint (added test for UTF-8 translate) (9.31 KB, patch)
2012-07-12 16:58 PDT, Anant Narayanan [:anant]
no flags Details | Diff | Review
Make RESTRequest handle different charsets (added test for UTF-8 translate) (9.31 KB, patch)
2012-07-13 10:19 PDT, Anant Narayanan [:anant]
no flags Details | Diff | Review
Make RESTRequest take a charset hint; default to UTF-8 (10.90 KB, patch)
2012-07-13 16:09 PDT, Gregory Szorc [:gps]
no flags Details | Diff | Review
Make RESTRequest take a charset hint; default to UTF-8, v2 (11.23 KB, patch)
2012-07-13 16:40 PDT, Gregory Szorc [:gps]
philipp: review+
bzbarsky: feedback+
Details | Diff | Review

Description Anant Narayanan [:anant] 2012-07-09 13:05:21 PDT
RFC 2616 specifies that if a charset is not explicitly specified in the headers, it must be treated as ISO-8859-1, but this is only for Content-Type "text" (and it's subtypes).

The spec does not stipulate a default encoding for any other Content-Types, and I propose that we default to UTF-8 for all of them except "text/*". Many web servers already fallback to UTF-8 (for POST requests at-least) if an encoding is not explicitly provided.

What are the risks of doing this?
Comment 1 Gregory Szorc [:gps] 2012-07-09 13:48:40 PDT
The main risk is that some existing server or system relying on the current behavior breaks as a result of this change.

I also don't know what necko is doing under the hood with regards to setting nsIChannel.contentCharset. I would like confirmation from the necko people that inferring UTF-8 (when it makes sense) if contentCharset is not set is a sane thing to do.

Finally, the patch from bug 761877 made the UTF-8 change unconditionally (I think). This *may* preclude RESTRequest from being used with binary data. I /think/ I would prefer the default charset and/or stream conversions be exposed as properties/APIs of RESTRequest. i.e. UTF-8 may be a reasonable default for text streams. But, we should make it easy for clients to change.

If we do make this change, we need to ensure servers being connected to by RESTRequest are serving proper content type headers and we don't inadvertently change behavior.
Comment 2 Anant Narayanan [:anant] 2012-07-09 14:27:06 PDT
Makes sense. A better fix might be to only infer UTF-8 only for Content-Type: application/x-web-app-manifest+json. Perhaps make the default encoding a parameter at the time a RESTRequest is instantiated, so the caller can decide? I suspect different consumers of RESTRequest will want different things and the caller is in the best position to predict what the responses from the server are going to be.
Comment 3 Gregory Szorc [:gps] 2012-07-09 14:33:35 PDT
(In reply to Anant Narayanan [:anant] from comment #2)
> Makes sense. A better fix might be to only infer UTF-8 only for
> Content-Type: application/x-web-app-manifest+json. Perhaps make the default
> encoding a parameter at the time a RESTRequest is instantiated, so the
> caller can decide? I suspect different consumers of RESTRequest will want
> different things and the caller is in the best position to predict what the
> responses from the server are going to be.

Precisely.

FWIW, JSON's default encoding is UTF-8 per RFC 4627 Sec 3. Perhaps if we register application/x-web-app-manifest+json somewhere, necko will automagically apply UTF-8 to the channel contentCharset in the absence of an explicit "charset" field in the C-T response header?
Comment 4 Boris Zbarsky [:bz] (Out June 25-July 6) 2012-07-09 14:49:42 PDT
Necko doesn't do any charset automagic at the moment.  It'll return empty string if no charset was specified in Content-Type....

I suppose we could consider changing that, though.
Comment 5 Gregory Szorc [:gps] 2012-07-09 15:06:29 PDT
OK. I think we should have a property on RESTRequest that controls the default encoding. By default it is UTF-8. If the response channel has contentCharset defined, we always use that. As long as the request's response charset is updated before the request is dispatched, it will be honored.

Anant: do you want to adapt your previous patch or should I code something up?
Comment 6 Boris Zbarsky [:bz] (Out June 25-July 6) 2012-07-09 15:10:22 PDT
> If the response channel has contentCharset defined, we always use that.

You can actually make necko do this for you.  Just set contentCharset on the channel _before_ calling asyncOpen.  See documentation for contentType: contentCharset behaves the same way.
Comment 7 Gregory Szorc [:gps] 2012-07-09 15:13:23 PDT
(In reply to Boris Zbarsky (:bz) from comment #6)
> > If the response channel has contentCharset defined, we always use that.
> 
> You can actually make necko do this for you.  Just set contentCharset on the
> channel _before_ calling asyncOpen.  See documentation for contentType:
> contentCharset behaves the same way.

Most excellent (assuming it behaves as a hint like contentType does). Perhaps https://developer.mozilla.org/en/nsIChannel should be updated to mention the behavior of contentCharset...
Comment 8 Boris Zbarsky [:bz] (Out June 25-July 6) 2012-07-09 15:22:55 PDT
> assuming it behaves as a hint like contentType does

Yes, exactly.  r=me on updating the docs.  ;)
Comment 9 Anant Narayanan [:anant] 2012-07-09 16:38:18 PDT
Created attachment 640421 [details] [diff] [review]
Make RESTRequest take a charset hint

Add a "charset" property to RESTRequest and pass it along to nsIChannel.
Comment 10 Gregory Szorc [:gps] 2012-07-10 09:57:40 PDT
Comment on attachment 640421 [details] [diff] [review]
Make RESTRequest take a charset hint

Review of attachment 640421 [details] [diff] [review]:
-----------------------------------------------------------------

::: services/common/rest.js
@@ +319,5 @@
>      channel.requestMethod = method;
>  
> +    // Before opening the channel, set the charset that serves as a hint
> +    // as to what the response might be encoded as.
> +    channel.contentCharset = this.charset ? this.charset : "utf-8";

Just set charset on the prototype to "utf-8" and always use this.charset.

::: services/common/tests/unit/test_restrequest.js
@@ +201,5 @@
> +      do_check_eq(request2.response.body, response);
> +      do_check_eq(request2.response.headers["content-type"], contentType);
> +
> +      server.stop(run_next_test);
> +    });

We need a bit more test coverage here. We need to ensure:

1) Default utf-8 charset is used if C-T header doesn't specify a charset
2) C-T header overrides client default/hint
3) Local hint is used if C-T doesn't specify a charset

We don't do any of these now since utf-8 is being used everywhere. i.e. we have no way of knowing if this works when we are using something not UTF-8.

An easy way to test this would be to use "us-ascii" as the charset and ensure all bytes higher than 127 are stripped from the response body.
Comment 11 Anant Narayanan [:anant] 2012-07-12 10:52:25 PDT
Created attachment 641532 [details] [diff] [review]
Make RESTRequest take a charset hint
Comment 12 Gregory Szorc [:gps] 2012-07-12 12:00:35 PDT
Comment on attachment 641532 [details] [diff] [review]
Make RESTRequest take a charset hint

Review of attachment 641532 [details] [diff] [review]:
-----------------------------------------------------------------

This is /really/ close. You just need a test that ensures that a hint with a server without a charset actually does the right thing. I'm not convinced that a simple ASCII string tests things adequately. Put a unicode character in that test string and ensure that us-ascii conversion on the client strips it out.
Comment 13 Gregory Szorc [:gps] 2012-07-12 14:55:57 PDT
Comment on attachment 641532 [details] [diff] [review]
Make RESTRequest take a charset hint

Review of attachment 641532 [details] [diff] [review]:
-----------------------------------------------------------------

Forgot to wipe flag.
Comment 14 Anant Narayanan [:anant] 2012-07-12 16:58:31 PDT
Created attachment 641658 [details] [diff] [review]
Make RESTRequest take a charset hint (added test for UTF-8 translate)

So, nsIConverterInputStream isn't doing what you suggest. While interpreting a string of UTF-8 byes as ISO-8859-1, it doesn't change anything about a character it doesn't recognize (although the documentation at https://developer.mozilla.org/en/Reading_textual_data implies that it should be replaced by the REPLACEMENT_CHARACTER - we use the default in rest.js).

The character code for ç is 231, which is why the test in this patch passes. But I suspect we won't need this kind of test. Trying to interpret UTF-8 content as ASCII is always going to fail, I don't think there is any expectation from any side that the string will somehow be downgraded gracefully.

The bottom line is that if the caller of a RESTRequest doesn't know what encoding the return value is in, and if the server doesn't provide a charset in the C-T header; all bets are off. I think I'm ok with that.

Leaving the previous patch as not obsolete, as I think we want that version to be checked-in.
Comment 15 Gregory Szorc [:gps] 2012-07-12 23:50:25 PDT
(In reply to Anant Narayanan [:anant] from comment #14)
> Created attachment 641658 [details] [diff] [review]
> Make RESTRequest take a charset hint (added test for UTF-8 translate)
> 
> So, nsIConverterInputStream isn't doing what you suggest. While interpreting
> a string of UTF-8 byes as ISO-8859-1, it doesn't change anything about a
> character it doesn't recognize (although the documentation at
> https://developer.mozilla.org/en/Reading_textual_data implies that it should
> be replaced by the REPLACEMENT_CHARACTER - we use the default in rest.js).
>
> The character code for ç is 231, which is why the test in this patch passes.

The internets says that c-cedilla is 231 in both the iso-8859-1 and utf-8 code pages. So, I would not expect this character to be replaced. I bet if you changed the hinted charset to "us-ascii" this character would disappear. Or you could use a character that isn't identical in both iso-8859-1 and utf-8...

> But I suspect we won't need this kind of test. Trying to interpret UTF-8
> content as ASCII is always going to fail, I don't think there is any
> expectation from any side that the string will somehow be downgraded
> gracefully.
> 
> The bottom line is that if the caller of a RESTRequest doesn't know what
> encoding the return value is in, and if the server doesn't provide a charset
> in the C-T header; all bets are off. I think I'm ok with that.

Yes and no. What we're testing here is that the code behaves as it says it will. If we change the behavior of the client's interpretation of character data, we want that to be an explicit change. i.e. we want to require someone to think long and hard about changing a test if they suddenly find a regression. This test does just that by detecting if there is a change to how the client interprets the data in the absence of a server override.

As you said, the conversion behavior may not be "right." That's totally fine. The responsibility of encoding is with the client and server of a particular service/server. If they manage to screw it up, there's nothing we (RESTRequest) can do.

If there's any takeaway from this bug, it should be "always explicitly define a character set."
Comment 16 Gregory Szorc [:gps] 2012-07-12 23:52:18 PDT
Comment on attachment 641658 [details] [diff] [review]
Make RESTRequest take a charset hint (added test for UTF-8 translate)

Review of attachment 641658 [details] [diff] [review]:
-----------------------------------------------------------------

We will move forward with this patch because it has the additional test. See my previous comment for what needs changed.
Comment 17 Anant Narayanan [:anant] 2012-07-13 10:19:39 PDT
Created attachment 641944 [details] [diff] [review]
Make RESTRequest handle different charsets (added test for UTF-8 translate)

(In reply to Gregory Szorc [:gps] from comment #15)
> (In reply to Anant Narayanan [:anant] from comment #14)
> > Created attachment 641658 [details] [diff] [review]
> > Make RESTRequest take a charset hint (added test for UTF-8 translate)
> > 
> > So, nsIConverterInputStream isn't doing what you suggest. While interpreting
> > a string of UTF-8 byes as ISO-8859-1, it doesn't change anything about a
> > character it doesn't recognize (although the documentation at
> > https://developer.mozilla.org/en/Reading_textual_data implies that it should
> > be replaced by the REPLACEMENT_CHARACTER - we use the default in rest.js).
> >
> > The character code for ç is 231, which is why the test in this patch passes.
> 
> The internets says that c-cedilla is 231 in both the iso-8859-1 and utf-8
> code pages. So, I would not expect this character to be replaced. I bet if
> you changed the hinted charset to "us-ascii" this character would disappear.
> Or you could use a character that isn't identical in both iso-8859-1 and
> utf-8...

I tried US-ASCII first, and it had the same effect (the character wasn't replaced, just retained the 231 code). In this patch, I've used a character that is neither in ASCII nor ISO-8859-1, but the character code is retained in both cases. Therefore, I don't really see the point of this test.
Comment 18 Gregory Szorc [:gps] 2012-07-13 11:46:24 PDT
After going down the rabbit hole, the current implementation is teh suck. This is no fault of Anant's. Instead of forcing Anant to dig through IDL docs, I'm going to fix this myself.
Comment 19 Gregory Szorc [:gps] 2012-07-13 16:09:29 PDT
Created attachment 642124 [details] [diff] [review]
Make RESTRequest take a charset hint; default to UTF-8

Ugh.

Apparently there are times when a client-specified contentCharset hint is reset to "" even when the server doesn't send an explicit charset as part of Content-Type. This was triggered by the test_proxy_auth_redirect test, which sends C-T: application/x-ns-proxy-autoconfig.

So, I had keep the nsIScriptableInputStream around to handle the case where contentCharset is not set. Previous versions of the patch eliminated it, assuming we always had contentCharset defined on the channel.

Also, upon consultation with others (although notably not :bz), it appears nsIConverterInputStream doesn't apply character set validation when doing the conversion. I thought that the "replacement character" support would replace byte sequences corresponding to characters not present in the destination character set. Apparently this is not how things work. Apparently it only checks for invalid byte sequences and replaces those.

So, the test I insisted on where the client does a silly conversion isn't a valid test, apparently. So, it has been removed.
Comment 20 Gregory Szorc [:gps] 2012-07-13 16:40:34 PDT
Created attachment 642136 [details] [diff] [review]
Make RESTRequest take a charset hint; default to UTF-8, v2

Assign response.charset in onDataAvailable; add tests for this.

Add comment on why we QI in onDataAvailable.
Comment 21 Gregory Szorc [:gps] 2012-07-13 16:53:52 PDT
https://hg.mozilla.org/services/services-central/rev/f7acb805fe4f

Anant gets blame :P
Comment 22 Gregory Szorc [:gps] 2012-07-13 16:55:10 PDT
Comment on attachment 642136 [details] [diff] [review]
Make RESTRequest take a charset hint; default to UTF-8, v2

Boris: I'd appreciate a post-commit glance if you have the time (everyone said you were the expert on this type of thing).
Comment 23 Boris Zbarsky [:bz] (Out June 25-July 6) 2012-07-13 17:01:51 PDT
Comment on attachment 642136 [details] [diff] [review]
Make RESTRequest take a charset hint; default to UTF-8, v2

Looks reasonable.
Comment 24 Tracy Walker [:tracy] 2012-07-14 13:22:58 PDT
sync setup. paring and smoketests fine. with latest s-c build of 1342246581.
Comment 25 Gregory Szorc [:gps] 2012-07-14 13:50:38 PDT
https://hg.mozilla.org/mozilla-central/rev/f7acb805fe4f
Comment 26 Jason Smith [:jsmith] 2012-07-14 14:32:56 PDT
Tracy btw - I'll need to do verification here as well for AITC, as this bug significantly affects app installations with the AITC server.
Comment 27 Jason Smith [:jsmith] 2012-07-16 19:33:17 PDT
Verification of this fix is blocked by bug 774555.
Comment 28 Jason Smith [:jsmith] 2012-07-16 19:38:05 PDT
(In reply to Jason Smith [:jsmith] from comment #27)
> Verification of this fix is blocked by bug 774555.

Apparently the staging server is undergoing maintenance right now, so that bug isn't valid.

Note You need to log in before you can comment on or make changes to this bug.