The default bug view has changed. See this FAQ.

RESTRequest should default to UTF-8 if no encoding is specified

RESOLVED FIXED in mozilla16

Status

Cloud Services
Firefox: Common
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: anant, Assigned: gps)

Tracking

unspecified
mozilla16
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [blocking-aitc+], [qa+])

Attachments

(1 attachment, 5 obsolete attachments)

(Reporter)

Description

5 years ago
RFC 2616 specifies that if a charset is not explicitly specified in the headers, it must be treated as ISO-8859-1, but this is only for Content-Type "text" (and it's subtypes).

The spec does not stipulate a default encoding for any other Content-Types, and I propose that we default to UTF-8 for all of them except "text/*". Many web servers already fallback to UTF-8 (for POST requests at-least) if an encoding is not explicitly provided.

What are the risks of doing this?
(Assignee)

Comment 1

5 years ago
The main risk is that some existing server or system relying on the current behavior breaks as a result of this change.

I also don't know what necko is doing under the hood with regards to setting nsIChannel.contentCharset. I would like confirmation from the necko people that inferring UTF-8 (when it makes sense) if contentCharset is not set is a sane thing to do.

Finally, the patch from bug 761877 made the UTF-8 change unconditionally (I think). This *may* preclude RESTRequest from being used with binary data. I /think/ I would prefer the default charset and/or stream conversions be exposed as properties/APIs of RESTRequest. i.e. UTF-8 may be a reasonable default for text streams. But, we should make it easy for clients to change.

If we do make this change, we need to ensure servers being connected to by RESTRequest are serving proper content type headers and we don't inadvertently change behavior.
(Reporter)

Comment 2

5 years ago
Makes sense. A better fix might be to only infer UTF-8 only for Content-Type: application/x-web-app-manifest+json. Perhaps make the default encoding a parameter at the time a RESTRequest is instantiated, so the caller can decide? I suspect different consumers of RESTRequest will want different things and the caller is in the best position to predict what the responses from the server are going to be.
(Assignee)

Comment 3

5 years ago
(In reply to Anant Narayanan [:anant] from comment #2)
> Makes sense. A better fix might be to only infer UTF-8 only for
> Content-Type: application/x-web-app-manifest+json. Perhaps make the default
> encoding a parameter at the time a RESTRequest is instantiated, so the
> caller can decide? I suspect different consumers of RESTRequest will want
> different things and the caller is in the best position to predict what the
> responses from the server are going to be.

Precisely.

FWIW, JSON's default encoding is UTF-8 per RFC 4627 Sec 3. Perhaps if we register application/x-web-app-manifest+json somewhere, necko will automagically apply UTF-8 to the channel contentCharset in the absence of an explicit "charset" field in the C-T response header?
Necko doesn't do any charset automagic at the moment.  It'll return empty string if no charset was specified in Content-Type....

I suppose we could consider changing that, though.
(Assignee)

Comment 5

5 years ago
OK. I think we should have a property on RESTRequest that controls the default encoding. By default it is UTF-8. If the response channel has contentCharset defined, we always use that. As long as the request's response charset is updated before the request is dispatched, it will be honored.

Anant: do you want to adapt your previous patch or should I code something up?
> If the response channel has contentCharset defined, we always use that.

You can actually make necko do this for you.  Just set contentCharset on the channel _before_ calling asyncOpen.  See documentation for contentType: contentCharset behaves the same way.
(Assignee)

Comment 7

5 years ago
(In reply to Boris Zbarsky (:bz) from comment #6)
> > If the response channel has contentCharset defined, we always use that.
> 
> You can actually make necko do this for you.  Just set contentCharset on the
> channel _before_ calling asyncOpen.  See documentation for contentType:
> contentCharset behaves the same way.

Most excellent (assuming it behaves as a hint like contentType does). Perhaps https://developer.mozilla.org/en/nsIChannel should be updated to mention the behavior of contentCharset...
> assuming it behaves as a hint like contentType does

Yes, exactly.  r=me on updating the docs.  ;)
(Reporter)

Comment 9

5 years ago
Created attachment 640421 [details] [diff] [review]
Make RESTRequest take a charset hint

Add a "charset" property to RESTRequest and pass it along to nsIChannel.
Assignee: nobody → anant
Status: NEW → ASSIGNED
Attachment #640421 - Flags: review?(gps)
(Assignee)

Comment 10

5 years ago
Comment on attachment 640421 [details] [diff] [review]
Make RESTRequest take a charset hint

Review of attachment 640421 [details] [diff] [review]:
-----------------------------------------------------------------

::: services/common/rest.js
@@ +319,5 @@
>      channel.requestMethod = method;
>  
> +    // Before opening the channel, set the charset that serves as a hint
> +    // as to what the response might be encoded as.
> +    channel.contentCharset = this.charset ? this.charset : "utf-8";

Just set charset on the prototype to "utf-8" and always use this.charset.

::: services/common/tests/unit/test_restrequest.js
@@ +201,5 @@
> +      do_check_eq(request2.response.body, response);
> +      do_check_eq(request2.response.headers["content-type"], contentType);
> +
> +      server.stop(run_next_test);
> +    });

We need a bit more test coverage here. We need to ensure:

1) Default utf-8 charset is used if C-T header doesn't specify a charset
2) C-T header overrides client default/hint
3) Local hint is used if C-T doesn't specify a charset

We don't do any of these now since utf-8 is being used everywhere. i.e. we have no way of knowing if this works when we are using something not UTF-8.

An easy way to test this would be to use "us-ascii" as the charset and ensure all bytes higher than 127 are stripped from the response body.
Attachment #640421 - Flags: review?(gps)

Updated

5 years ago
Whiteboard: [blocking-aitc+]

Updated

5 years ago
QA Contact: jsmith
(Reporter)

Comment 11

5 years ago
Created attachment 641532 [details] [diff] [review]
Make RESTRequest take a charset hint
Attachment #640421 - Attachment is obsolete: true
Attachment #641532 - Flags: review?(gps)
(Assignee)

Updated

5 years ago
Attachment #641532 - Attachment is patch: true
(Assignee)

Comment 12

5 years ago
Comment on attachment 641532 [details] [diff] [review]
Make RESTRequest take a charset hint

Review of attachment 641532 [details] [diff] [review]:
-----------------------------------------------------------------

This is /really/ close. You just need a test that ensures that a hint with a server without a charset actually does the right thing. I'm not convinced that a simple ASCII string tests things adequately. Put a unicode character in that test string and ensure that us-ascii conversion on the client strips it out.
(Assignee)

Comment 13

5 years ago
Comment on attachment 641532 [details] [diff] [review]
Make RESTRequest take a charset hint

Review of attachment 641532 [details] [diff] [review]:
-----------------------------------------------------------------

Forgot to wipe flag.
Attachment #641532 - Flags: review?(gps)
(Reporter)

Comment 14

5 years ago
Created attachment 641658 [details] [diff] [review]
Make RESTRequest take a charset hint (added test for UTF-8 translate)

So, nsIConverterInputStream isn't doing what you suggest. While interpreting a string of UTF-8 byes as ISO-8859-1, it doesn't change anything about a character it doesn't recognize (although the documentation at https://developer.mozilla.org/en/Reading_textual_data implies that it should be replaced by the REPLACEMENT_CHARACTER - we use the default in rest.js).

The character code for ç is 231, which is why the test in this patch passes. But I suspect we won't need this kind of test. Trying to interpret UTF-8 content as ASCII is always going to fail, I don't think there is any expectation from any side that the string will somehow be downgraded gracefully.

The bottom line is that if the caller of a RESTRequest doesn't know what encoding the return value is in, and if the server doesn't provide a charset in the C-T header; all bets are off. I think I'm ok with that.

Leaving the previous patch as not obsolete, as I think we want that version to be checked-in.
Attachment #641658 - Flags: review?(gps)
(Reporter)

Updated

5 years ago
Attachment #641658 - Attachment description: Make RESTRequest take a charset hint → Make RESTRequest take a charset hint (added test for UTF-8 translate)
(Assignee)

Comment 15

5 years ago
(In reply to Anant Narayanan [:anant] from comment #14)
> Created attachment 641658 [details] [diff] [review]
> Make RESTRequest take a charset hint (added test for UTF-8 translate)
> 
> So, nsIConverterInputStream isn't doing what you suggest. While interpreting
> a string of UTF-8 byes as ISO-8859-1, it doesn't change anything about a
> character it doesn't recognize (although the documentation at
> https://developer.mozilla.org/en/Reading_textual_data implies that it should
> be replaced by the REPLACEMENT_CHARACTER - we use the default in rest.js).
>
> The character code for ç is 231, which is why the test in this patch passes.

The internets says that c-cedilla is 231 in both the iso-8859-1 and utf-8 code pages. So, I would not expect this character to be replaced. I bet if you changed the hinted charset to "us-ascii" this character would disappear. Or you could use a character that isn't identical in both iso-8859-1 and utf-8...

> But I suspect we won't need this kind of test. Trying to interpret UTF-8
> content as ASCII is always going to fail, I don't think there is any
> expectation from any side that the string will somehow be downgraded
> gracefully.
> 
> The bottom line is that if the caller of a RESTRequest doesn't know what
> encoding the return value is in, and if the server doesn't provide a charset
> in the C-T header; all bets are off. I think I'm ok with that.

Yes and no. What we're testing here is that the code behaves as it says it will. If we change the behavior of the client's interpretation of character data, we want that to be an explicit change. i.e. we want to require someone to think long and hard about changing a test if they suddenly find a regression. This test does just that by detecting if there is a change to how the client interprets the data in the absence of a server override.

As you said, the conversion behavior may not be "right." That's totally fine. The responsibility of encoding is with the client and server of a particular service/server. If they manage to screw it up, there's nothing we (RESTRequest) can do.

If there's any takeaway from this bug, it should be "always explicitly define a character set."
(Assignee)

Comment 16

5 years ago
Comment on attachment 641658 [details] [diff] [review]
Make RESTRequest take a charset hint (added test for UTF-8 translate)

Review of attachment 641658 [details] [diff] [review]:
-----------------------------------------------------------------

We will move forward with this patch because it has the additional test. See my previous comment for what needs changed.
Attachment #641658 - Flags: review?(gps)
(Reporter)

Comment 17

5 years ago
Created attachment 641944 [details] [diff] [review]
Make RESTRequest handle different charsets (added test for UTF-8 translate)

(In reply to Gregory Szorc [:gps] from comment #15)
> (In reply to Anant Narayanan [:anant] from comment #14)
> > Created attachment 641658 [details] [diff] [review]
> > Make RESTRequest take a charset hint (added test for UTF-8 translate)
> > 
> > So, nsIConverterInputStream isn't doing what you suggest. While interpreting
> > a string of UTF-8 byes as ISO-8859-1, it doesn't change anything about a
> > character it doesn't recognize (although the documentation at
> > https://developer.mozilla.org/en/Reading_textual_data implies that it should
> > be replaced by the REPLACEMENT_CHARACTER - we use the default in rest.js).
> >
> > The character code for ç is 231, which is why the test in this patch passes.
> 
> The internets says that c-cedilla is 231 in both the iso-8859-1 and utf-8
> code pages. So, I would not expect this character to be replaced. I bet if
> you changed the hinted charset to "us-ascii" this character would disappear.
> Or you could use a character that isn't identical in both iso-8859-1 and
> utf-8...

I tried US-ASCII first, and it had the same effect (the character wasn't replaced, just retained the 231 code). In this patch, I've used a character that is neither in ASCII nor ISO-8859-1, but the character code is retained in both cases. Therefore, I don't really see the point of this test.
Attachment #641658 - Attachment is obsolete: true
Attachment #641944 - Flags: review?(gps)
(Reporter)

Updated

5 years ago
Summary: RESTRequest should default to UTF-8 is no encoding is specified → RESTRequest should default to UTF-8 if no encoding is specified
(Assignee)

Comment 18

5 years ago
After going down the rabbit hole, the current implementation is teh suck. This is no fault of Anant's. Instead of forcing Anant to dig through IDL docs, I'm going to fix this myself.
Assignee: anant → gps
(Assignee)

Comment 19

5 years ago
Created attachment 642124 [details] [diff] [review]
Make RESTRequest take a charset hint; default to UTF-8

Ugh.

Apparently there are times when a client-specified contentCharset hint is reset to "" even when the server doesn't send an explicit charset as part of Content-Type. This was triggered by the test_proxy_auth_redirect test, which sends C-T: application/x-ns-proxy-autoconfig.

So, I had keep the nsIScriptableInputStream around to handle the case where contentCharset is not set. Previous versions of the patch eliminated it, assuming we always had contentCharset defined on the channel.

Also, upon consultation with others (although notably not :bz), it appears nsIConverterInputStream doesn't apply character set validation when doing the conversion. I thought that the "replacement character" support would replace byte sequences corresponding to characters not present in the destination character set. Apparently this is not how things work. Apparently it only checks for invalid byte sequences and replaces those.

So, the test I insisted on where the client does a silly conversion isn't a valid test, apparently. So, it has been removed.
Attachment #641532 - Attachment is obsolete: true
Attachment #641944 - Attachment is obsolete: true
Attachment #641944 - Flags: review?(gps)
Attachment #642124 - Flags: review?(philipp)
(Assignee)

Comment 20

5 years ago
Created attachment 642136 [details] [diff] [review]
Make RESTRequest take a charset hint; default to UTF-8, v2

Assign response.charset in onDataAvailable; add tests for this.

Add comment on why we QI in onDataAvailable.
Attachment #642124 - Attachment is obsolete: true
Attachment #642124 - Flags: review?(philipp)
Attachment #642136 - Flags: review?(philipp)
Attachment #642136 - Flags: review?(philipp) → review+
(Assignee)

Comment 21

5 years ago
https://hg.mozilla.org/services/services-central/rev/f7acb805fe4f

Anant gets blame :P
Whiteboard: [blocking-aitc+] → [blocking-aitc+][fixed in services]
(Assignee)

Comment 22

5 years ago
Comment on attachment 642136 [details] [diff] [review]
Make RESTRequest take a charset hint; default to UTF-8, v2

Boris: I'd appreciate a post-commit glance if you have the time (everyone said you were the expert on this type of thing).
Attachment #642136 - Flags: feedback?(bzbarsky)

Updated

5 years ago
Whiteboard: [blocking-aitc+][fixed in services] → [blocking-aitc+], [fixed in services], [qa+]
Comment on attachment 642136 [details] [diff] [review]
Make RESTRequest take a charset hint; default to UTF-8, v2

Looks reasonable.
Attachment #642136 - Flags: feedback?(bzbarsky) → feedback+
sync setup. paring and smoketests fine. with latest s-c build of 1342246581.
Whiteboard: [blocking-aitc+], [fixed in services], [qa+] → [blocking-aitc+], [verified in services], [qa+]
(Assignee)

Comment 25

5 years ago
https://hg.mozilla.org/mozilla-central/rev/f7acb805fe4f
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Whiteboard: [blocking-aitc+], [verified in services], [qa+] → [blocking-aitc+][qa+]
Target Milestone: --- → mozilla16
Tracy btw - I'll need to do verification here as well for AITC, as this bug significantly affects app installations with the AITC server.
Verification of this fix is blocked by bug 774555.
Depends on: 774555
Whiteboard: [blocking-aitc+][qa+] → [blocking-aitc+][qa verification blocked]
(In reply to Jason Smith [:jsmith] from comment #27)
> Verification of this fix is blocked by bug 774555.

Apparently the staging server is undergoing maintenance right now, so that bug isn't valid.
No longer depends on: 774555
Whiteboard: [blocking-aitc+][qa verification blocked] → [blocking-aitc+], [qa+]

Updated

5 years ago
QA Contact: jsmith
You need to log in before you can comment on or make changes to this bug.