Closed Bug 942074 Opened 11 years ago Closed 11 years ago

javascript method XMLHttpRequest.setRequestHeader(...) is no longer perform URL encoding

Categories

(Core :: DOM: Core & HTML, defect)

25 Branch
defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: artur.bystrzycki, Unassigned)

References

(Blocks 1 open bug)

Details

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0 (Beta/Release)
Build ID: 20131112160018

Steps to reproduce:

Executed following javascript code from scratch pad:

var req = new XMLHttpRequest();
req.open('GET', 'http://инфо.об-образовании.рф', true);
req.setRequestHeader("Referer", "http://инфо.об-образовании.рф");
req.send(null);



Actual results:

Exception is thrown:

/*
Exception: Cannot convert string to ByteString because the character at index 7 has value 1080 which is greater than 255.
@Scratchpad/1:12
WCA_evalWithDebugger@resource://gre/modules/devtools/dbg-server.jsm -> resource://gre/modules/devtools/server/actors/webconsole.js:890
WCA_onEvaluateJS@resource://gre/modules/devtools/dbg-server.jsm -> resource://gre/modules/devtools/server/actors/webconsole.js:547
DSC_onPacket@resource://gre/modules/devtools/dbg-server.jsm -> resource://gre/modules/devtools/server/main.js:923
@resource://gre/modules/devtools/dbg-server.jsm -> resource://gre/modules/devtools/server/transport.js:242
@resource://gre/modules/devtools/dbg-server.jsm -> resource://gre/modules/devtools/DevToolsUtils.js:61
*/


Expected results:

Nothing in this case, but the problem is making unusable any Google Web Toolkit application deployed on domains that contain non ascii characters. I would like to know whether you removed url encoding by design or it's a bug. Based on this I can make a decision whether to report a bug to GWT.
Thanks.
Blocks: 942095
OS: Windows 7 → All
Hardware: x86_64 → All
> I would like to know whether you removed url encoding by design

There was no url encoding that I know of.  The old code used to just convert to UTF-8 and dump those bytes on the wire.

The change from that was purposeful, to align with the XMLHttpRequest specification.  But maybe this is a problem with the spec....  Let me check what other UAs do here.
Component: Untriaged → DOM
Product: Firefox → Core
So I tried this testcase:

  var req = new XMLHttpRequest();
  req.open('GET', 'test.html', true);
  req.setRequestHeader("Punk", "\u0444");
  req.send(null);

(because "Referer" is security-restricted in some browsers).

In Chrome dev, that throws an exception:

  Uncaught SyntaxError: Failed to execute 'setRequestHeader' on 'XMLHttpRequest':
    'ф' is not a valid HTTP header field value. 

In Safari, there is no exception, but it seems to put an empty string on the wire (?).

Not sure what other UAs do in terms of on-the-wire bits, since I can't run IE locally.  :(
Thanks for quick reply.

You're right and I'm closing this ticket and will report GWT problem.

Unfortunately the change to have more strict validation in XMLHttpRequest.XMLHttpRequest in addition to not encoding address bar entry (which I think is useful feature, not a bug) makes GWT application not working in FF when deployed under domain that contains non ascii characters.
Status: UNCONFIRMED → RESOLVED
Closed: 11 years ago
Resolution: --- → INVALID
Hmm.  So in Chrome gwt works because Chrome punycodes window.location.href?

Anne, it sounds like the XHR changes may not be web-compatible with a Unicode location.href...
Flags: needinfo?(annevk)
location.href is supposed to use Punycode per the URL Standard.

The old code we had in place for XMLHttpRequest did forbid certain bytes, correct?
Flags: needinfo?(annevk)
> location.href is supposed to use Punycode per the URL Standard.

It does?  Why, if I might ask?

> The old code we had in place for XMLHttpRequest did forbid certain bytes, correct?

Gecko's old implementation simply converted to UTF-8, so any valid Unicode string would work.  I'm not sure what the spec used to say....
Flags: needinfo?(annevk)
If we did not forbid newlines and such there would be a security issue...

URLs use Punycode because the data model for URLs appears to be bytes (so sad). And because toUnicode can fail, but toASCII cannot (and is required anyway). We could of course attempt a toUnicode whenever we return location.href, but given that it might not match what the user sees anyway (that's a UI thing) I'm not sure that's the best way. I could maybe see value in exposing URL.toUIString(URL url) or some such.
Flags: needinfo?(annevk)
> If we did not forbid newlines and such there would be a security issue...

That was presumably handled by the underlying HTTP code in Gecko.  The XHR code had no checks like that (and afaict still does not).

> because the data model for URLs appears to be bytes 

How so?  I mean, there's %-escaping, but apart from that you start with strings, not bytes.

> And because toUnicode can fail

Sure.  So Gecko's internal representation of hostnames is generally the toUnicode one, unless that fails, in which case it's the toASCII one.

> We could of course attempt a toUnicode whenever we return location.href

Why then as opposed to the "parse the host" phase?

I don't recall seeing any discussion on the mailing lists about making location.href punycode and I'm once again really worried about compat fallout...
Even if a URL were all strings their code points would be less than U+0080. For HTTP it pretty clearly is just bytes.

Discussion: http://lists.w3.org/Archives/Public/public-whatwg-archive/2013Sep/0124.html The "host parser" is required to use ToASCII. It could then follow that with ToUnicode, but I don't see the point. ToUnicode for domain, path, and such, is all about UI and not at all about the data model.
As for compatibility, I guess the same goes for the other browsers :/ It is a major problem for all the "edge" cases in widely deployed features.
Well, what goes on the wire is clearly the ToASCII version.

The question is why anyone else at all would want that version.

> ToUnicode for domain, path, and such, is all about UI and not at all about the data model.

Even if we grant that, why are we completely discounting "UI" (including what web developers see in their debugger!).
Because UI here is not a simple function of the ASCII variant. E.g. Chrome will only do ToUnicode based on whether they think the user will understand the result.
For their URL bar, sure.  But they can implement that as a custom thing, obviously, as would Firefox for whatever it does for their URL bar.

But that leaves the issue of web developers having to deal with these objects and the unreadable strings they produce.
As I said, we could introduce toUIString() or some such, that would use the same logic (at fingerprinting cost). Users would end up confused if the URL bar and location.href are different, no? And it's not just the URL bar, also link tooltips and such.
> Users would end up confused if the URL bar and location.href are different, no?

Er... the URL bar is definitely not always ASCII, so they're already different, right?  I'm not sure I follow...

Or do you mean that toUIString() would need to use the same heuristics as the url bar?
Yes, I meant that either way you end up with mismatches. And yes, toUIString() would have to use the same heuristics to be useful. Note that the API already exposes methods for converting just the domain portion of the URL: http://url.spec.whatwg.org/#api We should probably offer APIs for converting the path as well.
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.