Closed Bug 1007125 Opened 10 years ago Closed 6 years ago

URL parsing doesn't match the URL spec for non-Unicode domains

Categories

(Core :: Networking, defect, P5)

x86_64
Linux
defect

Tracking

()

RESOLVED INVALID

People

(Reporter: jcranmer, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-would-take])

Host-parsing of URLs is way too lenient compared to the URL standard:
<http://url.spec.whatwg.org/#concept-host-parser>

I've not tested steps 1-3, but step 4 we miserably fail:
4. Let asciiDomain be the result of running domain to ASCII on domain.
5. If asciiDomain is failure, return failure. 

Also not working properly is that URL.domain should be returning the ToASCII of the domain name, not ToUnicode.

Per UTS#46, something like U+ff01 should fail in all modes of IDNA processing, but:
new URL("http://\uff01/") -> valid URL.
Component: DOM → Networking
wdyt?
Flags: needinfo?(valentin.gosu)
Whiteboard: [necko-would-take]
According to http://www.unicode.org/Public/idna/latest/IdnaMappingTable.txt
FF01          ; disallowed_STD3_mapped ; 0021          # 1.1  FULLWIDTH EXCLAMATION MARK

https://url.spec.whatwg.org/#concept-domain-to-ascii says that UseSTD3ASCIIRules should be set to false when processing, meaning that FF01 gets mapped to !

When talking about web-compat, it gets a bit trickier:
Chrome turns new URL("http://\uff01/") to http://%21/
Safari throws for both "http://\uff01/" and "http://!/", while clicking a link to "http://!/" loads about:blank.
Edge doesn't have a URL API, so I tested using link and location. They both load http://!/, but loading http://\uff01/ fails, because the FULLWIDTH EX Mark isn't mapped to the regular exclamation mark.

Anne, how do you think we should proceed here?
Flags: needinfo?(valentin.gosu) → needinfo?(annevk)
One way to do this is to add "!" to the blocklist of ASCII code points in https://url.spec.whatwg.org/#concept-host-parser. That would make the host parser return failure, in turn the URL parser would return failure, new URL() would throw, and clicking the link would either not work or lead to an error page, depending on what kind of handling we have for things that are not URLs.

(I'm assuming we already map U+FF01 to "!", if we don't I wonder what kind of IDNA backend we have. Speaking of IDNA, has there been any kind of fallout from us no longer using Transitional processing? Should we try to get other browsers on board? Off-topic, but interesting...)
Flags: needinfo?(annevk)
(In reply to Anne (:annevk) from comment #3)
> One way to do this is to add "!" to the blocklist of ASCII code points in
> https://url.spec.whatwg.org/#concept-host-parser. That would make the host
> parser return failure, in turn the URL parser would return failure, new
> URL() would throw, and clicking the link would either not work or lead to an
> error page, depending on what kind of handling we have for things that are
> not URLs.
> 
> (I'm assuming we already map U+FF01 to "!",
Yes. That is what we're doing.

> Speaking of IDNA, has there been any kind of
> fallout from us no longer using Transitional processing? Should we try to
> get other browsers on board? Off-topic, but interesting...)

No fallout that I know of. We really should reach out to other browsers about this.
(In reply to Anne (:annevk) from comment #3)
> One way to do this is to add "!" to the blocklist of ASCII code points in
> https://url.spec.whatwg.org/#concept-host-parser. That would make the host
> parser return failure, in turn the URL parser would return failure, new
> URL() would throw, and clicking the link would either not work or lead to an
> error page, depending on what kind of handling we have for things that are
> not URLs.

This sounds like a good plan. Can't think of a reason to allow ! in the hostname.

(In reply to Joshua Cranmer [:jcranmer] from comment #0)
> Also not working properly is that URL.domain should be returning the ToASCII
> of the domain name, not ToUnicode.

Indeed, it also seems we're not doing the correct thing here.
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P5
Our behavior for U+FF01 is now correct per the URL Standard. "!" is not banned so you end up with that as the domain. Safari does the same and Chrome is close.

Bug 1365893 tracks remaining IDNA issues.

> Also not working properly is that URL.domain should be returning the ToASCII of the domain name, not ToUnicode.

This was also fixed somewhat recently.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.