Closed
Bug 1178095
Opened 10 years ago
Closed 1 year ago
Punycode doesn't handle interpuncts
Categories
(Core :: Networking: DNS, defect, P2)
Core
Networking: DNS
Tracking
()
RESOLVED
FIXED
125 Branch
| Tracking | Status | |
|---|---|---|
| firefox125 | --- | fixed |
People
(Reporter: manishearth, Assigned: valentin, Mentored)
References
Details
(Whiteboard: [necko-priority-queue][necko-triaged])
Attachments
(2 files)
E.g. http://www.mail·google.com/
Chrome nicely converts this into `http://www.xn--mailgoogle-ora.com/`.
I see that we're using the original punycode file from the RfC[1], Chrome uses a modification[2]
[1]: https://dxr.mozilla.org/mozilla-central/source/netwerk/dns/punycode.c
[2]: https://code.google.com/p/chromium/codesearch#chromium/src/third_party/icu/source/common/punycode.cpp&q=punycode&sq=package:chromium&l=1
| Assignee | ||
Comment 1•10 years ago
|
||
Actually, punycode is handled correctly, but the URL bar doesn't display the punycode domain.
You can verify that http://www.mail·google.com/ resolves to the same URL by opening the devtools-network tab, and looking at the Host header in the request. For me it shows up as http://www.xn--mailgoogle-ora.com/
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
| Reporter | ||
Comment 2•10 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #1)
> Actually, punycode is handled correctly, but the URL bar doesn't display the
> punycode domain.
> You can verify that http://www.mail·google.com/ resolves to the same URL by
> opening the devtools-network tab, and looking at the Host header in the
> request. For me it shows up as http://www.xn--mailgoogle-ora.com/
So, isn't that the point of punycode? That tricky URLs should be *displayed to the user* in a non-tricky form?
The bug may not be in punycode.c, but the bug still exists -- don't see how this can be made WFM.
| Assignee | ||
Comment 3•10 years ago
|
||
Oh, I understand. Even though I don't personally find the interpunct so similar to a dot (I may be in some fonts), we should consider adding it to the blacklist.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
| Reporter | ||
Comment 4•10 years ago
|
||
This patch works for me.
http://www.fileformat.info/info/unicode/char/b7/index.htm
The other two interpunct-ish symbols (U+2219 ∙ "bullet operator" and U+22C5 ⋅ "dot operator") are both already converted to punycode (probably because they're not basic)
Assignee: nobody → manishearth
Attachment #8627104 -
Flags: review?(valentin.gosu)
| Assignee | ||
Comment 5•10 years ago
|
||
Comment on attachment 8627104 [details] [diff] [review]
Add U+00B7 (MIDDLE DOT) to punycode blacklists
Looks good to me.
However, I'm not a peer. I see bz reviewed the last patch touching the blacklist.
Attachment #8627104 -
Flags: review?(valentin.gosu)
Attachment #8627104 -
Flags: review?(bzbarsky)
Attachment #8627104 -
Flags: feedback+
Comment 6•10 years ago
|
||
Comment on attachment 8627104 [details] [diff] [review]
Add U+00B7 (MIDDLE DOT) to punycode blacklists
Simon is a much better reviewer for the contents of this blacklist.
Attachment #8627104 -
Flags: review?(bzbarsky) → review?(smontagu)
Comment 7•10 years ago
|
||
We haven't blacklisted U+00B7 in the past because it's a permitted character in the .cat domain (and I believe also .es) -- see bug 331099 comment 0.
That said, I haven't been able to discover any sites that actually use the character, and if Chrome does blacklist it there's something to be said for doing the same. Gerv, would you be OK with adding it?
| Reporter | ||
Comment 8•10 years ago
|
||
Perhaps we could add a whitelist for tld-codepoint pairs where if we've determined that we want to punycode something, we can check the tld against this list and not convert it if the only special characters are those from the whitelist?
Or just directly blacklist it, doesn't seem to be used.
Flags: needinfo?(gerv)
Comment 9•10 years ago
|
||
I'm in favour of doing something like comment 8, possibly via ulocdata_getExemplarSet like Chrome does (bug 1172802 comment 1).
Comment 10•10 years ago
|
||
Simon: how hard would comment 8 be? It seems at first glance to be overkill.
If Chrome disallows the middle dot, then probably no domain will ever use it, but we should be polite and consult the .cat registry first.
I sent them a message via http://fundacio.cat/en/contact .
Gerv
Flags: needinfo?(gerv)
Comment 11•10 years ago
|
||
(In reply to Gervase Markham [:gerv] from comment #10)
> Simon: how hard would comment 8 be? It seems at first glance to be overkill.
ulocdata_getExemplarSet is an ICU API, so we could use it today on platforms where that is available. We would need to add some kind of mapping from TLDs to languages, but that wouldn't be insuperable (I don't think it would need to be an exhaustive list from the word go).
Comment 12•10 years ago
|
||
I'm Pep Masoliver, Technical Manager of the .cat registry.
Middle dot (·) is currently used on some .cat domains, such like www.l·l.cat, www.col·legi-lluis-vives.cat or www.fal·leragironina.cat. As you see, it's the central item of the Catalan's genuine construction "ela geminada", which exactly defines the "l·l" construction.
According to the Unicode recommendations, this construction is composed by U+006C (l), U+00B7 (·) and U+006C (l) characters. The middle dot cannot be used outside this construction on .cat domains, nor in the Catalan language.
| Reporter | ||
Comment 13•10 years ago
|
||
(In reply to Pep Masoliver from comment #12)
> I'm Pep Masoliver, Technical Manager of the .cat registry.
>
> Middle dot (·) is currently used on some .cat domains, such like
> www.l·l.cat, www.col·legi-lluis-vives.cat or www.fal·leragironina.cat. As
> you see, it's the central item of the Catalan's genuine construction "ela
> geminada", which exactly defines the "l·l" construction.
>
> According to the Unicode recommendations, this construction is composed by
> U+006C (l), U+00B7 (·) and U+006C (l) characters. The middle dot cannot be
> used outside this construction on .cat domains, nor in the Catalan language.
Would there be a major impact if domains containing such characters had their representations in the URL bar mangled via punycode? Chrome already does this (http://www.xn--ll-0ea.cat/ for www.l·l.cat). This doesn't change how the URLs work, just that they get displayed differently in the URL bar.
Comment 14•10 years ago
|
||
Chrome shows the URL correctly on my Fedora 20 laptop (43.0.2357) and also on the Mac OS (Yosemite 10.10.4, Chrome 43.0.2357.134) of a colleague ... but we've seen that it has an erratic behavior on Windows: it mainly shows the correct IDN URL, but sometimes it shows the punycode correspondence. We hadn't be able to concrete if there is a pattern, yet.
Well, this is not a technical approach, but on our opinion, it's not comfortable for the end users seeing an URL with strange characters. It's also hard to understand that the implied domain is reliable and still more that it's working properly.
Comment 15•10 years ago
|
||
I think that if a character is in use, clearly part of an established orthography and generally working everywhere, we shouldn't break it. If we want to add a restriction that it may only occur between two Ls, we can, but that's our problem to work out how to do that.
I also don't want to get into the game of trying to work out what "language" a particular domain name is in. Domain names don't come with language information.
So I think we should do either:
a) nothing; or
b) add a custom restriction to allow MIDDLE DOT only between two "l" characters.
Gerv
Comment 16•10 years ago
|
||
b) would be a a two- or three-line patch to nsIDNService::IsLabelSafe, and I think it's probably worth doing.
It only goes so far to solve the originally reported issue, of course: http://www.mail·google.com/ would be displayed in punycode, but http://www.mail·lycos.com/ would have to be displayed as-is.
Mentor: smontagu
Comment 17•10 years ago
|
||
Comment on attachment 8627104 [details] [diff] [review]
Add U+00B7 (MIDDLE DOT) to punycode blacklists
Review of attachment 8627104 [details] [diff] [review]:
-----------------------------------------------------------------
So we won't take this approach to fixing the bug.
Attachment #8627104 -
Flags: review?(smontagu) → review-
Comment 18•10 years ago
|
||
I'm not a fan of rules for specific characters, but I think this slope isn't very slippery. It's a character which looks like a protocol character (and there aren't too many of those) and it is used in one specific limited context, which we can easily check for. So I think we can do this, but notionally reserving the right to not implement a similar solution for other characters if the rules are more complex or the slope-to-madness turns out to be slippery after all.
Gerv
Updated•9 years ago
|
Whiteboard: [necko-backlog]
Comment 19•8 years ago
|
||
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Comment 20•8 years ago
|
||
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
Updated•3 years ago
|
Assignee: manishearth → nobody
Updated•3 years ago
|
Severity: normal → S3
| Assignee | ||
Comment 21•2 years ago
|
||
I think we want to follow Chrome's lead here.
We've imported their test cases in bug 1790163 - currently disabled here
The fix would probably go here - if we want to exclude .cat from the rule, we need to pass the TLD to nsIDNService::isLabelSafe
Comment 22•2 years ago
|
||
Bumping into priority-review (mcmanus put it backlog like 7 years ago)
Whiteboard: [necko-backlog] → [necko-priority-review][necko-triaged]
| Assignee | ||
Updated•1 year ago
|
Whiteboard: [necko-priority-review][necko-triaged] → [necko-priority-next][necko-triaged]
| Assignee | ||
Updated•1 year ago
|
Assignee: nobody → valentin.gosu
Whiteboard: [necko-priority-next][necko-triaged] → [necko-priority-queue][necko-triaged]
| Assignee | ||
Comment 23•1 year ago
|
||
Since some characters are only allowed in certain TLDs this patch
adds a TLD argument to isLabelSafe.
It also adds the check to nsIDNService to make sure that interpuncts
0xB7 is only allowed on Catalan domains between two l's.
Comment 24•1 year ago
|
||
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/19e827ffaa4b
Punycode doesn't handle interpuncts r=hsivonen,necko-reviewers,kershaw
Comment 25•1 year ago
|
||
| bugherder | ||
Status: NEW → RESOLVED
Closed: 10 years ago → 1 year ago
status-firefox125:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → 125 Branch
You need to log in
before you can comment on or make changes to this bug.
Description
•