Closed Bug 1178095 Opened 10 years ago Closed 1 year ago

Punycode doesn't handle interpuncts

Categories

(Core :: Networking: DNS, defect, P2)

defect

Tracking

()

RESOLVED FIXED
125 Branch
Tracking Status
firefox125 --- fixed

People

(Reporter: manishearth, Assigned: valentin, Mentored)

References

Details

(Whiteboard: [necko-priority-queue][necko-triaged])

Attachments

(2 files)

Actually, punycode is handled correctly, but the URL bar doesn't display the punycode domain. You can verify that http://www.mail·google.com/ resolves to the same URL by opening the devtools-network tab, and looking at the Host header in the request. For me it shows up as http://www.xn--mailgoogle-ora.com/
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
(In reply to Valentin Gosu [:valentin] from comment #1) > Actually, punycode is handled correctly, but the URL bar doesn't display the > punycode domain. > You can verify that http://www.mail·google.com/ resolves to the same URL by > opening the devtools-network tab, and looking at the Host header in the > request. For me it shows up as http://www.xn--mailgoogle-ora.com/ So, isn't that the point of punycode? That tricky URLs should be *displayed to the user* in a non-tricky form? The bug may not be in punycode.c, but the bug still exists -- don't see how this can be made WFM.
Oh, I understand. Even though I don't personally find the interpunct so similar to a dot (I may be in some fonts), we should consider adding it to the blacklist.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
This patch works for me. http://www.fileformat.info/info/unicode/char/b7/index.htm The other two interpunct-ish symbols (U+2219 ∙ "bullet operator" and U+22C5 ⋅ "dot operator") are both already converted to punycode (probably because they're not basic)
Assignee: nobody → manishearth
Attachment #8627104 - Flags: review?(valentin.gosu)
Comment on attachment 8627104 [details] [diff] [review] Add U+00B7 (MIDDLE DOT) to punycode blacklists Looks good to me. However, I'm not a peer. I see bz reviewed the last patch touching the blacklist.
Attachment #8627104 - Flags: review?(valentin.gosu)
Attachment #8627104 - Flags: review?(bzbarsky)
Attachment #8627104 - Flags: feedback+
Comment on attachment 8627104 [details] [diff] [review] Add U+00B7 (MIDDLE DOT) to punycode blacklists Simon is a much better reviewer for the contents of this blacklist.
Attachment #8627104 - Flags: review?(bzbarsky) → review?(smontagu)
We haven't blacklisted U+00B7 in the past because it's a permitted character in the .cat domain (and I believe also .es) -- see bug 331099 comment 0. That said, I haven't been able to discover any sites that actually use the character, and if Chrome does blacklist it there's something to be said for doing the same. Gerv, would you be OK with adding it?
Perhaps we could add a whitelist for tld-codepoint pairs where if we've determined that we want to punycode something, we can check the tld against this list and not convert it if the only special characters are those from the whitelist? Or just directly blacklist it, doesn't seem to be used.
Flags: needinfo?(gerv)
I'm in favour of doing something like comment 8, possibly via ulocdata_getExemplarSet like Chrome does (bug 1172802 comment 1).
Simon: how hard would comment 8 be? It seems at first glance to be overkill. If Chrome disallows the middle dot, then probably no domain will ever use it, but we should be polite and consult the .cat registry first. I sent them a message via http://fundacio.cat/en/contact . Gerv
Flags: needinfo?(gerv)
(In reply to Gervase Markham [:gerv] from comment #10) > Simon: how hard would comment 8 be? It seems at first glance to be overkill. ulocdata_getExemplarSet is an ICU API, so we could use it today on platforms where that is available. We would need to add some kind of mapping from TLDs to languages, but that wouldn't be insuperable (I don't think it would need to be an exhaustive list from the word go).
I'm Pep Masoliver, Technical Manager of the .cat registry. Middle dot (·) is currently used on some .cat domains, such like www.l·l.cat, www.col·legi-lluis-vives.cat or www.fal·leragironina.cat. As you see, it's the central item of the Catalan's genuine construction "ela geminada", which exactly defines the "l·l" construction. According to the Unicode recommendations, this construction is composed by U+006C (l), U+00B7 (·) and U+006C (l) characters. The middle dot cannot be used outside this construction on .cat domains, nor in the Catalan language.
(In reply to Pep Masoliver from comment #12) > I'm Pep Masoliver, Technical Manager of the .cat registry. > > Middle dot (·) is currently used on some .cat domains, such like > www.l·l.cat, www.col·legi-lluis-vives.cat or www.fal·leragironina.cat. As > you see, it's the central item of the Catalan's genuine construction "ela > geminada", which exactly defines the "l·l" construction. > > According to the Unicode recommendations, this construction is composed by > U+006C (l), U+00B7 (·) and U+006C (l) characters. The middle dot cannot be > used outside this construction on .cat domains, nor in the Catalan language. Would there be a major impact if domains containing such characters had their representations in the URL bar mangled via punycode? Chrome already does this (http://www.xn--ll-0ea.cat/ for www.l·l.cat). This doesn't change how the URLs work, just that they get displayed differently in the URL bar.
Chrome shows the URL correctly on my Fedora 20 laptop (43.0.2357) and also on the Mac OS (Yosemite 10.10.4, Chrome 43.0.2357.134) of a colleague ... but we've seen that it has an erratic behavior on Windows: it mainly shows the correct IDN URL, but sometimes it shows the punycode correspondence. We hadn't be able to concrete if there is a pattern, yet. Well, this is not a technical approach, but on our opinion, it's not comfortable for the end users seeing an URL with strange characters. It's also hard to understand that the implied domain is reliable and still more that it's working properly.
I think that if a character is in use, clearly part of an established orthography and generally working everywhere, we shouldn't break it. If we want to add a restriction that it may only occur between two Ls, we can, but that's our problem to work out how to do that. I also don't want to get into the game of trying to work out what "language" a particular domain name is in. Domain names don't come with language information. So I think we should do either: a) nothing; or b) add a custom restriction to allow MIDDLE DOT only between two "l" characters. Gerv
b) would be a a two- or three-line patch to nsIDNService::IsLabelSafe, and I think it's probably worth doing. It only goes so far to solve the originally reported issue, of course: http://www.mail·google.com/ would be displayed in punycode, but http://www.mail·lycos.com/ would have to be displayed as-is.
Mentor: smontagu
Comment on attachment 8627104 [details] [diff] [review] Add U+00B7 (MIDDLE DOT) to punycode blacklists Review of attachment 8627104 [details] [diff] [review]: ----------------------------------------------------------------- So we won't take this approach to fixing the bug.
Attachment #8627104 - Flags: review?(smontagu) → review-
I'm not a fan of rules for specific characters, but I think this slope isn't very slippery. It's a character which looks like a protocol character (and there aren't too many of those) and it is used in one specific limited context, which we can easily check for. So I think we can do this, but notionally reserving the right to not implement a similar solution for other characters if the rules are more complex or the slope-to-madness turns out to be slippery after all. Gerv
Whiteboard: [necko-backlog]
Priority: -- → P1
Priority: P1 → P3
Assignee: manishearth → nobody
Severity: normal → S3

I think we want to follow Chrome's lead here.
We've imported their test cases in bug 1790163 - currently disabled here

The fix would probably go here - if we want to exclude .cat from the rule, we need to pass the TLD to nsIDNService::isLabelSafe

Blocks: 1800628
Status: REOPENED → NEW
Priority: P3 → P2
See Also: → 1790163

Bumping into priority-review (mcmanus put it backlog like 7 years ago)

Whiteboard: [necko-backlog] → [necko-priority-review][necko-triaged]
Whiteboard: [necko-priority-review][necko-triaged] → [necko-priority-next][necko-triaged]
Assignee: nobody → valentin.gosu
Whiteboard: [necko-priority-next][necko-triaged] → [necko-priority-queue][necko-triaged]

Since some characters are only allowed in certain TLDs this patch
adds a TLD argument to isLabelSafe.

It also adds the check to nsIDNService to make sure that interpuncts
0xB7 is only allowed on Catalan domains between two l's.

Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/19e827ffaa4b Punycode doesn't handle interpuncts r=hsivonen,necko-reviewers,kershaw
Status: NEW → RESOLVED
Closed: 10 years ago1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 125 Branch
Blocks: 1885096
Regressions: 1887815
See Also: → 1891760
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: