Created attachment 8785552 [details] PoC.html User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Steps to reproduce: Open the "PoC.html", click the string 'Click Me', the new tab will appear with the address bar indicates that the host is '127.0.0.1', while the true host is '%D8%B9%D8%B1%D8%A8%D9%8A.%D8%A7%D9%85%D8%A7%D8%B1%D8%A7%D8%AA' PoC.html ----- <a href='http://%D8%B9%D8%B1%D8%A8%D9%8A.%D8%A7%D9%85%D8%A7%D8%B1%D8%A7%D8%AA/?%DA%86/127.0.0.1/'>Click Me</a> Actual results: The address bar incorrectly shown the location.host is '127.0.0.1', as the real locatin.host is '%D8%B9%D8%B1%D8%A8%D9%8A.%D8%A7%D9%85%D8%A7%D8%B1%D8%A7%D8%AA' Expected results: The address should show the real host
Jonathan, I've tried looking into this but I'm not really sure what's happening here and would appreciate help from someone who knows more about fonts and displaying text. The highlighting of the URI is correct - we just don't display it correctly. Ideas about what's going on? :-)
AFAICS, this is the expected result for the given address, because the entire string of hostname+path resolves to a right-to-left run of text, with the dot-separated numbers being an embedded left-to-right segment within that. The path-separating slashes and the question-mark have neutral directionality, so they adopt RTL from their environment, and the numbers (which are logically the last path element) end up at the left-hand (i.e. trailing) end of the string "عربي.امارات/?چ/127.0.0.1/". I see a similar result in Safari and Chrome, FWIW. This seems to be a natural consequence of the ambiguity of bidi text: the reader doesn't know, when seeing a URL as quoted above, whether to read it as overall-LTR, with the dot-separated numbers as the domain and the Arabic text as path components, or as overall-RTL, with an Arabic-script domain and the numbers as the final path element. Maybe we should do something to try and avoid this ambiguity, but I'm not sure what the best solution would be. Any attempt to address this should probably be coordinated with other browsers, too. Obviously, the same issue could arise with Hebrew domain names. :smontagu, any thoughts here?
If you stick a letter in the middle of the numeric part you see even odder results, e.g. http://%D8%B9%D8%B1%D8%A8%D9%8A.%D8%A7%D9%85%D8%A7%D8%B1%D8%A7%D8%AA/?%DA%86/127foo1/ There must be some special-casing of numbers, it's not just preserving a LTR label. If this is how all browsers behave and standardized, and only allows spoofing numeric IP addresses, perhaps this doesn't need to be hidden.
Huh, that's interesting -- in this case, I do see a significant difference between Firefox and Chrome's display. It looks like ours is simply the result of running the Unicode bidi algorithm on the whole URL, while Chrome has separated the labels in some way, so that the "127foo1" sequence stays together (whereas the raw UBA behavior treats the initial "127" as belonging with the preceding RTL text, and only the final "1" moves along with "foo" in the final LTR run). I'm not sure offhand if there's a spec that describes in detail exactly how this is expected to work (maybe the W3C i18n group has something?), but it does seem like Chrome's result is at least somewhat less confusing to a user.
(In reply to Jonathan Kew (:jfkthame) from comment #5) > Huh, that's interesting -- in this case, I do see a significant difference > between Firefox and Chrome's display. It looks like ours is simply the > result of running the Unicode bidi algorithm on the whole URL, while Chrome > has separated the labels in some way, so that the "127foo1" sequence stays > together (whereas the raw UBA behavior treats the initial "127" as belonging > with the preceding RTL text, and only the final "1" moves along with "foo" > in the final LTR run). > > I'm not sure offhand if there's a spec that describes in detail exactly how > this is expected to work (maybe the W3C i18n group has something?), but it > does seem like Chrome's result is at least somewhat less confusing to a user. Well... http://xn--ngbrx4e.xn--mgbaam7a8h/?%DA%86/google.com displays as "google.com/" followed by arabic characters in Chrome. Not so in Firefox. I haven't experimented with this sufficiently to figure out if there's a way to spoof a domain in Firefox as well, but if so I believe that probably makes the bug more severe. Reporter, did you report the same issue to chrome/chromium, and if so can you link to their report?
This is not a new issue: see e.g. bug 525831, http://www.macchiato.com/bidi-url, https://www.w3.org/International/wiki/IRIStatus#Unicode_Bidirectional_Algorithm_Failure, and http://unicode.org/review/pri185/pri185-rev-UBA-for-URL-IRI.html (In reply to Daniel Veditz [:dveditz] from comment #4) > There must be some special-casing of numbers, it's not just preserving a LTR > label. This special-casing is built into the Bidi Algorithm itself -- although the numbers 0-9 themselves are always LTR, if a sequence of numbers appear after RTL characters they will be reordered to the left, even in an LTR paragraph, and even with neutral characters like "/" and "." in between, and vice versa. So in your example "127" is reordered to the left of "چ" and "1" to the right of "foo". A practical consequence of this is that "Always display URLs LTR" is not sufficient to solve all the problems with RTL characters in URLs. This needs to be handled on a spec level and in a coordinated way by browser vendors, especially if Chrome is already doing some proactive spec bending to alleviate the issue. That said, I don't myself see any difference between Firefox nightly and Chrome (version 52.0.2743.116 on Ubuntu 16.04)
Yep, I did report this issue to Chrome, and the report is here: https://bugs.chromium.org/p/chromium/issues/detail?id=638818
Hm. I considered highlighting the text that currently appears before the domain with a custom selection range and making its foreground color match the bg color (we use a similar trick to highlight the domain already), but unfortunately, it seems like URLs like this: http://عربي.امارات/?%DA%86.........................................................................................................................................................................................................................................................................................................0.0.1/ would then just display nothing visibly at all. Even now, those are kind of... interesting. What's the best way to move forward here and get momentum towards a consensus that's implemented by all browsers, given that that hasn't happened since bug 525831 was filed? Jonathan? Independently, given that the chrome bug is clearly open and that 525831 is open, should we open this up?
I don't see any strong reason to keep this closed, given the existing public issues/discussion of the same kind of thing. It's not currently clear to me what a good way forward here might even look like... there's a fundamental tension between allowing people to use bidi scripts, and avoiding potentially ambiguous/confusing rendering. I think this is primarily a browser UI design problem, and as such I'm not sure whether we need "a consensus that's implemented by all browsers" if there are significant differences between how they approach UI design. Maybe each browser needs to look at its own UI and decide what will best serve its users. For example, Safari partially avoids this problem because it displays _just_ the domain name (not the path) in its URL field, except when actually focused for editing. (Personally, I don't like Safari's solution of hiding the rest of the URL. But there may be other approaches whereby the display could be made clearer/less ambiguous, at least when the field does not have editing focus. I'm not sure where to find the code that currently makes the TLD text black, and the rest of the URL gray, but maybe that could be extended to more clearly distinguish the domain from the path...)
11 months ago