Closed Bug 329202 Opened 18 years ago Closed 17 years ago

The URL bar encodes non-Latin URLs as $LANG/hex instead of UTF8/hex

Categories

(Firefox :: Address Bar, defect)

2.0 Branch
x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: ilatypov, Unassigned)

References

()

Details

(Whiteboard: CLOSEME 07/24)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060209 Debian/1.5.dfsg+1.5.0.1-2 Firefox/1.5.0.1
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060209 Debian/1.5.dfsg+1.5.0.1-2 Firefox/1.5.0.1


This issue is about the way Firefox automatically translates Unicode symbols typed into the address bar by the user as part of the URL path or query.  The HTML standard recommends that browsers will represent the user input as a sequence of Unicode symbols and encode it with UTF-8, then HEX before sending the HTTP GET request.  [The reason for this recommendation is that the browser is unaware about the encoding preferred by the server or used by the page].

Pasting the Unicode Cyrillic (U+0x400..U+0x4ff) link generated below into the Firefox'es address bar and hitting Enter will unexpectedly transform it to the 8-bit (0..0xff) Cyrillic character set KOI8-R before encoding it with UTF-8/HEX.  My environment variable LANG is set to "ru_RU.KOI8-R".

The second link uses Latin (U+0..U+0xff) characters only.  It is encoded correctly.  Perhaps, the unexpected transformation of Unicode symbols to the 8-bit character set specified in $LANG (KOI8-R) happens to be identical.

Curiously, clicking a Unicode Cyrillic link inside a UTF-8 web page works as expected.  I.e., the Unicode symbols are encoded directly to UTF-8/HEX.


function decode_utf_hex() {
  UTFHEX="$1"
  decoded=$(echo -n "$UTFHEX" \
    | perl -pe 's/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg;')
  url="http://en.wikipedia.org/wiki/$decoded"
  echo "$url" > /tmp/f
  gedit --encoding=utf-8 /tmp/f
}


# "kirillitsa" in U+0x400..U+0x4ff
decode_utf_hex "%D0%BA%D0%B8%D1%80%D0%B8%D0%BB%D0%BB%D0%B8%D1%86%D0%B0"

# "Re'sume'" in U+0..U+0xff
decode_utf_hex "R%C3%A9sum%C3%A9"

----------------- 

Note 1.  Using Wikipedia here is just a convenience.  

Note 2.  There isn't any assumed conversion/redirection of the link on the server side.

Note 3.  I am intentionally avoiding any Unicode symbols in the text of this bug because this Bugzilla's server character set isn't UTF-8.


Reproducible: Always

Steps to Reproduce:
The URL expected in the address bar:

http://en.wikipedia.org/wiki/%D0%9A%D0%B8%D1%80%D0%B8%D0%BB%D0%BB%D0%B8%D1%86%D0%B0

The URL observed matter-of-factly:

http://en.wikipedia.org/wiki/%C3%8B%C3%89%C3%92%C3%89%C3%8C%C3%8C%C3%89%C3%83%C3%81

Note.  The bug occurs only on Linux platform.  On Windows, the Unicode URLs are encoded correctly, i.e. directly to UTF-8/HEX.
(In reply to comment #1)
> The URL expected in the address bar:
> 
> http://en.wikipedia.org/wiki/%D0%9A%D0%B8%D1%80%D0%B8%D0%BB%D0%BB%D0%B8%D1%86%D0%B0

Here Wikipedia does change the first letter to upper case by doing an HTTP redirect, thus contaminating the experiment.  The expected Firefox'es role here is to encode the Unicode URL to UTF-8/HEX, producing the value submitted to decode_utf_hex:

http://en.wikipedia.org/wiki/%D0%BA%D0%B8%D1%80%D0%B8%D0%BB%D0%BB%D0%B8%D1%86%D0%B0
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9a1) Gecko/20060303 Firefox/1.6a1 ID:0000000000
Works fine for me, but my environment LANG variable is en_GB.UTF-8.
Ilguiz, are you able to reproduce this with a current 2.x, 3.x, or trunk build?
Thanks.
Whiteboard: CLOSEME 2007-07-24
Yes. the bug remains.  I am using Iceweasel 2.0.0.4 with Gecko/20070508 from Debian unstable.  I did not test Firefox 3, though.

Just clicking the correct non-latin link above brings me to an expected Wikipedia page.  When I copy the name of the page from its contents into the cut-and-paste buffer, paste the utf-hex encoded part of URL with the cut buffer and hit Enter, I am brought to another page with a garbled link.  

I don't think this is because of Wikipedia's server-side rewriting of URLs.  See, for example, a Usemod wiki page,

http://ei.homeip.net/wiki?кириллица

Cut the name of the page, paste it into URL instead of the utf-hex encoded part and hit Enter to reproduce the issue.
To sum up, Firefox 2.0 sends the LANG hex encoding of a Unicode URL instead of UTF hex encoding.
Summary: Unexpected Unicode-to-$LANG transformation when typing in a non-Latin path or query. → The URL bar encodes non-Latin URLs as $LANG/hex instead of UTF8/hex
http://localhost/тест.txt

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.5pre) Gecko/20070703 BonEcho/2.0.0.5pre
en_GB.UTF-8:  http://localhost/%D1%82%D0%B5%D1%81%D1%82.txt
ru_RU.KOI8-R: http://localhost/%D4%C5%D3%D4.txt (and the file cannot be loaded)

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a7pre) Gecko/20070703 Minefield/3.0a7pre
en_GB.UTF-8:  http://localhost/%D1%82%D0%B5%D1%81%D1%82.txt
ru_RU.KOI8-R: http://localhost/%D1%82%D0%B5%D1%81%D1%82.txt

That's because the default for network.standard-url.encode-utf8 is "false" on the former, and "true" on the latter.

This bug is either INVALID, or DUPLICATE of a bug that caused the default change.
Version: unspecified → 2.0 Branch
Status: UNCONFIRMED → RESOLVED
Closed: 17 years ago
Resolution: --- → DUPLICATE
No, it's not a duplicate of bug 105909.
1. That bug doesn't cause Error 404.
2. See comment #37 there about the pref.
Status: RESOLVED → UNCONFIRMED
Resolution: DUPLICATE → ---
I would think that the reason the file could not be loaded was the default value of the option you pointed me to, 

> network.standard-url.encode-utf8=false

I understand that the change of the above option's default value to true fixes my problem.  (I don't even see a reason to keep this option available for modifications).

Correct me if I am wrong.
Status: UNCONFIRMED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → INVALID
Whiteboard: CLOSEME 2007-07-24 → CLOSEME 07/24
You need to log in before you can comment on or make changes to this bug.