Handling of non ASCII characters in URLs

NEW
Assigned to

Status

()

Core
Internationalization
15 years ago
7 months ago

People

(Reporter: Christian Roslawski, Assigned: Jungshik Shin)

Tracking

(Blocks: 1 bug, {intl})

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

15 years ago
If a mailto-link contains url-encoded characters above %7f
(e.g. %A7 for "§" or %B4 for "´"), then the parameter parsing
appears to be confused. Url-encoded characters like "&" (%26)
and "=" (%3D) are suddenly interpreted just like they aren't
url-encoded. Furthermore the status bar doesn't show the link-
url if the mouse moves over the link.

The following link should open the mail client with a recipient
"test@test.com" and a subject "test§test&body=test". Netscape
4.78 actually does that. Mozilla sets the subject to "test§test"
and sets the body to "test":

  mailto:test@test.com?subject=test%A7test%26body%3Dtest


The following link should open the mail client with the subject
"test´&test". Mozilla sets the subject to "test´" and drops the
rest:

  mailto:test@test.com?subject=test%B4%26test


Everything seems to work fine in Mozilla when the above-%7f
character isn't encoded, e.g.:

  mailto:test@test.com?subject=test§test%26body%3Dtest


I didn't had much time to look into this, sorry. Maybe it's just
me who is confused or who didn't understand url-encoding. Maybe
there's something in a RFC about url-enconding for mailto which
I don't know yet. Sorry then. I think the behaviour is strange
enough to report it.

- Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2a) Gecko/20020910
- Character Encoding Western (ISO 8859-1), auto-detect off
- Mozilla Mail is default mailer (IMAP-SSL account)
- Windows 2000 SP2 (german)

I only found this statement from July 31, 2001 on http://mozilla-
evangelism.bclary.com/letter/news.html, which might point at the
same problem:

  "Also, when preparing the string containing the letter for use
   in the mailto: url assignment, it appears that ISO-8859-2
   strings are truncated when using the JS function escape() to
   prepare them for use in the URL. I am not sure if I am causing
   this problem or if it is a problem in Mozilla."

Comment 1

15 years ago
Confirming on Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2a) Gecko/2002091318.

I recently had a very similar bug which proved to be invalid for a reason which
does not apply here, see bug 168793. CCing Niklas who helped to solve it.

pi
Assignee: asa → harishd
Status: UNCONFIRMED → NEW
Component: Browser-General → Parser
Ever confirmed: true
OS: Windows 2000 → All
QA Contact: asa → moied
Hardware: PC → All

Comment 2

15 years ago
Created attachment 99657 [details]
Testcase with reporter's information
->Networking. Parser is HTML only, not appropriate for the URL parser.
Assignee: harishd → new-network-bugs
Component: Parser → Networking
QA Contact: moied → benc

Comment 4

15 years ago
It actually is related to bug 168793, namely because everything above %7F is
supposed to be encoded with two bytes. Why is that? Because everyhting not ASCII
(> %7F) _should_ be encoded with UTF-8, which uses one byte for the first 128
chars and two for the rest (and then three and four bytes for characters in
still higher positions in Unicode). Excerpt from the HTML spec:

****
B.2.1 Non-ASCII characters in URI attribute values

Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors
sometimes specify them in attribute values expecting URIs (i.e., defined with
%URI; in the DTD). For instance, the following href value is illegal:

<A href="http://foo.org/Håkon">...</A>

We recommend that user agents adopt the following convention for handling
non-ASCII characters in such cases:

   1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
   2. Escape these bytes with the URI escaping mechanism (i.e., by converting
each byte to %HH, where HH is the hexadecimal notation of the byte value).

This procedure results in a syntactically legal URI (as defined in [RFC1738],
section 2.2 or [RFC2141], section 2) that is independent of the character
encoding to which the HTML document carrying the URI may have been transcoded.
***

&#960;-winger's problem was that he tried to escape a two-byte value (in UTF-8) with
one byte, which failed. Same problem here. Find the relevant two-byte
combination, and it will probably work. The section sign is encoded as %C2%A7 in
a UTF-8 URI hex scheme.

Comment 5

15 years ago
And, may I add, it is safest NOT to encode anything. Let the browser do the
encoding. This will also make your code more legible.

Unicode to the world.
(Reporter)

Comment 6

15 years ago
The (default) character encoding of the document is ISO-8859-1, not UTF-8
like http://piology.org/ (at least Mozilla "view page info" tells me the
document there uses UTF-8 encoding). URI-encoded characters above %7f in
attribute values of http-links seem to work fine in ISO-8859-1 encoded
pages. The problem only happens in mailto-links.

I encoded the parameters of the mailto-link in UTF-8, but didn't change
the document encoding. It works fine with Mozilla. But IE 6.0 and Netscape
4.78 translate it into two characters, e.g. %C2%A7 yields "§" instead
of "§". And when I use %C2%A7 to encode an attribute value of a http-
link, Mozilla yields "§" as well (on ISO-8859-1 encoded document).

Btw, I had a quick run at the link with the UTF-8 encoded anchor name on
http://piology.org/. IE, Netscape, Opera, and Lynx fail to jump to the
given anchor point.

I don't say the other browsers are right, it's just that I'm seriously
confused by now. :-) And I don't know about HTML-specifications on URIs,
but why should they handle the encoding of http-links and mailto-links
differently?

And no encoding of attribute values at all isn't always an option,
especially when you have to handle dynamic content based on textual
input of various users.

Comment 7

15 years ago
@Niklas Dougherty:

The part from the HTML spec you have posted is only a recommendataion
for error correction when an URI contains non-ASCII characters. These
and only these should be encoded as UTF-8 and then as %nn.

They don't say that URIs can't contain (encoded) octets that are not
part of a valid UTF-8 sequence. (This is out of the scope of an HTML
specification -- or even the W3C -- anyway.)
And some URL schemes, such as "data" (RFC 2397), rely on that.

And no, it's not safest not to encode anything. This results in INVALID
URIs and amounts to relying on the error-correction scheme you mentioned.

------------------------

On the other hand, the URI shown above is still invalid. This is because
the "mailto" scheme does not define any representation for non-ASCII
characters (see RFC 2368):

| 8-bit characters in mailto URLs are forbidden. MIME encoded words (as
| defined in [RFC2047]) are permitted in header values, but not for any
| part of a "body" hname.
                                                  -- RFC 2368, section 2

But as it explicitly does allow MIME encoding (RFC 2047) in headers, the
URI written above should be written as:

  mailto:test@test.com?subject=%3D%3FISO-8859-1%3Ftest%3DA7test%3D26body%3D3Dtest%3F%3D

Note that the following would also be valid (but includes an extra
space):

  mailto:test@test.com?subject=%3D%3FISO-8859-1%3Ftest%3DA7test%3F%3D%20%26body%3Dtest

For the body, this does not work (you can't have encoded words in the
body and the spec explicitly disallows content-* headers).  

Now, what should Mozilla do when it encounters such an invalid URI:

1. It should be able to display it.
2. It should be able to parse it correctly and not confuse parts of the
   (intended) header with other (intended) headers or the (intended)
   body.
3. If the data for a header or for the body (incorrectly) contains
   non-ASCII characters, it should try to interpret it as UTF-8 and,
   failing that, as ISO-8859-1 (or any other 8bit charset determined by
   other means...)

Comment 8

15 years ago
OK, so let's make this bug broader.

A problem in another bug (which I'll dupe in a moment) was:
http://www.duden.de/schreibung/regelwerk/zeichen_11.html#%A797
Mozilla did not display the link in the status bar.

pi
Summary: url-encoded characters above %7f trouble parameter parsing in mailto:-links → Handling of non ASCII characters in URLs

Comment 9

15 years ago
*** Bug 168793 has been marked as a duplicate of this bug. ***

Comment 10

15 years ago
If I enter the URI references:
  http://www.duden.de/schreibung/regelwerk/zeichen_11.html#%A797 and
  http://www.duden.de/schreibung/regelwerk/zeichen_11.html#%C2%A797
manually into Mozilla's (Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
rv:1.1) Gecko/20020826) address bar, both of them actually works. Excellent.

Also, all of these URIs do work, too:
  mailto:test@test.com?subject=test§test%26body%3Dtest  
  mailto:test@test.com?subject=test%c2%a7test%26body%3Dtest  
  mailto:test@test.com?subject=%3D%3FISO-8859-1%3FQ%3Ftest%3DA7test%3D26body%3D3Dtest%3F%3D

The following URI does not:
  mailto:test@test.com?subject=test%a7test%26body%3Dtest
but that's not a big problem because it's neither a valid URI nor can it be expected that it would be in the future (the trand goes towars UTF-8). ((But maybe Mozilla should only produce a single <?> character (U+FFFD) for the non-UTF-8 "%A7" and not for the rest of the string.))

So the bug seems not to be related to Mozilla's URI handling in general but
only to extracting links from HTML documents.

Comment 11

15 years ago
darin
Assignee: new-network-bugs → darin

Comment 12

15 years ago
-> intl
Assignee: darin → yokoyama
Component: Networking → Internationalization
QA Contact: benc → ruixu

Updated

15 years ago
Keywords: intl
QA Contact: ruixu → ylong

Comment 13

15 years ago
nhotta
Assignee: yokoyama → nhotta

Updated

15 years ago
Blocks: 157673

Updated

15 years ago
Blocks: 181117

Comment 14

13 years ago
*** Bug 285731 has been marked as a duplicate of this bug. ***

Comment 15

12 years ago
*** Bug 307694 has been marked as a duplicate of this bug. ***

Comment 16

12 years ago
*** Bug 296934 has been marked as a duplicate of this bug. ***

Comment 17

12 years ago
*** Bug 307940 has been marked as a duplicate of this bug. ***

Comment 18

12 years ago
*** Bug 192108 has been marked as a duplicate of this bug. ***
(Assignee)

Updated

12 years ago
Assignee: nhottanscp → jshin1987
Blocks: 42899

Comment 19

11 years ago
*** Bug 341532 has been marked as a duplicate of this bug. ***

Comment 20

11 years ago
*** Bug 354567 has been marked as a duplicate of this bug. ***

Updated

11 years ago
Duplicate of this bug: 368036
Duplicate of this bug: 392423
QA Contact: amyy → i18n
Comment hidden (spam)
Comment hidden (spam)
Comment hidden (spam)

Comment 26

a year ago
The (default) character encoding of the document is ISO-8859-1, not UTF-8
like http://piology.org/ (at least Mozilla "view page info" tells me the
document there uses UTF-8 encoding). URI-encoded characters above %7f in
attribute values of http-links seem to work fine in ISO-8859-1 encoded
pages. The problem only happens in mailto-links.

I encoded the parameters of the mailto-link in UTF-8, but didn't change
the document encoding. It works fine with Mozilla. But IE 6.0 and Netscape
4.78 translate it into two characters, e.g. %C2%A7 yields "§" instead
of "§". And when I use %C2%A7 to encode an attribute value of a http-
link, Mozilla yields "§" as well (on ISO-8859-1 encoded document).

Btw, I had a quick run at the link with the UTF-8 encoded anchor name on
http://seopoker888.blogspot.com. IE, Netscape, Opera, and Lynx fail to jump to the
given anchor point.

Comment 27

7 months ago
why should people used Non ASCII for URL, UTF-8 Should be enough for sure, some example like http://www.s1228.net UTF-8 already enough to describe the url
You need to log in before you can comment on or make changes to this bug.