Closed Bug 84032 Opened 20 years ago Closed 19 years ago

Add "uriCharsetEncodingHint" field to nsIURI

Categories

(Core :: Networking, defect, P4)

defect

Tracking

()

RESOLVED FIXED
Future

People

(Reporter: nhottanscp, Assigned: neeti)

Details

This was proposed in mozilla netlib newsgroup.
news://news.mozilla.org:119/3AFC62DE.F5CDF5DA%40netscape.com

By adding "hint_charset", charset information will not be so the clients can use 
that information to apply appropriate charset conversion. Note that 
"hint_charset" may not match with nsIURI internal charset (which is UTF-8).
How about calling it |uriCharsetEncodingHint|?  This would help make it clear
that the encoding in question is intended to apply to the URI itself, and not
the thing pointed to by the URI.
That sounds good, change the summary.
Summary: Add "hint_charset" field to nsIURI → Add "uriCharsetEncodingHint" field to nsIURI
Priority: -- → P4
Target Milestone: --- → mozilla1.0
Let me try to understand, this would most likely be acquired from the charset in
the HTML / HTTP response / overridden from View->Encoding menu?
What uses (other than IDN) do you reckon this would be good for?
Other cases would be path names, file names.
Possibly relevant to this is bug 84186.
Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 
(you can query for this string to delete spam or retrieve the list of bugs I've 
moved)
Target Milestone: mozilla1.0 → mozilla1.0.1
this proposal:

news://news.mozilla.org:119/3AFC62DE.F5CDF5DA%40netscape.com

fails to satisfy several problems:

1) HTTP nsIURI's can be instantiated through redirects in which no charset
information is available.  the server may generate a URL in response to a
redirect that contains URL-escaped non-ASCII characters.  we have no way of
converting these URLs to UTF8.

2) also, servers may URL-escape characters that would interfer with parsing a
URL such as a '/' that is part of a path element and not a path element
delimiter... or a '@' in someones password.  there is a set of reserved
characters that must be URL-escaped, otherwise the URL would fail to parse properly.

in summary, adding a charset attribute to nsIURI is insufficient.
>adding a charset attribute to nsIURI is insufficient.

I agree, but we need the hint charset in order to support existing documents if
we switch to UTF-8 URI.
nhotta:

the problem is that we cannot switch to a UTF-8 URL in all cases.  in some cases
we have no way of converting the unescaped URL to UTF-8.  now, that doesn't stop
us from converting the escaped URL to UTF-8, which is of course a no-op.  so it
is possible for nsIURI to support UTF-8 w/o requiring that all unescaped URI's
be encoded using UTF-8.  URIs for some protocols should simply never be
unescaped.  HTTP is an example of one such protocol.

HTTP for example will most likely not use the charset attribute since there is
no way to know in general what charset sequences %80-%FF correspond to.  HTTP
URLs shouldn't be unescaped.

there are of course exceptions, and we want to make sure that, in the cases
where charset information does exist, we try to show the user the unescaped URI.
this really means that we should show the user the URI with escape sequences %80
and above unescaped.  other escape sequences should probably stay intact since
they could correspond to control characters and other reserved characters that
would either make the URL not display properly or make the URI mean something
entirely different.
As far as I know we store the URL escaped in nsStandardURL. This is a must! We
need to change the escaping to no longer escape chars > 127 by default. On
protocols that need to be in ASCII (could be stored on the protocol information)
we need a second special escaping run for all chars > 127. That can happen just
before sending the request to the server.

URLs as a whole should be unescaped for displaying purpose only.
>URLs as a whole should be unescaped for displaying purpose only.
I agree. The hint charset may be used to display if available.

I talked about the unescaped case. Unescaped non ASCII URI in a document (e.g.
HREF) is most likely in a charset of the document. I think we don't currently
convert thoese URI to UTF-8 but I am not sure if those are escaped in nsIURI or
left unescaped. In either cases, the hint charset would help to display those URI.
Target Milestone: mozilla1.0.1 → Future
This is already available as originCharset in nsIURI.
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.