Closed Bug 229548 Opened 21 years ago Closed 10 years ago

IDN: URL in status bar is displayed as garbage if the path part has non-ASCII characters in non-UTF-8 encoding

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
mozilla1.9alpha1

People

(Reporter: kazhik, Assigned: smontagu)

References

(Depends on 1 open bug, Blocks 1 open bug, )

Details

(Keywords: intl)

Attachments

(4 obsolete files)

URL in status bar is displayed as garbage if Non-ASCII characters are
used in domain name and name attribute.

http://<non-ASCII domain name>/index.html#<non-ASCII name attribute>

The second non-ASCII characters are displayed correctly, but the first
aren't.

Original report in Bugzilla-jp
http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=3513
What has to be 
http://賢明.jp/ent_exam/ent_exam.html#メニュー

is displayed as

http://莖∽??.jp/ent_exam/ent_exam.html#メニュー

With a debug build, I got a few assertions in xpconnvert.cpp and nsUTF8Utils.h
so that there is a conversion problem somewhere. My guess is that the first four
bytes of 賢明 (in EUC-JP) correspond to 莖∽ in UTF-8 and the rest two bytes
don't form a valid character in EUC-JP so that they're turned into question
marks. I'll check this out when I'm on Linux. (I can do it now, but it's a
little cumbersome).

This happens probably because somewhere we have a URI with the host address in
UTF-8 and the path part in EUC-JP. Given this URI, we try to convert it to
Unicode (UTF8 or UTF-16) as if the whole URI is in EUC-JP (originCharset).




Keywords: intl
OS: Linux → All
Hardware: PC → All
> UTF-8 and the path part in EUC-JP. Given this URI, we try to convert it to
> Unicode (UTF8 or UTF-16) as if the whole URI is in EUC-JP (originCharset)
 
 We only try this conversion when a given URI spec is NOT a valid UTF-8. 
With UTF-8 in the host address part and EUC-JP in the path part, it's not a
valid UTF-8 as a whole so that we assume the whole URI spec is in EUC-JP.
Therefore, this problem doesn't occur if we just have the host part in UTF-8
followed by the path part in ASCII-only. To see that, try
http://bugzilla.mozilla.gr.jp/attachment.cgi?id=1954 (quoted in bug 229546)

Darin, can I assume that the host part of _any_ URI is _always_ in UTF-8? Then,
I can fix this in nsISubTextURI (?). However, that wouldn't be pretty. 

Assignee: smontagu → jshin
Blocks: IDN
Summary: IDN: URL in status bar is displayed as garbage → IDN: URL in status bar is displayed as garbage if the path part has non-ASCII characters in non-UTF-8 encoding
I cannot reproduce 2004050304-trunk/WinXP.
WORKSFORME?
Sorry...

Reproduced with 2004050304-trunk/Win98, 20040503-trunk(Firefox)/Win98,
20040503-trunk(Firefox)/WinXP.
> Darin, can I assume that the host part of _any_ URI is _always_ in UTF-8? Then,
> I can fix this in nsISubTextURI (?). However, that wouldn't be pretty. 

nsIURI::host is always encoded using UTF-8.
Attached patch patch (obsolete) — Splinter Review
This fixes bug 229546 as well and can also be used for fixing bug 200150.
I've got a little more robust patch. This should be fixed before 1.8beta.
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.8beta
Attached patch patch (obsolete) — Splinter Review
asking for review
Attachment #171172 - Attachment is obsolete: true
Attachment #171508 - Flags: superreview?(darin)
Attachment #171508 - Flags: review?(smontagu)
Comment on attachment 171508 [details] [diff] [review]
patch

This would have been easier to review with more context, by the way.
Attachment #171508 - Flags: review?(smontagu) → review+
Attached patch patch with more context (obsolete) — Splinter Review
thanks for r and sorry for too little context. I was just too lazy to get rid
of another patch nearby (for bug 244754) and took a short-cut by omitting '-u'
option.
Attachment #171508 - Attachment is obsolete: true
Attachment #171515 - Flags: superreview?(darin)
Attachment #171515 - Flags: review+
Attachment #171508 - Flags: superreview?(darin)
Comment on attachment 171515 [details] [diff] [review]
patch with more context

>Index: intl/uconv/src/nsTextToSubURI.cpp

>+  nsCOMPtr<nsIURLParser> urlParser;
>+  // should we just use net_GetStdURLParser()? 
>+  urlParser = do_GetService(NS_STDURLPARSER_CONTRACTID, &rv);
>+  NS_ENSURE_SUCCESS(rv, rv);

net_GetStdURLParser is an internal necko method.  since this code
is not part of the necko DLL, it cannot use it.

How do you know that this is the correct nsIURLParser instance for
the given URI?	I don't think you can know that it is.	What if the
given URI scheme does not support an authority section, but would
erroneously be parsed as having one by the STDURLPARSER?

I think you should use nsIIOService::newURI instead, to construct
a nsIURI.  Then, call GetHost, and check that instead.
Attachment #171515 - Flags: superreview?(darin) → superreview-
Attached patch patch that generates nsIURI (obsolete) — Splinter Review
With standard-url.encode-utf8 set to true, this patch is not necessary for most
uris (except for file url). However, setting the pref to true (which is by
default now) doesn't fix bug 229546 and this patch fixes it.
Attachment #171515 - Attachment is obsolete: true
Attachment #173054 - Flags: superreview?(darin)
Attachment #173054 - Flags: review?(smontagu)
Attachment #173054 - Flags: review?(smontagu) → review+
Comment on attachment 173054 [details] [diff] [review]
patch that generates nsIURI

Phew... This patch creates an infinite loop for 'javascript:.....' url (I
hadn't tested any page with  such a url) becacuse nsJSProtocolHanler relies on
nsITextToSubURI to ensure the UTF8ness of a spec (EnsureUTF8Spec method of
nsJSProtocolHandler). THere may be other protocol handlers with the same issue.
Attachment #173054 - Attachment is obsolete: true
Attachment #173054 - Flags: superreview?(darin)
Attachment #173054 - Flags: review+
(In reply to comment #13)
> Phew... This patch creates an infinite loop for 'javascript:.....' url because 
> nsJSProtocolHanler relies on nsITextToSubURI 

One way to break the infinite loop is to check if the scheme is 'javascript',
but that may not scale. Alternatively, we may add a parameter to the APIs of
nsITextToSubURI to indicate whether 'host' is present /needs to be checked for IDN.
(In reply to comment #14)

> One way to break the infinite loop is to check if the scheme is 'javascript',
> but that may not scale. Alternatively, we may add a parameter to the APIs of
> nsITextToSubURI to indicate whether 'host' is present /needs to be checked for
IDN.

Well, the second approach is just shifting the 'responsibility' to callers so
that it has the same problem. If so, just checking if the scheme is 'javascript'
(for now) in nsITexToSubURI is better.
It seems to me that these functions should take a nsIURI as their parameter
instead of a raw character string.
Blocks: 316730
Jshin:

I want to fix this issue myself.
Are you working on this?
Can I take this?
Target Milestone: mozilla1.8beta1 → ---
Assignee: jshin1987 → masayuki
Status: ASSIGNED → NEW
Target Milestone: --- → mozilla1.9alpha
Status: NEW → ASSIGNED
Masayuki Nakano: 2½ years after you "accepted" this bug, no one has objected. Are you still willing to fix it? And are you still experiencing it? (I'm not sure what to check against what).
(In reply to comment #18)
> Masayuki Nakano: 2½ years after you "accepted" this bug, no one has objected.
> Are you still willing to fix it? And are you still experiencing it? (I'm not
> sure what to check against what).

No, I'm not sure. I'll clean up my bug list after all Gecko1.9 works finished.
It works for me now with the latest trunk. Can you confirm?
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073100 SeaMonkey/2.0a1pre

After studying the bug again somewhat, I'd say it works for me but not decisively:

- I see the same garbage (plus www) in the Location Bar of the xul error page as in the input box of the bug's URL: http://www.è3¢æ¤ô¤3¤ò¤μ.jp/ent_exam/ent_exam.html

- In the top URL in comment #1, the blue underlined part stops just before the # sign. Clicking that link gives a xul error page for http://www.賢明.jp/ent_exam/ent_exam.html

(Apparently the DNS query gives a null result in both cases. Don't know if relevant.)

However, these Bugzilla pages are in UTF-8. A link to a non-Bugzilla non-Unicode page, with a link on it with non-ASCII in it, might be necessary for a "really" valid testcase nowadays.
I'm resetting bugs which are assigned to me but I'm not working on them and I don't have plan for fixing them in near future.
Assignee: masayuki → smontagu
QA Contact: amyy → i18n
Status bar is not supported any longer.
So this bug should be closed.
(In reply to Hideo Oshima from comment #23)
> Status bar is not supported any longer.
> So this bug should be closed.

URL preview is still available, even without the status bar.

That being said, I can't reproduce this, so I'm inclined to WFM. Anne, what do you think?
Flags: needinfo?(annevk)
Agreed.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(annevk)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: