Closed Bug 130393 Opened 18 years ago Closed 18 years ago

Unable to access to the site whose URL having DBCS is typed at location bar.

Categories

(SeaMonkey :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.0

People

(Reporter: hansoodev, Assigned: nhottanscp)

References

Details

(Keywords: intl, topembed)

Attachments

(1 file, 3 obsolete files)

Try to type any DBCS URL in the location bar and enter.

You will get "file not found" error.

It's b/c mozilla encode/escape URL typed in location bar to UTF-8 b/f sending 
it to server.

For IE, it sends in system charset when "always send URLs as UTF-8" is off,
and it sends in UTF-8 encoding when the option is on,
which increases the possibility to find the file in the server.

According to research done in bugs related to non-ASCII URLs, most modern 
servers are not ready to receive URLs in UTF-8 encoding, but expects them in 
their file system charset.

Therefore, sending URLs in system charset will increase the possibility of 
fidning the file wanted, even though it would not work when the server charset 
does not match with client machine's charset.

HK
There is a bug related to UTF-8 option in mozilla..
http://bugzilla.mozilla.org/show_bug.cgi?id=129726

Since this would not do much when the server charset 
does not match with client machine's charset and this option is off,
why don't we do ths way ?

We have three new properties.
1) network.standard-url.escape-utf8  --> Boolean
2) network.standard-url.embedded-charset --> String
3) network.standard-url.typed-charset --> String

When property (1) is on, we sends all URLs in UTF-8 format,
even embbeded URLs in documents.

When prperty (1) is off,
(2) controls charset used for encoding/escaping embedded URLs in documents.
(3) controls charset used for encoding/escaping URLs typed in location bar.

Valid values for (2) would be "doccharset", and all encoding available,
where default "doccharset" indicates mozilla should use charset of the document 
which has an embedded URLs.

Valid values for (3) would be "syscharset", and all encoding available,
where default "syscharset" indicates mozilla should use charset of system.


I am not sure what's involved for changes and the performance hits on these 
changes, but it will guarantee that mozilla is more flexible than IE
on this issue.
Plus, mozilla will always get files targeted if those options have right values.

Just a thought  =)
CCing   nhotta
Status: UNCONFIRMED → NEW
Ever confirmed: true
Copy keyword from bugscape 12324.
Assignee: asa → nhotta
Keywords: intl, topembed
Target Milestone: --- → mozilla1.0
Roy, could you reveiw?
Status: NEW → ASSIGNED
Comment on attachment 73927 [details] [diff] [review]
Use system charset for URL bar if the URI scheme is http, https, ftp or file.

some minor nits:

>Index: nsDefaultURIFixup.cpp

>+    PRBool bAsciiURI = nsCRT::IsAscii(uriString.get());

how about this instead:

      PRBool bAsciiURI = IsASCII(uriString);


>+  if (mFsCharset.IsEmpty())
>+  {
>+    nsAutoString charset(NS_LITERAL_STRING("ISO-8859-1"));  // set the fallback first.
>+    nsCOMPtr<nsIPlatformCharset> plat(do_GetService(NS_PLATFORMCHARSET_CONTRACTID));
>+    if (plat)
>+      (void) plat->GetCharset(kPlatformCharsetSel_FileName, charset);
>+    mFsCharset.Assign(NS_ConvertUCS2toUTF8(charset));

NS_LossyConvertUCS2toASCII would be better since you know that charsets are
really ASCII.

and instead of initializing |charset| with a NS_LITERAL_STRING (which calls
NS_ConvertASCIItoUCS2
under linux and other platforms), it'd be better to check if charset is empty
and if so assign
a NS_LITERAL_CSTRING to mFsCharset.
Attached patch Updated with darin's comment. (obsolete) — Splinter Review
Attachment #73927 - Attachment is obsolete: true
>+nsCAutoString nsDefaultURIFixup::mFsCharset;
I think it's better to use nsCString instead of nsCAuto.  nsCAutoString
is meant for short life span and mostly used in stack.  nsCString is
allocated in heap.

>+nsCOMPtr<nsIPlatformCharset>plat(do_GetService(NS_PLATFORMCHARSET_CONTRACTID));
>+if (plat)
Instead of checking for _plat_, can we do something like
nsresult rv;
nsCOMPtr<nsIPlatformCharset> plat(do_GetService(NS_PLATFORMCHARSET_CONTRACTID),
&rv);
and check for _rv_

> +      (void) plat->GetCharset(kPlatformCharsetSel_FileName, charset);
Do we need (void) casting?

> +const char * nsDefaultURIFixup::GetCharsetForUrlBar(const nsString& aStringURI)
Can we use nsAString or nsAFlatString for parameter?
I am not even close to knowing all the underlying string staff; 
but in this case I believe we should use abstract string classe for function
parameter instead of using the concrete classe. I believe we have
Ready-Only methods for nsAString (ie, FindChar(), Equals(),.. )
alecf and jag can comment here better. 

My question to alecf/jag: 
Do we have nsAString::EqualsIgnoreCase()?  Current implementation of nsString
has EqualsIgnoreCase() and it will call unicharutil for lower case conversion.

 






Comment on attachment 73952 [details] [diff] [review]
Updated with darin's comment.

>+nsCAutoString nsDefaultURIFixup::mFsCharset;

Well, in this case I think an autostring is fine, because charsets are usually
very short
(i.e. less than 64 bytes) so this will save us a 2nd heap allocation

>+    PRBool bAsciiURI = IsASCII(uriString);

what's "IsASCII()"? I don't see it defined anywhere... sure you don't mean
nsCRT::IsASCII?

I'm kind of baffled by the number of conversion back and forth between UCS2 and
ASCII for the charset..
(see my comments below, regarding this)


>     // Just try to create an URL out of it
>-    NS_NewURI(aURI, uriString, nsnull);
>+    NS_NewURI(aURI, uriString, bAsciiURI ? nsnull : GetCharsetForUrlBar(uriString));

Huh. Now I'm confused. 
How do we get a unicode string that's encoded in a certain charset? I kind of
dont' get that
because if we have a PRUnichar* buffer, shouldn't it already be in UCS2?

It kind of looks like we should be looking at the caller of CreateFixupURI to
make sure they are passing
in a true unicode string, rather than forcing us to do a wierd conversion... 

Assuming there's a reasonable explanation, I'll continue with the review:

> 
>-    nsresult rv = NS_NewURI(aURI, uriString, nsnull);
>+    nsresult rv = NS_NewURI(aURI, uriString, bAsciiURI ? nsnull : GetCharsetForUrlBar(uriString));
> 
> 
>+const char * nsDefaultURIFixup::GetFileSystemCharset()
>+{

since this always returns mFsCharset, we should make this return nsAString&
(due to the comments I explain below)

>+}
>+
>+const char * nsDefaultURIFixup::GetCharsetForUrlBar(const nsString& aStringURI)
>+{
>+  if (aStringURI.FindChar(':') == -1 ||

not -1, use kNotFound.

>+      aStringURI.EqualsIgnoreCase("http:", 5) ||
>+      aStringURI.EqualsIgnoreCase("https:", 6) ||
>+      aStringURI.EqualsIgnoreCase("ftp:", 4) ||
>+      aStringURI.EqualsIgnoreCase("file:", 5))
>+  {

Why only for these types? what about other protocols that might be installed?

>+    // for file url, we need to convert the nsString to the file system
>+    // charset before we pass to NS_NewURI
>+    nsAutoString fsCharset(NS_ConvertUTF8toUCS2(GetFileSystemCharset()));

If you change from const char* to nsAString&, You can avoid the
double-conversion (since GetFileSystemCharset itself does one conversion,
and this line converts it back.


>Index: nsDefaultURIFixup.h
>+    static nsCAutoString mFsCharset;

Bah! This should not be static. No static classes allowed (see the C++ coding
guidelines on www.mozilla.org)
Is this a service? If so, a simple member variable will suffice.

Anyway, the char/PRUnichar conversions in this patch are a bit of a mess..
Attachment #73952 - Flags: needs-work+
>what's "IsASCII()"? I don't see it defined anywhere... sure you don't mean
>nsCRT::IsASCII?
nsReadableUtils.h

>+    NS_NewURI(aURI, uriString, bAsciiURI ? nsnull : 
> GetCharsetForUrlBar(uriString));
uriString is always in Unicode,
charset here is not for the current charset encoding of uriString, it will be
used later to encode the URI.

>Why only for these types? what about other protocols that might be installed?
The default is to use UTF-8 to encode URI. I listed protocols which we prefer
system charset instead of UTF-8. E.g., UTF-8 is preffered for ldap, javascript
instead of system charset.

>+    nsAutoString fsCharset(NS_ConvertUTF8toUCS2(GetFileSystemCharset()));
>If you change from const char* to nsAString&, You can avoid the
>double-conversion (since GetFileSystemCharset itself does one conversion,
>and this line converts it back.
I use char* because NS_NewURI takes char* which is called more frequently than
this part of the code which is file URI specific.




character set names are defined using ASCII characters, so we shouldn't be using
unicode to represent charsets anywhere.  unfortunately, so much of our code
already uses unicode for charsets :-(  in this case, i'm not sure what the best
solution is... but, fwiw: you can get a unicode encoder/decoder from an ASCII
charset using nsICharsetConverterManager2::GetCharsetAtom2.
From comment #8,
> +const char * nsDefaultURIFixup::GetCharsetForUrlBar(const nsString& aStringURI)
>Can we use nsAString or nsAFlatString for parameter?

Need suggestion. Do we want to do that?
Comment on attachment 73952 [details] [diff] [review]
Updated with darin's comment.

oops, right. I forgot the use of the charset parameter in NS_NewURI (darin
explained it to me a while back.. whoops!)

I guess my concern with the whole character conversion is that in the case of
this line:
+    nsAutoString fsCharset(NS_ConvertUTF8toUCS2(GetFileSystemCharset()));

you actually are converting it twice - once inside of GetFileSystemCharset()
(from UCS2 to ASCII) and once with NS_ConvertUTF8toUCS2...(ASCII back to UCS2)
not to mention that if charsets are really ASCII then we should be using
NS_ConvertASCIItoUCS2

So I guess that's why I'm complaining - I know that charset names are always
ascii, but it sounds like it would be faster to store UCS2 for the existing
APIs... I guess going through GetCharsetAtom2 might be just as slow :)
oh, and yes, you want to use nsAString, not nsString, as your parameter, if
possible.
should necko have used UCS2 for charsets just to be consistent, or is there a
chance at all that i can coerce the rest of the tree into using ASCII for
charset strings??? :-)
>if charsets are really ASCII then we should be using NS_ConvertASCIItoUCS2
I will make the change.

About nsString ->nsAString change, I tried something and I found that I need to
link unicharutil in order to use nsCaseInsensitiveStringComparator (cannot
simply done by adding REQUIRES	= unicharutil). Alec, do you know anything about that?


no way, I think necko does the right thing.
I would so love to see the whole tree change.. maybe after 1.0? :)
Its just in this case, it seems like storing UCS2 makes sense. I dunno.
you need to #include nsIUnicharUtils.h to get that.
Ugh. what a mess :)
>you need to #include nsIUnicharUtils.h to get that.
But I am getting a link error. Do I need to do a static link to unicharutil for
docshell?
Attachment #73952 - Attachment is obsolete: true
Roy, could you review?
Comment on attachment 73997 [details] [diff] [review]
Changed to no argument for GetCharsetForUrlBar, removed static for mFsCharset.

+    if (charset.IsEmpty())
+      mFsCharset.Assign(NS_LITERAL_CSTRING("ISO-8859-1"));
Instead of hardcoding the charset, can we read from nsPref's
default charset "intl.charset.default" and assign it here?
However, if we need to static link nsPref, then forget it.
The charset here is for the system charset and has different meaning from
"intl.charset.default". Does it make sense to use it as a substitution?

Comment on attachment 73997 [details] [diff] [review]
Changed to no argument for GetCharsetForUrlBar, removed static for mFsCharset.

I thought the "intl.charset.default" Pref is used when 
every charset detections are failed. 
Granted that this setting may be used for Document encoding; but
the wordings in the Pref dlgbox says Default Character Encoding for
_Navigator_,
not for _Document_ alone.

My point is who decides the fallback encoding? Developers?
"ISO-8859-1" is just an another encoding. Why do we treat it any differently 
from other encodings?

Well, I don't want to fight over this. Either way is fine with me and give
/r=yokoyama
Attachment #73997 - Flags: review+
In this case, the hard coded string is used either nsIPlatformCharset is not
available or it returns an error. I think writing the fallback to use the pref
value everytime using nsIPlatformCharse is too match. It could be done in
nsIPlatformCharse but note that nsIPlatformCharset also has its own fallback to
use the hard coded string.

> My point is who decides the fallback encoding? Developers?
> "ISO-8859-1" is just an another encoding. Why do we treat it any differently 
> from other encodings?
I think that depends on the situation.
For the case like the service error, we can use ISO-8859-1 then we can expect
the consistent result (even if the result is not correct).
Darin, could you review?
Comment on attachment 73997 [details] [diff] [review]
Changed to no argument for GetCharsetForUrlBar, removed static for mFsCharset.

>Index: nsDefaultURIFixup.cpp

>+const char * nsDefaultURIFixup::GetCharsetForUrlBar()
>+{
>+  const char *charset = GetFileSystemCharset();
>+#ifdef XP_MAC
>+  if (charset[0] == 'x' && charset[2] == 'm')

what if GetFileSystemCharset returns an empty string?  charset[2] might result
in a crash.


>+    // for file url, we need to convert the nsString to the file system
>+    // charset before we pass to NS_NewURI
>+    nsAutoString fsCharset(NS_ConvertASCIItoUCS2(GetFileSystemCharset()));

NS_ConvertASCIItoUCS2 is actually a class, so write this instead:

      NS_ConvertASCIItoUCS2 fsCharset(GetFileSystemCharset());
Attachment #73997 - Flags: needs-work+
Attachment #73997 - Attachment is obsolete: true
Comment on attachment 74128 [details] [diff] [review]
Updated with darin's comment.

sr=darin.... i don't much like seeing nsCAutoString used as a member variable,
cuz it bloats the class by 64 bytes, but i guess you're doing that to eliminate
a heap alloc.  i'm not sure which is better... one larger than necessary heap
alloc or two small heap allocs.
Attachment #74128 - Flags: superreview+
Comment on attachment 74128 [details] [diff] [review]
Updated with darin's comment.

a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #74128 - Flags: approval+
checked in to the trunk
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Product: Browser → Seamonkey
Blocks: 393246
You need to log in before you can comment on or make changes to this bug.