Closed Bug 278161 Opened 15 years ago Closed 14 years ago

make file URLs always be in (escaped) UTF-8 regardless of the file system encoding (opening an link to a local non-ASCII file)

Categories

(Core :: Networking: File, defect)

defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: jshin1987)

References

Details

(Keywords: fixed1.8.1, intl)

Attachments

(3 files, 6 obsolete files)

Follow-up to bug 261929  

When a local file is referred to in an html document in an encoding different
from the local file system encoding, it cannot be opened because the encoding of
the document (instead of the file system encoding) is used for the file URL (my
comments -  bug 261929 comment #19 and bug 263750 comment #17 - seem to have
been incorrect) regardless of the value of 'network.standard-url.encode-utf8'. 

With a patch I'm gonna upload, whatever the document encoding may be, file URLs
will be in escaped UTF-8 and the file system encoding will be used only when
interacting with the local file system (as long as net_GetXXX helpers are used)
jshin: I think it is important to support file:// URLs that are also encoded
using the filesystem charset for backwards compatibility.  e.g., suppose someone
has a file:// URL bookmarked, or saved somewhere.  We should try to parse the
unescaped file:// URL as UTF-8, and if that fails, we should fallback on the
filesystem charset.
Summary: make file URLs always be in (escaped) UTF-8 regardless of the file system encoding → make file URLs always be in (escaped) UTF-8 regardless of the file system encoding
*** Bug 278166 has been marked as a duplicate of this bug. ***
do note that the gnomevfs code assumes, iirc, that file:// uris use the
filesystem charset.
Attached patch first shot (obsolete) — Splinter Review
haven't tested much yet (haven't yet compiled on Windows)

darin, yes, I'm  doing that in net_GetFileFromURLSpec.
(In reply to comment #3)
> do note that the gnomevfs code assumes, iirc, that file:// uris use the
> filesystem charset.

 On modern Linux where UTF-8 is the default, that wouldn't matter but I heard
that one of the first things Japanese Linux users do with FC3, SuSE 9.x, etc is
to switch back to ja_JP.EUC-JP (don't ask me why...). 

Anyway, how do we interact with gnomevfs? Do we pass a (file) URL to it?
(execuse me for my ignorance)

Thanks for the reference:

146  if (NS_SUCCEEDED(fileURI->SchemeIs("file", &isFile)) && isFile) {
147       gnome_vfs_get_file_info(spec.get(), &fileInfo,
GNOME_VFS_FILE_INFO_DEFAULT);

http://developer.gnome.org/doc/API/2.0/gnome-vfs-2.0/gnome-vfs-20-gnome-vfs-file-info-ops.html
is silent about the first argument for gnome_vfs_get_file_info which is
text_url.  I doubt it expects the local file system encoding, however. It's more
likely to expect IRI (in UTF-8)



Status: NEW → ASSIGNED
jshin: be careful... this is performance critical code.  you might impact
startup time significantly by asking for the UTF-16 file path first (in the
XP_UNIX case at least).  maybe we should optimize for the case where UTF-8 is
the filesystem charset (except for Windows and OS/2).
jshin: see
http://mail.gnome.org/archives/gnome-vfs-list/2004-March/msg00049.html -
according to that URL, the URI passed to gnome-vfs does expect the filesystem
encoding.
> the URI passed to gnome-vfs does expect the filesystem encoding.

I don't like that ;-) However, we may just get away with using escaped UTF-8 and
hope gnome-vfs does what we do here (try to interpret it as UTF-8 and see if it
works. if it works, it's fine. if not, try file system charset.)

] Whenever we can detect the charset used for the URI type we try to
] convert it to/from utf8 automatically inside gnome-vfs.

(In reply to comment #7)
> jshin: be careful... this is performance critical code.  you might impact
> startup time significantly by asking for the UTF-16 file path first (in the

 Hmm... so I have to do that exercise once more ....  To special-case UTF-8 for
XP_UNIX, we have to make the codeset readily available by caching it rather
early (xpcom init?)

maybe nsNativeCharsetUtils.h could expose an API to expose the current codeset.
 or maybe it could have a "IsNativeUTF8"-type function.
How about this in
http://mail.gnome.org/archives/gnome-vfs-list/2004-March/msg00049.html ?
It seems like we can get away with passing UTF-8 expecting gnome-vfs to take
care of the encoding mismatch gracefully. 

> In the case of unknown encoding we try to use utf8 as much as possible,
> on the grounds that thats where we want to go in the future, but as a
> fallback we can also try latin1 or the local encoding.

a new reference (because nsIconChannel.cpp has changed since)
http://lxr.mozilla.org/seamonkey/source/modules/libpr0n/decoders/icon/gtk/nsIconChannel.cpp#240
well, that mail says:
> >>o) If I give gnome-vfs a file: URI, how should non-ascii characters be 
> >>encoded? Should I give it escaped UTF-8, or unescaped UTF-8, or should I 
> >>use the filesystem's charset?
> >>(the function I'm most interested in in this context is 
> >>gnome_vfs_get_file_info)
> > 
> > You should use the filesystems charset (which is undefined, filenames
> > are bytestrings). 

But sure... if UTF-8 works, use it :)
Attached patch second shot (obsolete) — Splinter Review
I added NS_IsNativeUTF8(), but it might be still slow due to Lock/Unlock. (I
may have to build an optimized build and measure the start-up time) Can I do
without them ?
Attachment #171094 - Attachment is obsolete: true
You can check gInitialized before locking.  That way you avoid entering the
locks in most cases.

Also, I think the NS_IsNativeUTF8 function should be available on all platforms.
(In reply to comment #14)
> You can check gInitialized before locking.  That way you avoid entering the
> locks in most cases.

I considered that, but I was afraid it's not  thread-safe, which might be all
right in this case...

> Also, I think the NS_IsNativeUTF8 function should be available on all platforms.

Is it for the sake of completeness/future-proofness or do you have any specific
use in mind for now? 
> I considered that, but I was afraid it's not  thread-safe, which might be all
> right in this case...

Right, the only concern is if gInitialized switches from true to false.  Then
you'd just be reading a stale value for gIsNativeUTF8.  But, that's ok since
that value never changes post-initialization anyways.


> Is it for the sake of completeness/future-proofness or do you have any 
> specific use in mind for now? 

I was just thinking that it would be good to have a cross-platform API even if
the implementation is not interesting on some platforms, and even if the API is
never called on other platforms.  Are you concerned about implementing it on
Windows?
(In reply to comment #16)

> Right, the only concern is if gInitialized switches from true to false.  Then
> you'd just be reading a stale value for gIsNativeUTF8.  

How about false to true? LazyInit() is called twice (iconv_open is called
twice), but it's closed only once...
 

> I was just thinking that it would be good to have a cross-platform API even if
> the implementation is not interesting on some platforms, and even if the API is
> never called on other platforms.  Are you concerned about implementing it on
> Windows?

No. I was just thinking that if it's not used, I wouldn't have to implement it.
If you like the other way, I have no objection to adding it on all platforms. At
least for now, that would return true on OSX and BeOS and false on Windows 

> How about false to true? LazyInit() is called twice (iconv_open is called
> twice), but it's closed only once...

Hrm.. so LazyInit doesn't set gInitialized to true until after it sets
gNativeIsUTF8.  In other words, you don't have to worry about false to true.

Like this:

{
  if (!gInitialized) {
    Lock();
    if (!gInitialized)
      LazyInit();
    Unlock();
  }
  return gNativeIsUTF8;
}


> No. I was just thinking that if it's not used, I wouldn't have to implement 
> it.  If you like the other way, I have no objection to adding it on all 
> platforms. At least for now, that would return true on OSX and BeOS and false 
> on Windows 

Aren't there some cases where the multibyte codepage is UTF-8 on Windows?
(In reply to comment #18)

> gNativeIsUTF8.  In other words, you don't have to worry about false to true.
> 
> Like this:
....
Aha.. thanks.

> > false on Windows 

> Aren't there some cases where the multibyte codepage is UTF-8 on Windows?

  For Indic scripts and other scripts for which there were no legacy Windows
code pages, it might indeed be UTF-8 on Win 2k/XP (or it may be just ISCII-xx at
least for Indic..[1] ). I have to check it out after changing my default locale
to one of them. Hmm... in that case, if GetACP() is expensive (although not
likely) and we use it somewhere,  we may need to cache it by adding some real
code. For now, I'll just use GetACP()

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/
unicode_81rn.asp
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_21bk.asp
Attached patch patch (obsolete) — Splinter Review
NS_IsNativeUTF8 is now available on all platforms. I changed the locking as
suggested by Darin. I haven't yet tested with gnomevfs (somehow  'smb' and
'sftp' don't work on FC3)
Attachment #178799 - Attachment is obsolete: true
> I haven't yet tested with gnomevfs (somehow  'smb' and
> 'sftp' don't work on FC3)

I'm sure the GnomeVFS people changed the behavior of the API again.  It's been a
royal pain in the neck trying to use that library :-(
(In reply to comment #21)
> > I haven't yet tested with gnomevfs (somehow  'smb' and
> > 'sftp' don't work on FC3)
> 
> I'm sure the GnomeVFS people changed the behavior of the API again.  It's been 

What should I do, then? The patch is ready for  review except GnomeVFS issue. 
OS: Linux → All
try whitelisting "ssh" for gnomevfs
smb: and sftp: are by default on, aren't they? Anyway, I've just added
'network.gnomevfs.supported-protocols' in about:config and set it to
'smb:,ssh:,sftp:', but 'smb://server/share' still doesn't work. With konqueror,
it works fine. I may be doing something wrong. 
yeah, sftp is, but ssh isn't. iirc, FC3 has no sftp, only ssh.

Testing in konqueror is useless, it doesn't use gnomevfs. try nautilus.
I can't make ssh: work although smb works fine in mozilla after opening smb:
link in nautilus. ssh: doesn't work in nautilus. Anyway, how can I test the code at 

http://lxr.mozilla.org/seamonkey/source/modules/libpr0n/decoders/icon/gtk/nsIconChannel.cpp#240
? It's only about 'file://' url so that I don't have to wrestle with
'ssh:,sftp:,smb:'

moz-icon://file:///path/to/some/file
phew... Whether I applied the patch or not,
moz-icon://file:///full/path/KoreanDirectoryInEUCKR doesn't work (I get just a
dull grey icon) under ko_KR.EUC-KR locale. Regardless of the locale (UTF-8 or
legacy), moz-icon://file:///full/path/DirectoryNameInUTF8 gives me the directory
icon. 

Perhaps, in a separate bug, we have to modify nsIconChannel to check the return
status of gnome_vfs_file_info and try again with the file name converted to the
locale encoding.

http://developer.gnome.org/doc/API/2.0/gnome-vfs-2.0/gnome-vfs-20-gnome-vfs-file-info-ops.html#gnome-vfs-get-file-info
Comment on attachment 178951 [details] [diff] [review]
patch

Asking for review.
Attachment #178951 - Flags: superreview?(brendan)
Attachment #178951 - Flags: review?(darin)
As for Windows, I found that on Windows XP (and I think the same is true on Win
XP) I can only pick one of legacy code pages to use with 'A' APIs, which makes
sense because 'A' APIs only work with legacy code pages. Therefore, at the
moment, IsNativeUTF8 should return false on Windows.
Comment on attachment 178951 [details] [diff] [review]
patch

+	 rv = localFile->InitWithPath(NS_ConvertUTF8toUTF16(path));
+	 // In rare cases, a valid UTF-8 string can be valid as a native
encoding 
+	 // (e.g. 0xC5 0x83 is valid both as UTF-8 and Windows-125x )
+	 if (NS_FAILED(rv))
+	     rv = localFile->InitWithNativePath(path);


why would InitWithPath fail in such case?
Attachment #178951 - Flags: superreview?(brendan)
Attachment #178951 - Flags: review?(darin)
Summary: make file URLs always be in (escaped) UTF-8 regardless of the file system encoding → make file URLs always be in (escaped) UTF-8 regardless of the file system encoding (opening an link to a local non-ASCII file)
I added |OpenNSPRFileDesc(PR_RDONLY,...)|. This will slow things down.
Moreover, it's assumed that the file in question exists, which may not be
valid.
Any thought?
Attachment #178951 - Attachment is obsolete: true
I would say that assuming it's a UTF-8 URL unless the file exists is probably an
OK assumption... what do others think?

(Why not use Exists, btw?)
(In reply to comment #33)
> I would say that assuming it's a UTF-8 URL unless the file exists is probably an
> OK assumption... what do others think?
> 
> (Why not use Exists, btw?)

by a simple oversight (I knew it's there, but I couldn't find it. I was looking
for it in  nsILocalFile.idl instead of nsIFile.idl). If darin et al agree with
us, I'll use 

        // See if the file exists assuming its name is in UTF-8.
        // If not, fall back to the native file system encoding.
        PRBool exists;
        if (NS_FAILED(localFile->Exists(&exists)) || !exists)
            rv = localFile->InitWithNativePath(path);
Yes, the switch to UTF-8 file URLs is needed on Windows if we are ever to
successfully switch over to support Unicode file paths.  So, having a failover
mechanism makes sense.

However, this patch is wrong because there is no requirement for the file to
exist when net_GetFileFromURLSpec is called.  URLs and nsIFiles are pointers to
files that may or may not exist.  Therefore, the only time you can test for file
existence to determine if this failover trick should be applied is when you are
asked to open or create the file.  So, nsFileChannel might be a good place for
this since that is the code that opens a file URL.  Of course, what about people
who use nsIFileURL::GetFile.  That's a problem, and I'm not sure what to do
about that.
Blocks: 288154
*** Bug 281384 has been marked as a duplicate of this bug. ***
You can test this by loading a non-utf-8 page with an href to an iri:
http://www.w3.org/International/tests/sec-iri-3
*** Bug 202657 has been marked as a duplicate of this bug. ***
Perhaps, we should just throw away this part in attachment 182846 [details] [diff] [review]. It's extremely low that a 'meaningful' filename (as opposed to random gibberish) encoded in a legacy encoding is a valid UTF-8 (there's a thread on the Unicode mailing list about this and the conclusion was that the chance is very low, IIRC).


+        // In rare cases, a valid UTF-8 string can be valid as a native 
+        // encoding (e.g. 0xC5 0x83 is valid both as UTF-8 and Windows-125x ).
+        // Try to open it assuming it's UTF-8 first. If it fails, fall back to
+        // the native file system encoding.
+        PRFileDesc *file;
+        rv = localFile->OpenNSPRFileDesc(PR_RDONLY, 0, &file);
+        if (NS_FAILED(rv)) 
+            rv = localFile->InitWithNativePath(path);
+        else 
+            PR_Close(file);

Darin, what do you think? 
I realized that there are some temporary patches that had been added with the understanding that they'd be removed once this bug is fixed. I got rid of them as well as the wrong fallback routine in the previous patch. If we only need to worry about bookmarks for 'file urls', I guess this is good enough.
Attachment #182846 - Attachment is obsolete: true
> +        // Try to open it assuming it's UTF-8 first. If it fails, fall back 
...
> Darin, what do you think? 

This sounds like a good plan to me.
(In reply to comment #41)
> > +        // Try to open it assuming it's UTF-8 first. If it fails, fall back
 
> ...
> > Darin, what do you think? 
> 
> This sounds like a good plan to me.

Darin, I'm afraid I failed to make myself clear. What I meant in comment #41 is 
that it may not be so bad to get rid of the fallback code in attachment 182846 [details] [diff] [review] (
that is WRONG anyway : comment #35) and 'bite the bullet' (of causing a potentia
l incompatibility). My reasoning for that is that there's very low chance that a
 valid UTF-8 string is a valid and 'non-gibberish'  string in a legacy encoding.
 If the potential incompatibility is expected to only occur in bookmarks with fi
le urls, I'm pretty sure that the risk is very low (unless somebody intentionall
y makes a 'gibberish' file name that is both valid in UTF-8 and a legacy encodin
g and bookmarks it to expose the problem. Could that be a security risk somehow?
) 

here's just a random idea:
comment #35
> who use nsIFileURL::GetFile.  That's a problem, and I'm not sure what to do
> about that.
 Does the file corresponding to an nsIFileURL have to exist in this case? If so (which is not likely), can we modify net_GetFileFromURLSpec to have a 3rd optional argument for existence check (or rather legacy encoding fallback)? 
 
http://lxr.mozilla.org/seamonkey/ident?i=EnsureFile
Blocks: 228437
A nsIFile does not have to point to a file that exists.  The same goes for nsIFileURL.  I'm happy to leverage an IsUTF8 check to avoid having to code messy failover logic.
Asking for r/sr.

Thanks, Darin, for relieving me of that potentially messy stuff.
Attachment #209284 - Attachment is obsolete: true
Attachment #216317 - Flags: superreview?(darin)
Attachment #216317 - Flags: review?
Comment on attachment 216317 [details] [diff] [review]
patch updated to the trunk

>Index: netwerk/base/src/nsURLHelperUnix.cpp
...
>     // Escape the path with the directory mask
>-    if (NS_EscapeURL(ePath.get(), ePath.Length(), esc_Directory+esc_Forced, escPath))
>-        escPath.Insert(prefix, 0);
>-    else
>-        escPath.Assign(prefix + ePath);
>+    NS_EscapeURL(NS_ConvertUTF16toUTF8(ePath), 
>+                 esc_Directory+esc_Forced+esc_AlwaysCopy, escPath);
>+    escPath.Insert(prefix, 0);

While the old code path was more complex, it was also a bit more efficient
in the case where nothing needs to be escaped.  I think you should change
this to:

    nsAutoString path;
    rv = aFile->GetPath(path);
    ...
    NS_ConvertUTF16toUTF8 ePath(uPath);
    if (NS_EscapeURL(ePath.get(), ePath.Length(), esc_Directory+esc_Forced, escPath))
        escPath.Insert(prefix, 0);
    else
        escPath.Assign(prefix + ePath);

That way, you avoid one realloc in the case where no escaping was needed.

This same nit applies to all of the files.


>+    else 
>+    // if path is not in UTF-8, assuming it is encoded in the native charset
>+        rv = localFile->InitWithNativePath(path);

Please indent the comment

This same nit applies to all of the files.


>Index: netwerk/base/src/nsURLHelperWin.cpp

>+    ePath.ReplaceChar(PRUnichar(0x5cu), PRUnichar(0x2fu));

It might be a little clearer if you used uppercase 'C' and 'F' in the above.
Same for the OS2 file.


r+sr=darin w/ nits picked
Attachment #216317 - Flags: superreview?(darin)
Attachment #216317 - Flags: superreview+
Attachment #216317 - Flags: review?
Attachment #216317 - Flags: review+
Attached patch patch for the branch (obsolete) — Splinter Review
Darin, thanks for r/sr. patch with your concerns addressed got landed on the trunk. this patch is for 1.8branch and includes changes suggested in the review comment.
(In reply to comment #47)

> you don't need the esc_AlwaysCopy anymore, do you?

Thanks for pointing that out. I've just checked in the fix. 

> 
> nsURLHelperOSX.cpp needs no changes? are paths always UTF-8 there?

Yes, it's always in UTF-8 (except that it is in NFKD instead of NFC, which I took care of in nsLocalFileOSX.cpp a couple of years ago).  

> +            // XXX In rare cases, a valid UTF-8 string can be valid as a
> native 
> 
> not sure that this should be an XXX comment. we don't consider that a bug,
> right?

Well, it depends. There's a very low chance that somebody may hit that problem during the transition period.  

The USE_STDCONV case is missing a declaration for IsNativeUTF8().
Attachment #216670 - Flags: review?(jshin1987)
Comment on attachment 216670 [details] [diff] [review]
Fix USE_STDCONV bustage (checked in, trunk)

Sorry for the bustage and thanks for the fix
Attachment #216670 - Flags: review?(jshin1987) → review+
Comment on attachment 216670 [details] [diff] [review]
Fix USE_STDCONV bustage (checked in, trunk)

The bustage fix has been checked in on the trunk.
Attachment #216670 - Attachment description: Fix USE_STDCONV bustage → Fix USE_STDCONV bustage (checked in, trunk)
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Blocks: 193358
Blocks: 88292
To land this patch on 1.8 branch, we need to take care of bug 332108 (and other cases if there's any).
Attachment #216605 - Attachment is obsolete: true
Depends on: 333703
Blocks: 261929
Depends on: 332108
Comment on attachment 217263 [details] [diff] [review]
branch patch updated to include STDCONV fix

a=darin for 1.8.1
Attachment #217263 - Flags: approval-branch-1.8.1+
landed on the 1.8 branch along with the patch for bug 332108
Keywords: fixed1.8.1
*** Bug 345882 has been marked as a duplicate of this bug. ***
Unfortunately this amounts to a change in the Plugin API, without any way for plugins to detect it, except to **** around with the User Agent string.
Blocks: 211961
Depends on: 409796
You need to log in before you can comment on or make changes to this bug.