Closed Bug 57011 Opened 24 years ago Closed 22 years ago

Make LocalFile's PersistentDescriptor i18n friendly

Categories

(Core :: Networking: File, defect, P3)

x86
Windows NT
defect

Tracking

()

RESOLVED WONTFIX
mozilla1.0.1

People

(Reporter: racham, Assigned: ccarlen)

References

Details

Attachments

(1 file)

When I worked on fixing the japanese migration problem (bug 44764), I discovered
that I couldn't use Get/SetPersistentDescriptor in order to migrate japanese
profiles (profiles/profile folder with japanese characters) properly. I had to
use some unicode representation. I used GetUnicodePath and InitWithUnicodePath
APIs to move around the unicode data smoothly across profile manager APIs. But
that turned out to be a problem on Mac as that changed the way we stored the
profile paths in the registry (we started storing paths instead of persistent
strings and mac users are affected with that). On Windows and Linux, it didn't
really matter as the paths are stored as paths and not as aliases (in persistent
desc format) in the registry.

We should allow the usage of Get/Set PersistentDescriptor on all platforms to
take the some kind of persistent desc formated data and then decode it properly
inside the function. Looks like on windows and linux, the persistent data is
just like path. May be we need some kind of encoded format here which can be
decoded as we process the input string. Looks like Mac data is already encoded
by base64 in the persistent interfaces. We need similar mechanism on Windows and
Linux. That allows us to use APIs directly without having to worry about
platform and difference input string formats.

Conrad, please update the bug with some of the ideas you mentioned.
Blocks: 57014
The problem with doing this is forwards/backwards compatibility. Persistent
descriptors which have been written into files up to now on Windows and Linux
just used ASCII paths. Those "descriptors", inadequate as they are, still need
to be understood by SetPersistentDescriptor so we can read old prefs and such.
The way we could do it is this: Add a header onto the descriptor data which
cannot possibly be the beginning of a path. For instance "/\/\/\" Then followed
by unicode path data. Then SetPersistentDescriptor can look for this header and
if found decode the unicode string which follows it. If it does not see this
header, it knows that it is an old ASCII path. That's easy enough. The harder
part is making the data returned by GetPersistentDescriptor usable by old
versions of mozilla. Do we need to worry about this? Since the current code
expects this string to be a null-terminated ASCII string, we could make the data
begin with a null-terminated ASCII path string. Old readers could still read
this. Then, following the null-terminated string would be the "/\/\/\" header
followed by the unicode string.
it won't break backwards compatibility, because:
1) UTF8-encoded ASCII strings are identical to the existing ASCII strings
2) any ASCII versions of i18n filenames are already pointing to non-existent
paths as it is.
so we shouldn't have to do any wierd header tricks.
Conrad, do you want to take a look at this?
Assignee: dougt → ccarlen
Got a patch working using UTF8 strings - pretty simple. As Alec pointed out
UTF8-encoded strings are identical to char* paths in the ASCII range. Is this
true for chars in the range 0-7F, but not from 80-FF? If a user had already made
a profile called "olé" and the cache folder was saved in the prefs as an
old-style char string and we just treat this string as UTF8-encoded data, what
would happen?
Status: NEW → ASSIGNED
well, if it was a multi-character string, it might become 
ol<some odd character> 
- but that weird character would probably not map to a real directory anyway 
(i.e. would be broken in the first place)
When we tried to decode that from UTF8, we'd probably get a bad string:
ol<different odd character>

but my assertion is that the original string was busted in the first place, so 
having a different busted string doesn't worsen the situation.
After some testing with the string "olé", I found this: With the current code,
it works. The 'é' is put into the string as 0xE9 and gets read back in as such
and everything is fine. With my first patch, if I just treat the string as a
UTF8 string, the 'é' character screws up the UTF8 code, and I get a bad file
path. That would be a regression. What I tried was this: made a routine that
determines of a string is a valid UTF8 string and if so, use it as UTF8. If not,
assume it's an old persistent desc string with a non-ASCII char. It works. Does
anybody know of a public way to determine if a string is UTF8? nsString has
IsASCII() If only it had IsUTF8(). Any other ideas?
that also probably works currently because the é is still a single-byte 
character, but it has the 8th bit set, so it must be confusing the UTF8 
converter.

Anyhow, I think the IsUTF8() is a good idea - not sure where it belongs in the 
nsString family though. Maybe scc would know
Scott, can you take a look at the last comments about how good it would be to
have a way to know if a string was a valid UTF8 string? It needs to be in a
common place because it is needed in nsLocalFile(Win/Unix/OS2).cpp.
Hmmm.  This is kind of hard.  First, of course, since this is an encoding issue,
you wouldn't make such a function a member function of strings.  You'd have a
non-member function, just like |IsASCII| in "nsReadableUtils.h".  |IsASCII| can
be fooled by any string that just doesn't use the high bit in any contained
byte.  So perhaps it would be better named, e.g., |CouldBeASCII|.

Now with UTF-8, you can detect the BOM if present, or an illegal sequence, but
for the most part, an arbitrary byte stream could easily be legally but
incorrectly interpreted as UTF-8.  The fact that a particular string already is
UTF-8 encoded is something you want to know up front and remember during the
life of the string.  If I give you a routine that alleges (falsely, I hope you
realize) to discriminate UTF-8 data ... that might encourage people _not_ to
remember.  See

  <http://www.unicode.org/unicode/faq/utf_bom.html>

for a better understanding of UTF-8 encoding.

On a distantly related note, some time in the future, I may add a string class
that offers the interface of a wide string, but stores it's data UTF-8 encoded.
 Current string classes have no intrinsic knowledge of the encoding you are
using in them.  They are more byte sequence managers than character sequence
managers.
Scott, check the patch I posted last night. It does just what you suggest in the
place you suggest - nsReadableUtils. You are right though - it should be called
CouldBeUTF8.
I have two problems with this patch.  One is the repeated use of the pattern

  nsCRT::strdup(NS_ConvertUCS2toUTF8(uc2Path))

which does twice as much allocation as it should.  You were in
"nsReadableUtils.h", you should have seen |ToNewUTF8String| :-)

The second thing is that all your logic can be fooled by strings that aren't
actually UTF8 but look like it.  This is different than the |IsASCII| case,
because |IsASCII| really asks (contrary to my earlier comment) "could I throw
away the top byte of every character in this string without data loss?"  The
patch under consideration blurs encoding.  The policy it represents is "the
caller might have given me anything".  Here are some clearer possibilities:

  (a) caller promises me ASCII
  (b) caller promises me UTF8 (degenerates to ASCII)
  (c) caller promises me 16 bit Unicode
  (d) caller communicates the encoding, e.g., by selecting from
        |SetPersistentDescriptorFromASCII|,
        |SetPersistentDescriptorFromUTF8|, and
        |SetPersistentDescriptorFromUCS2|
        or else with an encoding argument

For us, either (b) or (d) are probably acceptable answers; I just don't think we
can or should try to guess the encoding.  Somebody knew it when they originally
made the string.  Let's not throw away that knowledge.  If that means adding a
BOM when we put the string in the registry, fine ... let's do it and make that
part of the API.
I think having a header is the way to go. That was my original idea if you read
the bug from the beginning. (d) is no good. The persistent descriptor data is
supposed to be opaque. The caller should not know its encoding or even know it's
encoded. Look at the comments I added to nsILocalFile.idl.
I've made two different solutions to this but now the question is: Does it need
to be fixed in the first place? As it is right now, the persistent descriptor
does work with i118. It's a char* string and if that string contains non-ASCII
characters, it's fine. After all, the path in nsIFile is stored internally as a
char* string. So, the only time it can be a problem is when we write out a desc,
and then the user changes the file system char set. How often is that going to
happen? The two solutions I came up with have compatibility problems:
1) Write it out as a UTF-8 string and when reading it, check to see if the
string is valid UTF-8 and if so, convert to Unicode and then use
InitWithUnicodePath(). Because of the check for a valid UTF-8 string, we can
still read old descs which contain non-ASCII characters - they will fail the
IsUTF-8 test. However, if we write out a UTF-8 desc with a non-ASCII character,
it will choke an older program.
2) Use a header on the string so its encoding is not in question. The problem
with this is, because of the header, it will choke an older program - no matter
what the contents of the string.
Basically, I think that writing out a desc which will choke an old program is
not good. Also, since the only problem this could solve (user switching file
system char set) is, IMO, very obscure, does it do more harm to fix this than not?
so I guess I'm confused - the persistent descriptor string is in fact encoded in
some kind of i18n friendly manner? If I store a Japanese filepath in an nsIFile,
write it out to disk with the peristentDescriptor, then read it back in, it will
be initialized back to the same file?

This is good but I would still like to see it in UTF8 because at some point
we're going to be dealing with Location Independence, in which we share prefs
between platforms.. all the other i18n prefs (i.e. the non-path ones) are stored
in UTF8 just for this reason.

If we're i18n friendly right now, this bug isn't as urgent as it once was.. but
realize we'll have to revisit it for LI...
I think that if you take a Japanese filepath, write it out, then read it back
in, it will be initialized to the same file. Not having a Japanese system to try
it on, I can't say for sure, but given that the path in nsIFile is stored
internally as a char* path and SetPersistentDesc currently calls InitWithPath, I
think it would work. I was able to veryify that it works with non-ASCII
single-byte chars. Can somebody with such a system test this?
I think that the desc should be UTF-8 but cannot think of how to do this with
perfect backward/forward compatibility. Ideas/opinions on the compatibility issue?
This turned out not to be the problem that it was once thought to be. Can't
quite close it as invalid though. What we really need as far as persistent
descriptors are concerned, are XP relative persistent descs. They would be
relative to a location that could be determined at runtime, and they would use
UTF-8 for the path portion.
Target Milestone: --- → mozilla1.0
Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 
(you can query for this string to delete spam or retrieve the list of bugs I've 
moved)
Target Milestone: mozilla1.0 → mozilla1.0.1
Component: XP Utilities → Networking: File
The character set of the persistent desc can't really be changed at this point
because of backward compatibility.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: