Closed Bug 57011 Opened 23 years ago Closed 21 years ago
File's Persistent Descriptor i18n friendly
When I worked on fixing the japanese migration problem (bug 44764), I discovered that I couldn't use Get/SetPersistentDescriptor in order to migrate japanese profiles (profiles/profile folder with japanese characters) properly. I had to use some unicode representation. I used GetUnicodePath and InitWithUnicodePath APIs to move around the unicode data smoothly across profile manager APIs. But that turned out to be a problem on Mac as that changed the way we stored the profile paths in the registry (we started storing paths instead of persistent strings and mac users are affected with that). On Windows and Linux, it didn't really matter as the paths are stored as paths and not as aliases (in persistent desc format) in the registry. We should allow the usage of Get/Set PersistentDescriptor on all platforms to take the some kind of persistent desc formated data and then decode it properly inside the function. Looks like on windows and linux, the persistent data is just like path. May be we need some kind of encoded format here which can be decoded as we process the input string. Looks like Mac data is already encoded by base64 in the persistent interfaces. We need similar mechanism on Windows and Linux. That allows us to use APIs directly without having to worry about platform and difference input string formats. Conrad, please update the bug with some of the ideas you mentioned.
The problem with doing this is forwards/backwards compatibility. Persistent descriptors which have been written into files up to now on Windows and Linux just used ASCII paths. Those "descriptors", inadequate as they are, still need to be understood by SetPersistentDescriptor so we can read old prefs and such. The way we could do it is this: Add a header onto the descriptor data which cannot possibly be the beginning of a path. For instance "/\/\/\" Then followed by unicode path data. Then SetPersistentDescriptor can look for this header and if found decode the unicode string which follows it. If it does not see this header, it knows that it is an old ASCII path. That's easy enough. The harder part is making the data returned by GetPersistentDescriptor usable by old versions of mozilla. Do we need to worry about this? Since the current code expects this string to be a null-terminated ASCII string, we could make the data begin with a null-terminated ASCII path string. Old readers could still read this. Then, following the null-terminated string would be the "/\/\/\" header followed by the unicode string.
it won't break backwards compatibility, because: 1) UTF8-encoded ASCII strings are identical to the existing ASCII strings 2) any ASCII versions of i18n filenames are already pointing to non-existent paths as it is. so we shouldn't have to do any wierd header tricks.
Conrad, do you want to take a look at this?
Assignee: dougt → ccarlen
Got a patch working using UTF8 strings - pretty simple. As Alec pointed out UTF8-encoded strings are identical to char* paths in the ASCII range. Is this true for chars in the range 0-7F, but not from 80-FF? If a user had already made a profile called "olé" and the cache folder was saved in the prefs as an old-style char string and we just treat this string as UTF8-encoded data, what would happen?
Status: NEW → ASSIGNED
well, if it was a multi-character string, it might become ol<some odd character> - but that weird character would probably not map to a real directory anyway (i.e. would be broken in the first place) When we tried to decode that from UTF8, we'd probably get a bad string: ol<different odd character> but my assertion is that the original string was busted in the first place, so having a different busted string doesn't worsen the situation.
After some testing with the string "olé", I found this: With the current code, it works. The 'é' is put into the string as 0xE9 and gets read back in as such and everything is fine. With my first patch, if I just treat the string as a UTF8 string, the 'é' character screws up the UTF8 code, and I get a bad file path. That would be a regression. What I tried was this: made a routine that determines of a string is a valid UTF8 string and if so, use it as UTF8. If not, assume it's an old persistent desc string with a non-ASCII char. It works. Does anybody know of a public way to determine if a string is UTF8? nsString has IsASCII() If only it had IsUTF8(). Any other ideas?
that also probably works currently because the é is still a single-byte character, but it has the 8th bit set, so it must be confusing the UTF8 converter. Anyhow, I think the IsUTF8() is a good idea - not sure where it belongs in the nsString family though. Maybe scc would know
Scott, can you take a look at the last comments about how good it would be to have a way to know if a string was a valid UTF8 string? It needs to be in a common place because it is needed in nsLocalFile(Win/Unix/OS2).cpp.
Hmmm. This is kind of hard. First, of course, since this is an encoding issue, you wouldn't make such a function a member function of strings. You'd have a non-member function, just like |IsASCII| in "nsReadableUtils.h". |IsASCII| can be fooled by any string that just doesn't use the high bit in any contained byte. So perhaps it would be better named, e.g., |CouldBeASCII|. Now with UTF-8, you can detect the BOM if present, or an illegal sequence, but for the most part, an arbitrary byte stream could easily be legally but incorrectly interpreted as UTF-8. The fact that a particular string already is UTF-8 encoded is something you want to know up front and remember during the life of the string. If I give you a routine that alleges (falsely, I hope you realize) to discriminate UTF-8 data ... that might encourage people _not_ to remember. See <http://www.unicode.org/unicode/faq/utf_bom.html> for a better understanding of UTF-8 encoding. On a distantly related note, some time in the future, I may add a string class that offers the interface of a wide string, but stores it's data UTF-8 encoded. Current string classes have no intrinsic knowledge of the encoding you are using in them. They are more byte sequence managers than character sequence managers.
Scott, check the patch I posted last night. It does just what you suggest in the place you suggest - nsReadableUtils. You are right though - it should be called CouldBeUTF8.
I have two problems with this patch. One is the repeated use of the pattern nsCRT::strdup(NS_ConvertUCS2toUTF8(uc2Path)) which does twice as much allocation as it should. You were in "nsReadableUtils.h", you should have seen |ToNewUTF8String| :-) The second thing is that all your logic can be fooled by strings that aren't actually UTF8 but look like it. This is different than the |IsASCII| case, because |IsASCII| really asks (contrary to my earlier comment) "could I throw away the top byte of every character in this string without data loss?" The patch under consideration blurs encoding. The policy it represents is "the caller might have given me anything". Here are some clearer possibilities: (a) caller promises me ASCII (b) caller promises me UTF8 (degenerates to ASCII) (c) caller promises me 16 bit Unicode (d) caller communicates the encoding, e.g., by selecting from |SetPersistentDescriptorFromASCII|, |SetPersistentDescriptorFromUTF8|, and |SetPersistentDescriptorFromUCS2| or else with an encoding argument For us, either (b) or (d) are probably acceptable answers; I just don't think we can or should try to guess the encoding. Somebody knew it when they originally made the string. Let's not throw away that knowledge. If that means adding a BOM when we put the string in the registry, fine ... let's do it and make that part of the API.
I think having a header is the way to go. That was my original idea if you read the bug from the beginning. (d) is no good. The persistent descriptor data is supposed to be opaque. The caller should not know its encoding or even know it's encoded. Look at the comments I added to nsILocalFile.idl.
I've made two different solutions to this but now the question is: Does it need to be fixed in the first place? As it is right now, the persistent descriptor does work with i118. It's a char* string and if that string contains non-ASCII characters, it's fine. After all, the path in nsIFile is stored internally as a char* string. So, the only time it can be a problem is when we write out a desc, and then the user changes the file system char set. How often is that going to happen? The two solutions I came up with have compatibility problems: 1) Write it out as a UTF-8 string and when reading it, check to see if the string is valid UTF-8 and if so, convert to Unicode and then use InitWithUnicodePath(). Because of the check for a valid UTF-8 string, we can still read old descs which contain non-ASCII characters - they will fail the IsUTF-8 test. However, if we write out a UTF-8 desc with a non-ASCII character, it will choke an older program. 2) Use a header on the string so its encoding is not in question. The problem with this is, because of the header, it will choke an older program - no matter what the contents of the string. Basically, I think that writing out a desc which will choke an old program is not good. Also, since the only problem this could solve (user switching file system char set) is, IMO, very obscure, does it do more harm to fix this than not?
so I guess I'm confused - the persistent descriptor string is in fact encoded in some kind of i18n friendly manner? If I store a Japanese filepath in an nsIFile, write it out to disk with the peristentDescriptor, then read it back in, it will be initialized back to the same file? This is good but I would still like to see it in UTF8 because at some point we're going to be dealing with Location Independence, in which we share prefs between platforms.. all the other i18n prefs (i.e. the non-path ones) are stored in UTF8 just for this reason. If we're i18n friendly right now, this bug isn't as urgent as it once was.. but realize we'll have to revisit it for LI...
I think that if you take a Japanese filepath, write it out, then read it back in, it will be initialized to the same file. Not having a Japanese system to try it on, I can't say for sure, but given that the path in nsIFile is stored internally as a char* path and SetPersistentDesc currently calls InitWithPath, I think it would work. I was able to veryify that it works with non-ASCII single-byte chars. Can somebody with such a system test this? I think that the desc should be UTF-8 but cannot think of how to do this with perfect backward/forward compatibility. Ideas/opinions on the compatibility issue?
This turned out not to be the problem that it was once thought to be. Can't quite close it as invalid though. What we really need as far as persistent descriptors are concerned, are XP relative persistent descs. They would be relative to a location that could be determined at runtime, and they would use UTF-8 for the path portion.
Target Milestone: --- → mozilla1.0
Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 (you can query for this string to delete spam or retrieve the list of bugs I've moved)
Target Milestone: mozilla1.0 → mozilla1.0.1
The character set of the persistent desc can't really be changed at this point because of backward compatibility.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.