Bug 1571672 Comment 41 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Well, trying to get this to work has not been fun. With no offline store, when I open a message with multiple attachments and the folder name contains one or more non-ascii chars such as ä, the url is OK and the whole message is fetched and stored to RAM cache. Then each attachment (part) is accessed from cache and displayed inline. Then another URL occurs to fetch the whole message again (not sure why) but this time the url does not contain the proper 2 byte utf8 sequence for ä but instead contains the "replacement character" sequence that displays as a question mark inside a diamond (0xef, 0xbf, 0xbd). The 2nd fetch with the bad URL often causes a hang since it can't get a protocol connection. 

I finally found a way to fix this. The bad utf8 was caused by the ä being encoded as a single byte 0xE4 instead of utf8, 0xc3,0xa4. Converting from ASCII to UTF16 and then back to UTF8 fixes it. However it only works when the UTF8 character in the folder name can be represented as a 2 byte "latin" utf8. If you want to have a 3 or more byte "non-latin" utf8 char in the folder name, like ᄂ (U+FFA4, or 0xef,0xbe,0xa4), it won't work.

With offline store the same problem occurs. The messages are all stored OK to mbox file with correctly encoded UTF8 URLs.  But when an individual message is accessed, the URL again has the invalid replacement characters. The same fix also works to convert the URL from single byte to multi-byte UTF8 for only the "latin" chars.

The mystery is why the URL to access messages starts out OK, even with non-latin, but then gets "corrupted".  (The corruption seem to be that the folder name starts out with proper UTF8 encoding but then somehow gets changed to ISO-8859-1 single byte per char style. It's fixable for "latin" chars that can be represented as a single byte but not for non-latin that won't fit in a single byte.)
Well, trying to get this to work has not been fun. With no offline store, when I open a message with multiple attachments and the folder name contains one or more non-ascii chars such as ä, the url is OK and the whole message is fetched and stored to RAM cache. Then each attachment (part) is accessed from cache and displayed inline. Then another URL occurs to fetch the whole message again (not sure why) but this time the url does not contain the proper 2 byte utf8 sequence for ä but instead contains the "replacement character" sequence that displays as a question mark inside a diamond (0xef, 0xbf, 0xbd). The 2nd fetch with the bad URL often causes a hang since it can't get a protocol connection. 

I finally found a way to fix this. The bad utf8 was caused by the ä being encoded as a single byte 0xE4 instead of utf8, 0xc3,0xa4. Converting from ASCII to UTF16 and then back to UTF8 fixes it. However it only works when the UTF8 character in the folder name can be represented as a 2 byte "latin" utf8. If you want to have a 3 or more byte "non-latin" utf8 char in the folder name, like ᄂ (U+FFA4, or 0xef,0xbe,0xa4), it won't work.

With offline store the same problem occurs. The messages are all stored OK to mbox file with correctly encoded UTF8 URLs.  But when an individual message is accessed, the URL again has the invalid replacement characters. The same fix also works to convert the URL from single byte to multi-byte UTF8 for only the "latin" chars.

The mystery is why the URL to access messages starts out OK, even with non-latin, but then gets "corrupted".  (The corruption seem to be that the folder name starts out with proper UTF8 encoding but then somehow gets changed to ISO-8859-1 single byte per char style. It's fixable for "latin" chars that can be represented as a single byte but not for non-latin that won't fit in a single byte.)

Here's the diff fragment for the "fix" I'm referring to above:
```
diff --git a/mailnews/base/util/nsMsgDBFolder.cpp b/mailnews/base/util/nsMsgDBFolder.cpp
--- a/mailnews/base/util/nsMsgDBFolder.cpp
+++ b/mailnews/base/util/nsMsgDBFolder.cpp
@@ -2843,17 +2843,25 @@ nsresult nsMsgDBFolder::parseURI(bool ne
     nsAutoCString fileName;
     nsAutoCString escapedFileName;
     url->GetFileName(escapedFileName);
     if (!escapedFileName.IsEmpty()) {
       // XXX conversion to unicode here? is fileName in UTF8?
       // yes, let's say it is in utf8
       MsgUnescapeString(escapedFileName, 0, fileName);
       NS_ASSERTION(mozilla::IsUtf8(fileName), "fileName is not in UTF-8");
-      CopyUTF8toUTF16(fileName, mName);
+      if (!mozilla::IsUtf8(fileName))
+      {
+        static nsAutoString temp;
+        printf("gds: 1. fileName is not in UTF-8=%s\n", fileName.get()); 
+        CopyASCIItoUTF16(fileName, temp /*mName*/);
+        mName = temp;
+      }
+      else 
+        CopyUTF8toUTF16(fileName, mName);
     }
   }
 
   // grab the server by parsing the URI and looking it up
   // in the account manager...
   // But avoid this extra work by first asking the parent, if any
   nsCOMPtr<nsIMsgIncomingServer> server = do_QueryReferent(mServer, &rv);
   if (NS_FAILED(rv)) {
@@ -2888,16 +2896,23 @@ nsresult nsMsgDBFolder::parseURI(bool ne
   // now try to find the local path for this folder
   if (server) {
     nsAutoCString newPath;
     nsAutoCString escapedUrlPath;
     nsAutoCString urlPath;
     url->GetFilePath(escapedUrlPath);
     if (!escapedUrlPath.IsEmpty()) {
       MsgUnescapeString(escapedUrlPath, 0, urlPath);
+      if (!mozilla::IsUtf8(urlPath))
+      {
+        static nsAutoString temp;
+        printf("gds: 2. urlPath is not in UTF-8=%s\n", urlPath.get()); 
+        CopyASCIItoUTF16(urlPath, temp);
+        urlPath = NS_ConvertUTF16toUTF8(temp);
+      }
 
       // transform the filepath from the URI, such as
       // "/folder1/folder2/foldern"
       // to
       // "folder1.sbd/folder2.sbd/foldern"
       // (remove leading / and add .sbd to first n-1 folders)
       // to be appended onto the server's path

Back to Bug 1571672 Comment 41