Closed Bug 14349 Opened 21 years ago Closed 20 years ago

[FEATURE] implement migration tool to convert 4.x prefs to UTF8

Categories

(Core :: Internationalization, defect, P1)

All
Windows NT
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: neeti, Assigned: sspitzer)

References

Details

(Whiteboard: [PDT+] eta 2-17-00)

Need to implement a migration tool to convert convert all 4.x prefs to utf8
Blocks: 5130
Assignee: ftang → neeti
neeti, who own the migration tool ?
I can provide needed function call the support the migration tool. But I don't
think I should own the tool.

Since the text in the pref could be in native encoding or in UTF-8 , we need the
following algorithm in the migration tool-
1. read in one line of pref text as char*
2. call the i18ngrp support routine IsUTF8String
3. if IsUTF8String return true, the pref is already in UTF-8. Save that pref as
is to the new pref
4. If IsUTF8String return false, the charset of that line is in the native
encoding, use the nsIPlatformCharset API to get the charset

5. Then use Get a nsIUnicodeDecoder to decode that char* to PRUnichar*. Then you
can use nsString::ToNewUTF8String to generate the UTF8 and save into the new 5.0
pref.

Reassign this back to Neeti since I do not own the tool. When do we need these
api ?

All the needed function is already in the code, except the IsUTF8String, which I
will attached the code into this bug report later.
Here is the IsUTF8String function

#define kLeft1BitMask  0x80
#define kLeft2BitsMask 0xC0
#define kLeft3BitsMask 0xE0
#define kLeft4BitsMask 0xF0
#define kLeft5BitsMask 0xF8
#define kLeft6BitsMask 0xFC
#define kLeft7BitsMask 0xFE

#define k2BytesLeadByte kLeft2BitsMask
#define k3BytesLeadByte kLeft3BitsMask
#define k4BytesLeadByte kLeft4BitsMask
#define k5BytesLeadByte kLeft5BitsMask
#define k6BytesLeadByte kLeft6BitsMask
#define kTrialByte      kLeft1BitMask

#define UTF8_1Byte(c) ( 0 == ((c) & kLeft1BitMask))
#define UTF8_2Bytes(c) ( k2BytesLeadByte == ((c) & kLeft3BitsMask))
#define UTF8_3Bytes(c) ( k3BytesLeadByte == ((c) & kLeft4BitsMask))
#define UTF8_4Bytes(c) ( k4BytesLeadByte == ((c) & kLeft5BitsMask))
#define UTF8_5Bytes(c) ( k5BytesLeadByte == ((c) & kLeft6BitsMask))
#define UTF8_6Bytes(c) ( k6BytesLeadByte == ((c) & kLeft7BitsMask))
#define UTF8_ValidTrialByte(c) ( kTrialByte == ((c) & kLeft2BitsMask))


PRBool IsUTF8String(const char* utf8)
{
        if(NULL == utf8)
                return TRUE;
        return IsUTF8Text(utf8, strlen(utf8));
}
PRBool IsUTF8Text(const char* utf8, int32 len)
{
   int32 i;
   int32 j;
   int32 clen;
   for(i =0; i < len; i += clen)
   {
      if(UTF8_1Byte(utf8[i]))
      {
        clen = 1;
      } else if(UTF8_2Bytes(utf8[i])) {
        clen = 2;
        /* No enough trail bytes */
        if( (i + clen) > len)
          return FALSE;
        /* 0000 0000 - 0000 007F : should encode in less bytes */
        if(0 ==  (utf8[i] & 0x1E ))
          return FALSE;
      } else if(UTF8_3Bytes(utf8[i])) {
        clen = 3;
        /* No enough trail bytes */
        if( (i + clen) > len)
          return FALSE;
        /* a single Surrogate should not show in 3 bytes UTF8, instead, the pair
should be intepreted
           as one single UCS4 char and encoded UTF8 in 4 bytes */
        if((0xED == utf8[i] ) && (0xA0 ==  (utf8[i+1] & 0xA0 ) ))
          return FALSE;
        /* 0000 0000 - 0000 07FF : should encode in less bytes */
        if((0 ==  (utf8[i] & 0x0F )) && (0 ==  (utf8[i+1] & 0x20 ) ))
          return FALSE;
      } else if(UTF8_4Bytes(utf8[i])) {
        clen = 4;
        /* No enough trail bytes */
        if( (i + clen) > len)
          return FALSE;
        /* 0000 0000 - 0000 FFFF : should encode in less bytes */
        if((0 ==  (utf8[i] & 0x07 )) && (0 ==  (utf8[i+1] & 0x30 )) )
          return FALSE;
      } else if(UTF8_5Bytes(utf8[i])) {
        clen = 5;
        /* No enough trail bytes */
        if( (i + clen) > len)
          return FALSE;
        /* 0000 0000 - 001F FFFF : should encode in less bytes */
        if((0 ==  (utf8[i] & 0x03 )) && (0 ==  (utf8[i+1] & 0x38 )) )
          return FALSE;
      } else if(UTF8_6Bytes(utf8[i])) {
        clen = 6;
        /* No enough trail bytes */
        if( (i + clen) > len)
          return FALSE;
        /* 0000 0000 - 03FF FFFF : should encode in less bytes */
        if((0 ==  (utf8[i] & 0x01 )) && (0 ==  (utf8[i+1] & 0x3E )) )
          return FALSE;
      } else {
        return FALSE;
      }
      for(j = 1; j<clen ;j++)
      {
        if(! UTF8_ValidTrialByte(utf8[i+j])) /* Trail bytes invalid */
          return FALSE;
      }
   }
   return TRUE;
}
Frank, profile migration is being done by Don Bragg and Steve Elmer's group
deals with profiles. I think the ownership issue of who is going to implement
the migration tool needs to be worked out between internationalization, Steve
Elmer and Don Bragg.
Frank, you need to write the migration portion of the tool and then it can hook
into Don's migrator.  None of us are signed up to create every migration tool
required, each team needs to fit their part into the framework we've created.
Assignee: neeti → ftang
Target Milestone: M12
Status: NEW → ASSIGNED
selmer: I understand that part, see my email for details. I need to know who is working on migration tool in your team (so we can chat about eng
details.) Our team will do the text conversion for the prefile content to UTF-8. But it will be nice if we can understand who the code can fit into your
tool code. Let' chat over the phone.
Don, Seth, it sounds like Frank needs to hook in at your level.  Can one of you
help him with the details?
Assignee: ftang → bobj
Status: ASSIGNED → NEW
bobj- I have no bandwidth to handle this right now. Please find a owner for it.
Summary: implement migration tool to convert 4.x prefs to UTF8 → [BETA] implement migration tool to convert 4.x prefs to UTF8
Target Milestone: M12 → M13
Assignee: bobj → nhotta
Status: ASSIGNED → NEW
nhotta kindly volunteer for taking this bug. reassign to him.
Status: NEW → ASSIGNED
Summary: [BETA] implement migration tool to convert 4.x prefs to UTF8 → [BETA][FEATURE] implement migration tool to convert 4.x prefs to UTF8
I have a couple of questions.
1)Where to put IsUTF8String? In mozilla/intl?
2)Where (in which directory) is the migration code?
I found nsPrefMigration does the migration.
http://lxr.mozilla.org/seamonkey/source/profile/pref-migrator/src/nsPrefMigratio
n.cpp
Here is what I can do. I can add member functions to
1) detects UTF-8 string
2) get a platform charset
3) convert string to UTF-8

Then I will reassign this to dougt or sspitzer (as they appear in CVS log).
Target Milestone: M13 → M14
I can probably check in my part in next week but it needs time actually to be
used in the migration code. Moving to M14.
Assignee: nhotta → sspitzer
Status: ASSIGNED → NEW
Checked in I18N functions, reassign to sspitzer.
Here is a usage from the header file (nsPrefMigration.h).

      // I18N pref migration:
      //
      // 5.0 stores pref strings are UTF-8 while 4.x stores them either plat
form charset or UTF-8
      // depends on the pref.
      // Functions here provide possible two ways to deal with the I18N
migration.
      //
      // 1) Use the knowleage of which 4.x pref strings are platform charset.
      // If PrefStringNeedsCharsetConversion() returns true then the string to
be converted to UTF-8.
      //
      // 2) Apply UTF-8 detection to all string pres. Apply the conversion if
the string is detected as UTF-8.
      //
      // The user of the functions need to decide 1) or 2).
      // The functions to get platform charset and charset conversion code to
UTF-8 are also provided.
      //
selmer, is this something your team can handle?

I don't think this is a mail-news specific bug, and I'm overloaded.
Seth, why isn't nhotta finishing the implementation?  It's certainly not mail
specific...
I do not have knowledge of pref migration. All I could do was to provide I18N
functions (unicode conversion etc..), see my comment on 2000-01-05 15:05.
I assigned to sspitzer because he and dougt is the owner of the source files
according to cvs.
Basically, I18N group provides I18N functions and consultation and it's each
group responsibility to support I18N features. I agree this is not mail/news
issue so sspitzer is not the person to resolve this.
Seth, please reassign this to the person who is responsible for pref migration
tool. We can help implementing I18N support.
Status: NEW → ASSIGNED
accepting for now, but I may re-assign.
Priority: P3 → P2
Priority: P2 → P1
this is migration related, so marking p1.
Put beta1 as this is listed in the beta1 criteria.
Keywords: beta1
Summary: [BETA][FEATURE] implement migration tool to convert 4.x prefs to UTF8 → [FEATURE] implement migration tool to convert 4.x prefs to UTF8
Whiteboard: [PDT+]
it was suggested by nhotta that I call PrefStringNeedsCharsetConversion() for
each prefname, and if that returned true, check if the pref was in utf-8 (by
calling IsUTF8String(), and if not, I'd convert that pref to utf-8 by using
GetPlatformCharset() and ConvertStringToUTF8()

that seems like a waste.

looking at PrefStringNeedsCharsetConversion() it looks like he plans on listing
all the prefs that may have been stored in the platform charset.

I'm going to just do this:

if we migrate a profile, after ReadUserPrefs(), I'll call ConvertPrefsToUTF8()
which will go through a list of prefs (currently the list is in
PrefStringNeedsCharsetConversion()) and then migrate those prefs from the
platform charset to utf8, if necessary.

this way, we will only migrate the prefs we know we need to.

I'll get rid of PrefStringNeedsCharsetConversion()

the only draw backs:

1) we need to know all the prefs that might need conversion
2) if the user doesn't migrate from 4.x to 5.0 using the migrator, their prefs
won't get migrated.

those seem acceptable.

I'll work on doing what I propose, then getting nhotta to review my changes (and
help me test) and then it will be up to nhotta to provide the full list of prefs
that may need migration.

I should be able to have this ready tomorrow.
Whiteboard: [PDT+] → [PDT+] eta 2-17-00
if it turns out that there prefs that we can't hard code ahead of time, we can
always add to ConvertPrefsToUTF8() [the thing I am writing now] use
nsIPref::EnumerateChildren to look for prefs that begin with "foo.bar", for
example.
I have a fix in hand.

I need nhotta to review and we need that list of prefs to be converted
You can attach your diff in the bug report. You can ask momoi for the list of 
the prefs to be converted.
How about if I gave you a prefs.js file which conatins
probably almost all possible non-ASCII pref items.
yes, please, do that.

and I can use it to test!
I walked through the change Seth made with him. I reviewed a basic concept of 
the conversion. I didn't review the code in detail so need another reviewer for 
the change.
alecf has reviewed my code.

there are prefs I'm trying to convert:

"mail.identity.username"
"mail.signature_file"
"mail.identity.organization"
"li.server.ldap.userbase"
"editor.image_editor"
"editor.html_editor"
"editor.author"
"custtoolbar.personal_toolbar_folder"
"browser.cache.directory"
"mail.directory"
"news.directory"
"mail.imap.root_dir"
"premigration.mail.directory"
"premigration.news.directory"
"browser.bookmark_file"
"browser.history_file"
"browser.sarcache.directory"
"browser.user_history_file"
"helpers.private_mailcap_file"
"helpers.private_mime_types_file"
"mail.default_fcc"
"mail.default_templates"
"news.default_fcc"

note, many of those are paths.  any pref that is a path could have characters in
the system charset.

I'm also converting  any pref that matches these patterns:

"ldap_2.server.*.description"
"intl.font*.fixed_font"
"intl.font*.prop_font"

a=phil, checking in soon.
The following could also be saved locally:

"mail.default_drafts"
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
fixed.

if we find more prefs to migrate, let me know, and I'll add them.
I just added it.
I  verified this in 2000022108 Win32 build.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.