Closed Bug 102595 Opened 23 years ago Closed 23 years ago

nsUnicodeToUTF8 does not handle surrogate pair correctly

Categories

(Core :: Internationalization, defect)

x86
Windows NT
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla0.9.8

People

(Reporter: ftang, Assigned: shanjian)

Details

(Keywords: intl, Whiteboard: need r/sr)

Attachments

(1 file, 3 obsolete files)

we currently does not handle surrogate pair in nsUnicodeToUTF8 correctly. 
0xd800 0xdc00 should be generated as one 4 bytes sequence instead of two 3 bytes 
sequence
Status: NEW → ASSIGNED
Keywords: intl
QA Contact: andreasb → ylong
give to shanjian
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Target Milestone: --- → mozilla0.9.7
accepting. 
Status: NEW → ASSIGNED
Attached patch patch (obsolete) — Splinter Review
Whiteboard: need r/sr
Target Milestone: mozilla0.9.7 → mozilla0.9.8
@@ -94,14 +75,127 @@
                                               PRInt32 aSrcLength,
                                               PRInt32 * aDestLength)
 {
-  // in theory it should be 6, but since we do not handle 
-  // UCS4 and UTF-16 here. It is 3. We should change it to 6 when we
-  // support UCS4 or UTF-16
-  *aDestLength = 3*aSrcLength;
+  // aSrc is interpreted as UTF16, 3 is normaly enough, but part of 
+  //surrogate pair need 4, but we atmost have one  
+  *aDestLength = 3*aSrcLength +1;
   return NS_OK;
 }

I don't understand this part. Why we at most have one ?
I think it should still be *aDestLength = 3*aSrcLength;
If the unicode do not contain Surrogate 3 * srcLenght should be enough, if ALL 
the unicode are surrogate, then (2 * aSrcLength) should be enough since 
aSrcLength is the number of PRUnichar and one surrogate pair require two 
PRUnichar, a surrogate pair will use two PRUnichar and should be convert to 4 
bytes UTF-8. which is less than 3 * srcLength.
Attached patch update patch (obsolete) — Splinter Review
Attachment #59220 - Attachment is obsolete: true
frank, 

In fact, we need to add 3 instead of 1. That is for situation when previous buffer contains only part 
of a surrogate pair. If incoming buffer is in valid form, 1 is enough. 
Comment on attachment 61388 [details] [diff] [review]
update patch

r=ftang
Attachment #61388 - Flags: review+
cc to brendan

Brendan, could you sr?
Comment on attachment 61388 [details] [diff] [review]
update patch

>-//----------------------------------------------------------------------
>-// Class nsUnicodeToUTF8 [implementation]
>-
>-nsUnicodeToUTF8::nsUnicodeToUTF8() 
>-: nsTableEncoderSupport((uShiftTable*) &g_UTF8ShiftTable, 
>-                        (uMappingTable*) &g_UTF8MappingTable)
>+NS_IMETHODIMP NS_NewUnicodeToUTF8(nsISupports* aOuter, 
>+                                            const nsIID& aIID,
>+                                            void** aResult)

Don't use NS_IMETHODIMP for an extern function.  Use NS_COM nsresult, IIRC. 
See numerous cases in xpcom and elsewhere.

> {
>+  if (!aResult) {
>+    return NS_ERROR_NULL_POINTER;
>+  }
>+  if (aOuter) {
>+    *aResult = nsnull;
>+    return NS_ERROR_NO_AGGREGATION;
>+  }
>+  nsUnicodeToUTF8 * inst = new nsUnicodeToUTF8();
>+  if (!inst) {
>+    *aResult = nsnull;
>+    return NS_ERROR_OUT_OF_MEMORY;
>+  }
>+  nsresult res = inst->QueryInterface(aIID, aResult);
>+  if (NS_FAILED(res)) {
>+    *aResult = nsnull;
>+    delete inst;
>+  }
>+  return res;
> }

Why not set *aResult = nsnull; early, after the (pointless, I think) !aResult
test, and unconditionally, so the following error returns don't need to do
that?

>@@ -69,14 +75,130 @@
>                                               PRInt32 aSrcLength,
>                                               PRInt32 * aDestLength)
> {
>-  // in theory it should be 6, but since we do not handle 
>-  // UCS4 and UTF-16 here. It is 3. We should change it to 6 when we
>-  // support UCS4 or UTF-16
>-  *aDestLength = 3*aSrcLength;
>+  // aSrc is interpreted as UTF16, 3 is normaly enough.

"normally"

>+  // but

Capitalize "but", or make it continue the previous sentence (it could do that
without starting a new line in the comment; nit).

> when previous buffer only contains part of the surrogate pair, we 
>+  // need to complete it here. If the first word in following buffer is not
>+  // in valid surrogate rang, we need to convert the remaining of last buffer 
>+  // to 3 bytes.
>+  *aDestLength = 3*aSrcLength + 3;
>   return NS_OK;
> }

[. . .]

>+NS_IMETHODIMP nsUnicodeToUTF8::Finish(char * aDest, PRInt32 * aDestLength)
>+{
>+  char * dest = aDest;
>+
>+  if (mHighSurrogate) {
>+    if (*aDestLength >= 3) {
>+      *dest++ = (char)0xe0 | (mHighSurrogate >> 12);
>+      *dest++ = (char)0x80 | ((mHighSurrogate >> 6) & 0x003f);
>+      *dest++ = (char)0x80 | (mHighSurrogate & 0x003f);
>+      mHighSurrogate = 0;
>+      *aDestLength = 3;
>+      return NS_OK;
>+    } else {

else after return makes no sense.  Better to test if (*aDestLength < 3) and
return early as the else clause here does:

>+      *aDestLength = 0;
>+      return NS_OK_UENC_MOREOUTPUT;
>+    }

... then put the 'if (*aDestLength >= 3) {...}' code here, indented one fewer
levels.

>+  } 
>+
>+  *aDestLength  = 0;
>   return NS_OK;
> }

Is this the only UnicodeToUTF8 encoder in the tree, or are there others (in
xpcom, perhaps) that might need to track this change?

/be
Attached patch new patch (obsolete) — Splinter Review
Attachment #61388 - Attachment is obsolete: true
brendan,

NS_NewUnicodeToUTF8 does not seem necessary. I guess the following line
"NS_GENERIC_FACTORY_CONSTRUCTOR(nsUnicodeToUTF8);" in nsUConvModule.cpp
serve this purpose. All other issues were done as you suggested. 

>>Is this the only UnicodeToUTF8 encoder in the tree, or are there others (in
>>xpcom, perhaps) that might need to track this change?
We do have another encoder "NS_ConvertUCS2toUTF8" implemented in nsString.cpp.
That one supports surrogate now. That one allocate buffer itself, so is  
less complicated. 
shanjian: "NS_GENERIC_FACTORY_CONSTRUCTOR(nsUnicodeToUTF8);" has a buglet in it
-- there should be no semi-colon after the call to such a macro, as it expands
to a static function definition.  Some compilers do not like empty declarations
(semicolon at start of file or after another ; or } at top-level in a C or C++
file).

But to repeat what I just wrote with different emphasis, that macro just creates
a static function called back by generic factory code.  It does not make a
public, extern function with the right linkage for others to call from another
DLL or DSO, or from the main executable.  You want an extern NS_COM function
declaration and an NS_COM definition to match.

But if you're sure no one needs NS_NewUnicodeToUTF8, and that everyone who needs
to create an instance can go through do_CreateInstance (which I hope and believe
is the case), then we're good.  Occasionally it is useful to make an extern
"constructor" function that others can use instead of going through the
component manager, but usually that's a premature optimization -- or else the
object shouldn't be an XPCOM object in the first place (if component manager
overhead is too high).

/be
Attachment #64091 - Attachment is obsolete: true
brendan,
I removed those extra ';' in my updated patch. 

Yes, I believe "NS_NewUnicodeToUTF8" is not needed by anybody. I searched this 
string in lxr, and there is no referenced anywhere else. My locale build runs fine 
without any problem, and if I set a break point in the conversion function, program
did stop there in certain situation. Could you give sr now?
Comment on attachment 64922 [details] [diff] [review]
patch fixes that eliminate those extra ";'

sr=brendan@mozilla.org and noting r= from earlier version of patch, which I
believe carries over.

/be
Attachment #64922 - Flags: superreview+
Attachment #64922 - Flags: review+
fix checked in. 
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: