Closed
Bug 102595
Opened 23 years ago
Closed 23 years ago
nsUnicodeToUTF8 does not handle surrogate pair correctly
Categories
(Core :: Internationalization, defect)
Tracking
()
RESOLVED
FIXED
mozilla0.9.8
People
(Reporter: ftang, Assigned: shanjian)
Details
(Keywords: intl, Whiteboard: need r/sr)
Attachments
(1 file, 3 obsolete files)
9.45 KB,
patch
|
brendan
:
review+
brendan
:
superreview+
|
Details | Diff | Splinter Review |
we currently does not handle surrogate pair in nsUnicodeToUTF8 correctly.
0xd800 0xdc00 should be generated as one 4 bytes sequence instead of two 3 bytes
sequence
Reporter | ||
Updated•23 years ago
|
Status: NEW → ASSIGNED
Reporter | ||
Comment 1•23 years ago
|
||
give to shanjian
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Target Milestone: --- → mozilla0.9.7
Assignee | ||
Comment 3•23 years ago
|
||
Assignee | ||
Updated•23 years ago
|
Whiteboard: need r/sr
Assignee | ||
Updated•23 years ago
|
Target Milestone: mozilla0.9.7 → mozilla0.9.8
Reporter | ||
Comment 4•23 years ago
|
||
@@ -94,14 +75,127 @@
PRInt32 aSrcLength,
PRInt32 * aDestLength)
{
- // in theory it should be 6, but since we do not handle
- // UCS4 and UTF-16 here. It is 3. We should change it to 6 when we
- // support UCS4 or UTF-16
- *aDestLength = 3*aSrcLength;
+ // aSrc is interpreted as UTF16, 3 is normaly enough, but part of
+ //surrogate pair need 4, but we atmost have one
+ *aDestLength = 3*aSrcLength +1;
return NS_OK;
}
I don't understand this part. Why we at most have one ?
I think it should still be *aDestLength = 3*aSrcLength;
If the unicode do not contain Surrogate 3 * srcLenght should be enough, if ALL
the unicode are surrogate, then (2 * aSrcLength) should be enough since
aSrcLength is the number of PRUnichar and one surrogate pair require two
PRUnichar, a surrogate pair will use two PRUnichar and should be convert to 4
bytes UTF-8. which is less than 3 * srcLength.
Assignee | ||
Comment 5•23 years ago
|
||
Attachment #59220 -
Attachment is obsolete: true
Assignee | ||
Comment 6•23 years ago
|
||
frank,
In fact, we need to add 3 instead of 1. That is for situation when previous buffer contains only part
of a surrogate pair. If incoming buffer is in valid form, 1 is enough.
Reporter | ||
Comment 7•23 years ago
|
||
Comment on attachment 61388 [details] [diff] [review]
update patch
r=ftang
Attachment #61388 -
Flags: review+
Assignee | ||
Comment 8•23 years ago
|
||
cc to brendan
Brendan, could you sr?
Comment 9•23 years ago
|
||
Comment on attachment 61388 [details] [diff] [review]
update patch
>-//----------------------------------------------------------------------
>-// Class nsUnicodeToUTF8 [implementation]
>-
>-nsUnicodeToUTF8::nsUnicodeToUTF8()
>-: nsTableEncoderSupport((uShiftTable*) &g_UTF8ShiftTable,
>- (uMappingTable*) &g_UTF8MappingTable)
>+NS_IMETHODIMP NS_NewUnicodeToUTF8(nsISupports* aOuter,
>+ const nsIID& aIID,
>+ void** aResult)
Don't use NS_IMETHODIMP for an extern function. Use NS_COM nsresult, IIRC.
See numerous cases in xpcom and elsewhere.
> {
>+ if (!aResult) {
>+ return NS_ERROR_NULL_POINTER;
>+ }
>+ if (aOuter) {
>+ *aResult = nsnull;
>+ return NS_ERROR_NO_AGGREGATION;
>+ }
>+ nsUnicodeToUTF8 * inst = new nsUnicodeToUTF8();
>+ if (!inst) {
>+ *aResult = nsnull;
>+ return NS_ERROR_OUT_OF_MEMORY;
>+ }
>+ nsresult res = inst->QueryInterface(aIID, aResult);
>+ if (NS_FAILED(res)) {
>+ *aResult = nsnull;
>+ delete inst;
>+ }
>+ return res;
> }
Why not set *aResult = nsnull; early, after the (pointless, I think) !aResult
test, and unconditionally, so the following error returns don't need to do
that?
>@@ -69,14 +75,130 @@
> PRInt32 aSrcLength,
> PRInt32 * aDestLength)
> {
>- // in theory it should be 6, but since we do not handle
>- // UCS4 and UTF-16 here. It is 3. We should change it to 6 when we
>- // support UCS4 or UTF-16
>- *aDestLength = 3*aSrcLength;
>+ // aSrc is interpreted as UTF16, 3 is normaly enough.
"normally"
>+ // but
Capitalize "but", or make it continue the previous sentence (it could do that
without starting a new line in the comment; nit).
> when previous buffer only contains part of the surrogate pair, we
>+ // need to complete it here. If the first word in following buffer is not
>+ // in valid surrogate rang, we need to convert the remaining of last buffer
>+ // to 3 bytes.
>+ *aDestLength = 3*aSrcLength + 3;
> return NS_OK;
> }
[. . .]
>+NS_IMETHODIMP nsUnicodeToUTF8::Finish(char * aDest, PRInt32 * aDestLength)
>+{
>+ char * dest = aDest;
>+
>+ if (mHighSurrogate) {
>+ if (*aDestLength >= 3) {
>+ *dest++ = (char)0xe0 | (mHighSurrogate >> 12);
>+ *dest++ = (char)0x80 | ((mHighSurrogate >> 6) & 0x003f);
>+ *dest++ = (char)0x80 | (mHighSurrogate & 0x003f);
>+ mHighSurrogate = 0;
>+ *aDestLength = 3;
>+ return NS_OK;
>+ } else {
else after return makes no sense. Better to test if (*aDestLength < 3) and
return early as the else clause here does:
>+ *aDestLength = 0;
>+ return NS_OK_UENC_MOREOUTPUT;
>+ }
... then put the 'if (*aDestLength >= 3) {...}' code here, indented one fewer
levels.
>+ }
>+
>+ *aDestLength = 0;
> return NS_OK;
> }
Is this the only UnicodeToUTF8 encoder in the tree, or are there others (in
xpcom, perhaps) that might need to track this change?
/be
Assignee | ||
Comment 10•23 years ago
|
||
Attachment #61388 -
Attachment is obsolete: true
Assignee | ||
Comment 11•23 years ago
|
||
brendan,
NS_NewUnicodeToUTF8 does not seem necessary. I guess the following line
"NS_GENERIC_FACTORY_CONSTRUCTOR(nsUnicodeToUTF8);" in nsUConvModule.cpp
serve this purpose. All other issues were done as you suggested.
>>Is this the only UnicodeToUTF8 encoder in the tree, or are there others (in
>>xpcom, perhaps) that might need to track this change?
We do have another encoder "NS_ConvertUCS2toUTF8" implemented in nsString.cpp.
That one supports surrogate now. That one allocate buffer itself, so is
less complicated.
Comment 12•23 years ago
|
||
shanjian: "NS_GENERIC_FACTORY_CONSTRUCTOR(nsUnicodeToUTF8);" has a buglet in it
-- there should be no semi-colon after the call to such a macro, as it expands
to a static function definition. Some compilers do not like empty declarations
(semicolon at start of file or after another ; or } at top-level in a C or C++
file).
But to repeat what I just wrote with different emphasis, that macro just creates
a static function called back by generic factory code. It does not make a
public, extern function with the right linkage for others to call from another
DLL or DSO, or from the main executable. You want an extern NS_COM function
declaration and an NS_COM definition to match.
But if you're sure no one needs NS_NewUnicodeToUTF8, and that everyone who needs
to create an instance can go through do_CreateInstance (which I hope and believe
is the case), then we're good. Occasionally it is useful to make an extern
"constructor" function that others can use instead of going through the
component manager, but usually that's a premature optimization -- or else the
object shouldn't be an XPCOM object in the first place (if component manager
overhead is too high).
/be
Assignee | ||
Comment 13•23 years ago
|
||
Attachment #64091 -
Attachment is obsolete: true
Assignee | ||
Comment 14•23 years ago
|
||
brendan,
I removed those extra ';' in my updated patch.
Yes, I believe "NS_NewUnicodeToUTF8" is not needed by anybody. I searched this
string in lxr, and there is no referenced anywhere else. My locale build runs fine
without any problem, and if I set a break point in the conversion function, program
did stop there in certain situation. Could you give sr now?
Comment 15•23 years ago
|
||
Comment on attachment 64922 [details] [diff] [review]
patch fixes that eliminate those extra ";'
sr=brendan@mozilla.org and noting r= from earlier version of patch, which I
believe carries over.
/be
Attachment #64922 -
Flags: superreview+
Attachment #64922 -
Flags: review+
Assignee | ||
Comment 16•23 years ago
|
||
fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•