Closed
Bug 331748
Opened 19 years ago
Closed 14 years ago
Fails to load invalid UTF-8 "properly" when using native uconv module
Categories
(Core :: Internationalization, defect)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: glandium, Assigned: smontagu)
References
Details
Attachments
(2 files, 5 obsolete files)
19.46 KB,
text/plain
|
Details | |
20.38 KB,
patch
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.8.0.1) Gecko/20060313 Debian/1.5.dfsg+1.5.0.1-4 Firefox/1.5.0.1
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.8.0.1) Gecko/20060313 Debian/1.5.dfsg+1.5.0.1-4 Firefox/1.5.0.1
After enabling --enable-native-uconv, gecko is unable to succesfully load invalid UTF-8 (such as http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt), and fails on invalid UTF-8 javascripts, which has a nice side effect: breaking the "Website Certified by an Unknown Authority" dialog because newserver.js is not UTF-8, but I'm going to file another bug for that.
Somehow, it gets a lot of garbage in the internal representation of the data, and pango complains a lot about it:
(Gecko:1740): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()
Reproducible: Always
Reporter | ||
Comment 1•19 years ago
|
||
Seems to be related to the fact that mReplaceOnError seems to never be true in nsNativeUConvService.cpp.
Updated•19 years ago
|
Assignee: nobody → smontagu
Component: General → Internationalization
Product: Firefox → Core
QA Contact: general → amyy
Version: unspecified → Trunk
Reporter | ||
Comment 2•19 years ago
|
||
mmmm in fact, seen how nsUTF8ToUnicode is written, it seems more like a problem with return values...
Reporter | ||
Comment 3•19 years ago
|
||
Here is a basic patch that solves the issue. The problem is that now, it doesn't throw a NS_ERROR_UENC_NOMAPPING when it should, ie in cases where the non native uconv would.
Assignee | ||
Comment 4•19 years ago
|
||
I think your analysis is wrong, and the problem is only with specific characters which the UTF-8 decoder does not recognize as invalid, but pango does. This attachment removes such characters from the testcase and I see no Pango-WARNINGs with it. Do you get the same results? If so, this is a dupe of bug 172701.
Reporter | ||
Comment 5•19 years ago
|
||
The pango issue is a side effect, it only shows that the string that is passed to pango is somehow borked.
The problem here is that the "native" implementation (iconv) and the "non native" one differ in their behaviour, and the code that calls the conversion doesn't act the same way with these different behaviours, thus an overall different experience with charsets.
Bug #172701 is actually the opposite bug. It's the lack of bug #172701 with the native implementation that is annoying in my case.
Note that the code that is not aborting the conversion (for bug #172701) is not the uconv service (native or not), but rather the code that calls the convert function. The uconv service, on its end, doesn't return the same status depending on the error it got during conversion between the native and the non native implementations.
Assignee | ||
Comment 6•19 years ago
|
||
Sorry, I overlooked "After enabling --enable-native-uconv" in comment 0 and misunderstood what this bug report was about.
Assignee | ||
Updated•19 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee | ||
Comment 7•19 years ago
|
||
(In reply to comment #3)
> The problem is that now, it
> doesn't throw a NS_ERROR_UENC_NOMAPPING when it should, ie in cases where the
> non native uconv would.
How about doing something like:
if (aDestCharSize == 1)
return NS_ERROR_UENC_NOMAPPING;
else
return NS_ERROR_UNEXPECTED;
With the patch the results aren't quite the same as with the non-native implementation, because the input pointer doesn't get resynchronized, but I don't believe that there is any way to do that with iconv so this may be the best we can do.
Reporter | ||
Comment 8•19 years ago
|
||
are there any cases where aDestCharSize == 1 ? most of the time, the dest charset is UCS-2, isn't it ?
what do you mean about the input pointer resynchronisation ?
Assignee | ||
Comment 9•19 years ago
|
||
IConvAdaptor::ConvertInternal is used by the implementations of both nsIUnicodeDecoder and nsIUnicodeEncoder, i.e. for conversions both to and from Unicode. In the non-native implementation only nsIUnicodeEncoder returns NS_ERROR_UENC_NOMAPPING.
Input resynchronization should occur in cases like 3.3.2 in http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
The non-native decoder at http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUTF8ToUnicode.cpp#233 returns NS_ERROR_UNEXPECTED and will resume after the incomplete sequence, so that the whole sequence is decoded as a single replacement character. With your patch, the native implementation will resume at the second character of the incomplete sequence, which will also fail, and the whole sequence will be decoded as two replacement characters.
Reporter | ||
Comment 10•19 years ago
|
||
OTOH, NS_ERROR_UNEXPECTED is not returned in a lot of cases in the non native implementation.
Apart from nsUTF8ToUnicode, it is returned by nsUnicodeToJamoTTF::composeHangul, but it's ignored in nsUnicodeToJamoTTF::Convert (interesting that the return code is not checked). It is also returned in nsISO2022JPToUnicodeV2::Convert in some cases.
Most of the nsIUnicodeDecoders return NS_OK_UDEC_MOREOUTPUT or NS_OK_UDEC_MOREINPUT.
What we could do is set NS_ERROR_UNEXPECTED when running as a nsIUnicodeDecoder and NS_ERROR_UENC_NOMAPPING when running as a nsIUnicodeEncoder.
I'm not sure (aDestCharSize == 1) would be the adequate test for that. Testing if the source charset is UCS-2 at initialization could do the trick. What do you think ?
Assignee | ||
Comment 11•19 years ago
|
||
(In reply to comment #10)
> What we could do is set NS_ERROR_UNEXPECTED when running as a nsIUnicodeDecoder
> and NS_ERROR_UENC_NOMAPPING when running as a nsIUnicodeEncoder.
Yeah, that's what I was shooting for.
> I'm not sure (aDestCharSize == 1) would be the adequate test for that. Testing
> if the source charset is UCS-2 at initialization could do the trick. What do
> you think ?
That might be better, especially if we make "UCS-2" a #define and use it in nsCharsetConverterManager.cpp instead of the current hard-coded string.
Reporter | ||
Comment 12•19 years ago
|
||
> That might be better, especially if we make "UCS-2" a #define and use it in
> nsCharsetConverterManager.cpp instead of the current hard-coded string.
I guess you meant a const char * instead of a #define.
Assignee | ||
Comment 13•19 years ago
|
||
exactly so :)
Reporter | ||
Comment 14•19 years ago
|
||
Here is the patch implemented as we agreed. Maybe the const name should be changed, i don't know... I was thinking we could also avoid using a const if we used nsnull as the value to say we want UCS-2...
Anyways, is the patch okay as is, or should i make some changes ?
Attachment #216300 -
Attachment is obsolete: true
Attachment #216821 -
Flags: review?(smontagu)
Reporter | ||
Comment 15•18 years ago
|
||
ping
Assignee | ||
Updated•18 years ago
|
Attachment #216821 -
Flags: review?(smontagu) → review+
Reporter | ||
Comment 16•18 years ago
|
||
FWIW, the patch is not enough as is, there is a corner case when reaching end of buffer when the converter is called by the html stream parser, which sends the html 4096 bytes at a time, and a multibyte character is cut at the end of the buffer. I'm working on a new and better patch.
Reporter | ||
Comment 17•18 years ago
|
||
(Patch against 1.8.0 branch)
This patch is mostly a rewrite of the native uconv module for iconv. It splits the convert methods for conversion to unicode and from unicode, thus removing the ConvertInternal method, and tries to implement at best the nsIUnicodeDecoder and nsIUnicodeEncoder interfaces. The continuation part still has issues, but much less than the current implementation and the previous patch.
I also modified the nsINativeUConvService interface (which means the wince native uconv would also need modification) so that it is cleaner for the UCS_2 thing we introduced in the previous patch. I wonder if it is possible to drop the nsISupports wrapper kinda thing, but I don't know much about that stuff...
Please comment on these changes so that I know if it is suitable to mozilla before finalizing and fixing the last issues.
Assignee: smontagu → mh+mozilla
Attachment #216821 -
Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #241458 -
Flags: review?(smontagu)
Reporter | ||
Comment 18•18 years ago
|
||
Comment on attachment 241458 [details] [diff] [review]
New preliminary patch
I'm still waiting feedback on this patch. Note that in the meanwhile I found out one bug with the patch when mReplaceOnError is true, which is due to bad while condition (*aSrcLength should be replaced by inLeft).
Comment 19•18 years ago
|
||
(In reply to comment #14)
> used nsnull as the value to say we want UCS-2...
Why do you use UCS-2? Our internal character encoding (AString, PRUnichar *) is UTF-16. Am I missing anything?
BTW, glibc iconv adds a BOM unless the endianness is explicitly specified. So, it needs to be either 'UTF-16LE' or 'UTF-16BE' (see xpcom/io/nsNativeCharsetUtils.cpp)
Reporter | ||
Comment 20•18 years ago
|
||
(In reply to comment #19)
> (In reply to comment #14)
>
> > used nsnull as the value to say we want UCS-2...
>
> Why do you use UCS-2? Our internal character encoding (AString, PRUnichar *) is
> UTF-16. Am I missing anything?
I haven't checked that yet, but was considering to, since UCS-2 only maps the BMP. I only used UCS-2 because it is what was already used by the code... Anyways, switching to UTF16 will require some changes to the continuation trick...
Comment 21•18 years ago
|
||
Comment on attachment 241458 [details] [diff] [review]
New preliminary patch
you must change the IID when changing an interface.
+static const char * UCS_2 = "UCS-2";
and this should be an array instead of a pointer
Attachment #241458 -
Flags: review?(smontagu) → review-
Reporter | ||
Comment 22•18 years ago
|
||
other comments on the approach would be much appreciated ;)
Reporter | ||
Comment 23•18 years ago
|
||
This patch has been tested on little and big endians, and has been used in debian for epiphany/galeon/... for a few weeks.
Note it also implements a fix for bug #356654 (though I don't know if it would work nice with other iconv implementations than the glibc's) and a workaround to make is do the same as the non-native implementation for iso-8859-1/windows-1252 (i.e the contrary of solving bug #288904)
Attachment #241458 -
Attachment is obsolete: true
Attachment #254071 -
Flags: review?(cbiesinger)
Comment 24•18 years ago
|
||
The latest patch fails against the trunk.
[snip]
patching file intl/uconv/native/nsNativeUConvService.cpp
Hunk #10 FAILED at 477.
Hunk #11 succeeded at 501 (offset 3 lines).
Hunk #12 FAILED at 540.
2 out of 12 hunks FAILED -- saving rejects to file intl/uconv/native/nsNativeUConvService.cpp.rej
[/snip]
Also seems that some of the upstream changes have introduced a compile failure, so this bug is blocked by bug #372648 at the moment.
Reporter | ||
Comment 25•18 years ago
|
||
This patch is against trunk, and also fixes an issue when UTF16 input doesn't have a mapping in the output charset. Though there is still an issue with the replacement character when mReplaceOnError is set but I would like to be sure what mReplaceChar is supposed to be. Actually the whole concept of replace character on unicodeencoder sounds broken to me, the replacement character depeding on the destination character set.
Attachment #254071 -
Attachment is obsolete: true
Attachment #265513 -
Flags: review?(cbiesinger)
Attachment #254071 -
Flags: review?(cbiesinger)
Comment 26•17 years ago
|
||
Mike, can you make sure your patch is still valid on the trunk?
biesi, do you have time to review this sometime?
QA Contact: amyy → i18n
Reporter | ||
Comment 27•17 years ago
|
||
A one line change was necessary to fit changes on trunk since May. I also added myself to contributors in intl/uconv/native/nsNativeUConvService.cpp.
Attachment #265513 -
Attachment is obsolete: true
Attachment #284739 -
Flags: review?(cbiesinger)
Attachment #265513 -
Flags: review?(cbiesinger)
Reporter | ||
Comment 28•17 years ago
|
||
(In reply to comment #19)
> BTW, glibc iconv adds a BOM unless the endianness is explicitly specified. So,
> it needs to be either 'UTF-16LE' or 'UTF-16BE' (see
> xpcom/io/nsNativeCharsetUtils.cpp)
Unfortunately, it adds a BOM even when UTF-16LE or UTF-16BE is specified :-/
Reporter | ||
Comment 29•17 years ago
|
||
(In reply to comment #28)
> (In reply to comment #19)
> > BTW, glibc iconv adds a BOM unless the endianness is explicitly specified. So,
> > it needs to be either 'UTF-16LE' or 'UTF-16BE' (see
> > xpcom/io/nsNativeCharsetUtils.cpp)
>
> Unfortunately, it adds a BOM even when UTF-16LE or UTF-16BE is specified :-/
Actually, it doesn't *add* it, it just doesn't remove it in conversions from UTF-8 to UTF-16LE or UTF-16BE...
Comment 30•16 years ago
|
||
Comment on attachment 284739 [details] [diff] [review]
New patch against trunk
Simon, got a chance to look over this?
Attachment #284739 -
Flags: review?(smontagu)
Reporter | ||
Comment 31•16 years ago
|
||
Comment on attachment 284739 [details] [diff] [review]
New patch against trunk
This patch still has problems, and I stopped maintaining it...
Attachment #284739 -
Flags: review?(smontagu)
Attachment #284739 -
Flags: review?(cbiesinger)
Reporter | ||
Updated•15 years ago
|
Status: ASSIGNED → NEW
Reporter | ||
Updated•15 years ago
|
Assignee: mh+mozilla → smontagu
Assignee | ||
Comment 32•14 years ago
|
||
Native uconv is gone.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•