Closed Bug 331748 Opened 19 years ago Closed 14 years ago

Fails to load invalid UTF-8 "properly" when using native uconv module

Categories

(Core :: Internationalization, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: glandium, Assigned: smontagu)

References

Details

Attachments

(2 files, 5 obsolete files)

User-Agent: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.8.0.1) Gecko/20060313 Debian/1.5.dfsg+1.5.0.1-4 Firefox/1.5.0.1 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.8.0.1) Gecko/20060313 Debian/1.5.dfsg+1.5.0.1-4 Firefox/1.5.0.1 After enabling --enable-native-uconv, gecko is unable to succesfully load invalid UTF-8 (such as http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt), and fails on invalid UTF-8 javascripts, which has a nice side effect: breaking the "Website Certified by an Unknown Authority" dialog because newserver.js is not UTF-8, but I'm going to file another bug for that. Somehow, it gets a lot of garbage in the internal representation of the data, and pango complains a lot about it: (Gecko:1740): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text() Reproducible: Always
Seems to be related to the fact that mReplaceOnError seems to never be true in nsNativeUConvService.cpp.
Assignee: nobody → smontagu
Component: General → Internationalization
Product: Firefox → Core
QA Contact: general → amyy
Version: unspecified → Trunk
mmmm in fact, seen how nsUTF8ToUnicode is written, it seems more like a problem with return values...
Attached patch Basic patch (obsolete) — Splinter Review
Here is a basic patch that solves the issue. The problem is that now, it doesn't throw a NS_ERROR_UENC_NOMAPPING when it should, ie in cases where the non native uconv would.
I think your analysis is wrong, and the problem is only with specific characters which the UTF-8 decoder does not recognize as invalid, but pango does. This attachment removes such characters from the testcase and I see no Pango-WARNINGs with it. Do you get the same results? If so, this is a dupe of bug 172701.
The pango issue is a side effect, it only shows that the string that is passed to pango is somehow borked. The problem here is that the "native" implementation (iconv) and the "non native" one differ in their behaviour, and the code that calls the conversion doesn't act the same way with these different behaviours, thus an overall different experience with charsets. Bug #172701 is actually the opposite bug. It's the lack of bug #172701 with the native implementation that is annoying in my case. Note that the code that is not aborting the conversion (for bug #172701) is not the uconv service (native or not), but rather the code that calls the convert function. The uconv service, on its end, doesn't return the same status depending on the error it got during conversion between the native and the non native implementations.
Sorry, I overlooked "After enabling --enable-native-uconv" in comment 0 and misunderstood what this bug report was about.
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to comment #3) > The problem is that now, it > doesn't throw a NS_ERROR_UENC_NOMAPPING when it should, ie in cases where the > non native uconv would. How about doing something like: if (aDestCharSize == 1) return NS_ERROR_UENC_NOMAPPING; else return NS_ERROR_UNEXPECTED; With the patch the results aren't quite the same as with the non-native implementation, because the input pointer doesn't get resynchronized, but I don't believe that there is any way to do that with iconv so this may be the best we can do.
are there any cases where aDestCharSize == 1 ? most of the time, the dest charset is UCS-2, isn't it ? what do you mean about the input pointer resynchronisation ?
IConvAdaptor::ConvertInternal is used by the implementations of both nsIUnicodeDecoder and nsIUnicodeEncoder, i.e. for conversions both to and from Unicode. In the non-native implementation only nsIUnicodeEncoder returns NS_ERROR_UENC_NOMAPPING. Input resynchronization should occur in cases like 3.3.2 in http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt The non-native decoder at http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUTF8ToUnicode.cpp#233 returns NS_ERROR_UNEXPECTED and will resume after the incomplete sequence, so that the whole sequence is decoded as a single replacement character. With your patch, the native implementation will resume at the second character of the incomplete sequence, which will also fail, and the whole sequence will be decoded as two replacement characters.
OTOH, NS_ERROR_UNEXPECTED is not returned in a lot of cases in the non native implementation. Apart from nsUTF8ToUnicode, it is returned by nsUnicodeToJamoTTF::composeHangul, but it's ignored in nsUnicodeToJamoTTF::Convert (interesting that the return code is not checked). It is also returned in nsISO2022JPToUnicodeV2::Convert in some cases. Most of the nsIUnicodeDecoders return NS_OK_UDEC_MOREOUTPUT or NS_OK_UDEC_MOREINPUT. What we could do is set NS_ERROR_UNEXPECTED when running as a nsIUnicodeDecoder and NS_ERROR_UENC_NOMAPPING when running as a nsIUnicodeEncoder. I'm not sure (aDestCharSize == 1) would be the adequate test for that. Testing if the source charset is UCS-2 at initialization could do the trick. What do you think ?
(In reply to comment #10) > What we could do is set NS_ERROR_UNEXPECTED when running as a nsIUnicodeDecoder > and NS_ERROR_UENC_NOMAPPING when running as a nsIUnicodeEncoder. Yeah, that's what I was shooting for. > I'm not sure (aDestCharSize == 1) would be the adequate test for that. Testing > if the source charset is UCS-2 at initialization could do the trick. What do > you think ? That might be better, especially if we make "UCS-2" a #define and use it in nsCharsetConverterManager.cpp instead of the current hard-coded string.
> That might be better, especially if we make "UCS-2" a #define and use it in > nsCharsetConverterManager.cpp instead of the current hard-coded string. I guess you meant a const char * instead of a #define.
exactly so :)
Attached patch New patch with proposed changes (obsolete) — Splinter Review
Here is the patch implemented as we agreed. Maybe the const name should be changed, i don't know... I was thinking we could also avoid using a const if we used nsnull as the value to say we want UCS-2... Anyways, is the patch okay as is, or should i make some changes ?
Attachment #216300 - Attachment is obsolete: true
Attachment #216821 - Flags: review?(smontagu)
ping
Attachment #216821 - Flags: review?(smontagu) → review+
FWIW, the patch is not enough as is, there is a corner case when reaching end of buffer when the converter is called by the html stream parser, which sends the html 4096 bytes at a time, and a multibyte character is cut at the end of the buffer. I'm working on a new and better patch.
Attached patch New preliminary patch (obsolete) — Splinter Review
(Patch against 1.8.0 branch) This patch is mostly a rewrite of the native uconv module for iconv. It splits the convert methods for conversion to unicode and from unicode, thus removing the ConvertInternal method, and tries to implement at best the nsIUnicodeDecoder and nsIUnicodeEncoder interfaces. The continuation part still has issues, but much less than the current implementation and the previous patch. I also modified the nsINativeUConvService interface (which means the wince native uconv would also need modification) so that it is cleaner for the UCS_2 thing we introduced in the previous patch. I wonder if it is possible to drop the nsISupports wrapper kinda thing, but I don't know much about that stuff... Please comment on these changes so that I know if it is suitable to mozilla before finalizing and fixing the last issues.
Assignee: smontagu → mh+mozilla
Attachment #216821 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #241458 - Flags: review?(smontagu)
Comment on attachment 241458 [details] [diff] [review] New preliminary patch I'm still waiting feedback on this patch. Note that in the meanwhile I found out one bug with the patch when mReplaceOnError is true, which is due to bad while condition (*aSrcLength should be replaced by inLeft).
Blocks: 356654
(In reply to comment #14) > used nsnull as the value to say we want UCS-2... Why do you use UCS-2? Our internal character encoding (AString, PRUnichar *) is UTF-16. Am I missing anything? BTW, glibc iconv adds a BOM unless the endianness is explicitly specified. So, it needs to be either 'UTF-16LE' or 'UTF-16BE' (see xpcom/io/nsNativeCharsetUtils.cpp)
(In reply to comment #19) > (In reply to comment #14) > > > used nsnull as the value to say we want UCS-2... > > Why do you use UCS-2? Our internal character encoding (AString, PRUnichar *) is > UTF-16. Am I missing anything? I haven't checked that yet, but was considering to, since UCS-2 only maps the BMP. I only used UCS-2 because it is what was already used by the code... Anyways, switching to UTF16 will require some changes to the continuation trick...
Comment on attachment 241458 [details] [diff] [review] New preliminary patch you must change the IID when changing an interface. +static const char * UCS_2 = "UCS-2"; and this should be an array instead of a pointer
Attachment #241458 - Flags: review?(smontagu) → review-
other comments on the approach would be much appreciated ;)
Attached patch New patch (obsolete) — Splinter Review
This patch has been tested on little and big endians, and has been used in debian for epiphany/galeon/... for a few weeks. Note it also implements a fix for bug #356654 (though I don't know if it would work nice with other iconv implementations than the glibc's) and a workaround to make is do the same as the non-native implementation for iso-8859-1/windows-1252 (i.e the contrary of solving bug #288904)
Attachment #241458 - Attachment is obsolete: true
Attachment #254071 - Flags: review?(cbiesinger)
The latest patch fails against the trunk. [snip] patching file intl/uconv/native/nsNativeUConvService.cpp Hunk #10 FAILED at 477. Hunk #11 succeeded at 501 (offset 3 lines). Hunk #12 FAILED at 540. 2 out of 12 hunks FAILED -- saving rejects to file intl/uconv/native/nsNativeUConvService.cpp.rej [/snip] Also seems that some of the upstream changes have introduced a compile failure, so this bug is blocked by bug #372648 at the moment.
Attached patch New patch against trunk (obsolete) — Splinter Review
This patch is against trunk, and also fixes an issue when UTF16 input doesn't have a mapping in the output charset. Though there is still an issue with the replacement character when mReplaceOnError is set but I would like to be sure what mReplaceChar is supposed to be. Actually the whole concept of replace character on unicodeencoder sounds broken to me, the replacement character depeding on the destination character set.
Attachment #254071 - Attachment is obsolete: true
Attachment #265513 - Flags: review?(cbiesinger)
Attachment #254071 - Flags: review?(cbiesinger)
Mike, can you make sure your patch is still valid on the trunk? biesi, do you have time to review this sometime?
QA Contact: amyy → i18n
A one line change was necessary to fit changes on trunk since May. I also added myself to contributors in intl/uconv/native/nsNativeUConvService.cpp.
Attachment #265513 - Attachment is obsolete: true
Attachment #284739 - Flags: review?(cbiesinger)
Attachment #265513 - Flags: review?(cbiesinger)
(In reply to comment #19) > BTW, glibc iconv adds a BOM unless the endianness is explicitly specified. So, > it needs to be either 'UTF-16LE' or 'UTF-16BE' (see > xpcom/io/nsNativeCharsetUtils.cpp) Unfortunately, it adds a BOM even when UTF-16LE or UTF-16BE is specified :-/
(In reply to comment #28) > (In reply to comment #19) > > BTW, glibc iconv adds a BOM unless the endianness is explicitly specified. So, > > it needs to be either 'UTF-16LE' or 'UTF-16BE' (see > > xpcom/io/nsNativeCharsetUtils.cpp) > > Unfortunately, it adds a BOM even when UTF-16LE or UTF-16BE is specified :-/ Actually, it doesn't *add* it, it just doesn't remove it in conversions from UTF-8 to UTF-16LE or UTF-16BE...
Comment on attachment 284739 [details] [diff] [review] New patch against trunk Simon, got a chance to look over this?
Attachment #284739 - Flags: review?(smontagu)
Comment on attachment 284739 [details] [diff] [review] New patch against trunk This patch still has problems, and I stopped maintaining it...
Attachment #284739 - Flags: review?(smontagu)
Attachment #284739 - Flags: review?(cbiesinger)
Status: ASSIGNED → NEW
Assignee: mh+mozilla → smontagu
Depends on: 644801
Native uconv is gone.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: