Closed Bug 331748 Opened 19 years ago Closed 14 years ago

Fails to load invalid UTF-8 "properly" when using native uconv module

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: glandium, Assigned: smontagu)

References

Details

Attachments

(2 files, 5 obsolete files)

Basic patch 19 years ago Mike Hommey [:glandium] 592 bytes, patch		Details \| Diff \| Splinter Review
edited version of http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt 19 years ago Simon Montagu :smontagu 19.46 KB, text/plain		Details
New patch with proposed changes 19 years ago Mike Hommey [:glandium] 2.18 KB, patch	smontagu : review+	Details \| Diff \| Splinter Review
New preliminary patch 18 years ago Mike Hommey [:glandium] 16.79 KB, patch	Biesinger : review-	Details \| Diff \| Splinter Review
New patch 18 years ago Mike Hommey [:glandium] 19.53 KB, patch		Details \| Diff \| Splinter Review
New patch against trunk 18 years ago Mike Hommey [:glandium] 22.68 KB, patch		Details \| Diff \| Splinter Review
New patch against trunk 17 years ago Mike Hommey [:glandium] 20.38 KB, patch		Details \| Diff \| Splinter Review

Mike Hommey [:glandium]

Reporter

Description

•

19 years ago

User-Agent: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.8.0.1) Gecko/20060313 Debian/1.5.dfsg+1.5.0.1-4 Firefox/1.5.0.1 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.8.0.1) Gecko/20060313 Debian/1.5.dfsg+1.5.0.1-4 Firefox/1.5.0.1 After enabling --enable-native-uconv, gecko is unable to succesfully load invalid UTF-8 (such as http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt), and fails on invalid UTF-8 javascripts, which has a nice side effect: breaking the "Website Certified by an Unknown Authority" dialog because newserver.js is not UTF-8, but I'm going to file another bug for that. Somehow, it gets a lot of garbage in the internal representation of the data, and pango complains a lot about it: (Gecko:1740): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text() Reproducible: Always

Mike Hommey [:glandium]

Reporter

Comment 1

•

19 years ago

Seems to be related to the fact that mReplaceOnError seems to never be true in nsNativeUConvService.cpp.

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

19 years ago

Assignee: nobody → smontagu

Component: General → Internationalization

Product: Firefox → Core

QA Contact: general → amyy

Version: unspecified → Trunk

Mike Hommey [:glandium]

Reporter

Comment 2

•

19 years ago

mmmm in fact, seen how nsUTF8ToUnicode is written, it seems more like a problem with return values...

Mike Hommey [:glandium]

Reporter

Comment 3

•

19 years ago

Attached patch Basic patch (obsolete) — Details — Splinter Review

Here is a basic patch that solves the issue. The problem is that now, it doesn't throw a NS_ERROR_UENC_NOMAPPING when it should, ie in cases where the non native uconv would.

Simon Montagu :smontagu

Assignee

Comment 4

•

19 years ago

Attached file edited version of http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt — Details

I think your analysis is wrong, and the problem is only with specific characters which the UTF-8 decoder does not recognize as invalid, but pango does. This attachment removes such characters from the testcase and I see no Pango-WARNINGs with it. Do you get the same results? If so, this is a dupe of bug 172701.

Mike Hommey [:glandium]

Reporter

Comment 5

•

19 years ago

The pango issue is a side effect, it only shows that the string that is passed to pango is somehow borked. The problem here is that the "native" implementation (iconv) and the "non native" one differ in their behaviour, and the code that calls the conversion doesn't act the same way with these different behaviours, thus an overall different experience with charsets. Bug #172701 is actually the opposite bug. It's the lack of bug #172701 with the native implementation that is annoying in my case. Note that the code that is not aborting the conversion (for bug #172701) is not the uconv service (native or not), but rather the code that calls the convert function. The uconv service, on its end, doesn't return the same status depending on the error it got during conversion between the native and the non native implementations.

Simon Montagu :smontagu

Assignee

Comment 6

•

19 years ago

Sorry, I overlooked "After enabling --enable-native-uconv" in comment 0 and misunderstood what this bug report was about.

Simon Montagu :smontagu

Assignee

Updated

•

19 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Simon Montagu :smontagu

Assignee

Comment 7

•

19 years ago

(In reply to comment #3) > The problem is that now, it > doesn't throw a NS_ERROR_UENC_NOMAPPING when it should, ie in cases where the > non native uconv would. How about doing something like: if (aDestCharSize == 1) return NS_ERROR_UENC_NOMAPPING; else return NS_ERROR_UNEXPECTED; With the patch the results aren't quite the same as with the non-native implementation, because the input pointer doesn't get resynchronized, but I don't believe that there is any way to do that with iconv so this may be the best we can do.

Mike Hommey [:glandium]

Reporter

Comment 8

•

19 years ago

are there any cases where aDestCharSize == 1 ? most of the time, the dest charset is UCS-2, isn't it ? what do you mean about the input pointer resynchronisation ?

Simon Montagu :smontagu

Assignee

Comment 9

•

19 years ago

IConvAdaptor::ConvertInternal is used by the implementations of both nsIUnicodeDecoder and nsIUnicodeEncoder, i.e. for conversions both to and from Unicode. In the non-native implementation only nsIUnicodeEncoder returns NS_ERROR_UENC_NOMAPPING. Input resynchronization should occur in cases like 3.3.2 in http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt The non-native decoder at http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUTF8ToUnicode.cpp#233 returns NS_ERROR_UNEXPECTED and will resume after the incomplete sequence, so that the whole sequence is decoded as a single replacement character. With your patch, the native implementation will resume at the second character of the incomplete sequence, which will also fail, and the whole sequence will be decoded as two replacement characters.

Mike Hommey [:glandium]

Reporter

Comment 10

•

19 years ago

OTOH, NS_ERROR_UNEXPECTED is not returned in a lot of cases in the non native implementation. Apart from nsUTF8ToUnicode, it is returned by nsUnicodeToJamoTTF::composeHangul, but it's ignored in nsUnicodeToJamoTTF::Convert (interesting that the return code is not checked). It is also returned in nsISO2022JPToUnicodeV2::Convert in some cases. Most of the nsIUnicodeDecoders return NS_OK_UDEC_MOREOUTPUT or NS_OK_UDEC_MOREINPUT. What we could do is set NS_ERROR_UNEXPECTED when running as a nsIUnicodeDecoder and NS_ERROR_UENC_NOMAPPING when running as a nsIUnicodeEncoder. I'm not sure (aDestCharSize == 1) would be the adequate test for that. Testing if the source charset is UCS-2 at initialization could do the trick. What do you think ?

Simon Montagu :smontagu

Assignee

Comment 11

•

19 years ago

(In reply to comment #10) > What we could do is set NS_ERROR_UNEXPECTED when running as a nsIUnicodeDecoder > and NS_ERROR_UENC_NOMAPPING when running as a nsIUnicodeEncoder. Yeah, that's what I was shooting for. > I'm not sure (aDestCharSize == 1) would be the adequate test for that. Testing > if the source charset is UCS-2 at initialization could do the trick. What do > you think ? That might be better, especially if we make "UCS-2" a #define and use it in nsCharsetConverterManager.cpp instead of the current hard-coded string.

Mike Hommey [:glandium]

Reporter

Comment 12

•

19 years ago

> That might be better, especially if we make "UCS-2" a #define and use it in > nsCharsetConverterManager.cpp instead of the current hard-coded string. I guess you meant a const char * instead of a #define.

Simon Montagu :smontagu

Assignee

Comment 13

•

19 years ago

exactly so :)

Mike Hommey [:glandium]

Reporter

Comment 14

•

19 years ago

Attached patch New patch with proposed changes (obsolete) — Details — Splinter Review

Here is the patch implemented as we agreed. Maybe the const name should be changed, i don't know... I was thinking we could also avoid using a const if we used nsnull as the value to say we want UCS-2... Anyways, is the patch okay as is, or should i make some changes ?

Attachment #216300 - Attachment is obsolete: true

Attachment #216821 - Flags: review?(smontagu)

Mike Hommey [:glandium]

Reporter

Comment 15

•

18 years ago

ping

Simon Montagu :smontagu

Assignee

Updated

•

18 years ago

Attachment #216821 - Flags: review?(smontagu) → review+

Mike Hommey [:glandium]

Reporter

Comment 16

•

18 years ago

FWIW, the patch is not enough as is, there is a corner case when reaching end of buffer when the converter is called by the html stream parser, which sends the html 4096 bytes at a time, and a multibyte character is cut at the end of the buffer. I'm working on a new and better patch.

Mike Hommey [:glandium]

Reporter

Comment 17

•

18 years ago

Attached patch New preliminary patch (obsolete) — Details — Splinter Review

(Patch against 1.8.0 branch) This patch is mostly a rewrite of the native uconv module for iconv. It splits the convert methods for conversion to unicode and from unicode, thus removing the ConvertInternal method, and tries to implement at best the nsIUnicodeDecoder and nsIUnicodeEncoder interfaces. The continuation part still has issues, but much less than the current implementation and the previous patch. I also modified the nsINativeUConvService interface (which means the wince native uconv would also need modification) so that it is cleaner for the UCS_2 thing we introduced in the previous patch. I wonder if it is possible to drop the nsISupports wrapper kinda thing, but I don't know much about that stuff... Please comment on these changes so that I know if it is suitable to mozilla before finalizing and fixing the last issues.

Assignee: smontagu → mh+mozilla

Attachment #216821 - Attachment is obsolete: true

Status: NEW → ASSIGNED

Attachment #241458 - Flags: review?(smontagu)

Mike Hommey [:glandium]

Reporter

Comment 18

•

18 years ago

Comment on attachment 241458 [details] [diff] [review] New preliminary patch I'm still waiting feedback on this patch. Note that in the meanwhile I found out one bug with the patch when mReplaceOnError is true, which is due to bad while condition (*aSrcLength should be replaced by inLeft).

Mike Hommey [:glandium]

Reporter

Updated

•

18 years ago

Blocks: 356654

Jungshik Shin

Comment 19

•

18 years ago

(In reply to comment #14) > used nsnull as the value to say we want UCS-2... Why do you use UCS-2? Our internal character encoding (AString, PRUnichar *) is UTF-16. Am I missing anything? BTW, glibc iconv adds a BOM unless the endianness is explicitly specified. So, it needs to be either 'UTF-16LE' or 'UTF-16BE' (see xpcom/io/nsNativeCharsetUtils.cpp)

Mike Hommey [:glandium]

Reporter

Comment 20

•

18 years ago

(In reply to comment #19) > (In reply to comment #14) > > > used nsnull as the value to say we want UCS-2... > > Why do you use UCS-2? Our internal character encoding (AString, PRUnichar *) is > UTF-16. Am I missing anything? I haven't checked that yet, but was considering to, since UCS-2 only maps the BMP. I only used UCS-2 because it is what was already used by the code... Anyways, switching to UTF16 will require some changes to the continuation trick...

Christian :Biesinger (don't email me, ping me on IRC)

Comment 21

•

18 years ago

Comment on attachment 241458 [details] [diff] [review] New preliminary patch you must change the IID when changing an interface. +static const char * UCS_2 = "UCS-2"; and this should be an array instead of a pointer

Attachment #241458 - Flags: review?(smontagu) → review-

Mike Hommey [:glandium]

Reporter

Comment 22

•

18 years ago

other comments on the approach would be much appreciated ;)

Mike Hommey [:glandium]

Reporter

Comment 23

•

18 years ago

Attached patch New patch (obsolete) — Details — Splinter Review

This patch has been tested on little and big endians, and has been used in debian for epiphany/galeon/... for a few weeks. Note it also implements a fix for bug #356654 (though I don't know if it would work nice with other iconv implementations than the glibc's) and a workaround to make is do the same as the non-native implementation for iso-8859-1/windows-1252 (i.e the contrary of solving bug #288904)

Attachment #241458 - Attachment is obsolete: true

Attachment #254071 - Flags: review?(cbiesinger)

Priit Laes

Comment 24

•

18 years ago

The latest patch fails against the trunk. [snip] patching file intl/uconv/native/nsNativeUConvService.cpp Hunk #10 FAILED at 477. Hunk #11 succeeded at 501 (offset 3 lines). Hunk #12 FAILED at 540. 2 out of 12 hunks FAILED -- saving rejects to file intl/uconv/native/nsNativeUConvService.cpp.rej [/snip] Also seems that some of the upstream changes have introduced a compile failure, so this bug is blocked by bug #372648 at the moment.

Mike Hommey [:glandium]

Reporter

Comment 25

•

18 years ago

Attached patch New patch against trunk (obsolete) — Details — Splinter Review

This patch is against trunk, and also fixes an issue when UTF16 input doesn't have a mapping in the output charset. Though there is still an issue with the replacement character when mReplaceOnError is set but I would like to be sure what mReplaceChar is supposed to be. Actually the whole concept of replace character on unicodeencoder sounds broken to me, the replacement character depeding on the destination character set.

Attachment #254071 - Attachment is obsolete: true

Attachment #265513 - Flags: review?(cbiesinger)

Attachment #254071 - Flags: review?(cbiesinger)

Reed Loden [:reed]

Comment 26

•

17 years ago

Mike, can you make sure your patch is still valid on the trunk? biesi, do you have time to review this sometime?

QA Contact: amyy → i18n

Mike Hommey [:glandium]

Reporter

Comment 27

•

17 years ago

Attached patch New patch against trunk — Details — Splinter Review

A one line change was necessary to fit changes on trunk since May. I also added myself to contributors in intl/uconv/native/nsNativeUConvService.cpp.

Attachment #265513 - Attachment is obsolete: true

Attachment #284739 - Flags: review?(cbiesinger)

Attachment #265513 - Flags: review?(cbiesinger)

Mike Hommey [:glandium]

Reporter

Comment 28

•

17 years ago

(In reply to comment #19) > BTW, glibc iconv adds a BOM unless the endianness is explicitly specified. So, > it needs to be either 'UTF-16LE' or 'UTF-16BE' (see > xpcom/io/nsNativeCharsetUtils.cpp) Unfortunately, it adds a BOM even when UTF-16LE or UTF-16BE is specified :-/

Mike Hommey [:glandium]

Reporter

Comment 29

•

17 years ago

(In reply to comment #28) > (In reply to comment #19) > > BTW, glibc iconv adds a BOM unless the endianness is explicitly specified. So, > > it needs to be either 'UTF-16LE' or 'UTF-16BE' (see > > xpcom/io/nsNativeCharsetUtils.cpp) > > Unfortunately, it adds a BOM even when UTF-16LE or UTF-16BE is specified :-/ Actually, it doesn't *add* it, it just doesn't remove it in conversions from UTF-8 to UTF-16LE or UTF-16BE...

Reed Loden [:reed]

Comment 30

•

16 years ago

Comment on attachment 284739 [details] [diff] [review] New patch against trunk Simon, got a chance to look over this?

Attachment #284739 - Flags: review?(smontagu)

Mike Hommey [:glandium]

Reporter

Comment 31

•

16 years ago

Comment on attachment 284739 [details] [diff] [review] New patch against trunk This patch still has problems, and I stopped maintaining it...

Attachment #284739 - Flags: review?(smontagu)

Attachment #284739 - Flags: review?(cbiesinger)

Mike Hommey [:glandium]

Reporter

Updated

•

15 years ago

Status: ASSIGNED → NEW

Mike Hommey [:glandium]

Reporter

Updated

•

15 years ago

Assignee: mh+mozilla → smontagu

Simon Montagu :smontagu

Assignee

Updated

•

14 years ago

Depends on: 644801

Simon Montagu :smontagu

Assignee

Comment 32

•

14 years ago

Native uconv is gone.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.