Fails to load invalid UTF-8 "properly" when using native uconv module

RESOLVED WONTFIX

Status

()

Core
Internationalization
RESOLVED WONTFIX
12 years ago
7 years ago

People

(Reporter: glandium, Assigned: smontagu)

Tracking

Trunk
x86
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 5 obsolete attachments)

(Reporter)

Description

12 years ago
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.8.0.1) Gecko/20060313 Debian/1.5.dfsg+1.5.0.1-4 Firefox/1.5.0.1
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.8.0.1) Gecko/20060313 Debian/1.5.dfsg+1.5.0.1-4 Firefox/1.5.0.1

After enabling --enable-native-uconv, gecko is unable to succesfully load invalid UTF-8 (such as http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt), and fails on invalid UTF-8 javascripts, which has a nice side effect: breaking the "Website Certified by an Unknown Authority" dialog because newserver.js is not UTF-8, but I'm going to file another bug for that.

Somehow, it gets a lot of garbage in the internal representation of the data, and pango complains a lot about it:
(Gecko:1740): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()


Reproducible: Always
(Reporter)

Comment 1

12 years ago
Seems to be related to the fact that mReplaceOnError seems to never be true in nsNativeUConvService.cpp.
Assignee: nobody → smontagu
Component: General → Internationalization
Product: Firefox → Core
QA Contact: general → amyy
Version: unspecified → Trunk
(Reporter)

Comment 2

12 years ago
mmmm in fact, seen how nsUTF8ToUnicode is written, it seems more like a problem with return values...
(Reporter)

Comment 3

12 years ago
Created attachment 216300 [details] [diff] [review]
Basic patch

Here is a basic patch that solves the issue. The problem is that now, it doesn't throw a NS_ERROR_UENC_NOMAPPING when it should, ie in cases where the non native uconv would.
(Assignee)

Comment 4

12 years ago
Created attachment 216308 [details]
edited version of http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

I think your analysis is wrong, and the problem is only with specific characters which the UTF-8 decoder does not recognize as invalid, but pango does. This attachment removes such characters from the testcase and I see no Pango-WARNINGs with it. Do you get the same results? If so, this is a dupe of bug 172701.
(Reporter)

Comment 5

12 years ago
The pango issue is a side effect, it only shows that the string that is passed to pango is somehow borked.

The problem here is that the "native" implementation (iconv) and the "non native" one differ in their behaviour, and the code that calls the conversion doesn't act the same way with these different behaviours, thus an overall different experience with charsets.

Bug #172701 is actually the opposite bug. It's the lack of bug #172701 with the native implementation that is annoying in my case.

Note that the code that is not aborting the conversion (for bug #172701) is not the uconv service (native or not), but rather the code that calls the convert function. The uconv service, on its end, doesn't return the same status depending on the error it got during conversion between the native and the non native implementations.
(Assignee)

Comment 6

12 years ago
Sorry, I overlooked "After enabling --enable-native-uconv" in comment 0 and misunderstood what this bug report was about. 
(Assignee)

Updated

12 years ago
Status: UNCONFIRMED → NEW
Ever confirmed: true
(Assignee)

Comment 7

12 years ago
(In reply to comment #3)
> The problem is that now, it
> doesn't throw a NS_ERROR_UENC_NOMAPPING when it should, ie in cases where the
> non native uconv would.

How about doing something like:

                if (aDestCharSize == 1)
                    return NS_ERROR_UENC_NOMAPPING;
                else
                    return NS_ERROR_UNEXPECTED;

With the patch the results aren't quite the same as with the non-native implementation, because the input pointer doesn't get resynchronized, but I don't believe that there is any way to do that with iconv so this may be the best we can do.
(Reporter)

Comment 8

12 years ago
are there any cases where aDestCharSize == 1 ? most of the time, the dest charset is UCS-2, isn't it ?

what do you mean about the input pointer resynchronisation ?
(Assignee)

Comment 9

12 years ago
IConvAdaptor::ConvertInternal is used by the implementations of both  nsIUnicodeDecoder and nsIUnicodeEncoder, i.e. for conversions both to and from Unicode. In the non-native implementation only nsIUnicodeEncoder returns NS_ERROR_UENC_NOMAPPING.

Input resynchronization should occur in cases like 3.3.2 in http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
The non-native decoder at http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/nsUTF8ToUnicode.cpp#233 returns NS_ERROR_UNEXPECTED and will resume after the incomplete sequence, so that the whole sequence is decoded as a single replacement character. With your patch, the native implementation will resume at the second character of the incomplete sequence, which will also fail, and the whole sequence will be decoded as two replacement characters.
(Reporter)

Comment 10

12 years ago
OTOH, NS_ERROR_UNEXPECTED is not returned in a lot of cases in the non native implementation.
Apart from nsUTF8ToUnicode, it is returned by nsUnicodeToJamoTTF::composeHangul, but it's ignored in nsUnicodeToJamoTTF::Convert (interesting that the return code is not checked). It is also returned in nsISO2022JPToUnicodeV2::Convert in some cases.
Most of the nsIUnicodeDecoders return NS_OK_UDEC_MOREOUTPUT or NS_OK_UDEC_MOREINPUT.

What we could do is set NS_ERROR_UNEXPECTED when running as a nsIUnicodeDecoder and NS_ERROR_UENC_NOMAPPING when running as a nsIUnicodeEncoder.
I'm not sure (aDestCharSize == 1) would be the adequate test for that. Testing if the source charset is UCS-2 at initialization could do the trick. What do you think ?
(Assignee)

Comment 11

12 years ago
(In reply to comment #10)

> What we could do is set NS_ERROR_UNEXPECTED when running as a nsIUnicodeDecoder
> and NS_ERROR_UENC_NOMAPPING when running as a nsIUnicodeEncoder.

Yeah, that's what I was shooting for.

> I'm not sure (aDestCharSize == 1) would be the adequate test for that. Testing
> if the source charset is UCS-2 at initialization could do the trick. What do
> you think ?

That might be better, especially if we make "UCS-2" a #define and use it in nsCharsetConverterManager.cpp instead of the current hard-coded string.
(Reporter)

Comment 12

12 years ago
> That might be better, especially if we make "UCS-2" a #define and use it in
> nsCharsetConverterManager.cpp instead of the current hard-coded string.

I guess you meant a const char * instead of a #define. 

(Assignee)

Comment 13

12 years ago
exactly so :)
(Reporter)

Comment 14

12 years ago
Created attachment 216821 [details] [diff] [review]
New patch with proposed changes

Here is the patch implemented as we agreed. Maybe the const name should be changed, i don't know... I was thinking we could also avoid using a const if we used nsnull as the value to say we want UCS-2...
Anyways, is the patch okay as is, or should i make some changes ?
Attachment #216300 - Attachment is obsolete: true
Attachment #216821 - Flags: review?(smontagu)
(Reporter)

Comment 15

12 years ago
ping
(Assignee)

Updated

12 years ago
Attachment #216821 - Flags: review?(smontagu) → review+
(Reporter)

Comment 16

12 years ago
FWIW, the patch is not enough as is, there is a corner case when reaching end of buffer when the converter is called by the html stream parser, which sends the html 4096 bytes at a time, and a multibyte character is cut at the end of the buffer. I'm working on a new and better patch.
(Reporter)

Comment 17

12 years ago
Created attachment 241458 [details] [diff] [review]
New preliminary patch

(Patch against 1.8.0 branch)
This patch is mostly a rewrite of the native uconv module for iconv. It splits the convert methods for conversion to unicode and from unicode, thus removing the ConvertInternal method, and tries to implement at best the nsIUnicodeDecoder and nsIUnicodeEncoder interfaces. The continuation part still has issues, but much less than the current implementation and the previous patch.
I also modified the nsINativeUConvService interface (which means the wince native uconv would also need modification) so that it is cleaner for the UCS_2 thing we introduced in the previous patch. I wonder if it is possible to drop the nsISupports wrapper kinda thing, but I don't know much about that stuff...
Please comment on these changes so that I know if it is suitable to mozilla before finalizing and fixing the last issues.
Assignee: smontagu → mh+mozilla
Attachment #216821 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #241458 - Flags: review?(smontagu)
(Reporter)

Comment 18

12 years ago
Comment on attachment 241458 [details] [diff] [review]
New preliminary patch

I'm still waiting feedback on this patch. Note that in the meanwhile I found out one bug with the patch when mReplaceOnError is true, which is due to bad while condition (*aSrcLength should be replaced by inLeft).
(Reporter)

Updated

12 years ago
Blocks: 356654

Comment 19

12 years ago
(In reply to comment #14)

> used nsnull as the value to say we want UCS-2...

Why do you use UCS-2? Our internal character encoding (AString, PRUnichar *) is UTF-16. Am I missing anything? 

BTW, glibc iconv adds a BOM unless the endianness is explicitly specified. So, it needs to be either 'UTF-16LE' or 'UTF-16BE' (see xpcom/io/nsNativeCharsetUtils.cpp)

(Reporter)

Comment 20

12 years ago
(In reply to comment #19)
> (In reply to comment #14)
> 
> > used nsnull as the value to say we want UCS-2...
> 
> Why do you use UCS-2? Our internal character encoding (AString, PRUnichar *) is
> UTF-16. Am I missing anything?

I haven't checked that yet, but was considering to, since UCS-2 only maps the BMP. I only used UCS-2 because it is what was already used by the code... Anyways, switching to UTF16 will require some changes to the continuation trick...
Comment on attachment 241458 [details] [diff] [review]
New preliminary patch

you must change the IID when changing an interface.

+static const char * UCS_2 = "UCS-2";

and this should be an array instead of a pointer
Attachment #241458 - Flags: review?(smontagu) → review-
(Reporter)

Comment 22

12 years ago
other comments on the approach would be much appreciated ;)
(Reporter)

Comment 23

12 years ago
Created attachment 254071 [details] [diff] [review]
New patch

This patch has been tested on little and big endians, and has been used in debian for epiphany/galeon/... for a few weeks.

Note it also implements a fix for bug #356654 (though I don't know if it would work nice with other iconv implementations than the glibc's) and a workaround to make is do the same as the non-native implementation for iso-8859-1/windows-1252 (i.e the contrary of solving bug #288904)
Attachment #241458 - Attachment is obsolete: true
Attachment #254071 - Flags: review?(cbiesinger)

Comment 24

11 years ago
The latest patch fails against the trunk. 

[snip]
patching file intl/uconv/native/nsNativeUConvService.cpp
Hunk #10 FAILED at 477.
Hunk #11 succeeded at 501 (offset 3 lines).
Hunk #12 FAILED at 540.
2 out of 12 hunks FAILED -- saving rejects to file intl/uconv/native/nsNativeUConvService.cpp.rej
[/snip]

Also seems that some of the upstream changes have introduced a compile failure, so this bug is blocked by bug #372648 at the moment.
(Reporter)

Comment 25

11 years ago
Created attachment 265513 [details] [diff] [review]
New patch against trunk

This patch is against trunk, and also fixes an issue when UTF16 input doesn't have a mapping in the output charset. Though there is still an issue with the replacement character when mReplaceOnError is set but I would like to be sure what mReplaceChar is supposed to be. Actually the whole concept of replace character on unicodeencoder sounds broken to me, the replacement character depeding on the destination character set.
Attachment #254071 - Attachment is obsolete: true
Attachment #265513 - Flags: review?(cbiesinger)
Attachment #254071 - Flags: review?(cbiesinger)
Mike, can you make sure your patch is still valid on the trunk?

biesi, do you have time to review this sometime?
QA Contact: amyy → i18n
(Reporter)

Comment 27

11 years ago
Created attachment 284739 [details] [diff] [review]
New patch against trunk

A one line change was necessary to fit changes on trunk since May. I also added myself to contributors in intl/uconv/native/nsNativeUConvService.cpp.
Attachment #265513 - Attachment is obsolete: true
Attachment #284739 - Flags: review?(cbiesinger)
Attachment #265513 - Flags: review?(cbiesinger)
(Reporter)

Comment 28

10 years ago
(In reply to comment #19)
> BTW, glibc iconv adds a BOM unless the endianness is explicitly specified. So,
> it needs to be either 'UTF-16LE' or 'UTF-16BE' (see
> xpcom/io/nsNativeCharsetUtils.cpp)

Unfortunately, it adds a BOM even when UTF-16LE or UTF-16BE is specified :-/

(Reporter)

Comment 29

10 years ago
(In reply to comment #28)
> (In reply to comment #19)
> > BTW, glibc iconv adds a BOM unless the endianness is explicitly specified. So,
> > it needs to be either 'UTF-16LE' or 'UTF-16BE' (see
> > xpcom/io/nsNativeCharsetUtils.cpp)
> 
> Unfortunately, it adds a BOM even when UTF-16LE or UTF-16BE is specified :-/

Actually, it doesn't *add* it, it just doesn't remove it in conversions from UTF-8 to UTF-16LE or UTF-16BE... 

Comment on attachment 284739 [details] [diff] [review]
New patch against trunk

Simon, got a chance to look over this?
Attachment #284739 - Flags: review?(smontagu)
(Reporter)

Comment 31

10 years ago
Comment on attachment 284739 [details] [diff] [review]
New patch against trunk

This patch still has problems, and I stopped maintaining it...
Attachment #284739 - Flags: review?(smontagu)
Attachment #284739 - Flags: review?(cbiesinger)
(Reporter)

Updated

8 years ago
Status: ASSIGNED → NEW
(Reporter)

Updated

8 years ago
Assignee: mh+mozilla → smontagu
(Assignee)

Updated

7 years ago
Depends on: 644801
(Assignee)

Comment 32

7 years ago
Native uconv is gone.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.