35004 - UTF-8 converter stops conversion after encountering invalid characters

Reporter

Description

•

25 years ago

Mozilla stops displaying after reading invalid utf-8 bit characters. Example String is "Subject: Re: [utf_fr] Re: [FAQ] Comment lire et ÃÂ©crire" The story : I received an email in utf-8 from an OE user that had illegally some UTF8 characters in headers. (it probably has been converted from QP to 8 bit by an auto-converter) This one displays correctly. I've read it with NS 4.72, and answered back, resulting in invalid UTF-8 characters in the header as well as in the body of the message. As this message is in a mailing list, I got it back home and I just can't read the resulting mail in NS 6 PR1. The message windows stops displaying when it encourter the first invalid character both in the body and header. I saved the mail and opened the file in browser. In ISO-8859-1 the whole content is displayed. In UTF-8, the display stops after first invalid character. I include the file in wich I saved the mail as an attachment. When displaying it, switching to UTF-8 encoding should show the bug.

Jean-Marc Desperrier

Reporter

Comment 1

•

25 years ago

Attached file test case — Details

Jean-Marc Desperrier

Reporter

Comment 2

•

25 years ago

Thanks to the string I included earlier,the bug page too will show the problem if you switch to utf-8. Disabled auto-detection of charset during the display of the test page if you want to see the problem. I had it set and it would try to display my page as japanese, even when telling it's utf-8. By the way, under NS 4.72 in the sample, the header is not displayed correctly, but the body has no problem.

nhottanscp

Comment 3

•

25 years ago

I can reproduce both in mail and browser.

Status: UNCONFIRMED → NEW

Ever confirmed: true

nhottanscp

Comment 4

•

25 years ago

Looks like this is a converter problem. You can reproduce it by viewing the attachment of this bug in browser. Reassign to cata and cc to ftang.

Assignee: nhotta → cata

cata

Updated

•

25 years ago

Status: NEW → ASSIGNED

Target Milestone: --- → M18

Jean-Marc Desperrier

Reporter

Comment 5

•

25 years ago

If you have read the story part in my comment, you have seen the incorrect UTF-8 data was generated by NS 4.72. Indeed I have just found out that NS 4.72 incorrectly encodes the character "à" as C3 20 instead of C3 A0. (C3 20 is not just wrong, it's a forbidden sequence in UTF-8). So in an environment where both NS 4.72 and NS 6/Mozilla are used, such invalid UTF-8 data will be encountered quite often. I have reported this bug on the newsgroup netscape.communicator, but Mozilla will probably have to accept (or at least behave nicely with) those invalid characters.

Giacomo Magnini

Comment 6

•

25 years ago

This looks similar to #25037. Maybe they are the same or depend on each other. Can somebody take a look?

Katsuhiko Momoi

Comment 7

•

25 years ago

No, this bug is not the same as Bug 25037. That bug was a temporary regression and was fixed. This one still occurs and is attributable to a different cause.

Frank Tang

Assignee

Updated

•

25 years ago

Keywords: nsbeta3, polish

Frank Tang

Assignee

Comment 8

•

25 years ago

Reassinging to myself.

Assignee: cata → ftang

Status: ASSIGNED → NEW

Frank Tang

Assignee

Comment 9

•

25 years ago

mark as assign

Status: NEW → ASSIGNED

Frank Tang

Assignee

Comment 10

•

24 years ago

nhotta- pleaes take a look at it since the problem is in MailNews the converter should stop and report error when it hit illegal UTF8. The caller should do error handling nicer. In the brwoser, what I do is *eat* one byte , reset the decoder and then continue.

Assignee: ftang → nhotta

Status: ASSIGNED → NEW

John G. Myers

Comment 11

•

24 years ago

Adding a testcase.

URL: http://www.cl.cam.ac.uk/~mgk25/ucs/ex...

nhottanscp

Updated

•

24 years ago

Status: NEW → ASSIGNED

lchiang

Updated

•

24 years ago

Keywords: mailtrack

Frank Tang

Assignee

Comment 12

•

24 years ago

nsbeta3- per i18n bug meeting. This will only impact UTF8 mail, right ? how many people send illegal UTF-8 mail ?

Whiteboard: nsbeta3-

John G. Myers

Comment 13

•

24 years ago

This affects more than just mail, this affects all UTF-8 content. This has potential security implications, as it may be possible to get malicious content past input validators.

Whiteboard: nsbeta3-

Jean-Marc Desperrier

Reporter

Comment 14

•

24 years ago

F. Tang, please consider my 2000-05-02 06:54 comment. Every user of NS 4.xx will frequently send invalid data when trying to use UTF-8 in mail or news, because a OxA0 in the sequence will get replaced by 0x20. This also happens with a greek character. The browser behaviour (eat one byte, reset) sounds adequate. It shouldn't take very long to implement that in mail/news ?

Jaime Rodriguez, Jr.

Comment 15

•

24 years ago

Need more information. Ftang to investigate.

Jean-Marc Desperrier

Reporter

Comment 16

•

24 years ago

About that NS 4.xx bug generating invalid utf-8 : The greek character concerned is capital pi (CE A0 in utf-8), as also the unicode character 01A0 (C6 A0). I suspect NS 4.xx destroys utf-8 characters every time they end with A0. It's still happening with NS 4.75.

Katsuhiko Momoi

Comment 17

•

24 years ago

I filed a Browser bug, Bug 50049, which seems to be due to the same weakness in our code. The point of the other bug is that this problem manifests in more ways than just in mis-created mail messages and that we should reallt fix this problem for beta3. jgmyers also mentions another possible way in which the weakness can be exploited and the test cases above.

Jaime Rodriguez, Jr.

Comment 18

•

24 years ago

Marking nsbeta3+ per I18N Bug Triage. Fix this is Browser and Mail.

Whiteboard: [nsbeta3+]

nhottanscp

Comment 19

•

24 years ago

So fix for mail and browser? Then we better change UTF-8 decoder.

Frank Tang

Assignee

Comment 20

•

24 years ago

we may need to change both the decoder and the caller. The following fix make the Browser do not display as blank. However, I think some of the caller still need to do error handling . also in http://warp/u/ftang/tmp/fix35004.txt Index: nsUTF8ToUnicode.cpp =================================================================== RCS file: /m/pub/mozilla/intl/uconv/ucvlatin/nsUTF8ToUnicode.cpp,v retrieving revision 1.7 diff -u -r1.7 nsUTF8ToUnicode.cpp --- nsUTF8ToUnicode.cpp 2000/04/25 13:50:41 1.7 +++ nsUTF8ToUnicode.cpp 2000/08/26 07:53:30 @@ -85,7 +85,7 @@ nsresult res; // conversion result - for(in=aSrc,out=aDest,res=nsnull;((in < inend) && (out < outend)); in++) + for(in=aSrc,out=aDest,res=NS_OK;((in < inend) && (out < outend)); in++) { if(0 == mState) { if( 0 == (0x80 & (*in))) { @@ -138,7 +138,7 @@ if(0 == --mState) { if(mUcs4 >= 0x00010000) { - if(mUcs4 >= 0x001F0000) { + if(mUcs4 >= 0x00110000) { *out++ = 0xFFFD; } else { mUcs4 -= 0x00010000; @@ -168,11 +168,13 @@ } //output not finished, output buffer too short - if ((in < inend) && (out >= outend)) res = NS_OK_UDEC_MOREOUTPUT; + if((NS_OK == res) && (in < inend) && (out >= outend)) + res = NS_OK_UDEC_MOREOUTPUT; //last USC4 is incomplete, make sure the caller //returns with properly aligned continuation of the buffer - if (mState != 0) res = NS_OK_UDEC_MOREINPUT; + if ((NS_OK == res) && (mState != 0)) + res = NS_OK_UDEC_MOREINPUT; *aSrcLength = in - aSrc; *aDestLength = out - aDest; the above patch also partial fix 41425. can some one give me a full list of problem area. Then we can test it. My patch fix the browser show blank issue. It won't treat the 0x20 as 0xA0. It will CONTINUE. What other places have this issue. can this fix it ? If not, we should know why, maybe we need to add code to the caller also.

Frank Tang

Assignee

Updated

•

24 years ago

Whiteboard: [nsbeta3+] → [nsbeta3+]Possible patch in hand. Need code review and verify with other cases

Frank Tang

Assignee

Comment 21

•

24 years ago

With my latest patch, I can see http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt at least. The showing may not be as the expect result as the author wrote.

R.K.Aa.

Comment 22

•

24 years ago

Is bug 50459 a duplicate of this one?

Jean-Marc Desperrier

Reporter

Comment 23

•

24 years ago

You did the analyze, R.K.Aa 2 : bug 50459 is not fully a duplicate, but it shows that if ever something is interpreted in utf-8 when it shouldn't be, this bug increases the problem. I consider the behaviour of not interprating 0x20 as 0xA0 as adequate, users of NS 4.x who need utf-8 and adequate i18n support, really should update to version 6 anyway. By continuing, the content of the mail/web page should be interpretable, even if not fully correct.

nhottanscp

Comment 24

•

24 years ago

Attached image Screen shot of the test case view after Frank's patch — Details

nhottanscp

Comment 25

•

24 years ago

The mail view of the test case does not stop and continues conversion after Frank's patch, reassign to him.

Assignee: nhotta → ftang

Status: ASSIGNED → NEW

Jean-Marc Desperrier

Reporter

Comment 26

•

24 years ago

I think the result is great. I thought you were goind to eat the missing character, but replacing it by a question mark is better. Just one room for enhancement : There should be a visual difference between an invalid character and a character that can't be displayed because the font is missing. Was the last resort font mecanism finally incorporated inside Mozilla ? (this was discussed in the thread named Request for review on fix of #35910 - PostScript printing error, on netscape.public.mozilla.i18n in april).

Frank Tang

Assignee

Updated

•

24 years ago

Status: NEW → ASSIGNED

Whiteboard: [nsbeta3+]Possible patch in hand. Need code review and verify with other cases → [nsbeta3+]patch reviewed, wait for tree open to check in.

Frank Tang

Assignee

Comment 27

•

24 years ago

fix and check in.

Status: ASSIGNED → RESOLVED

Closed: 24 years ago

Resolution: --- → FIXED

John G. Myers

Comment 28

•

24 years ago

Per the testcase URL, the converter still eats one character following an incomplete UTF-8 sequence.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

John G. Myers

Comment 29

•

24 years ago

Attached patch Proposed fix — Details — Splinter Review

Frank Tang

Assignee

Comment 30

•

24 years ago

check in the new fix john proposed

Status: REOPENED → ASSIGNED

Frank Tang

Assignee

Comment 31

•

24 years ago

fix check in.

Status: ASSIGNED → RESOLVED

Closed: 24 years ago → 24 years ago

Resolution: --- → FIXED

Katsuhiko Momoi

Comment 32

•

24 years ago

QA contact to marina. Please use the test case by jean-Marc and see if the problem occurs any more

QA Contact: momoi → marina

Jean-Marc Desperrier

Reporter

Comment 33

•

24 years ago

I tried with build 2000091021. My test case displays as expected. The "à" character becomes "? ". It doesn't look very great for this test sample, but it's probably the standard way it has to be done. I also watched the test case of Marcus Kuhn. It works quite good exept that it accepts invalid characters representation of for example the character "/". This _might_ mean a security problem if some day this decoder is used for something else than displaying ("decoding an utf-8 sequence before doing file access"). The last resort font mecanism is not implemented in Mozilla as I see. Invalid UTF-8 should be diplayed in a way that's different from the way a character that can not be found in the current font is displayed.

John G. Myers

Comment 34

•

24 years ago

The "accepting invalid overlong sequences" problem is bug 50702.

Jean-Marc Desperrier

Reporter

Comment 35

•

24 years ago

And what about opening a bug about the fact "invalid" and "not found in available fonts" characters are displayed the same ? It would link to "last resort font" references for suggestions about the way the characters that are not found in the available fonts are displayed.

marina

Comment 36

•

24 years ago

i see the correct display in 2000-09-20 build, verifying as fixed

Status: RESOLVED → VERIFIED

Myk Melez [:myk] [@mykmelez]

Updated

•

20 years ago

Product: MailNews → Core

Nobody; OK to take it and work on it

Updated

•

17 years ago

Product: Core → MailNews Core

test case 25 years ago Jean-Marc Desperrier 4.89 KB, text/plain		Details
Screen shot of the test case view after Frank's patch 24 years ago nhottanscp 140.78 KB, image/jpeg		Details
Proposed fix 24 years ago John G. Myers 564 bytes, patch		Details \| Diff \| Splinter Review