Closed Bug 35004 Opened 24 years ago Closed 24 years ago

UTF-8 converter stops conversion after encountering invalid characters

Categories

(MailNews Core :: Internationalization, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: jmdesp, Assigned: ftang)

References

()

Details

(Keywords: polish, Whiteboard: [nsbeta3+]patch reviewed, wait for tree open to check in.)

Attachments

(3 files)

Mozilla stops displaying after reading invalid utf-8 bit characters.
Example String is "Subject: Re: [utf_fr] Re: [FAQ] Comment lire et écrire"

The story : I received an email in utf-8 from an OE user that had illegally some
UTF8 characters in headers. (it probably has been converted from QP to 8 bit by
an auto-converter)

This one displays correctly.

I've read it with NS 4.72, and answered back, resulting in invalid UTF-8
characters in the header as well as in the body of the message.
As this message is in a mailing list, I got it back home and I just can't read
the resulting mail in NS 6 PR1.

The message windows stops displaying when it encourter the first invalid
character both in the body and header.

I saved the mail and opened the file in browser.
In ISO-8859-1 the whole content is displayed.
In UTF-8, the display stops after first invalid character.

I include the file in wich I saved the mail as an attachment.
When displaying it, switching to UTF-8 encoding should show the bug.
Attached file test case
Thanks to the string I included earlier,the bug page too will show the problem
if you switch to utf-8.
Disabled auto-detection of charset during the display of the test page if you
want to see the problem. I had it set and it would try to display my page as
japanese, even when telling it's utf-8.

By the way, under NS 4.72 in the sample, the header is not displayed correctly,
but the body has no problem.


I can reproduce both in mail and browser.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Looks like this is a converter problem.
You can reproduce it by viewing the attachment of this bug in browser.
Reassign to cata and cc to ftang.
Assignee: nhotta → cata
Status: NEW → ASSIGNED
Target Milestone: --- → M18
If you have read the story part in my comment, you have seen the incorrect UTF-8 
data was generated by NS 4.72.
Indeed I have just found out that NS 4.72 incorrectly encodes the character "à" 
as C3 20 instead of C3 A0. (C3 20 is not just wrong, it's a forbidden sequence 
in UTF-8).
So in an environment where both NS 4.72 and NS 6/Mozilla are used, such invalid 
UTF-8 data will be encountered quite often.
I have reported this bug on the newsgroup netscape.communicator, but Mozilla 
will probably have to accept (or at least behave nicely with) those invalid 
characters.
This looks similar to #25037. Maybe they are the same or depend on each other.
Can somebody take a look?
No, this bug is not the same as Bug 25037. That bug was a temporary 
regression and was fixed. This one still occurs and is attributable to a different cause.
Keywords: nsbeta3, polish
Reassinging to myself.
Assignee: cata → ftang
Status: ASSIGNED → NEW
mark as assign
Status: NEW → ASSIGNED
nhotta- pleaes take a look at it since the problem is in MailNews
the converter should stop and report error when it hit illegal UTF8. The caller 
should do error handling nicer. In the brwoser, what I do is *eat* one byte , 
reset the decoder and then continue. 
Assignee: ftang → nhotta
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
Keywords: mailtrack
nsbeta3- per i18n bug meeting. This will only impact UTF8 mail, right ? how many 
people send illegal UTF-8 mail ?
Whiteboard: nsbeta3-
This affects more than just mail, this affects all UTF-8 content.  This has 
potential security implications, as it may be possible to get malicious content 
past input validators.
Whiteboard: nsbeta3-
F. Tang, please consider my 2000-05-02 06:54 comment.
Every user of NS 4.xx will frequently send invalid data when trying to use UTF-8 
in mail or news, because a OxA0 in the sequence will get replaced by 0x20. This 
also happens with a greek character.
The browser behaviour (eat one byte, reset) sounds adequate.
It shouldn't take very long to implement that in mail/news ?
Need more information.  Ftang to investigate.
About that NS 4.xx bug generating invalid utf-8 : The greek character concerned 
is capital pi (CE A0 in utf-8), as also the unicode character 01A0 (C6 A0).
I suspect NS 4.xx destroys utf-8 characters every time they end with A0.
It's still happening with NS 4.75.
I filed a Browser bug, Bug 50049, which seems to be due to the same weakness in our
code. The point of the other bug is that this problem manifests in more ways than just in
mis-created mail messages and that we should reallt fix this problem for beta3. 
jgmyers also mentions another possible way in which the weakness can be exploited
and the test cases above.
Marking nsbeta3+ per I18N Bug Triage. Fix this is Browser and Mail.
Whiteboard: [nsbeta3+]
So fix for mail and browser? Then we better change UTF-8 decoder.
we may need to change both the decoder and the caller. 
The following fix make the Browser do not display as blank. However, I think 
some of the caller still need to do error handling .
also in http://warp/u/ftang/tmp/fix35004.txt

Index: nsUTF8ToUnicode.cpp
===================================================================
RCS file: /m/pub/mozilla/intl/uconv/ucvlatin/nsUTF8ToUnicode.cpp,v
retrieving revision 1.7
diff -u -r1.7 nsUTF8ToUnicode.cpp
--- nsUTF8ToUnicode.cpp 2000/04/25 13:50:41     1.7
+++ nsUTF8ToUnicode.cpp 2000/08/26 07:53:30
@@ -85,7 +85,7 @@
 
    nsresult res;       // conversion result
 
-   for(in=aSrc,out=aDest,res=nsnull;((in < inend) && (out < outend)); in++)
+   for(in=aSrc,out=aDest,res=NS_OK;((in < inend) && (out < outend)); in++)
    {
       if(0 == mState) {
          if( 0 == (0x80 & (*in))) {
@@ -138,7 +138,7 @@
                         if(0 == --mState)
              {
                  if(mUcs4 >= 0x00010000) {
-                    if(mUcs4 >= 0x001F0000) {
+                    if(mUcs4 >= 0x00110000) {
                       *out++ = 0xFFFD;
                     } else {
                       mUcs4 -= 0x00010000;
@@ -168,11 +168,13 @@
    }
 
    //output not finished, output buffer too short
-   if ((in < inend) && (out >= outend)) res = NS_OK_UDEC_MOREOUTPUT;
+   if((NS_OK == res) && (in < inend) && (out >= outend)) 
+       res = NS_OK_UDEC_MOREOUTPUT;
 
    //last USC4 is incomplete, make sure the caller 
    //returns with properly aligned continuation of the buffer
-   if (mState != 0) res = NS_OK_UDEC_MOREINPUT;
+   if ((NS_OK == res) && (mState != 0))
+       res = NS_OK_UDEC_MOREINPUT;
 
    *aSrcLength = in - aSrc;
    *aDestLength  = out - aDest;

the above patch also partial fix 41425.
can some one give me a full list of problem area. Then we can test it. My patch 
fix the browser show blank issue. It won't treat the 0x20 as 0xA0. It will 
CONTINUE. What other places have this issue. can this fix it ? If not, we should  
know why, maybe we need to add code to the caller also.
Whiteboard: [nsbeta3+] → [nsbeta3+]Possible patch in hand. Need code review and verify with other cases
With my latest patch, I can see 
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
at least. The showing may not be as the expect result as the author wrote.
Is bug 50459 a duplicate of this one?
You did the analyze, R.K.Aa 2 : bug 50459 is not fully a duplicate, but it shows 
that if ever something is interpreted in utf-8 when it shouldn't be, this bug 
increases the problem.

I consider the behaviour of not interprating 0x20 as 0xA0 as adequate, users of 
NS 4.x who need utf-8 and adequate i18n support, really should update to version 
6 anyway. By continuing, the content of the mail/web page should be 
interpretable, even if not fully correct.
The mail view of the test case does not stop and continues conversion after 
Frank's patch, reassign to him.
Assignee: nhotta → ftang
Status: ASSIGNED → NEW
I think the result is great.
I thought you were goind to eat the missing character, but replacing it by a 
question mark is better.
Just one room for enhancement : There should be a visual difference between an 
invalid character and a character that can't be displayed because the font is 
missing.
Was the last resort font mecanism finally incorporated inside Mozilla ?
(this was discussed in the thread named Request for review on fix of #35910 - 
PostScript printing error, on netscape.public.mozilla.i18n in april).
Status: NEW → ASSIGNED
Whiteboard: [nsbeta3+]Possible patch in hand. Need code review and verify with other cases → [nsbeta3+]patch reviewed, wait for tree open to check in.
fix and check in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Per the testcase URL, the converter still eats one character following an
incomplete UTF-8 sequence.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch Proposed fixSplinter Review
check in the new fix john proposed
Status: REOPENED → ASSIGNED
fix check in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
QA contact to marina.
Please use the test case by jean-Marc
and see if the problem occurs any more
QA Contact: momoi → marina
I tried with build 2000091021.
My test case displays as expected. The "à" character becomes "? ".
It doesn't look very great for this test sample, but it's probably the standard
way  it has to be done.

I also watched the test case of Marcus Kuhn.
It works quite good exept that it accepts invalid characters representation of
for example the character "/". This _might_ mean a security problem if some day
this decoder is used for something else than displaying ("decoding an utf-8
sequence before doing file access").

The last resort font mecanism is not implemented in Mozilla as I see.

Invalid UTF-8 should be diplayed in a way that's different from the way a
character that can not be found in the current font is displayed.
The "accepting invalid overlong sequences" problem is bug 50702.
And what about opening a bug about the fact "invalid" and "not found in
available fonts" characters are displayed the same ?

It would link to "last resort font" references for suggestions about the way the
characters that are not found in the available fonts are displayed.
i see the correct display in 2000-09-20 build, verifying as fixed
Status: RESOLVED → VERIFIED
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: