Closed
Bug 35004
Opened 25 years ago
Closed 24 years ago
UTF-8 converter stops conversion after encountering invalid characters
Categories
(MailNews Core :: Internationalization, defect, P3)
Tracking
(Not tracked)
VERIFIED
FIXED
M18
People
(Reporter: jmdesp, Assigned: ftang)
References
()
Details
(Keywords: polish, Whiteboard: [nsbeta3+]patch reviewed, wait for tree open to check in.)
Attachments
(3 files)
4.89 KB,
text/plain
|
Details | |
140.78 KB,
image/jpeg
|
Details | |
564 bytes,
patch
|
Details | Diff | Splinter Review |
Mozilla stops displaying after reading invalid utf-8 bit characters.
Example String is "Subject: Re: [utf_fr] Re: [FAQ] Comment lire et écrire"
The story : I received an email in utf-8 from an OE user that had illegally some
UTF8 characters in headers. (it probably has been converted from QP to 8 bit by
an auto-converter)
This one displays correctly.
I've read it with NS 4.72, and answered back, resulting in invalid UTF-8
characters in the header as well as in the body of the message.
As this message is in a mailing list, I got it back home and I just can't read
the resulting mail in NS 6 PR1.
The message windows stops displaying when it encourter the first invalid
character both in the body and header.
I saved the mail and opened the file in browser.
In ISO-8859-1 the whole content is displayed.
In UTF-8, the display stops after first invalid character.
I include the file in wich I saved the mail as an attachment.
When displaying it, switching to UTF-8 encoding should show the bug.
Reporter | ||
Comment 1•25 years ago
|
||
Reporter | ||
Comment 2•25 years ago
|
||
Thanks to the string I included earlier,the bug page too will show the problem
if you switch to utf-8.
Disabled auto-detection of charset during the display of the test page if you
want to see the problem. I had it set and it would try to display my page as
japanese, even when telling it's utf-8.
By the way, under NS 4.72 in the sample, the header is not displayed correctly,
but the body has no problem.
Comment 3•25 years ago
|
||
I can reproduce both in mail and browser.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 4•25 years ago
|
||
Looks like this is a converter problem.
You can reproduce it by viewing the attachment of this bug in browser.
Reassign to cata and cc to ftang.
Assignee: nhotta → cata
Reporter | ||
Comment 5•25 years ago
|
||
If you have read the story part in my comment, you have seen the incorrect UTF-8
data was generated by NS 4.72.
Indeed I have just found out that NS 4.72 incorrectly encodes the character "à"
as C3 20 instead of C3 A0. (C3 20 is not just wrong, it's a forbidden sequence
in UTF-8).
So in an environment where both NS 4.72 and NS 6/Mozilla are used, such invalid
UTF-8 data will be encountered quite often.
I have reported this bug on the newsgroup netscape.communicator, but Mozilla
will probably have to accept (or at least behave nicely with) those invalid
characters.
Comment 6•25 years ago
|
||
This looks similar to #25037. Maybe they are the same or depend on each other.
Can somebody take a look?
Comment 7•25 years ago
|
||
No, this bug is not the same as Bug 25037. That bug was a temporary
regression and was fixed. This one still occurs and is attributable to a different cause.
Assignee | ||
Updated•25 years ago
|
Assignee | ||
Comment 8•25 years ago
|
||
Reassinging to myself.
Assignee: cata → ftang
Status: ASSIGNED → NEW
Assignee | ||
Comment 10•24 years ago
|
||
nhotta- pleaes take a look at it since the problem is in MailNews
the converter should stop and report error when it hit illegal UTF8. The caller
should do error handling nicer. In the brwoser, what I do is *eat* one byte ,
reset the decoder and then continue.
Assignee: ftang → nhotta
Status: ASSIGNED → NEW
Comment 11•24 years ago
|
||
Adding a testcase.
Updated•24 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 12•24 years ago
|
||
nsbeta3- per i18n bug meeting. This will only impact UTF8 mail, right ? how many
people send illegal UTF-8 mail ?
Whiteboard: nsbeta3-
Comment 13•24 years ago
|
||
This affects more than just mail, this affects all UTF-8 content. This has
potential security implications, as it may be possible to get malicious content
past input validators.
Whiteboard: nsbeta3-
Reporter | ||
Comment 14•24 years ago
|
||
F. Tang, please consider my 2000-05-02 06:54 comment.
Every user of NS 4.xx will frequently send invalid data when trying to use UTF-8
in mail or news, because a OxA0 in the sequence will get replaced by 0x20. This
also happens with a greek character.
The browser behaviour (eat one byte, reset) sounds adequate.
It shouldn't take very long to implement that in mail/news ?
Comment 15•24 years ago
|
||
Need more information. Ftang to investigate.
Reporter | ||
Comment 16•24 years ago
|
||
About that NS 4.xx bug generating invalid utf-8 : The greek character concerned
is capital pi (CE A0 in utf-8), as also the unicode character 01A0 (C6 A0).
I suspect NS 4.xx destroys utf-8 characters every time they end with A0.
It's still happening with NS 4.75.
Comment 17•24 years ago
|
||
I filed a Browser bug, Bug 50049, which seems to be due to the same weakness in our
code. The point of the other bug is that this problem manifests in more ways than just in
mis-created mail messages and that we should reallt fix this problem for beta3.
jgmyers also mentions another possible way in which the weakness can be exploited
and the test cases above.
Comment 18•24 years ago
|
||
Marking nsbeta3+ per I18N Bug Triage. Fix this is Browser and Mail.
Whiteboard: [nsbeta3+]
Comment 19•24 years ago
|
||
So fix for mail and browser? Then we better change UTF-8 decoder.
Assignee | ||
Comment 20•24 years ago
|
||
we may need to change both the decoder and the caller.
The following fix make the Browser do not display as blank. However, I think
some of the caller still need to do error handling .
also in http://warp/u/ftang/tmp/fix35004.txt
Index: nsUTF8ToUnicode.cpp
===================================================================
RCS file: /m/pub/mozilla/intl/uconv/ucvlatin/nsUTF8ToUnicode.cpp,v
retrieving revision 1.7
diff -u -r1.7 nsUTF8ToUnicode.cpp
--- nsUTF8ToUnicode.cpp 2000/04/25 13:50:41 1.7
+++ nsUTF8ToUnicode.cpp 2000/08/26 07:53:30
@@ -85,7 +85,7 @@
nsresult res; // conversion result
- for(in=aSrc,out=aDest,res=nsnull;((in < inend) && (out < outend)); in++)
+ for(in=aSrc,out=aDest,res=NS_OK;((in < inend) && (out < outend)); in++)
{
if(0 == mState) {
if( 0 == (0x80 & (*in))) {
@@ -138,7 +138,7 @@
if(0 == --mState)
{
if(mUcs4 >= 0x00010000) {
- if(mUcs4 >= 0x001F0000) {
+ if(mUcs4 >= 0x00110000) {
*out++ = 0xFFFD;
} else {
mUcs4 -= 0x00010000;
@@ -168,11 +168,13 @@
}
//output not finished, output buffer too short
- if ((in < inend) && (out >= outend)) res = NS_OK_UDEC_MOREOUTPUT;
+ if((NS_OK == res) && (in < inend) && (out >= outend))
+ res = NS_OK_UDEC_MOREOUTPUT;
//last USC4 is incomplete, make sure the caller
//returns with properly aligned continuation of the buffer
- if (mState != 0) res = NS_OK_UDEC_MOREINPUT;
+ if ((NS_OK == res) && (mState != 0))
+ res = NS_OK_UDEC_MOREINPUT;
*aSrcLength = in - aSrc;
*aDestLength = out - aDest;
the above patch also partial fix 41425.
can some one give me a full list of problem area. Then we can test it. My patch
fix the browser show blank issue. It won't treat the 0x20 as 0xA0. It will
CONTINUE. What other places have this issue. can this fix it ? If not, we should
know why, maybe we need to add code to the caller also.
Assignee | ||
Updated•24 years ago
|
Whiteboard: [nsbeta3+] → [nsbeta3+]Possible patch in hand. Need code review and verify with other cases
Assignee | ||
Comment 21•24 years ago
|
||
With my latest patch, I can see
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
at least. The showing may not be as the expect result as the author wrote.
Comment 22•24 years ago
|
||
Is bug 50459 a duplicate of this one?
Reporter | ||
Comment 23•24 years ago
|
||
You did the analyze, R.K.Aa 2 : bug 50459 is not fully a duplicate, but it shows
that if ever something is interpreted in utf-8 when it shouldn't be, this bug
increases the problem.
I consider the behaviour of not interprating 0x20 as 0xA0 as adequate, users of
NS 4.x who need utf-8 and adequate i18n support, really should update to version
6 anyway. By continuing, the content of the mail/web page should be
interpretable, even if not fully correct.
Comment 24•24 years ago
|
||
Comment 25•24 years ago
|
||
The mail view of the test case does not stop and continues conversion after
Frank's patch, reassign to him.
Assignee: nhotta → ftang
Status: ASSIGNED → NEW
Reporter | ||
Comment 26•24 years ago
|
||
I think the result is great.
I thought you were goind to eat the missing character, but replacing it by a
question mark is better.
Just one room for enhancement : There should be a visual difference between an
invalid character and a character that can't be displayed because the font is
missing.
Was the last resort font mecanism finally incorporated inside Mozilla ?
(this was discussed in the thread named Request for review on fix of #35910 -
PostScript printing error, on netscape.public.mozilla.i18n in april).
Assignee | ||
Updated•24 years ago
|
Status: NEW → ASSIGNED
Whiteboard: [nsbeta3+]Possible patch in hand. Need code review and verify with other cases → [nsbeta3+]patch reviewed, wait for tree open to check in.
Assignee | ||
Comment 27•24 years ago
|
||
fix and check in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Comment 28•24 years ago
|
||
Per the testcase URL, the converter still eats one character following an
incomplete UTF-8 sequence.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 29•24 years ago
|
||
Assignee | ||
Comment 31•24 years ago
|
||
fix check in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago → 24 years ago
Resolution: --- → FIXED
Comment 32•24 years ago
|
||
QA contact to marina.
Please use the test case by jean-Marc
and see if the problem occurs any more
QA Contact: momoi → marina
Reporter | ||
Comment 33•24 years ago
|
||
I tried with build 2000091021.
My test case displays as expected. The "à" character becomes "? ".
It doesn't look very great for this test sample, but it's probably the standard
way it has to be done.
I also watched the test case of Marcus Kuhn.
It works quite good exept that it accepts invalid characters representation of
for example the character "/". This _might_ mean a security problem if some day
this decoder is used for something else than displaying ("decoding an utf-8
sequence before doing file access").
The last resort font mecanism is not implemented in Mozilla as I see.
Invalid UTF-8 should be diplayed in a way that's different from the way a
character that can not be found in the current font is displayed.
Comment 34•24 years ago
|
||
The "accepting invalid overlong sequences" problem is bug 50702.
Reporter | ||
Comment 35•24 years ago
|
||
And what about opening a bug about the fact "invalid" and "not found in
available fonts" characters are displayed the same ?
It would link to "last resort font" references for suggestions about the way the
characters that are not found in the available fonts are displayed.
Comment 36•24 years ago
|
||
i see the correct display in 2000-09-20 build, verifying as fixed
Status: RESOLVED → VERIFIED
Updated•20 years ago
|
Product: MailNews → Core
Updated•17 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•