Closed Bug 128896 Opened 23 years ago Closed 22 years ago

UTF8 decoder should recover from errors (CSS not applied)

Categories

(Core :: CSS Parsing and Computation, defect, P2)

x86
All
defect

Tracking

()

VERIFIED FIXED
mozilla1.0

People

(Reporter: ezh, Assigned: smontagu)

References

()

Details

(Keywords: testcase, topembed, Whiteboard: [adt2])

Attachments

(4 files, 3 obsolete files)

1. Load this page.
2. CSS is not applied in Moz, but works fine in Opera, IE.

Tested on 2 PC's.

moz 2002022808
The page *is* sending the stylesheet with the correct content-type.
Does it work with an @charset rule in the CSS file?  I'm guessing we're using
the document encoding for the CSS file, which is UTF-8.
Seeing this as well on Linux 2002030223 trunk build

the stylesheet does have 2 errors in it:
http://jigsaw.w3.org/css-validator/validator?uri=http://www.eyp.ee/styles.css&warning=1&profile=css2

However they are not what's causing this problem
apparently this line is what's causing this:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

removing this line from HTML, fixes it.
OS: Windows 2000 → All
bz: So what was the reason we want to use the document charset as the falllback?
Specifying a charset with @charset or in the link element gets it to work, as 
does removing the Content-Type meta-header as Alexey noted. It does seem that 
the document charset is taken as the default for the stylesheet.

The CSS file contains some characters with umlauts (in comments). It seems the 
UTF-8 decoder chokes on these and the whole stylesheet is discarded.
Attached file testcase
Keywords: testcase
We do it for a few reasons:

1)  Lots of Japanese pages are UTF-8 with the sheet also encoded as UTF-8 and
    not specifying any @charset
2)  The CSS spec says nothing about fallbacks for encoding at this point except
    vague handwaving at the XML spec and saying something about "Mechanisms of
    the language of the referencing document (e.g., in HTML, the "charset"
    attribute of the LINK element). "
3)  It makes sense that any "internal" linked resources that do not specify
    otherwise would be in the same encoding as the document.  We certainly
    assume that for javascript...

Marking duplicate of our bug to ignore decoding errors like this when possible.

*** This bug has been marked as a duplicate of 107790 ***
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → DUPLICATE
reopening and marking nsbeta1+ per triage meeting, where bug 107790 was marked
nsbeta1-, but we thought this specific issue should be fixed.

Attachment 71802 [details] [diff] fixes this problem.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
->smontagu, since he seems to have a fix (I think)
Assignee: dbaron → smontagu
Status: REOPENED → NEW
really marking nsbeta1+
Keywords: nsbeta1+
nhotta, can you review the patch? Would there be a better way to do this?
See also www.opera.com
There is actually an evangelism bug on opera.com -- it was felt that this was a
big enough site to be worth evangelizing, since technically what they're doing
(serving up a style sheet with no charset info) leads to undefined behavior..
Simon, I thought the caller would be changed instead of the converter. Could you
ask Frank to review? I am not familiar with the converter code. For UTF-8 error
handling, you need to make sure the security issue is taken care. Is that okay
to just skip invalid bytes and proceed?
On the security issue, my patch continues to treat sequences that are decodable
but not minimum length as fatal errors, so I think we are safe here.
Blocks: 106843
Status: NEW → ASSIGNED
Summary: CSS not applied → UTF8 decoder should recover from errors (CSS not applied)
Comment on attachment 72722 [details] [diff] [review]
The patch from bug 107790 (diff -uw)

This patch certainly seems right to me (and the re-indentation is badly
needed!).  It might be better to call the constant NS_UTF8_ERROR_STATE, since
NS_UTF8_ERROR sounds somewhat like an nsresult value.  I wonder if you want to
fix the one other case (the non-shortest form case) to do the same thing (put
in 0xFFFD and continue), which would allow you to remove the |res| variable
(and switch the order of the statements at the end and just use |return|).  I
also wonder if the NS_OK_UDEC_MOREOUTPUT test should change from (mState != 0)
to (mState != 0 && mState != NS_UTF8_ERROR_STATE).
I'm not sure what the security issue in comment 16 and comment 17 is, but:
 * If it's a buffer overrun issue, I don't see how changing what I suggested in
comment 18 would create any problems.
 * If it's an issue where some caller depends on a 1-to-1 mapping between UTF8
and the decoded UCS2, then there might be a need to report the error in some way
in all three cases, not just the one.
Quoting from http://www.ietf.org/rfc/rfc2279.txt:

6.  Security Considerations

   Implementors of UTF-8 need to consider the security aspects of how
   they handle illegal UTF-8 sequences.  It is conceivable that in some
   circumstances an attacker would be able to exploit an incautious
   UTF-8 parser by sending it an octet sequence that is not permitted by
   the UTF-8 syntax.

   A particularly subtle form of this attack could be carried out
   against a parser which performs security-critical validity checks
   against the UTF-8 encoded form of its input, but interprets certain
   illegal octet sequences as characters.  For example, a parser might
   prohibit the NUL character when encoded as the single-octet sequence
   00, but allow the illegal two-octet sequence C0 80 and interpret it
   as a NUL character.  Another example might be a parser which
   prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
   illegal octet sequence 2F C0 AE 2E 2F.

I don't see any impediment here to what dbaron suggests in comment 18.
1. The real problem is the nsCSSLoader does not tell the decoder to recover from
the error. In the decoder, there are a method to recover from the error- Reset()
method.
2. The real problem code is in SheetLoadData::OnStreamComplete

content/html/style/src/nsCSSLoader.cpp

686           if (decoder) {
687             PRInt32 unicodeLength=0;
688             if
(NS_SUCCEEDED(decoder->GetMaxLength(aString,aStringLen,&unicodeLength))) {
689               PRUnichar *unicodeString = nsnull;
690               strUnicodeBuffer = new nsString;
691               if (nsnull == strUnicodeBuffer) {
692                 result = NS_ERROR_OUT_OF_MEMORY;
693               } else {
694                 // make space for the decoding
695                 strUnicodeBuffer->SetCapacity(unicodeLength);
696                 unicodeString = (PRUnichar *) strUnicodeBuffer->get();
697                 result = decoder->Convert(aString, (PRInt32 *) &aStringLen,
unicodeString, &unicodeLength);
698                 if (NS_SUCCEEDED(result)) {
699                   strUnicodeBuffer->SetLength(unicodeLength);
700                 } else {
701 #ifdef DEBUG
702                   /*
703                    * We currently fail often because of XML documents
704                    * which are in UTF-8 importing stylesheets which
705                    * use an ISO-8859-1 char, especially in comments in
706                    * the sheet.  See
http://bugzilla.mozilla.org/show_bug.cgi?id=106843
707                    *
708                    * This assertion should at least make it possible
709                    * to catch such errors in chrome, pending this bug
710                    * being fixed
711                    */
712                    
713                   nsCAutoString uriStr;
714                   mURL->GetSpec(uriStr);
715                   nsCAutoString errorMessage;
716                   errorMessage = NS_LITERAL_CSTRING("Decoding sheet from ") +
717                                  uriStr +
718                                  NS_LITERAL_CSTRING(" failed");
719                   NS_ASSERTION(PR_FALSE, errorMessage.get());
720 #endif // DEBUG
721                   strUnicodeBuffer->SetLength(0);
722                 }
723               }
724             }
725             NS_RELEASE(decoder);
726           }

The way it call the converter is WRONG, when the Convert function failed, it
should do what the parser do in nsScanner::Append
htmlparser/src/nsScanner.cpp

341     do {
342       PRInt32 srcLength = aLen;
343       res = mUnicodeDecoder->Convert(aBuffer, &srcLength, unichars,
&unicharLength);
344 
345       totalChars += unicharLength;
346       // Continuation of failure case
347       if(NS_FAILED(res)) {
348         // if we failed, we consume one byte, replace it with U+FFFD
349         // and try the conversion again.
350         unichars[unicharLength++] = (PRUnichar)0xFFFD;
351         unichars = unichars + unicharLength;
352         unicharLength = unicharBufLen - (++totalChars);
353 
354         mUnicodeDecoder->Reset();
355 
356         if(((PRUint32) (srcLength + 1)) > aLen) {
357           srcLength = aLen;
358         }
359         else {
360           srcLength++;
361         }
362 
363         aBuffer += srcLength;
364         aLen -= srcLength;
365       }
366     } while (NS_FAILED(res) && (aLen > 0));
367 


Please do NOT put your fix in UTF8 decoder only but fix the nsCSSLoader.cpp,
otherwise, other charset (say Shift_JIS) will still failed even you land your
current patch.

>UTF8 decoder should recover from errors
it WILL if the caller call Reset() method . 



Priority: -- → P2
Whiteboard: adt2
Target Milestone: --- → mozilla1.0
Isn't that going to cause severe problems with error handling in encodings like
UTF-16, where consuming one byte can totally change the meaning of the entire
sequence?  Or does UTF-16 have protections against a 1-byte shift?
The point is that duplicating the logic from comment 121 in _every_ place that 
calls converters makes less sense than moving it into the converters... But 
that's bug 107790, I guess.
>The point is that duplicating the logic from comment 121 in _every_
> place that calls converters makes less sense than moving it into 
>the converters... But that's bug 107790, I guess.
1. either fix need duplicate code, either duplicate in caller or duplicate in
the ~80 unicode converters, and
2. only the caller can decide the logic how to recover from the error. In  the
case of HTML and CSS, the behavior probably is the same, but in the case of
mime-2 decodeing or other issue, the behavior will be different. 

>Isn't that going to cause severe problems with error handling 
>in encodings like UTF-16, where consuming one byte can totally
> change the meaning of the entire sequence?  
You will have this problem regardless where you recover from the error because
the error could be ONE extra byte in the UTF-16 or TWO extra bytes in the UTF-16
and you cannot tell which one is the case from neither the caller nor the
converter. The so-called severe problem is intrduced when the extra byte(s) put
into the UTF-16, not in any of our code. 


Boris Zbarsky (out of town till mid-April) :
There are two reason we don't want to change the converter but the caller 
1. it mean we need to change at least 5 to 10 different unicode decoders
2. it will break mime-2 decoder which currently depend on this.
Attached patch Patch on callerSplinter Review
I need a new testcase. The referenced URL doesn't seem to encounter any
decoding errors any more, and nor does www.opera.com
Attachment #72721 - Attachment is obsolete: true
Attachment #72722 - Attachment is obsolete: true
Just revert the patch to bug 106674 and see whether mailnews is all broken or 
not.  :)
Attachment #76426 - Attachment mime type: text/html → text/plain
Attachment #76426 - Attachment is obsolete: true
Attachment #76426 - Attachment mime type: text/plain → text/html
BTW, reverting bug 106674 doesn't test this patch, because the decoder is called
from nsScanner::Append.
 Impact Platform: ALL
Impact language users: 560M 100% of internet users
Probability of hitting the problem: HIGH, we saw major web sites have this problem
Severity if hit the problem in the worst case: part of the stylesheet won't be
apply to the page, major display problem may happen
Way of recover after hit the problem: ask the webmaster to label the style sheet
with @charset
Risk of the fix: LOW, we have similar code in the parser
Potential benefit of fix this problem: a lot of stylesheet issues
A real-world testcase: go to http://www.toyota.com/ and notice that the top
banner is displayed properly.  Then go to either http://www.toyota.com/matrix/
or http://www.toyota.com/camry/ , either of which have UTF-8 METAs (other models
may have this as well).  The banner on the model pages is broken up, and in
general no CSS at all is applied on the entire page.  Their CSS validates and is
served text/css, although their HTML does not validate.  Also, their popdown
menus are sniffing badly, but that's an evang issue.  I've confirmed that saving
a local copy of the model pages and deleting the character-encoding line results
in the CSS being applied to the document; restoring the line causes the CSS to
drop away.
Keywords: topembed
Comment on attachment 76351 [details] [diff] [review]
Patch on caller

r=ftang
Attachment #76351 - Flags: review+
please ask sr=
http://www.toyota.com/matrix/ and http://www.toyota.com/camry/ display correctly
with my patch.
break on top sites. 
Whiteboard: adt2 → [adt2]
https://www.radiolinja.fi/ - same thing?
Comment on attachment 76351 [details] [diff] [review]
Patch on caller

sr=attinasi
Attachment #76351 - Flags: superreview+
>https://www.radiolinja.fi/ - same thing?

It doesn't seem to be. The source page isn't UTF-8 and the style sheet seems to
be all ASCII anyway. What problem(s) are you seeing there?
For the https://www.radiolinja.fi/ added bug 136204.
please approve for this bug. This is a high impact , mid risk bug, the similar
code is already used in html parser for many monthes. 
Keywords: adt1.0.0
adt1.0.0+ (on ADT's behalf) for approval to checkin to 1.0. Pls check this in today.
Keywords: adt1.0.0adt1.0.0+
Comment on attachment 76351 [details] [diff] [review]
Patch on caller

a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #76351 - Flags: approval+
Fix checked in
Status: ASSIGNED → RESOLVED
Closed: 23 years ago22 years ago
Resolution: --- → FIXED
view these websites :-

http://www.geocities.com/i18nguy/unicode-example.html
http://www.columbia.edu/kermit/utf8.html

in Red Hat Linux 7.3 buildID: 2002-04-11-06trunk , i see all the international  
language characters displayed right.

whereas, in Win2000 buildID: 2002-04-10-06trunk , for many of these characters i 
see "?????" displayed, instead of their respective language character.

any comments?


*** Bug 136284 has been marked as a duplicate of this bug. ***
verified fixed in trunk on  the following platforms/builds:-

win2000 build id :- 2002-04-15-10trunk
redhat linux 7.3 build id:- 2002-04-11-10trunk

Status: RESOLVED → VERIFIED
*** Bug 146514 has been marked as a duplicate of this bug. ***
Madhur, you're simply missing fonts that contain these characters on Windows.
I recommend installing "Arial Unicode MS" which contains entire Unicode 2.1 set.
It comes with MS Office 2000 and XP, You have to explicitly install it under:
Office Shared Features->International Support->Universal Font
Checked in a reftest for this at Håkan's suggestion.
Flags: in-testsuite+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: