Closed
Bug 128896
Opened 23 years ago
Closed 22 years ago
UTF8 decoder should recover from errors (CSS not applied)
Categories
(Core :: CSS Parsing and Computation, defect, P2)
Tracking
()
VERIFIED
FIXED
mozilla1.0
People
(Reporter: ezh, Assigned: smontagu)
References
()
Details
(Keywords: testcase, topembed, Whiteboard: [adt2])
Attachments
(4 files, 3 obsolete files)
265 bytes,
text/html
|
Details | |
3.40 KB,
patch
|
ftang
:
review+
attinasi
:
superreview+
asa
:
approval+
|
Details | Diff | Splinter Review |
132 bytes,
text/css
|
Details | |
466 bytes,
text/html
|
Details |
1. Load this page. 2. CSS is not applied in Moz, but works fine in Opera, IE. Tested on 2 PC's. moz 2002022808
Comment 1•23 years ago
|
||
The page *is* sending the stylesheet with the correct content-type.
Does it work with an @charset rule in the CSS file? I'm guessing we're using the document encoding for the CSS file, which is UTF-8.
Comment 3•23 years ago
|
||
Seeing this as well on Linux 2002030223 trunk build the stylesheet does have 2 errors in it: http://jigsaw.w3.org/css-validator/validator?uri=http://www.eyp.ee/styles.css&warning=1&profile=css2 However they are not what's causing this problem apparently this line is what's causing this: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> removing this line from HTML, fixes it.
OS: Windows 2000 → All
bz: So what was the reason we want to use the document charset as the falllback?
Comment 5•23 years ago
|
||
Specifying a charset with @charset or in the link element gets it to work, as does removing the Content-Type meta-header as Alexey noted. It does seem that the document charset is taken as the default for the stylesheet. The CSS file contains some characters with umlauts (in comments). It seems the UTF-8 decoder chokes on these and the whole stylesheet is discarded.
Comment 6•23 years ago
|
||
Comment 7•23 years ago
|
||
We do it for a few reasons: 1) Lots of Japanese pages are UTF-8 with the sheet also encoded as UTF-8 and not specifying any @charset 2) The CSS spec says nothing about fallbacks for encoding at this point except vague handwaving at the XML spec and saying something about "Mechanisms of the language of the referencing document (e.g., in HTML, the "charset" attribute of the LINK element). " 3) It makes sense that any "internal" linked resources that do not specify otherwise would be in the same encoding as the document. We certainly assume that for javascript... Marking duplicate of our bug to ignore decoding errors like this when possible. *** This bug has been marked as a duplicate of 107790 ***
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → DUPLICATE
Assignee | ||
Comment 8•23 years ago
|
||
reopening and marking nsbeta1+ per triage meeting, where bug 107790 was marked nsbeta1-, but we thought this specific issue should be fixed. Attachment 71802 [details] [diff] fixes this problem.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
->smontagu, since he seems to have a fix (I think)
Assignee: dbaron → smontagu
Status: REOPENED → NEW
Assignee | ||
Comment 11•23 years ago
|
||
Assignee | ||
Comment 12•23 years ago
|
||
Assignee | ||
Comment 13•23 years ago
|
||
nhotta, can you review the patch? Would there be a better way to do this?
Reporter | ||
Comment 14•23 years ago
|
||
See also www.opera.com
Comment 15•23 years ago
|
||
There is actually an evangelism bug on opera.com -- it was felt that this was a big enough site to be worth evangelizing, since technically what they're doing (serving up a style sheet with no charset info) leads to undefined behavior..
Comment 16•23 years ago
|
||
Simon, I thought the caller would be changed instead of the converter. Could you ask Frank to review? I am not familiar with the converter code. For UTF-8 error handling, you need to make sure the security issue is taken care. Is that okay to just skip invalid bytes and proceed?
Assignee | ||
Comment 17•23 years ago
|
||
On the security issue, my patch continues to treat sequences that are decodable but not minimum length as fatal errors, so I think we are safe here.
Assignee | ||
Updated•22 years ago
|
Status: NEW → ASSIGNED
Summary: CSS not applied → UTF8 decoder should recover from errors (CSS not applied)
Comment on attachment 72722 [details] [diff] [review] The patch from bug 107790 (diff -uw) This patch certainly seems right to me (and the re-indentation is badly needed!). It might be better to call the constant NS_UTF8_ERROR_STATE, since NS_UTF8_ERROR sounds somewhat like an nsresult value. I wonder if you want to fix the one other case (the non-shortest form case) to do the same thing (put in 0xFFFD and continue), which would allow you to remove the |res| variable (and switch the order of the statements at the end and just use |return|). I also wonder if the NS_OK_UDEC_MOREOUTPUT test should change from (mState != 0) to (mState != 0 && mState != NS_UTF8_ERROR_STATE).
I'm not sure what the security issue in comment 16 and comment 17 is, but: * If it's a buffer overrun issue, I don't see how changing what I suggested in comment 18 would create any problems. * If it's an issue where some caller depends on a 1-to-1 mapping between UTF8 and the decoded UCS2, then there might be a need to report the error in some way in all three cases, not just the one.
Assignee | ||
Comment 20•22 years ago
|
||
Quoting from http://www.ietf.org/rfc/rfc2279.txt: 6. Security Considerations Implementors of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax. A particularly subtle form of this attack could be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but allow the illegal two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F. I don't see any impediment here to what dbaron suggests in comment 18.
Comment 21•22 years ago
|
||
1. The real problem is the nsCSSLoader does not tell the decoder to recover from the error. In the decoder, there are a method to recover from the error- Reset() method. 2. The real problem code is in SheetLoadData::OnStreamComplete content/html/style/src/nsCSSLoader.cpp 686 if (decoder) { 687 PRInt32 unicodeLength=0; 688 if (NS_SUCCEEDED(decoder->GetMaxLength(aString,aStringLen,&unicodeLength))) { 689 PRUnichar *unicodeString = nsnull; 690 strUnicodeBuffer = new nsString; 691 if (nsnull == strUnicodeBuffer) { 692 result = NS_ERROR_OUT_OF_MEMORY; 693 } else { 694 // make space for the decoding 695 strUnicodeBuffer->SetCapacity(unicodeLength); 696 unicodeString = (PRUnichar *) strUnicodeBuffer->get(); 697 result = decoder->Convert(aString, (PRInt32 *) &aStringLen, unicodeString, &unicodeLength); 698 if (NS_SUCCEEDED(result)) { 699 strUnicodeBuffer->SetLength(unicodeLength); 700 } else { 701 #ifdef DEBUG 702 /* 703 * We currently fail often because of XML documents 704 * which are in UTF-8 importing stylesheets which 705 * use an ISO-8859-1 char, especially in comments in 706 * the sheet. See http://bugzilla.mozilla.org/show_bug.cgi?id=106843 707 * 708 * This assertion should at least make it possible 709 * to catch such errors in chrome, pending this bug 710 * being fixed 711 */ 712 713 nsCAutoString uriStr; 714 mURL->GetSpec(uriStr); 715 nsCAutoString errorMessage; 716 errorMessage = NS_LITERAL_CSTRING("Decoding sheet from ") + 717 uriStr + 718 NS_LITERAL_CSTRING(" failed"); 719 NS_ASSERTION(PR_FALSE, errorMessage.get()); 720 #endif // DEBUG 721 strUnicodeBuffer->SetLength(0); 722 } 723 } 724 } 725 NS_RELEASE(decoder); 726 } The way it call the converter is WRONG, when the Convert function failed, it should do what the parser do in nsScanner::Append htmlparser/src/nsScanner.cpp 341 do { 342 PRInt32 srcLength = aLen; 343 res = mUnicodeDecoder->Convert(aBuffer, &srcLength, unichars, &unicharLength); 344 345 totalChars += unicharLength; 346 // Continuation of failure case 347 if(NS_FAILED(res)) { 348 // if we failed, we consume one byte, replace it with U+FFFD 349 // and try the conversion again. 350 unichars[unicharLength++] = (PRUnichar)0xFFFD; 351 unichars = unichars + unicharLength; 352 unicharLength = unicharBufLen - (++totalChars); 353 354 mUnicodeDecoder->Reset(); 355 356 if(((PRUint32) (srcLength + 1)) > aLen) { 357 srcLength = aLen; 358 } 359 else { 360 srcLength++; 361 } 362 363 aBuffer += srcLength; 364 aLen -= srcLength; 365 } 366 } while (NS_FAILED(res) && (aLen > 0)); 367 Please do NOT put your fix in UTF8 decoder only but fix the nsCSSLoader.cpp, otherwise, other charset (say Shift_JIS) will still failed even you land your current patch. >UTF8 decoder should recover from errors it WILL if the caller call Reset() method .
Updated•22 years ago
|
Priority: -- → P2
Whiteboard: adt2
Target Milestone: --- → mozilla1.0
Isn't that going to cause severe problems with error handling in encodings like UTF-16, where consuming one byte can totally change the meaning of the entire sequence? Or does UTF-16 have protections against a 1-byte shift?
Comment 23•22 years ago
|
||
The point is that duplicating the logic from comment 121 in _every_ place that calls converters makes less sense than moving it into the converters... But that's bug 107790, I guess.
Comment 24•22 years ago
|
||
>The point is that duplicating the logic from comment 121 in _every_ > place that calls converters makes less sense than moving it into >the converters... But that's bug 107790, I guess. 1. either fix need duplicate code, either duplicate in caller or duplicate in the ~80 unicode converters, and 2. only the caller can decide the logic how to recover from the error. In the case of HTML and CSS, the behavior probably is the same, but in the case of mime-2 decodeing or other issue, the behavior will be different. >Isn't that going to cause severe problems with error handling >in encodings like UTF-16, where consuming one byte can totally > change the meaning of the entire sequence? You will have this problem regardless where you recover from the error because the error could be ONE extra byte in the UTF-16 or TWO extra bytes in the UTF-16 and you cannot tell which one is the case from neither the caller nor the converter. The so-called severe problem is intrduced when the extra byte(s) put into the UTF-16, not in any of our code.
Comment 25•22 years ago
|
||
Boris Zbarsky (out of town till mid-April) : There are two reason we don't want to change the converter but the caller 1. it mean we need to change at least 5 to 10 different unicode decoders 2. it will break mime-2 decoder which currently depend on this.
Assignee | ||
Comment 26•22 years ago
|
||
I need a new testcase. The referenced URL doesn't seem to encounter any decoding errors any more, and nor does www.opera.com
Attachment #72721 -
Attachment is obsolete: true
Attachment #72722 -
Attachment is obsolete: true
Comment 27•22 years ago
|
||
Just revert the patch to bug 106674 and see whether mailnews is all broken or not. :)
Assignee | ||
Comment 28•22 years ago
|
||
Assignee | ||
Comment 29•22 years ago
|
||
Assignee | ||
Updated•22 years ago
|
Attachment #76426 -
Attachment mime type: text/html → text/plain
Assignee | ||
Updated•22 years ago
|
Attachment #76426 -
Attachment is obsolete: true
Attachment #76426 -
Attachment mime type: text/plain → text/html
Assignee | ||
Comment 30•22 years ago
|
||
Assignee | ||
Comment 31•22 years ago
|
||
BTW, reverting bug 106674 doesn't test this patch, because the decoder is called from nsScanner::Append.
Comment 32•22 years ago
|
||
Impact Platform: ALL Impact language users: 560M 100% of internet users Probability of hitting the problem: HIGH, we saw major web sites have this problem Severity if hit the problem in the worst case: part of the stylesheet won't be apply to the page, major display problem may happen Way of recover after hit the problem: ask the webmaster to label the style sheet with @charset Risk of the fix: LOW, we have similar code in the parser Potential benefit of fix this problem: a lot of stylesheet issues
Comment 33•22 years ago
|
||
A real-world testcase: go to http://www.toyota.com/ and notice that the top banner is displayed properly. Then go to either http://www.toyota.com/matrix/ or http://www.toyota.com/camry/ , either of which have UTF-8 METAs (other models may have this as well). The banner on the model pages is broken up, and in general no CSS at all is applied on the entire page. Their CSS validates and is served text/css, although their HTML does not validate. Also, their popdown menus are sniffing badly, but that's an evang issue. I've confirmed that saving a local copy of the model pages and deleting the character-encoding line results in the CSS being applied to the document; restoring the line causes the CSS to drop away.
Comment 34•22 years ago
|
||
Added topembed and cc'd momoi.
Comment 35•22 years ago
|
||
Comment on attachment 76351 [details] [diff] [review] Patch on caller r=ftang
Attachment #76351 -
Flags: review+
Comment 36•22 years ago
|
||
please ask sr=
Assignee | ||
Comment 37•22 years ago
|
||
http://www.toyota.com/matrix/ and http://www.toyota.com/camry/ display correctly with my patch.
Reporter | ||
Comment 39•22 years ago
|
||
https://www.radiolinja.fi/ - same thing?
Comment 40•22 years ago
|
||
Comment on attachment 76351 [details] [diff] [review] Patch on caller sr=attinasi
Attachment #76351 -
Flags: superreview+
Assignee | ||
Comment 41•22 years ago
|
||
>https://www.radiolinja.fi/ - same thing?
It doesn't seem to be. The source page isn't UTF-8 and the style sheet seems to
be all ASCII anyway. What problem(s) are you seeing there?
Reporter | ||
Comment 42•22 years ago
|
||
For the https://www.radiolinja.fi/ added bug 136204.
Comment 43•22 years ago
|
||
please approve for this bug. This is a high impact , mid risk bug, the similar code is already used in html parser for many monthes.
Keywords: adt1.0.0
Comment 44•22 years ago
|
||
adt1.0.0+ (on ADT's behalf) for approval to checkin to 1.0. Pls check this in today.
Updated•22 years ago
|
Comment 45•22 years ago
|
||
Comment on attachment 76351 [details] [diff] [review] Patch on caller a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #76351 -
Flags: approval+
Assignee | ||
Comment 46•22 years ago
|
||
Fix checked in
Status: ASSIGNED → RESOLVED
Closed: 23 years ago → 22 years ago
Resolution: --- → FIXED
Comment 47•22 years ago
|
||
view these websites :- http://www.geocities.com/i18nguy/unicode-example.html http://www.columbia.edu/kermit/utf8.html in Red Hat Linux 7.3 buildID: 2002-04-11-06trunk , i see all the international language characters displayed right. whereas, in Win2000 buildID: 2002-04-10-06trunk , for many of these characters i see "?????" displayed, instead of their respective language character. any comments?
Comment 48•22 years ago
|
||
*** Bug 136284 has been marked as a duplicate of this bug. ***
Comment 49•22 years ago
|
||
verified fixed in trunk on the following platforms/builds:- win2000 build id :- 2002-04-15-10trunk redhat linux 7.3 build id:- 2002-04-11-10trunk
Status: RESOLVED → VERIFIED
*** Bug 146514 has been marked as a duplicate of this bug. ***
Comment 51•22 years ago
|
||
Madhur, you're simply missing fonts that contain these characters on Windows. I recommend installing "Arial Unicode MS" which contains entire Unicode 2.1 set. It comes with MS Office 2000 and XP, You have to explicitly install it under: Office Shared Features->International Support->Universal Font
Assignee | ||
Comment 52•16 years ago
|
||
Checked in a reftest for this at Håkan's suggestion.
Flags: in-testsuite+
You need to log in
before you can comment on or make changes to this bug.
Description
•