Closed Bug 128896 Opened 23 years ago Closed 23 years ago

UTF8 decoder should recover from errors (CSS not applied)

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla1.0

People

(Reporter: ezh, Assigned: smontagu)

References

(
URL
)

Details

(Keywords: testcase, topembed, Whiteboard: [adt2])

Attachments

(4 files, 3 obsolete files)

testcase 23 years ago Alexey Chernyak 265 bytes, text/html		Details
The patch from bug 107790 (diff -u) 23 years ago Simon Montagu :smontagu 8.89 KB, patch		Details \| Diff \| Splinter Review
The patch from bug 107790 (diff -uw) 23 years ago Simon Montagu :smontagu 1.90 KB, patch		Details \| Diff \| Splinter Review
Patch on caller 23 years ago Simon Montagu :smontagu 3.40 KB, patch	ftang : review+ attinasi : superreview+ asa : approval+	Details \| Diff \| Splinter Review
CSS file with non-ascii chars 23 years ago Simon Montagu :smontagu 132 bytes, text/css		Details
Test case using attachment 76423 as its stylesheet 23 years ago Simon Montagu :smontagu 459 bytes, text/html		Details
Testcase using attachment 76423 as its stylesheet 23 years ago Simon Montagu :smontagu 466 bytes, text/html		Details

Eugene Savitsky

Reporter

Description

•

23 years ago

1. Load this page. 2. CSS is not applied in Moz, but works fine in Opera, IE. Tested on 2 PC's. moz 2002022808

Christopher Hoess (gone)

Comment 1

•

23 years ago

The page *is* sending the stylesheet with the correct content-type.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 2

•

23 years ago

Does it work with an @charset rule in the CSS file? I'm guessing we're using the document encoding for the CSS file, which is UTF-8.

Alexey Chernyak

Comment 3

•

23 years ago

Seeing this as well on Linux 2002030223 trunk build the stylesheet does have 2 errors in it: http://jigsaw.w3.org/css-validator/validator?uri=http://www.eyp.ee/styles.css&warning=1&profile=css2 However they are not what's causing this problem apparently this line is what's causing this: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> removing this line from HTML, fixes it.

OS: Windows 2000 → All

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 4

•

23 years ago

bz: So what was the reason we want to use the document charset as the falllback?

Richard Brodie

Comment 5

•

23 years ago

Specifying a charset with @charset or in the link element gets it to work, as does removing the Content-Type meta-header as Alexey noted. It does seem that the document charset is taken as the default for the stylesheet. The CSS file contains some characters with umlauts (in comments). It seems the UTF-8 decoder chokes on these and the whole stylesheet is discarded.

Alexey Chernyak

Comment 6

•

23 years ago

Attached file testcase — Details

Alexey Chernyak

Updated

•

23 years ago

Keywords: testcase

Boris Zbarsky [:bzbarsky]

Comment 7

•

23 years ago

We do it for a few reasons: 1) Lots of Japanese pages are UTF-8 with the sheet also encoded as UTF-8 and not specifying any @charset 2) The CSS spec says nothing about fallbacks for encoding at this point except vague handwaving at the XML spec and saying something about "Mechanisms of the language of the referencing document (e.g., in HTML, the "charset" attribute of the LINK element). " 3) It makes sense that any "internal" linked resources that do not specify otherwise would be in the same encoding as the document. We certainly assume that for javascript... Marking duplicate of our bug to ignore decoding errors like this when possible. *** This bug has been marked as a duplicate of 107790 ***

Status: NEW → RESOLVED

Closed: 23 years ago

Resolution: --- → DUPLICATE

Simon Montagu :smontagu

Assignee

Comment 8

•

23 years ago

reopening and marking nsbeta1+ per triage meeting, where bug 107790 was marked nsbeta1-, but we thought this specific issue should be fixed. Attachment 71802 [details] [diff] fixes this problem.

Status: RESOLVED → REOPENED

Resolution: DUPLICATE → ---

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 9

•

23 years ago

->smontagu, since he seems to have a fix (I think)

Assignee: dbaron → smontagu

Status: REOPENED → NEW

Simon Montagu :smontagu

Assignee

Comment 10

•

23 years ago

really marking nsbeta1+

Keywords: nsbeta1+

Simon Montagu :smontagu

Assignee

Comment 11

•

23 years ago

Attached patch The patch from bug 107790 (diff -u) (obsolete) — Details — Splinter Review

Simon Montagu :smontagu

Assignee

Comment 12

•

23 years ago

Attached patch The patch from bug 107790 (diff -uw) (obsolete) — Details — Splinter Review

Simon Montagu :smontagu

Assignee

Comment 13

•

23 years ago

nhotta, can you review the patch? Would there be a better way to do this?

Eugene Savitsky

Reporter

Comment 14

•

23 years ago

Comment 15

•

23 years ago

There is actually an evangelism bug on opera.com -- it was felt that this was a big enough site to be worth evangelizing, since technically what they're doing (serving up a style sheet with no charset info) leads to undefined behavior..

nhottanscp

Comment 16

•

23 years ago

Simon, I thought the caller would be changed instead of the converter. Could you ask Frank to review? I am not familiar with the converter code. For UTF-8 error handling, you need to make sure the security issue is taken care. Is that okay to just skip invalid bytes and proceed?

Simon Montagu :smontagu

Assignee

Comment 17

•

23 years ago

On the security issue, my patch continues to treat sequences that are decodable but not minimum length as fatal errors, so I think we are safe here.

Boris Zbarsky [:bzbarsky]

Updated

•

23 years ago

Blocks: 106843

Simon Montagu :smontagu

Assignee

Updated

•

23 years ago

Status: NEW → ASSIGNED

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

23 years ago

Summary: CSS not applied → UTF8 decoder should recover from errors (CSS not applied)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 18

•

23 years ago

Comment on attachment 72722 [details] [diff] [review] The patch from bug 107790 (diff -uw) This patch certainly seems right to me (and the re-indentation is badly needed!). It might be better to call the constant NS_UTF8_ERROR_STATE, since NS_UTF8_ERROR sounds somewhat like an nsresult value. I wonder if you want to fix the one other case (the non-shortest form case) to do the same thing (put in 0xFFFD and continue), which would allow you to remove the |res| variable (and switch the order of the statements at the end and just use |return|). I also wonder if the NS_OK_UDEC_MOREOUTPUT test should change from (mState != 0) to (mState != 0 && mState != NS_UTF8_ERROR_STATE).

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 19

•

23 years ago

I'm not sure what the security issue in comment 16 and comment 17 is, but: * If it's a buffer overrun issue, I don't see how changing what I suggested in comment 18 would create any problems. * If it's an issue where some caller depends on a 1-to-1 mapping between UTF8 and the decoded UCS2, then there might be a need to report the error in some way in all three cases, not just the one.

Simon Montagu :smontagu

Assignee

Comment 20

•

23 years ago

Quoting from http://www.ietf.org/rfc/rfc2279.txt: 6. Security Considerations Implementors of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax. A particularly subtle form of this attack could be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but allow the illegal two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F. I don't see any impediment here to what dbaron suggests in comment 18.

Frank Tang

Comment 21

•

23 years ago

1. The real problem is the nsCSSLoader does not tell the decoder to recover from the error. In the decoder, there are a method to recover from the error- Reset() method. 2. The real problem code is in SheetLoadData::OnStreamComplete content/html/style/src/nsCSSLoader.cpp 686 if (decoder) { 687 PRInt32 unicodeLength=0; 688 if (NS_SUCCEEDED(decoder->GetMaxLength(aString,aStringLen,&unicodeLength))) { 689 PRUnichar *unicodeString = nsnull; 690 strUnicodeBuffer = new nsString; 691 if (nsnull == strUnicodeBuffer) { 692 result = NS_ERROR_OUT_OF_MEMORY; 693 } else { 694 // make space for the decoding 695 strUnicodeBuffer->SetCapacity(unicodeLength); 696 unicodeString = (PRUnichar *) strUnicodeBuffer->get(); 697 result = decoder->Convert(aString, (PRInt32 *) &aStringLen, unicodeString, &unicodeLength); 698 if (NS_SUCCEEDED(result)) { 699 strUnicodeBuffer->SetLength(unicodeLength); 700 } else { 701 #ifdef DEBUG 702 /* 703 * We currently fail often because of XML documents 704 * which are in UTF-8 importing stylesheets which 705 * use an ISO-8859-1 char, especially in comments in 706 * the sheet. See http://bugzilla.mozilla.org/show_bug.cgi?id=106843 707 * 708 * This assertion should at least make it possible 709 * to catch such errors in chrome, pending this bug 710 * being fixed 711 */ 712 713 nsCAutoString uriStr; 714 mURL->GetSpec(uriStr); 715 nsCAutoString errorMessage; 716 errorMessage = NS_LITERAL_CSTRING("Decoding sheet from ") + 717 uriStr + 718 NS_LITERAL_CSTRING(" failed"); 719 NS_ASSERTION(PR_FALSE, errorMessage.get()); 720 #endif // DEBUG 721 strUnicodeBuffer->SetLength(0); 722 } 723 } 724 } 725 NS_RELEASE(decoder); 726 } The way it call the converter is WRONG, when the Convert function failed, it should do what the parser do in nsScanner::Append htmlparser/src/nsScanner.cpp 341 do { 342 PRInt32 srcLength = aLen; 343 res = mUnicodeDecoder->Convert(aBuffer, &srcLength, unichars, &unicharLength); 344 345 totalChars += unicharLength; 346 // Continuation of failure case 347 if(NS_FAILED(res)) { 348 // if we failed, we consume one byte, replace it with U+FFFD 349 // and try the conversion again. 350 unichars[unicharLength++] = (PRUnichar)0xFFFD; 351 unichars = unichars + unicharLength; 352 unicharLength = unicharBufLen - (++totalChars); 353 354 mUnicodeDecoder->Reset(); 355 356 if(((PRUint32) (srcLength + 1)) > aLen) { 357 srcLength = aLen; 358 } 359 else { 360 srcLength++; 361 } 362 363 aBuffer += srcLength; 364 aLen -= srcLength; 365 } 366 } while (NS_FAILED(res) && (aLen > 0)); 367 Please do NOT put your fix in UTF8 decoder only but fix the nsCSSLoader.cpp, otherwise, other charset (say Shift_JIS) will still failed even you land your current patch. >UTF8 decoder should recover from errors it WILL if the caller call Reset() method .

Frank Tang

Updated

•

23 years ago

Priority: -- → P2

Whiteboard: adt2

Target Milestone: --- → mozilla1.0

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 22

•

23 years ago

Isn't that going to cause severe problems with error handling in encodings like UTF-16, where consuming one byte can totally change the meaning of the entire sequence? Or does UTF-16 have protections against a 1-byte shift?

Boris Zbarsky [:bzbarsky]

Comment 23

•

23 years ago

The point is that duplicating the logic from comment 121 in _every_ place that calls converters makes less sense than moving it into the converters... But that's bug 107790, I guess.

Frank Tang

Comment 24

•

23 years ago

>The point is that duplicating the logic from comment 121 in _every_ > place that calls converters makes less sense than moving it into >the converters... But that's bug 107790, I guess. 1. either fix need duplicate code, either duplicate in caller or duplicate in the ~80 unicode converters, and 2. only the caller can decide the logic how to recover from the error. In the case of HTML and CSS, the behavior probably is the same, but in the case of mime-2 decodeing or other issue, the behavior will be different. >Isn't that going to cause severe problems with error handling >in encodings like UTF-16, where consuming one byte can totally > change the meaning of the entire sequence? You will have this problem regardless where you recover from the error because the error could be ONE extra byte in the UTF-16 or TWO extra bytes in the UTF-16 and you cannot tell which one is the case from neither the caller nor the converter. The so-called severe problem is intrduced when the extra byte(s) put into the UTF-16, not in any of our code.

Frank Tang

Comment 25

•

23 years ago

Boris Zbarsky (out of town till mid-April) : There are two reason we don't want to change the converter but the caller 1. it mean we need to change at least 5 to 10 different unicode decoders 2. it will break mime-2 decoder which currently depend on this.

Simon Montagu :smontagu

Assignee

Comment 26

•

23 years ago

Attached patch Patch on caller — Details — Splinter Review

I need a new testcase. The referenced URL doesn't seem to encounter any decoding errors any more, and nor does www.opera.com

Attachment #72721 - Attachment is obsolete: true

Attachment #72722 - Attachment is obsolete: true

Boris Zbarsky [:bzbarsky]

Comment 27

•

23 years ago

Just revert the patch to bug 106674 and see whether mailnews is all broken or not. :)

Simon Montagu :smontagu

Assignee

Comment 28

•

23 years ago

Attached file CSS file with non-ascii chars — Details

Simon Montagu :smontagu

Assignee

Comment 29

•

23 years ago

Attached file Test case using attachment 76423 as its stylesheet (obsolete) — Details

Simon Montagu :smontagu

Assignee

Updated

•

23 years ago

Attachment #76426 - Attachment mime type: text/html → text/plain

Simon Montagu :smontagu

Assignee

Updated

•

23 years ago

Attachment #76426 - Attachment is obsolete: true

Attachment #76426 - Attachment mime type: text/plain → text/html

Simon Montagu :smontagu

Assignee

Comment 30

•

23 years ago

Attached file Testcase using attachment 76423 as its stylesheet — Details

Simon Montagu :smontagu

Assignee

Comment 31

•

23 years ago

BTW, reverting bug 106674 doesn't test this patch, because the decoder is called from nsScanner::Append.

Frank Tang

Comment 32

•

23 years ago

Impact Platform: ALL Impact language users: 560M 100% of internet users Probability of hitting the problem: HIGH, we saw major web sites have this problem Severity if hit the problem in the worst case: part of the stylesheet won't be apply to the page, major display problem may happen Way of recover after hit the problem: ask the webmaster to label the style sheet with @charset Risk of the fix: LOW, we have similar code in the parser Potential benefit of fix this problem: a lot of stylesheet issues

Eric A. Meyer (dead account)

Comment 33

•

23 years ago

A real-world testcase: go to http://www.toyota.com/ and notice that the top banner is displayed properly. Then go to either http://www.toyota.com/matrix/ or http://www.toyota.com/camry/ , either of which have UTF-8 METAs (other models may have this as well). The banner on the model pages is broken up, and in general no CSS at all is applied on the entire page. Their CSS validates and is served text/css, although their HTML does not validate. Also, their popdown menus are sniffing badly, but that's an evang issue. I've confirmed that saving a local copy of the model pages and deleting the character-encoding line results in the CSS being applied to the document; restoring the line causes the CSS to drop away.

Susie Wyshak

Updated

•

23 years ago

Keywords: topembed

Susie Wyshak

Comment 34

•

23 years ago

Added topembed and cc'd momoi.

URL: http://www.eyp.ee/index_rus.php3 → http://www.eyp.ee/index_rus.php3, toy...

Frank Tang

Comment 35

•

23 years ago

Comment on attachment 76351 [details] [diff] [review] Patch on caller r=ftang

Attachment #76351 - Flags: review+

Frank Tang

Comment 36

•

23 years ago

please ask sr=

Simon Montagu :smontagu

Assignee

Comment 37

•

23 years ago

http://www.toyota.com/matrix/ and http://www.toyota.com/camry/ display correctly with my patch.

Frank Tang

Comment 38

•

23 years ago

break on top sites.

Whiteboard: adt2 → [adt2]

Eugene Savitsky

Reporter

Comment 39

•

23 years ago

https://www.radiolinja.fi/ - same thing?

Marc Attinasi

Comment 40

•

23 years ago

Comment on attachment 76351 [details] [diff] [review] Patch on caller sr=attinasi

Attachment #76351 - Flags: superreview+

Simon Montagu :smontagu

Assignee

Comment 41

•

23 years ago

>https://www.radiolinja.fi/ - same thing? It doesn't seem to be. The source page isn't UTF-8 and the style sheet seems to be all ASCII anyway. What problem(s) are you seeing there?

Eugene Savitsky

Reporter

Comment 42

•

23 years ago

For the https://www.radiolinja.fi/ added bug 136204.

Frank Tang

Comment 43

•

23 years ago

please approve for this bug. This is a high impact , mid risk bug, the similar code is already used in html parser for many monthes.

Keywords: adt1.0.0

Jaime Rodriguez, Jr.

Comment 44

•

23 years ago

adt1.0.0+ (on ADT's behalf) for approval to checkin to 1.0. Pls check this in today.

Jaime Rodriguez, Jr.

Updated

•

23 years ago

Keywords: adt1.0.0 → adt1.0.0+

Asa Dotzler [:asa]

Comment 45

•

23 years ago

Comment on attachment 76351 [details] [diff] [review] Patch on caller a=asa (on behalf of drivers) for checkin to the 1.0 trunk

Attachment #76351 - Flags: approval+

Simon Montagu :smontagu

Assignee

Comment 46

•

23 years ago

Fix checked in

Status: ASSIGNED → RESOLVED

Closed: 23 years ago → 23 years ago

Resolution: --- → FIXED

Madhur Bhatia

Comment 47

•

23 years ago

view these websites :- http://www.geocities.com/i18nguy/unicode-example.html http://www.columbia.edu/kermit/utf8.html in Red Hat Linux 7.3 buildID: 2002-04-11-06trunk , i see all the international language characters displayed right. whereas, in Win2000 buildID: 2002-04-10-06trunk , for many of these characters i see "?????" displayed, instead of their respective language character. any comments?

Boris Zbarsky [:bzbarsky]

Comment 48

•

23 years ago

*** Bug 136284 has been marked as a duplicate of this bug. ***

Madhur Bhatia

Comment 49

•

23 years ago

verified fixed in trunk on the following platforms/builds:- win2000 build id :- 2002-04-15-10trunk redhat linux 7.3 build id:- 2002-04-11-10trunk

Status: RESOLVED → VERIFIED

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 50

•

23 years ago

*** Bug 146514 has been marked as a duplicate of this bug. ***

Alexey Chernyak

Comment 51

•

23 years ago

Madhur, you're simply missing fonts that contain these characters on Windows. I recommend installing "Arial Unicode MS" which contains entire Unicode 2.1 set. It comes with MS Office 2000 and XP, You have to explicitly install it under: Office Shared Features->International Support->Universal Font

Simon Montagu :smontagu

Assignee

Comment 52

•

17 years ago

Checked in a reftest for this at Håkan's suggestion.

Flags: in-testsuite+

You need to log in before you can comment on or make changes to this bug.