Closed
Bug 128896
Opened 23 years ago
Closed 23 years ago
UTF8 decoder should recover from errors (CSS not applied)
Categories
(Core :: CSS Parsing and Computation, defect, P2)
Tracking
()
VERIFIED
FIXED
mozilla1.0
People
(Reporter: ezh, Assigned: smontagu)
References
()
Details
(Keywords: testcase, topembed, Whiteboard: [adt2])
Attachments
(4 files, 3 obsolete files)
265 bytes,
text/html
|
Details | |
3.40 KB,
patch
|
ftang
:
review+
attinasi
:
superreview+
asa
:
approval+
|
Details | Diff | Splinter Review |
132 bytes,
text/css
|
Details | |
466 bytes,
text/html
|
Details |
1. Load this page.
2. CSS is not applied in Moz, but works fine in Opera, IE.
Tested on 2 PC's.
moz 2002022808
Comment 1•23 years ago
|
||
The page *is* sending the stylesheet with the correct content-type.
Does it work with an @charset rule in the CSS file? I'm guessing we're using
the document encoding for the CSS file, which is UTF-8.
Comment 3•23 years ago
|
||
Seeing this as well on Linux 2002030223 trunk build
the stylesheet does have 2 errors in it:
http://jigsaw.w3.org/css-validator/validator?uri=http://www.eyp.ee/styles.css&warning=1&profile=css2
However they are not what's causing this problem
apparently this line is what's causing this:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
removing this line from HTML, fixes it.
OS: Windows 2000 → All
bz: So what was the reason we want to use the document charset as the falllback?
Comment 5•23 years ago
|
||
Specifying a charset with @charset or in the link element gets it to work, as
does removing the Content-Type meta-header as Alexey noted. It does seem that
the document charset is taken as the default for the stylesheet.
The CSS file contains some characters with umlauts (in comments). It seems the
UTF-8 decoder chokes on these and the whole stylesheet is discarded.
Comment 6•23 years ago
|
||
![]() |
||
Comment 7•23 years ago
|
||
We do it for a few reasons:
1) Lots of Japanese pages are UTF-8 with the sheet also encoded as UTF-8 and
not specifying any @charset
2) The CSS spec says nothing about fallbacks for encoding at this point except
vague handwaving at the XML spec and saying something about "Mechanisms of
the language of the referencing document (e.g., in HTML, the "charset"
attribute of the LINK element). "
3) It makes sense that any "internal" linked resources that do not specify
otherwise would be in the same encoding as the document. We certainly
assume that for javascript...
Marking duplicate of our bug to ignore decoding errors like this when possible.
*** This bug has been marked as a duplicate of 107790 ***
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → DUPLICATE
Assignee | ||
Comment 8•23 years ago
|
||
reopening and marking nsbeta1+ per triage meeting, where bug 107790 was marked
nsbeta1-, but we thought this specific issue should be fixed.
Attachment 71802 [details] [diff] fixes this problem.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
->smontagu, since he seems to have a fix (I think)
Assignee: dbaron → smontagu
Status: REOPENED → NEW
Assignee | ||
Comment 11•23 years ago
|
||
Assignee | ||
Comment 12•23 years ago
|
||
Assignee | ||
Comment 13•23 years ago
|
||
nhotta, can you review the patch? Would there be a better way to do this?
Reporter | ||
Comment 14•23 years ago
|
||
See also www.opera.com
![]() |
||
Comment 15•23 years ago
|
||
There is actually an evangelism bug on opera.com -- it was felt that this was a
big enough site to be worth evangelizing, since technically what they're doing
(serving up a style sheet with no charset info) leads to undefined behavior..
Comment 16•23 years ago
|
||
Simon, I thought the caller would be changed instead of the converter. Could you
ask Frank to review? I am not familiar with the converter code. For UTF-8 error
handling, you need to make sure the security issue is taken care. Is that okay
to just skip invalid bytes and proceed?
Assignee | ||
Comment 17•23 years ago
|
||
On the security issue, my patch continues to treat sequences that are decodable
but not minimum length as fatal errors, so I think we are safe here.
Assignee | ||
Updated•23 years ago
|
Status: NEW → ASSIGNED
Summary: CSS not applied → UTF8 decoder should recover from errors (CSS not applied)
Comment on attachment 72722 [details] [diff] [review]
The patch from bug 107790 (diff -uw)
This patch certainly seems right to me (and the re-indentation is badly
needed!). It might be better to call the constant NS_UTF8_ERROR_STATE, since
NS_UTF8_ERROR sounds somewhat like an nsresult value. I wonder if you want to
fix the one other case (the non-shortest form case) to do the same thing (put
in 0xFFFD and continue), which would allow you to remove the |res| variable
(and switch the order of the statements at the end and just use |return|). I
also wonder if the NS_OK_UDEC_MOREOUTPUT test should change from (mState != 0)
to (mState != 0 && mState != NS_UTF8_ERROR_STATE).
I'm not sure what the security issue in comment 16 and comment 17 is, but:
* If it's a buffer overrun issue, I don't see how changing what I suggested in
comment 18 would create any problems.
* If it's an issue where some caller depends on a 1-to-1 mapping between UTF8
and the decoded UCS2, then there might be a need to report the error in some way
in all three cases, not just the one.
Assignee | ||
Comment 20•23 years ago
|
||
Quoting from http://www.ietf.org/rfc/rfc2279.txt:
6. Security Considerations
Implementors of UTF-8 need to consider the security aspects of how
they handle illegal UTF-8 sequences. It is conceivable that in some
circumstances an attacker would be able to exploit an incautious
UTF-8 parser by sending it an octet sequence that is not permitted by
the UTF-8 syntax.
A particularly subtle form of this attack could be carried out
against a parser which performs security-critical validity checks
against the UTF-8 encoded form of its input, but interprets certain
illegal octet sequences as characters. For example, a parser might
prohibit the NUL character when encoded as the single-octet sequence
00, but allow the illegal two-octet sequence C0 80 and interpret it
as a NUL character. Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F.
I don't see any impediment here to what dbaron suggests in comment 18.
Comment 21•23 years ago
|
||
1. The real problem is the nsCSSLoader does not tell the decoder to recover from
the error. In the decoder, there are a method to recover from the error- Reset()
method.
2. The real problem code is in SheetLoadData::OnStreamComplete
content/html/style/src/nsCSSLoader.cpp
686 if (decoder) {
687 PRInt32 unicodeLength=0;
688 if
(NS_SUCCEEDED(decoder->GetMaxLength(aString,aStringLen,&unicodeLength))) {
689 PRUnichar *unicodeString = nsnull;
690 strUnicodeBuffer = new nsString;
691 if (nsnull == strUnicodeBuffer) {
692 result = NS_ERROR_OUT_OF_MEMORY;
693 } else {
694 // make space for the decoding
695 strUnicodeBuffer->SetCapacity(unicodeLength);
696 unicodeString = (PRUnichar *) strUnicodeBuffer->get();
697 result = decoder->Convert(aString, (PRInt32 *) &aStringLen,
unicodeString, &unicodeLength);
698 if (NS_SUCCEEDED(result)) {
699 strUnicodeBuffer->SetLength(unicodeLength);
700 } else {
701 #ifdef DEBUG
702 /*
703 * We currently fail often because of XML documents
704 * which are in UTF-8 importing stylesheets which
705 * use an ISO-8859-1 char, especially in comments in
706 * the sheet. See
http://bugzilla.mozilla.org/show_bug.cgi?id=106843
707 *
708 * This assertion should at least make it possible
709 * to catch such errors in chrome, pending this bug
710 * being fixed
711 */
712
713 nsCAutoString uriStr;
714 mURL->GetSpec(uriStr);
715 nsCAutoString errorMessage;
716 errorMessage = NS_LITERAL_CSTRING("Decoding sheet from ") +
717 uriStr +
718 NS_LITERAL_CSTRING(" failed");
719 NS_ASSERTION(PR_FALSE, errorMessage.get());
720 #endif // DEBUG
721 strUnicodeBuffer->SetLength(0);
722 }
723 }
724 }
725 NS_RELEASE(decoder);
726 }
The way it call the converter is WRONG, when the Convert function failed, it
should do what the parser do in nsScanner::Append
htmlparser/src/nsScanner.cpp
341 do {
342 PRInt32 srcLength = aLen;
343 res = mUnicodeDecoder->Convert(aBuffer, &srcLength, unichars,
&unicharLength);
344
345 totalChars += unicharLength;
346 // Continuation of failure case
347 if(NS_FAILED(res)) {
348 // if we failed, we consume one byte, replace it with U+FFFD
349 // and try the conversion again.
350 unichars[unicharLength++] = (PRUnichar)0xFFFD;
351 unichars = unichars + unicharLength;
352 unicharLength = unicharBufLen - (++totalChars);
353
354 mUnicodeDecoder->Reset();
355
356 if(((PRUint32) (srcLength + 1)) > aLen) {
357 srcLength = aLen;
358 }
359 else {
360 srcLength++;
361 }
362
363 aBuffer += srcLength;
364 aLen -= srcLength;
365 }
366 } while (NS_FAILED(res) && (aLen > 0));
367
Please do NOT put your fix in UTF8 decoder only but fix the nsCSSLoader.cpp,
otherwise, other charset (say Shift_JIS) will still failed even you land your
current patch.
>UTF8 decoder should recover from errors
it WILL if the caller call Reset() method .
Updated•23 years ago
|
Priority: -- → P2
Whiteboard: adt2
Target Milestone: --- → mozilla1.0
Isn't that going to cause severe problems with error handling in encodings like
UTF-16, where consuming one byte can totally change the meaning of the entire
sequence? Or does UTF-16 have protections against a 1-byte shift?
![]() |
||
Comment 23•23 years ago
|
||
The point is that duplicating the logic from comment 121 in _every_ place that
calls converters makes less sense than moving it into the converters... But
that's bug 107790, I guess.
Comment 24•23 years ago
|
||
>The point is that duplicating the logic from comment 121 in _every_
> place that calls converters makes less sense than moving it into
>the converters... But that's bug 107790, I guess.
1. either fix need duplicate code, either duplicate in caller or duplicate in
the ~80 unicode converters, and
2. only the caller can decide the logic how to recover from the error. In the
case of HTML and CSS, the behavior probably is the same, but in the case of
mime-2 decodeing or other issue, the behavior will be different.
>Isn't that going to cause severe problems with error handling
>in encodings like UTF-16, where consuming one byte can totally
> change the meaning of the entire sequence?
You will have this problem regardless where you recover from the error because
the error could be ONE extra byte in the UTF-16 or TWO extra bytes in the UTF-16
and you cannot tell which one is the case from neither the caller nor the
converter. The so-called severe problem is intrduced when the extra byte(s) put
into the UTF-16, not in any of our code.
Comment 25•23 years ago
|
||
Boris Zbarsky (out of town till mid-April) :
There are two reason we don't want to change the converter but the caller
1. it mean we need to change at least 5 to 10 different unicode decoders
2. it will break mime-2 decoder which currently depend on this.
Assignee | ||
Comment 26•23 years ago
|
||
I need a new testcase. The referenced URL doesn't seem to encounter any
decoding errors any more, and nor does www.opera.com
Attachment #72721 -
Attachment is obsolete: true
Attachment #72722 -
Attachment is obsolete: true
![]() |
||
Comment 27•23 years ago
|
||
Just revert the patch to bug 106674 and see whether mailnews is all broken or
not. :)
Assignee | ||
Comment 28•23 years ago
|
||
Assignee | ||
Comment 29•23 years ago
|
||
Assignee | ||
Updated•23 years ago
|
Attachment #76426 -
Attachment mime type: text/html → text/plain
Assignee | ||
Updated•23 years ago
|
Attachment #76426 -
Attachment is obsolete: true
Attachment #76426 -
Attachment mime type: text/plain → text/html
Assignee | ||
Comment 30•23 years ago
|
||
Assignee | ||
Comment 31•23 years ago
|
||
BTW, reverting bug 106674 doesn't test this patch, because the decoder is called
from nsScanner::Append.
Comment 32•23 years ago
|
||
Impact Platform: ALL
Impact language users: 560M 100% of internet users
Probability of hitting the problem: HIGH, we saw major web sites have this problem
Severity if hit the problem in the worst case: part of the stylesheet won't be
apply to the page, major display problem may happen
Way of recover after hit the problem: ask the webmaster to label the style sheet
with @charset
Risk of the fix: LOW, we have similar code in the parser
Potential benefit of fix this problem: a lot of stylesheet issues
Comment 33•23 years ago
|
||
A real-world testcase: go to http://www.toyota.com/ and notice that the top
banner is displayed properly. Then go to either http://www.toyota.com/matrix/
or http://www.toyota.com/camry/ , either of which have UTF-8 METAs (other models
may have this as well). The banner on the model pages is broken up, and in
general no CSS at all is applied on the entire page. Their CSS validates and is
served text/css, although their HTML does not validate. Also, their popdown
menus are sniffing badly, but that's an evang issue. I've confirmed that saving
a local copy of the model pages and deleting the character-encoding line results
in the CSS being applied to the document; restoring the line causes the CSS to
drop away.
Comment 34•23 years ago
|
||
Added topembed and cc'd momoi.
Comment 35•23 years ago
|
||
Comment on attachment 76351 [details] [diff] [review]
Patch on caller
r=ftang
Attachment #76351 -
Flags: review+
Comment 36•23 years ago
|
||
please ask sr=
Assignee | ||
Comment 37•23 years ago
|
||
http://www.toyota.com/matrix/ and http://www.toyota.com/camry/ display correctly
with my patch.
Reporter | ||
Comment 39•23 years ago
|
||
https://www.radiolinja.fi/ - same thing?
Comment 40•23 years ago
|
||
Comment on attachment 76351 [details] [diff] [review]
Patch on caller
sr=attinasi
Attachment #76351 -
Flags: superreview+
Assignee | ||
Comment 41•23 years ago
|
||
>https://www.radiolinja.fi/ - same thing?
It doesn't seem to be. The source page isn't UTF-8 and the style sheet seems to
be all ASCII anyway. What problem(s) are you seeing there?
Reporter | ||
Comment 42•23 years ago
|
||
For the https://www.radiolinja.fi/ added bug 136204.
Comment 43•23 years ago
|
||
please approve for this bug. This is a high impact , mid risk bug, the similar
code is already used in html parser for many monthes.
Keywords: adt1.0.0
Comment 44•23 years ago
|
||
adt1.0.0+ (on ADT's behalf) for approval to checkin to 1.0. Pls check this in today.
Updated•23 years ago
|
Comment 45•23 years ago
|
||
Comment on attachment 76351 [details] [diff] [review]
Patch on caller
a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #76351 -
Flags: approval+
Assignee | ||
Comment 46•23 years ago
|
||
Fix checked in
Status: ASSIGNED → RESOLVED
Closed: 23 years ago → 23 years ago
Resolution: --- → FIXED
Comment 47•23 years ago
|
||
view these websites :-
http://www.geocities.com/i18nguy/unicode-example.html
http://www.columbia.edu/kermit/utf8.html
in Red Hat Linux 7.3 buildID: 2002-04-11-06trunk , i see all the international
language characters displayed right.
whereas, in Win2000 buildID: 2002-04-10-06trunk , for many of these characters i
see "?????" displayed, instead of their respective language character.
any comments?
![]() |
||
Comment 48•23 years ago
|
||
*** Bug 136284 has been marked as a duplicate of this bug. ***
Comment 49•23 years ago
|
||
verified fixed in trunk on the following platforms/builds:-
win2000 build id :- 2002-04-15-10trunk
redhat linux 7.3 build id:- 2002-04-11-10trunk
Status: RESOLVED → VERIFIED
*** Bug 146514 has been marked as a duplicate of this bug. ***
Comment 51•23 years ago
|
||
Madhur, you're simply missing fonts that contain these characters on Windows.
I recommend installing "Arial Unicode MS" which contains entire Unicode 2.1 set.
It comes with MS Office 2000 and XP, You have to explicitly install it under:
Office Shared Features->International Support->Universal Font
Assignee | ||
Comment 52•17 years ago
|
||
Checked in a reftest for this at Håkan's suggestion.
Flags: in-testsuite+
You need to log in
before you can comment on or make changes to this bug.
Description
•