Closed Bug 174351 Opened 22 years ago Closed 15 years ago

Encoding errors aren't treated as fatal XML errors

Categories

(Core :: XML, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9.2a1

People

(Reporter: hsivonen, Assigned: m_kato)

References

()

Details

Attachments

(1 file, 8 obsolete files)

Build ID: 2002100108

Reproducible: Always.

Steps to reproduce:
1) Load http://bugzilla.mozilla.org/buglist.cgi?bug_id=136650&format=rdf

Actual results:
The document contains a byte 0xC4. This byte shows up as a question mark.
However, the byte is surrounded by bytes whose most significant bit is zero.
Such a byte sequence is is not legal UTF-8. Neither the document nor the HTTP
headers declare a character encoding and there's no BOM, so the document should
be treated as UTF-8-encoded.

The XML spec (2nd ed) says:
"It is a fatal error if an XML entity is determined (via default, encoding
declaration, or higher-level protocol) to be in a certain encoding but contains
octet sequences that are not legal in that encoding. It is also a fatal error if
an XML entity contains no encoding declaration and its content is not legal
UTF-8 or UTF-16."

Mozilla is letting a fatal XML error pass, which is bad since enforcing the
well-formedness constraints is important in order to keep XML from degenerating
into tag soup. Presumably the data goes though a lenient encoding converter
before reaching the XML parser.

Expected results:
Expected Mozilla do display an error report page similar to those displayed when
there's a tag mismatch.

Additional information:
Bug 174340 filed about Bugzilla serving the badly encoded document.
Yeah... we may need to conditionally instantiate a fault-intolerant nsScanner...
OS: MacOS X → All
Hardware: Macintosh → All
QA Contact: petersen → rakeshmishra
Blocks: 182751
Target Milestone: --- → mozilla1.4beta
Target Milestone: mozilla1.4beta → ---
QA Contact: rakeshmishra → ashishbhatt
No longer blocks: 182751
Summary: Fatal UTF-8 errors aren't treated as fatal errors → Fatal UTF-8 errors aren't treated as fatal XML errors
*** Bug 230821 has been marked as a duplicate of this bug. ***
Summary: Fatal UTF-8 errors aren't treated as fatal XML errors → Encoding errors aren't treated as fatal XML errors
Flags: blocking1.7a+
Flags: blocking1.6+
Flags: blocking1.4.2+
only drivers set blocking flags
Flags: blocking1.7a+
Flags: blocking1.6+
Flags: blocking1.4.2+
The converters now insert a placeholder character when the input contains a
nonsensical byte. This bug could be fixed in a somewhat dirty way by using a
place holder character that is illegal in XML.
> The converters now insert a placeholder character when the input contains a
> nonsensical byte.

They do?  They shouldn't be....  It's the responsibility of the converter's
caller to deal with nonsensical bytes; in the case of HTML and XML parsing, see
http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsScanner.cpp#364, I believe.
The problem, of course, is that the scanner is instantiated before we have any
idea whether we're parsing XML or HTML or what.
>> The converters now insert a placeholder character when the input contains a
>> nonsensical byte.
>
>They do?

Yes, often. All the table-based converters pad the tables with U+FFFD
REPLACEMENT CHARACTER, and silently convert undefined bytes to it, and plenty of
places in the algorithmic converters insert the same character without returning
failure.

>They shouldn't be....

File bugs. :)
In any case, the UTF-8 converter doesn't do that, which is the part most
relevant to XML...

I've filed bug 231267
smontagu: Would changing 0xFFFD to 0x001A (SUBSTITUTE) in the converters and the
scanner have ill effects to other parts of Mozilla?
Fixing nsScanner::Append's handling of the error case when the parser context's
doctype is eXML should fix the UTF-8 problem.
BTW, making the scanner know whether it's HTML or XML isn't hard (I just did it
in my tree attempting to fix this).  But I don't see an obvious way to go from
nsScanner::Append to nsExpatDriver::HandleError.
Blocks: 234628
No longer blocks: 234628
Keywords: helpwanted
*** Bug 267189 has been marked as a duplicate of this bug. ***
We really should fix this since we're creating interoperability problems by allowing it.
Flags: wanted1.9.1+
Peter, do you think you'd have cycles for this? If not let me know. I doubt Heikki is planning to work on this any time soon :)
Assignee: hjtoi-bugzilla → peterv
I'll take a look at this.
Status: NEW → ASSIGNED
Priority: -- → P3
QA Contact: ashshbhatt → xml
Target Milestone: --- → mozilla1.9.1
Attached patch a patch for xml error detection (obsolete) — Splinter Review
Assignee: peterv → m_kato
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
Peter any updates on this?
Schrep: Makoto Kato took this one.
Makoto Kato: are you still working on this?
(In reply to comment #19)
> Schrep: Makoto Kato took this one.
> Makoto Kato: are you still working on this?
> 

I will submit new patch and send review today.
Attachment #327210 - Attachment is obsolete: true
Attachment #331104 - Flags: superreview?(mrbkap)
Attachment #331104 - Flags: review?(mrbkap)
Error reporting doesn't really work with this patch, it points to the wrong position in the source. You would need add code to HandleError to get the correct position in the file from the scanner. Another, simpler approach would be to allow consumers of the encoders to set the replacement character they want to use, and then the nsExpatDriver could set it to an invalid XML character (0xFFFF) through aParserContext.mScanner in WillBuildModel. That way Expat will catch the error and the error message would point to the right position. I think the only downside here is that the error message would be "not well-formed", which isn't technically correct. XML_ERROR_INCORRECT_ENCODING is the wrong error code to use in any case.
Attached patch patch v3 (obsolete) — Splinter Review
Attachment #331104 - Attachment is obsolete: true
Attachment #331314 - Flags: review?(peterv)
Attachment #331104 - Flags: superreview?(mrbkap)
Attachment #331104 - Flags: review?(mrbkap)
Peter, could you review my patch v3?
Attachment #331314 - Flags: superreview?(peterv)
Comment on attachment 331314 [details] [diff] [review]
patch v3

I don't think we want to change the replacement character in the scanner from 0xFFFD to 0xFFFF, that would also affect HTML. As I mentioned before (in comment 22), I think there should be a way for the nsExpatDriver to set the error character through the scanner. I also think the scanner should then set the error character on the encoder, but I'm not a peer for the I18N library so you'd have to check with Simon Montagu or Jungshik Shin if that's the right thing to do.
Attachment #331314 - Flags: superreview?(peterv)
Attachment #331314 - Flags: review?(peterv)
Attachment #331314 - Flags: review-
Attached patch patch v4 (obsolete) — Splinter Review
Attachment #331314 - Attachment is obsolete: true
Attachment #332357 - Flags: review?(smontagu)
Note that v4 doesn't do what I proposed either.
Comment on attachment 332357 [details] [diff] [review]
patch v4

The changes to nsJapaneseToUnicode.cpp regress bug 108136.

> NS_IMETHODIMP nsTableDecoderSupport::ConvertNoBuff(const char * aSrc, 
>                                                    PRInt32 * aSrcLength, 
>                                                    PRUnichar * aDest, 
>                                                    PRInt32 * aDestLength)
> {
>   return nsUnicodeDecodeHelper::ConvertByTable(aSrc, aSrcLength,
>                                                aDest, aDestLength,
>                                                mScanClass, 
>-                                               mShiftInTable, mMappingTable);
>+                                               mShiftInTable, mMappingTable,
>+                                               mErrBehavior == kOnError_Signal ? PR_TRUE : PR_FALSE);
> }

"? PR_TRUE : PR_FALSE" is redundant here and elsewhere

> nsresult nsUnicodeDecodeHelper::ConvertByFastTable(

>+  for (; src<srcEnd;) {
>+    *dest = aFastTable[*src++];
>+    if (*dest == 0xfffb && aErrorSignal) {

Typo for 0xfffd?
Attachment #332357 - Flags: review?(smontagu) → review-
Attached patch patch v5 (obsolete) — Splinter Review
Attachment #332357 - Attachment is obsolete: true
Attachment #335358 - Flags: review?(smontagu)
Attachment #335358 - Flags: review?(smontagu) → review-
Comment on attachment 335358 [details] [diff] [review]
patch v5

I'm sorry: I made a mistake in my earlier comment "The changes to nsJapaneseToUnicode.cpp regress bug 108136." I should have said that they regressed bug 116882.
Attached patch patch v6 (obsolete) — Splinter Review
Attachment #335358 - Attachment is obsolete: true
Attachment #337431 - Flags: review?(smontagu)
Attachment #337431 - Flags: review?(smontagu) → review+
Attachment #337431 - Flags: superreview?(peterv)
Blocks: 289947
This should probably be a separate bug, however I wanted to report it here just in case there is some reason this case should not be triggering an XML Parsing Error message.

With this patch applied it is no longer possible to view the following URL:

http://wiki.mozilla-russia.org/index.php/Mozilla%20Firefox

Attempting to view this page results in the rather ugly, none user friendly XML Parsing Error message:

XML Parsing Error: not well-formed
Location: http://wiki.mozilla-russia.org/index.php/Mozilla%20Firefox
Line Number 26, Column 699:

This error message in this form is something that should never be presented to a user becuase of errors in content.  For chrome errors I guess it is OK, but for content errors, if browser.xul.error_pages.enabled is true, and XUL error page should be displayed.

I think we are probably correct in following the spec in not displaying the page in this case becuase the validator at http://validator.w3.org/ refuses to attempt to validate the page becuase of this parsing error.  It displays:

Sorry, I am unable to validate this document because on line 26  it contained one or more bytes that I cannot interpret as utf-8  (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

The error was: utf8 "\xD0" does not map to Unicode 

The issue with this particular error is that this page currently displays with the latest release version of every browser I have tried despite this error.  It is all well and good to say we are the only ones following the spec correctly, but we all know that argument is relatively worthless in the real world.

The other thing that is an issue about this particular case is that this is a WIKI, so you can;t just file a tech evangelism bug and get the page fixed on the server.  Unless the WIKI software is doing proper character encoding tests when people post, new pages will just be added that trigger this issue.

So, I think what is needed is more user friendly XUL error pages for all XML Parsing errors that occur in content and, for this error in particular, the page needs to have a "Try to Display Anyway" button that tries to parse and display the page using the previous, more lenient, rules.
IE treats this as not well-formed (though not in the case of XHTML, due to its lack of support, but does for generic XML and feeds), likewise does WebKit trunk (as part of the Acid3 fixes). As someone who primarily uses WebKit, I've not seen anything break as a result.

(In reply to comment #32)
> The other thing that is an issue about this particular case is that this is a
> WIKI, so you can;t just file a tech evangelism bug and get the page fixed on
> the server.  Unless the WIKI software is doing proper character encoding tests
> when people post, new pages will just be added that trigger this issue.

It is perfectly possible to write code that deals with it. The wiki, if it is going to output XHTML, should deal with it. If it can't be bothered checking everything in great detail, it should just use HTML (if it does use XHTML, it should probably use a serializer and thus not have to worry about such issues). Even if this is not checked, you'd then have to make sure you didn't have things like U+FFFF (a perm. noncharacter, equally breaking a well-formedness constraint).
Bill, the wiki in question is going out of its way to send different markup to IE and Firefox, and the markup it sends to Firefox is buggy.  It should just stop doing that (or fix its bug).

There are existing bugs on improving the error one gets on invalid XML.
Comment on attachment 337431 [details] [diff] [review]
patch v6

Doesn't do what I proposed, but more importantly:

>+      if (aScanner.HasInvalidCharacter())
>+        length = aScanner.FirstInvalidCharacterPosition();

is wrong given

>+      PRInt32 FirstInvalidCharacterPosition()
>+      {
>+        return Distance(mCurrentPosition, mFirstInvalidPosition);
>+      }

mFirstInvalidPosition can be in a different buffer from mCurrentPosition, length must only be the length of the buffer or less.
Attachment #337431 - Flags: superreview?(peterv) → superreview-
Attached patch patch v7 (obsolete) — Splinter Review
Attachment #337431 - Attachment is obsolete: true
When first call of  nsScanner::Append(), we don't know whether current markup is HTML or XML.
If we set invalid charcter (0xffff) for XML, we need patch current buffer (or reload?)
Kato, I think you should re-ask review for this.
Attachment #344880 - Flags: superreview?(peterv)
Comment on attachment 344880 [details] [diff] [review]
patch v7

Peter, does this patch address your concerns?
Comment on attachment 344880 [details] [diff] [review]
patch v7

>diff --git a/intl/uconv/public/nsIUnicodeDecoder.h b/intl/uconv/public/nsIUnicodeDecoder.h

>+  NS_IMETHOD SetInputErrorBehavior(PRInt32 aBehavior) = 0;

This should be just virtual void.

>+  NS_IMETHOD GetCharacterForUnMapped(PRUnichar *aMap) = 0;

This should return PRUnichar.

>diff --git a/intl/uconv/ucvja/nsJapaneseToUnicode.cpp b/intl/uconv/ucvja/nsJapaneseToUnicode.cpp

>@@ -210,6 +221,10 @@
>    }
>    *aDestLen = dest - aDest;
>    return NS_OK;
>+error_invalidchar:
>+   *aDestLen = dest-aDest;

Spaces around -

>@@ -303,6 +325,8 @@

>                  // others 
>+                if (mErrBehavior == kOnError_Signal)
>+                  goto error_invalidchar;

Spacing is off.

>@@ -394,6 +426,10 @@
>    }
>    *aDestLen = dest - aDest;
>    return NS_OK;
>+error_invalidchar:
>+   *aDestLen = dest-aDest;

Spaces around -

>diff --git a/parser/htmlparser/src/nsScanner.cpp b/parser/htmlparser/src/nsScanner.cpp

>+        // if we failed, we consume one byte, replace it with unmapped

... with the unmapped

>+    while (--aErrorPos >= 0)
>+      ++mFirstInvalidPosition;

mFirstInvalidPosition.advance(aErrorPos)?

>+void nsScanner::UseDecoderUnmappedChar(PRBool aUseDecoderChar)

>+    if (mHasInvalidCharacter)
>+      ReplaceCharacter(mFirstInvalidPosition, 0xffff);
>+    mHasInvalidCharacter = PR_FALSE;

I'm not sure why this gets set to PR_FALSE again?

>diff --git a/parser/htmlparser/src/nsScanner.h b/parser/htmlparser/src/nsScanner.h

>+      PRPackedBool    mUseDecoderUnmappedChar;

Would it make more sense to make this PRUnichar mUnmappedCharOverride, with default value 0x0, set it to 0xFFFF for XML?
Attachment #344880 - Flags: superreview?(peterv) → superreview-
Attachment #361753 - Flags: superreview?
Attachment #361753 - Flags: superreview? → superreview?(peterv)
Comment on attachment 361753 [details] [diff] [review]
Patch v8 - addressing sr comments

>diff --git a/intl/uconv/public/nsIUnicodeDecoder.h b/intl/uconv/public/nsIUnicodeDecoder.h

>+  /**
>+   * Specify what to do when a character cannot be mapped into unicode
>+   */
>+  virtual void SetInputErrorBehavior(PRInt32 aBehavior) = 0;

Need to document the values (refer to the enum or something).

>+
>+  /**
>+   * return the UNICODE charater for unmapped character

chara*c*ter

>diff --git a/intl/uconv/util/nsUCSupport.cpp b/intl/uconv/util/nsUCSupport.cpp

>@@ -138,16 +151,23 @@ NS_IMETHODIMP nsBufferDecoderSupport::Co

>+    if (res == NS_ERROR_ILLEGAL_INPUT) {
>+      if (mErrBehavior == kOnError_Signal) {

Merge those ifs.

>+        break;
>+      }
>+    }

>diff --git a/parser/htmlparser/src/nsParser.cpp b/parser/htmlparser/src/nsParser.cpp

>+    mParserContext->mScanner->OverrideReplacementCharacter(0xffff);

Please move this into nsExpatDriver::WillBuildModel instead.

>diff --git a/parser/htmlparser/src/nsScanner.cpp b/parser/htmlparser/src/nsScanner.cpp

>@@ -179,18 +183,30 @@ nsresult nsScanner::SetDocumentCharset(c

>+     mUnicodeDecoder->SetInputErrorBehavior(nsIUnicodeDecoder::kOnError_Signal);
>+     if (mReplacementCharacter != 0xffff)

I think this should be |mReplacementCharacter == 0xfffd|

>+        // 0xffff is invalid character for XML (expat)
>+        if (mReplacementCharacter != 0xffff) {

Same here, == 0xfffd (and drop the comment).

>+          if (errorPos == -1)

and merge the if's

>+            errorPos = totalChars;
>+        }

>diff --git a/parser/htmlparser/src/nsScanner.h b/parser/htmlparser/src/nsScanner.h

>+       * Override replacement character used by IUnicodeDecoder.
>+       * Default behavior is that it uses IUnicodeDecoder's mapping.

nsIUnicodeDecoder

>+      nsScannerIterator            mFirstInvalidPosition; // The position we detects invalid character

The position of the first invalid character that was detected
Attachment #361753 - Flags: superreview?(peterv)
Attachment #361753 - Flags: superreview+
Attachment #361753 - Flags: review+
(In reply to comment #42)
> I think this should be |mReplacementCharacter == 0xfffd|

Hmm, I guess that won't work. It's a bit annoying to hardcode this 0xffff in so many places. That's why I proposed 0x0 as default, you could then just check for == 0x0 where you currently check != 0xffff and instead of setting mReplacementCharacter to GetCharacterForUnMapped just call GetCharacterForUnMapped in the one place that it's read, making

        unichars[unicharLength++] = mReplacementCharacter;

into something like

        unichars[unicharLength++] = mReplacementCharacter == 0x0 ?
                                    mUnicodeDecoder->GetCharacterForUnMapped() :
                                    mReplacementCharacter;

I don't think we should be worried about the overhead of repeatedly calling GetCharacterForUnMapped in case of multiple errors.
http://hg.mozilla.org/mozilla-central/rev/5953efc48779
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Attachment #344880 - Attachment is obsolete: true
Attachment #361753 - Attachment is obsolete: true
Keywords: helpwanted
Target Milestone: mozilla1.9.1 → mozilla1.9.2a1
Flags: wanted1.9.1+ → wanted1.9.1-
Sayrer, why did you change this to wanted1.9.1-?
(In reply to comment #46)
> Sayrer, why did you change this to wanted1.9.1-?

It's going to break a bunch of pages and feeds. Seems a little late for the 3.1 branch, don't you think?
Saying that, both Safari 4.0 and Opera 10.0 will exhibit this behavior as part of their ACID3 compliance.

What's the argument for excluding this while 3.1 is still in beta?

It would certainly be interesting to see the real world metrics of what would break.
We have the fix on trunk and it will be in the next release. Getting one more point in Acid3 for 3.1 is not important.
Fair enough :)

I know ACID3 isn't important by itself :p - the fact that both Safari, Opera and Chrome will or do exhibit this behavior is surely relevant though?
Unless I'm missing something, IE has been doing this properly since XML was introduced.
Julian, IE doesn't get application/xhtml+xml content (which is possibly mis-encoded) served to it, so its situation is quite different.
And just to make it more clear, this is likely to break at least some pages.  Maybe they're already broken in Safari and Opera.  Maybe they browser-sniff.  But I can pretty much guarantee that the number of pages that used to work and now won't will be nonzero.  That's fine, but I don't think we want to be taking that sort of change this late in the cycle...  _if_ this were to land in b3 I _might_ be ok with that.  Anything later than that is unacceptable.  My gut feeling is that this is not exactly urgent given the history of the bug, and a 6-month lag in getting the fix out won't kill anyone.
I assume that the last few comments are using "pages break" in the sense "broken pages used to be unreadable, but now aren't displayed at all"?

FWIW, we already know of at least one page that breaks in that sense: http://wiki.mozilla-russia.org/index.php/Mozilla%20Firefox referenced in comment 32.
> "broken pages used to be unreadable, but now aren't displayed at all"?

Or "page was readable except for one single mis-encoded character but now isn't displayed at all".
The reality is that errors will only get fixed in most sites when content breaks in the browser, not before hand.

In that regard, it makes no real difference if it is now or six months down the line.

For that reason, if it were my choice, I would seek to get this committed for Beta 3.

If standards compliance is claimed in markup, it should be held to that contract.

Leaving this just prolongs interoperability issues and, as new content is created, just exacerbates the issue.
(In reply to comment #57)
> The reality is that errors will only get fixed in most sites when content
> breaks in the browser, not before hand.
> 
> In that regard, it makes no real difference if it is now or six months down the
> line.

Don't underestimate the power of nightly users and beta users to notify sites to get their content fixed.  A few extra months of that can make a huge difference in web compatibility.

(As for me, I think this would have been good in the previous beta, but I think now's really too late to do it for this release cycle.  Standards compatibility is a long-haul process; not every win has to be made NOW, for THIS RELEASE, to slightly overemphasize some people's demands a little.)
Nick, we have people whose job it is to contact sites that break (which we find out about from our nightly users) and get them to fix their stuff.  It's a lot easier for that to work if the sites have as much lead time as possible between things changing on our end and us shipping.

I'm basically with Jeff on the scheduling here.
It's also going to affect us more than, say, Safari, because they ignore these errors in their feed parser. The current arrangement allowed us to get a lot of XML edge cases right that Safari doesn't but IE does, and also avoid a lot of fatal errors that come down to "smart quotes" and things.

I'm a little annoyed that this is going to make a bunch of feed work for me when we don't get a lot of parsing bugs reported there as it stands.
Now with a Beta4 planned, might this Bug be a case of:
"- complete: no follow up fixes or initial implimentations that commit us to
extending the schedule further
 - proven: patches with tests, security reviews and performance impact
analysis, baked on trunk for a good amount of time
 - valuable: we need to understand the benefit of taking the change, and why
it's needed for 3.1 instead of later
 - removable: backout-ready, if anything goes wrong"

Or still no way for 1.9.1?
I would argue that this is not 1.9.1 material, even given more beta cycles. I'd like this to be baked on the trunks for many months before even considering this for a final release, which would mean it'll be in 1.9.2 (or whatever the number of the next release will be).
I think this also needs a more user friendly (or at least better looking) display of the error message, now that this error can be triggered by content, rather than just chrome errors.  Perhaps I will file a followup bug on that.
Depends on: 498607
Depends on: 545061
Depends on: 651074
You need to log in before you can comment on or make changes to this bug.