Closed Bug 107790 Opened 23 years ago Closed 9 years ago

nsIUnicodeDecoder, add an option to proceed the conversion skipping errors.

Categories

(Core :: Internationalization, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: nhottanscp, Assigned: smontagu)

References

Details

Attachments

(2 files)

The current behavior of unicode decoder when the input is not valid range in the
charset is to abort the process and return error.
This is not convenient for the callers who wants to ignore those out of range
characters and proceed the conversion.
Can we do something similar to nsIUnicodeEncoder::SetOutputErrorBehavior?
Change summary this bug is for nsIUnicodeDecoder not nsIUnicodeEncoder.
Summary: nsIUnicodeEncoder, add an option to proceed the conversion skipping errors. → nsIUnicodeDecoder, add an option to proceed the conversion skipping errors.
Blocks: 106843
Can it be done for 0.9.6 (less that a week from now)?
*** Bug 107712 has been marked as a duplicate of this bug. ***
why we need this?
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.9
We need this because the HTTP parser is already implementing this functionality
on its own.  The CSS loader also needs this functionality.  Rather than having
them both have to implement it, it makes a lot more sense to move it into the
unicode decoder, which should be able to handler conversion errors much better
because it has more knowledge of exactly what state it's in and such when the
error occurs.

In particular, give bug 106843 a read for an example of why this is wanted.
*** Bug 114209 has been marked as a duplicate of this bug. ***
*** Bug 115805 has been marked as a duplicate of this bug. ***
unmark this as ---. give it to shanjian. It seems a big chunk of work. 
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Target Milestone: mozilla0.9.9 → ---
Could we get a realistic target assessment here so we know whether to work
around this in the CSSLoader, please?  The "chunk of work" is already done in
the CSS Parser and needs to happen in the CSS Loader too unless the decoders do it.
I'm actually a former victim of this "bug". Well i think the character encoding
workaround for 
CSS shouldn't be done. Just because somebody screws or mixes
encodings (like me...) isn't 
worth creating a workaround. It's an html-editor
fault and must therefore be fixed by html-
editors, not by the browser. To support 
inconsistent encoded html-pages is vital for being able 
to read them (or parts at 
least), but css is not.
I think this is not a bug and must not be fixed. 
Support standards and don't make  
**** code work like MSIE does.
*** Bug 125331 has been marked as a duplicate of this bug. ***
> Just because somebody screws or mixes encodings

Well..... There is no standard saying what encoding should be used for
stylesheets when none is specified in the sheet.  We use the document encoding,
but a slightly different reading of things could lead to ISO-8859-1 being used
as default by a different browser.  Furthermore, Mozilla itself used to default
to ISO-8859-1 instead of the document charset.  So web authors may be expecting
that behavior...
> So web authors may be expecting that behavior...

Yep, that was how i became aware of this 
"bug", cause all older versions didn't choke on that one ;)

BTW i think there's a standard, 
although not directly mentioned... Posting <form> data for example implies that the html-
documents character encoding is to be used for encoding the form-data too. The "accept-charset" 
attribute changes this behaviour.
Same goes for the <link> tag but with the attribute 
"charset".
http://www.w3.org/TR/REC-html40/struct/links.html#edef-
LINK
http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.3
as you can read there - 
accept-charset and charset are #IMPLIED.
Concerning this information i still think this isn't 
a bug but a standard-compliant implementation. My interpretation of the w3c definitions might be 
wrong though...
Blocks: 126643
Keywords: mozilla1.0
Simon, this is the bug we were talking about.
Attached patch Possible fixSplinter Review
This patch is with diff -u, and includes standardization of tabs and indents
Keywords: nsbeta1
Hmmm, my patch doesn't really address the central issue of this bug as reported,
although it does fix a number of sites that are currently broken. Maybe it
should be punted to a new bug. 

The comments in nsIUnicodeDecoder.h say:

 * Error conditions: 
 * If the read value does not belong to this character set, one should 
 * replace it with the Unicode special 0xFFFD. When an actual input error is 
 * encountered, like a format error, the converter stop and return error.
 * Hoever, we should keep in mind that we need to be lax in decoding.

I believe that the specific case of UTF-8 is a classic example where we should
be lax. In the real-world examples where we are decoding an ISO-8859-1 page as
if it were UTF-8, being lax will let us retrieve all the characters <= 0x7E
correctly and render as expected.

This argument does not apply to sequences which are decodable but illegal
because they are not the minimum encoding, which is why I continue to return an
error in these cases.
give it to simon since he is working on it. 
Assignee: shanjian → smontagu
*** Bug 128896 has been marked as a duplicate of this bug. ***
for risk reason, we should fix the particular issue instead of general issue. 
so, nsbeta1- and file a seperate bug for the CSS. and nominate that bug for nsbeta1
Keywords: nsbeta1nsbeta1-
So per Ftang comment should we reopen bug 128896?
OK, reopening 128896
Status: NEW → ASSIGNED
*** Bug 278291 has been marked as a duplicate of this bug. ***
QA Contact: teruko → i18n
Fixed long ago.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: