Closed Bug 107790 Opened 23 years ago Closed 9 years ago

nsIUnicodeDecoder, add an option to proceed the conversion skipping errors.

Tracking

()

Status:

RESOLVED WORKSFORME

People

(Reporter: nhottanscp, Assigned: smontagu)

References

Details

Attachments

(2 files)

Possible fix 23 years ago Simon Montagu :smontagu 8.89 KB, patch		Details \| Diff \| Splinter Review
the same patch with diff -uw 23 years ago Simon Montagu :smontagu 1.90 KB, patch		Details \| Diff \| Splinter Review

nhottanscp

Reporter

Description

•

23 years ago

The current behavior of unicode decoder when the input is not valid range in the
charset is to abort the process and return error.
This is not convenient for the callers who wants to ignore those out of range
characters and proceed the conversion.
Can we do something similar to nsIUnicodeEncoder::SetOutputErrorBehavior?

nhottanscp

Reporter

Comment 1

•

23 years ago

Change summary this bug is for nsIUnicodeDecoder not nsIUnicodeEncoder.

Summary: nsIUnicodeEncoder, add an option to proceed the conversion skipping errors. → nsIUnicodeDecoder, add an option to proceed the conversion skipping errors.

Pierre Saslawsky

Updated

•

23 years ago

Blocks: 106843

Pierre Saslawsky

Comment 2

•

23 years ago

Can it be done for 0.9.6 (less that a week from now)?

Pierre Saslawsky

Comment 3

•

23 years ago

*** Bug 107712 has been marked as a duplicate of this bug. ***

Frank Tang

Comment 4

•

23 years ago

why we need this?

Status: NEW → ASSIGNED

Target Milestone: --- → mozilla0.9.9

Boris Zbarsky [:bzbarsky]

Comment 5

•

23 years ago

We need this because the HTTP parser is already implementing this functionality
on its own.  The CSS loader also needs this functionality.  Rather than having
them both have to implement it, it makes a lot more sense to move it into the
unicode decoder, which should be able to handler conversion errors much better
because it has more knowledge of exactly what state it's in and such when the
error occurs.

In particular, give bug 106843 a read for an example of why this is wanted.

Boris Zbarsky [:bzbarsky]

Comment 6

•

23 years ago

*** Bug 114209 has been marked as a duplicate of this bug. ***

Boris Zbarsky [:bzbarsky]

Comment 7

•

23 years ago

*** Bug 115805 has been marked as a duplicate of this bug. ***

Frank Tang

Comment 8

•

23 years ago

unmark this as ---. give it to shanjian. It seems a big chunk of work.

Assignee: ftang → shanjian

Status: ASSIGNED → NEW

Target Milestone: mozilla0.9.9 → ---

Boris Zbarsky [:bzbarsky]

Comment 9

•

23 years ago

Could we get a realistic target assessment here so we know whether to work
around this in the CSSLoader, please?  The "chunk of work" is already done in
the CSS Parser and needs to happen in the CSS Loader too unless the decoders do it.

Kefeder Michael

Comment 10

•

23 years ago

I'm actually a former victim of this "bug". Well i think the character encoding
workaround for 
CSS shouldn't be done. Just because somebody screws or mixes
encodings (like me...) isn't 
worth creating a workaround. It's an html-editor
fault and must therefore be fixed by html-
editors, not by the browser. To support 
inconsistent encoded html-pages is vital for being able 
to read them (or parts at 
least), but css is not.
I think this is not a bug and must not be fixed. 
Support standards and don't make  
**** code work like MSIE does.

Boris Zbarsky [:bzbarsky]

Comment 11

•

23 years ago

*** Bug 125331 has been marked as a duplicate of this bug. ***

Boris Zbarsky [:bzbarsky]

Comment 12

•

23 years ago

> Just because somebody screws or mixes encodings

Well..... There is no standard saying what encoding should be used for
stylesheets when none is specified in the sheet.  We use the document encoding,
but a slightly different reading of things could lead to ISO-8859-1 being used
as default by a different browser.  Furthermore, Mozilla itself used to default
to ISO-8859-1 instead of the document charset.  So web authors may be expecting
that behavior...

Kefeder Michael

Comment 13

•

23 years ago

> So web authors may be expecting that behavior...

Yep, that was how i became aware of this 
"bug", cause all older versions didn't choke on that one ;)

BTW i think there's a standard, 
although not directly mentioned... Posting <form> data for example implies that the html-
documents character encoding is to be used for encoding the form-data too. The "accept-charset" 
attribute changes this behaviour.
Same goes for the <link> tag but with the attribute 
"charset".
http://www.w3.org/TR/REC-html40/struct/links.html#edef-
LINK
http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.3
as you can read there - 
accept-charset and charset are #IMPLIED.
Concerning this information i still think this isn't 
a bug but a standard-compliant implementation. My interpretation of the w3c definitions might be 
wrong though...

Boris Zbarsky [:bzbarsky]

Updated

•

23 years ago

Blocks: 126643

basic

Updated

•

23 years ago

Keywords: mozilla1.0

Boris Zbarsky [:bzbarsky]

Comment 14

•

23 years ago

Simon, this is the bug we were talking about.

Simon Montagu :smontagu

Assignee

Comment 15

•

23 years ago

Attached patch Possible fix — Details — Splinter Review

This patch is with diff -u, and includes standardization of tabs and indents

Simon Montagu :smontagu

Assignee

Comment 16

•

23 years ago

Attached patch the same patch with diff -uw — Details — Splinter Review

Simon Montagu :smontagu

Assignee

Updated

•

23 years ago

Keywords: nsbeta1

Simon Montagu :smontagu

Assignee

Comment 17

•

23 years ago

Hmmm, my patch doesn't really address the central issue of this bug as reported,
although it does fix a number of sites that are currently broken. Maybe it
should be punted to a new bug. 

The comments in nsIUnicodeDecoder.h say:

 * Error conditions: 
 * If the read value does not belong to this character set, one should 
 * replace it with the Unicode special 0xFFFD. When an actual input error is 
 * encountered, like a format error, the converter stop and return error.
 * Hoever, we should keep in mind that we need to be lax in decoding.

I believe that the specific case of UTF-8 is a classic example where we should
be lax. In the real-world examples where we are decoding an ISO-8859-1 page as
if it were UTF-8, being lax will let us retrieve all the characters <= 0x7E
correctly and render as expected.

This argument does not apply to sequences which are decodable but illegal
because they are not the minimum encoding, which is why I continue to return an
error in these cases.

Shanjian Li

Comment 18

•

23 years ago

give it to simon since he is working on it.

Assignee: shanjian → smontagu

Boris Zbarsky [:bzbarsky]

Comment 19

•

22 years ago

*** Bug 128896 has been marked as a duplicate of this bug. ***

Frank Tang

Comment 20

•

22 years ago

for risk reason, we should fix the particular issue instead of general issue. 
so, nsbeta1- and file a seperate bug for the CSS. and nominate that bug for nsbeta1

Keywords: nsbeta1 → nsbeta1-

Eugene Savitsky

Comment 21

•

22 years ago

So per Ftang comment should we reopen bug 128896?

Simon Montagu :smontagu

Assignee

Comment 22

•

22 years ago

OK, reopening 128896

Simon Montagu :smontagu

Assignee

Updated

•

22 years ago

Status: NEW → ASSIGNED

Robin Monks

Comment 23

•

19 years ago

*** Bug 278291 has been marked as a duplicate of this bug. ***

Phil Ringnalda (:philor)

Updated

•

15 years ago

QA Contact: teruko → i18n

Henri Sivonen (:hsivonen)

Comment 24

•

9 years ago

Fixed long ago.

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → WORKSFORME

You need to log in before you can comment on or make changes to this bug.