Closed Bug 638379 Opened 9 years ago Closed 7 years ago

make nsIUnicodeDecoder::SetInputErrorBehavior reliable and stop using nsIUnicodeDecoder::Reset for error handling

Categories

(Core :: Internationalization, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla20

People

(Reporter: dbaron, Assigned: emk)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 1 obsolete file)

We currently use nsIUnicodeDecoder::Reset for error handling, though we've added nsIUnicodeDecoder::SetInputErrorBehavior that's implemented in some decoders.  If it were implemented reliably in all decoders, we could avoid the Reset() pattern.  This Reset() pattern is bad (see bug 335944) because it confuses resetting a decoder for an entirely new stream with this error handling behavior.  This means that, for example, we might redo BOM detection in the middle of a stream after error handling.
Do we actually want the decoders to do BOM detection at all? Which callers benefit from it? For markup parsers, I'd be happier to have BOM sniffing be entirely the parser's responsibility. See also bug 634541.
I wrote a workaround in bug 784367.
Duplicate of this bug: 782721
Assignee: smontagu → VYV03354
Attachment #687025 - Flags: review?(smontagu)
Henri, asking you for review because you reviewed bug 663057.
Attachment #687026 - Flags: review?(hsivonen)
Attachment #687026 - Attachment is patch: true
Comment on attachment 687025 [details] [diff] [review]
Part 1: Implement kOnError_Recover to the UTF-8 decoder

Review of attachment 687025 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/uconv/src/nsScriptableUConv.cpp
@@ +260,5 @@
>  
> +  if (NS_FAILED(rv) || !ccm) {
> +    return rv;
> +  }
> +  // get charset atom due to getting unicode converter

Please remove this dead comment while you're here.

@@ +283,5 @@
> +  // Some callers depend on the UTF-8 decoder throws.
> +  nsAutoCString charset;
> +  ccm->GetCharsetAlias(mCharset.get(), charset);
> +  if (charset.EqualsLiteral("UTF-8")) {
> +    mDecoder->SetInputErrorBehavior(nsIUnicodeDecoder::kOnError_Signal);

I don't really understand this bit. Why only UTF-8? And if it's "some callers", shouldn't the callers be setting the behaviour?

I also don't like the call to GetCharsetAlias when GetUnicodeDecoder has already done that. You can leave it for now if there's no choice, but maybe we should have an API on the decoder/encoder to return its canonical charset name. Does that make sense?
(In reply to Simon Montagu from comment #8)
> Please remove this dead comment while you're here.

Will do.

> @@ +283,5 @@
> I don't really understand this bit. Why only UTF-8?

Because only UTF-8 used to ignore kOnError_Recover.

> And if it's "some
> callers", shouldn't the callers be setting the behaviour?

Fixing in-tree callers would be easy.
But "some callers" include third-party add-ons because nsIScriptableUnicodeConverter is scriptable. So it's nearly impossible to audit and fix all callers.
Moreover, nsIScriptableUnicodeConverter does not provide a way to set the behavior. I want to avoid as far as possible changing this legacy interface. Eventually it should be replaced by TextEncoder and TextDecoder.

> I also don't like the call to GetCharsetAlias when GetUnicodeDecoder has
> already done that. You can leave it for now if there's no choice, but maybe
> we should have an API on the decoder/encoder to return its canonical charset
> name. Does that make sense?

Another option is exposing nsCharsetAlias::GetPreferredInternal to nsIScriptableUnicodeConverter and use it with GetUnicodeDecoderRaw.
- Removed a bogus comment.
- Elaborated a comment a bit more.
- nsScriptableUnicodeConverter is now a friend of nsCharsetAlias.

Try result:
https://tbpl.mozilla.org/?tree=Try&rev=d02d6a13249c
Attachment #687025 - Attachment is obsolete: true
Attachment #687025 - Flags: review?(smontagu)
Attachment #689963 - Flags: review?(smontagu)
Attachment #689963 - Flags: review?(smontagu) → review+
Comment on attachment 687027 [details] [diff] [review]
Part 3: Remove workaround for unreliable inputErrorBehavior

r+ on the assumption that the default behavior across all decoders really is to produce the REPLACEMENT CHARACTER in the decoder and never to return NS_ERROR_ILLEGAL_INPUT.
Attachment #687027 - Flags: review?(hsivonen) → review+
Comment on attachment 687026 [details] [diff] [review]
Part 2: Fix nsMIMEHeaderParamImpl

Why is this change needed?
Blocks: 819868
Filed bug 819868 as a follow-up.
Also filed bug 819869.
Blocks: 819869
(In reply to Henri Sivonen (:hsivonen) from comment #12)
> Comment on attachment 687026 [details] [diff] [review]
> Part 2: Fix nsMIMEHeaderParamImpl
> 
> Why is this change needed?

netwerk/test/unit/test_MIME_params.js failed without the change. It seems that the function expects that the conversion will fail if the input is invalid as UTF-8.
Unlike nsIScriptableUnicodeConverter, ConvertStringToUTF8 already had aAllowSubstitution parameter. So I changed the caller instead of simulating the old quirks in nsUTF8ConverterService.
Comment on attachment 687026 [details] [diff] [review]
Part 2: Fix nsMIMEHeaderParamImpl

OK. Thanks for the explanation.
Attachment #687026 - Flags: review?(hsivonen) → review+
Keywords: checkin-needed
Depends on: 820791
Depends on: 858433
Depends on: 859172
Depends on: 843434
You need to log in before you can comment on or make changes to this bug.