Closed Bug 1363281 Opened 3 years ago Closed 3 years ago

Port bug 1261841 |Replace uconv with encoding_rs| to mailnews

Categories

(MailNews Core :: Internationalization, enhancement)

enhancement
Not set

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 56.0

People

(Reporter: jorgk-bmo, Assigned: jorgk-bmo)

References

()

Details

Attachments

(8 files, 20 obsolete files)

10.02 KB, patch
jcranmer
: review+
Details | Diff | Splinter Review
65.65 KB, patch
mkmelin
: review+
hsivonen
: feedback+
Details | Diff | Splinter Review
2.60 KB, patch
Details | Diff | Splinter Review
4.59 KB, patch
Details | Diff | Splinter Review
1.82 KB, patch
Details | Diff | Splinter Review
3.41 KB, patch
aceman
: review+
Details | Diff | Splinter Review
2.57 KB, patch
Details | Diff | Splinter Review
1.07 KB, patch
Details | Diff | Splinter Review
+++ This bug was initially created as a clone of Bug #1261841 +++

Time to make as start.
Attached patch WIP: 1363281-uconv.patch (obsolete) — Splinter Review
This fixes a few compile errors. Roughly ten more source files to go.
Assignee: nobody → jorgk
Status: NEW → ASSIGNED
Attached patch WIP: 1363281-uconv.patch (v2) (obsolete) — Splinter Review
More work required in nsMsgI18N.cpp, MIME and nsMsgCompose.cpp.
Attachment #8872464 - Attachment is obsolete: true
Attached patch WIP: 1363281-uconv.patch (v3). (obsolete) — Splinter Review
nsMsgI18N.cpp compiles now. Still compile errors in MIME and nsMsgCompose.cpp.

Also in: nsUTF7ToUnicode.cpp and nsUnicodeToUTF7.cpp.

Henri, can you bring me up to speed with UTF-7. What's the story? Apparently used for IMAP.
Flags: needinfo?(hsivonen)
Looks like mailnews had its own UTF-7 decoder. And that relied on the removed nsUCSupport.h with classes like nsEncoderSupport.
Attached patch WIP: 1363281-uconv.patch (v4). (obsolete) — Splinter Review
This compiles all the way through (well, so far).

Work left is marked MORE WORK in:
- nsMsgCompose.cpp
- mimemoz2.cpp
- mimetext.cpp (fight over mozilla::Decoder* or UniquePtr<mozilla::Decoder>)

Also, compilation of nsUnicodeToUTF7.cpp and nsUTF7ToUnicode.cpp disabled.
That could be a bucket of fun since the underlying framework as gone away.
Attachment #8872556 - Attachment is obsolete: true
Attachment #8872656 - Attachment is obsolete: true
I looked through my e-mail database for UTF-7. So its needed for IMAP and by the looks of it we've already handled it in the source code I stopped compiling for now.

So how can that be adapted to the new scheme?

In fact, the compile didn't go though, choking on:
https://dxr.mozilla.org/comm-central/rev/62db49c50e60514a4178651227629a9625381cb4/mailnews/build/nsMailModule.cpp#1333
https://dxr.mozilla.org/comm-central/rev/62db49c50e60514a4178651227629a9625381cb4/mailnews/build/nsMailModule.cpp#1371

So I think what I've done so far, some little tweaks here and there to use the new encoder/decoder, is nothing compared to integrating UTF-7 support into the new scheme.
Attached patch WIP: 1363281-uconv.patch (v4b). (obsolete) — Splinter Review
nsMailModule.cpp forced to compile by removing the UTF-7 stuff. Now it really compiles all the way to the end. It even starts and runs and shows some messages.
Attachment #8872689 - Attachment is obsolete: true
Didn't investigate the feasibility, but maybe we can bite the bullet and move to a JavaScript utf-7 encoder. 
https://github.com/kkaefer/utf7 is apparently what gaia email ended up using for that.
And yes, we need utf-7 for email since the IMAP spec used that :(
https://tools.ietf.org/html/rfc3501#section-5.1.3
Attached patch WIP: 1363281-uconv.patch (v4c). (obsolete) — Splinter Review
OK, of the open issues I fixed the pointer business in MIME. So issues left are:
- nsMsgCompose.cpp
- mimemoz2.cpp
- UTF-7 stuff.
The first two are easy, I'll do them in the next 24 hours.
Attachment #8872710 - Attachment is obsolete: true
Well, I renamed an IMAP folder from HUHU to hühü and saw no problem.
(In reply to Magnus Melin from comment #10)
> Didn't investigate the feasibility, but maybe we can bite the bullet and
> move to a JavaScript utf-7 encoder. 
We have a functioning UTF-7 encoder/decoder. The problem is that previously it was registered with Gecko as a "custom" encoder and it shared a common interface with other decoders, see nsEncoderSupport.

So if the set of encoders is now "closed" and can't be extended, which I don't know, then all we can do is something like:

if (charset == "utf-7")
  call mailnews' own utf-7 encoder
else
  call the gecko decoder.

That would be very cumbersome since encoder objects are retrieved once for a given charset and then passed around, previously as nsIUnicodeEncoder*, now as mozilla::Encoder*.

I hope Henri will advise us on the issue.
IMAP modified UTF-7 ("x-imap4-modified-utf7" in old code) and email UTF-7 ("UTF-7" in old code) are different.

I suggest handling the IMAP modified UTF-7 case outside the converter framework: refactoring the existing code for usage directly from the IMAP code without the converter participating in the converter framework used for headers and message content at all.

In https://groups.google.com/d/msg/mozilla.dev.apps.thunderbird/8R8HU3gvif4/NJzZ9QkNBAAJ I said of email UTF-7:
"I suggest introducing a wrapper type mozilla::EmailEncoding that wraps a pointer to mozilla::Encoding or nullptr. If [the wrapped pointer is] nullptr, the type represents UTF-7, otherwise it represents a Web encoding and delegates to the wrapped encoding."

Dropping UTF-7 support for message body and headers would be a fine outcome if it's feasible to do so. IIRC, jcranmer has studied the feasibility and, as noted above, the Gaia email app ended up implementing UTF-7 decoding.

It's sad that in 2017 we are still discussing UTF-7, when http://web.archive.org/web/19981202094534/http://www.imc.org/imcr-010.html from *1998* said "Fortunately, very few vendors implemented UTF-7, and its use is strongly discouraged in Internet mail."
Flags: needinfo?(hsivonen) → needinfo?(Pidgeot18)
Attached patch WIP: 1363281-uconv.patch (v5). (obsolete) — Splinter Review
Fixed nsMsgCompose.cpp. No idea why there was such complicated code to convert UTF-8 to UTF-16, hmm. One left to go: mimemoz2.cpp.
Attachment #8872770 - Attachment is obsolete: true
Attached patch WIP: 1363281-uconv.patch (v6) (obsolete) — Splinter Review
So this compiles and identifies all the call sites that need to be changed.

Henri, would you be able to look this over and point all the errors I made. Most likely there are many and some stuff is clumsy :-( - I also left questions in comments in the code.

I've basically looked at https://hg.mozilla.org/try/rev/6767a326ba28 while doing the conversion.

I didn't read all the documentation in Encoding.h, I just wanted to get a feeling for the size of the job.

I wasn't able to work out how to set the replacement character. Also, I copied the code for inserting '?', but I'm asking myself whether '?' is valid to be inserted into the encoded stream.

Also please keep in mind that Thunderbird code is old and ugly, particularly the MIME code from the year 2000 which started as C code.

Oh, in MIME, class MimeDisplayOptions, I'm using mozilla::Encoder* instead of UniquePtr<mozilla::Encoder> as members simply since those objects are assigned in mimeeobj.cpp |MimeDisplayOptions newopt = *obj->options;|, and that caused a compile error (asking for a cast). Most likely that's also wrong.

Finally keep in mind that we Thunderbird people are "jacks of all trades, masters of none". We have to interface with all areas of Mozilla core, DOM, editor, storage, JS, Intl, just to name a few and we really can't know all that stuff as well as the highly qualified Mozilla module owners and peers.

Lastly:
(In reply to Henri Sivonen (:hsivonen) from comment #15)
> I suggest handling the IMAP modified UTF-7 case outside the converter
> framework: refactoring the existing code for usage directly from the IMAP
> code without the converter participating in the converter framework used for
> headers and message content at all.
Agreed. I have yet to investigate where this is used in the IMAP code. So the UTF-7 stuff will be covered in a second patch.
Attachment #8872950 - Attachment is obsolete: true
Attachment #8873187 - Flags: feedback?(hsivonen)
(In reply to Jorg K (GMT+2) from comment #17)
> I wasn't able to work out how to set the replacement character.

There isn't a way to set it, by design. For decode, there should be no legitimate reason to use a replacement other than U+FFFD. For encode, there is only one replacement approach--decimal NCRs--that's relevant to the two cases where legacy (i.e. non-UTF-8) encode is exposed in the Web Platform. For cases other than URL query string parsing and form submission, output should be UTF-8 anyway, in which case there's no need for replacement.

If you want a different replacement, you need to implement it yourself on top of the lower-level API (the *WithoutReplacement entry points).

> Also, I
> copied the code for inserting '?', but I'm asking myself whether '?' is
> valid to be inserted into the encoded stream.

It's valid in the sense that it's an ASCII byte that's not an ISO-2022-JP escape or a byte whose meaning differs between ISO-2022-JP ASCII and Roman modes.

The easy way to avoid having to consider this issue is to make Thunderbird always and only use UTF-8 for outgoing email.
(In reply to Henri Sivonen (:hsivonen) from comment #18)
> There isn't a way to set it, by design.

See https://docs.rs/encoding_rs/0.6.10/encoding_rs/#no-convenience-api-for-custom-replacements for design philosophy.
Thanks for the comments so far, I hope you can give me some feedback on the code as well.

(In reply to Henri Sivonen (:hsivonen) from comment #18)
> The easy way to avoid having to consider this issue is to make Thunderbird
> always and only use UTF-8 for outgoing email.
I know you've been advocating this for years ;-) - bug 862292. I'm not up-to-speed with this discussion, but I understand that conservative Japanese forces what to adhere to ISO-2022-JP, CTE 7bit and even plain text.
Comment on attachment 8873187 [details] [diff] [review]
WIP: 1363281-uconv.patch (v6)

Review of attachment 8873187 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/base/util/nsMsgI18N.cpp
@@ +50,5 @@
>    }
>    else if (!PL_strcasecmp(aCharset, "UTF-8")) {
>      CopyUTF16toUTF8(inString, outString);
>      return NS_OK;
>    }

The above optimization is built into mozilla::Encoding, so the above is now unnecessary.

@@ +63,2 @@
>    bool mappingFailure = false;
>    outString.Truncate();

Changing the length of a buffer wrapped as a span without rewrapping it is not OK.

@@ +68,5 @@
> +    size_t read;
> +    size_t written;
> +    mozilla::Tie(result, read, written) =
> +      encoder->EncodeFromUTF16WithoutReplacement(src, dst, false);
> +    if (result != mozilla::kInputEmpty && result != mozilla::kOutputFull) {

It seems that upon kOutputFull, there's nothing that would grow outString and rewrap it as a dst span.

@@ +75,5 @@
> +        // an unmappable character, because otherwise
> +        // we'd have gotten `kOutputFull`.
> +        dst[written++] = '?';
> +      } else {
> +        mappingFailure = true;

Seems odd that this case doesn't add the replacement but doesn't break out early, either.

@@ +119,1 @@
>    outString.Truncate();

Truncating here seems unnecessary.

@@ +219,5 @@
> +    return false;
> +  auto encoder = encoding->NewEncoder();
> +
> +  uint8_t buffer[512];
> +  auto src = mozilla::MakeSpan(nsDependentString(inString));

You can use MakeCStringSpan and avoid the intermediate nsDependentString.

::: mailnews/compose/src/nsMsgCompose.cpp
@@ +3056,5 @@
> +  // Should I use DecodeToUTF16() (with replacement) like in MIME which processes
> +  // potentially bad input data. Hmm.
> +  if (!inStr.IsEmpty()) {
> +    nsAutoString tmp;
> +    CopyUTF8toUTF16(inStr, tmp);

CopyUTF8toUTF16 has undesirable behavior if the input is invalid. Unless you know it to be valid UTF-8, UTF_8_ENCODING->DecodeToUTF16WithoutBOMHandling() makes more sense.

::: mailnews/import/outlook/src/MapiMessage.cpp
@@ +578,5 @@
>  
> +  uint8_t buffer[512];
> +  auto src = mozilla::MakeSpan(m_body);
> +  auto bufferSpan = mozilla::MakeSpan(buffer);
> +  auto dst = bufferSpan.To(bufferSpan.Length());

What's the purpose of this line and having dst distinct from bufferSpan?

::: mailnews/mime/src/mimetext.h
@@ +65,5 @@
>    char *cbuffer;      /* Buffer used for charset conversion. */
>    int32_t cbuffer_size;
>  
> +  mozilla::Decoder *inputDecoder;
> +  mozilla::Encoder *utf8Encoder;

There is no need to pivot through UTF-16 to convert to UTF-8. All this code should become a direct decode to UTF-8. (Without holding the decoder as a raw pointer. If not streaming, there's no need to hold a decoder at all and the methods on mozilla::Encoding should suffice.)
Attachment #8873187 - Flags: feedback?(hsivonen) → feedback-
Attached patch WIP: 1363281-uconv.patch (v7) (obsolete) — Splinter Review
Thank you for you feedback, Henri. I hope you don't mind another round.

I've addressed all your feedback. I cleaned up the MIME code and I'm now using methods on mozilla::Encoding where possible.

The MIME function ConvertToUTF8() is still a mess. There doesn't appear to be a method on mozilla::Encoding to do the direct decoding to UTF-8.

Also, I don't see a way to calculate a good buffer size to start with. There is also no example using this function. So your suggestions would be greatly appreciated.
Attachment #8873187 - Attachment is obsolete: true
Attachment #8873658 - Flags: feedback?(hsivonen)
Comment on attachment 8873658 [details] [diff] [review]
WIP: 1363281-uconv.patch (v7)

Review of attachment 8873658 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/base/util/nsMsgI18N.cpp
@@ +51,5 @@
> +
> +  auto encoding = mozilla::Encoding::ForLabelNoReplacement(nsDependentCString(aCharset));
> +  if (!encoding)
> +    return NS_ERROR_UCONV_NOCONV;
> +  auto encoder = encoding->NewEncoder();

Unused variable.
(In reply to Henri Sivonen (:hsivonen) from comment #15)
> Dropping UTF-7 support for message body and headers would be a fine outcome
> if it's feasible to do so. IIRC, jcranmer has studied the feasibility and,
> as noted above, the Gaia email app ended up implementing UTF-7 decoding.

I definitely seem to recall that a major client was liable to end up using UTF-7 in some configuration. (Looking back at bug 414064, apparently DSNs from Exchange were UTF-7, blech). Surprisingly, searching for fixed UTF-7 bugs in bugzilla turned up bug 614617, where one of our engineers thought it a good idea to send out UTF-7 emails in 2010?

I have a test that we pull in UTF-7 for RFC 2047-decoding, but I'm not sure about body decoding tests for UTF-7. The test for x-mac-croatian can be ditched (I'm assuming encoding-rs drops support for those, the x-mac-croatian stuff is mostly about testing fallout of dumb Thunderbird defaults we changed a long while ago). Unfortunately, we've got some decoding being done in C++ code and some other decoding being done in JS code, which is not ideal for creating new encoding data structures.

Looking on DXR, the imap4-modified-utf7 converter is only being called via the two nsMsgI18N.h helper functions. Switching those to using a hardcoded one-shot converter should be feasible if necessary (we definitely don't need streaming). The downside is that it makes a modicum of sense to reuse the UTF-7 converter with the imap4-modified-utf7 encoder which is, ugh. The other option is switching it all to a non-streaming variant and making the JSMime fake converter just pass everything off to the decoder at the end of the stream rather than trying to build a streaming decoder (UTF-7 should be rare enough that I'm fine making it pay a steep performance penalty).
Flags: needinfo?(Pidgeot18)
Comment on attachment 8873658 [details] [diff] [review]
WIP: 1363281-uconv.patch (v7)

Review of attachment 8873658 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/mime/src/mimemoz2.cpp
@@ +816,5 @@
>  
> +  auto encoding = mozilla::Encoding::ForLabelNoReplacement(nsDependentCString(input_charset));
> +  if (!encoding) {
> +    // Assume input is UTF-8.
> +    encoding = UTF_8_ENCODING;

This means that XSS-dangerous encodings like HZ and ISO-2022-KR are decoded as UTF-8 rather than being decoded as the replacement encoding.

Is that the desired outcome? That is, is it certain that the output of this function will never be parsed as HTML with scripting enabled? Parsing it as HTML at all matters in terms of loading external images. (That is, some other process filtering external images before the message reaches Thunderbird and Thunderbird would disagree on the presence of external images.)

In general the `ForLabelNoReplacement` variant should only be used if the action taken for `nullptr` return value is to stop and report an error.

If the action is to use some other encoding as fallback, in general, the `ForLabel` variant should be used instead to avoid XSS issues.

(I'll add this to the API docs.)

@@ +821,4 @@
>    }
> +  // Is there no method on Encoding to do the conversion to UTF-8?
> +  // With buffer allocation, etc.
> +  auto decoder = encoding->NewDecoder();

As noted in the other comment, the streaming semantics of this conversion are unclear to me. If it's non-streaming, it would be nice for the outparam to be nsACString& in which case you could use a convenience method on mozilla::Encoding. It it's streaming, then the decoder should survive over multiple calls to this method.

If the desired semantics are non-streaming but the output has to be a raw buffer (not an XPCOM string), then instantiating a decoder here makes sense.

@@ +825,2 @@
>  
> +  // XXX Is it possbile to get a good start length??

mozilla::Decoder::MaxUTF8BufferLength() if you want simple.

If you want complicated but more memory-efficient, you can start with MaxUTF8BufferLengthWithoutReplacement() first and then realloc the buffer to MaxUTF8BufferLength() if you get kOutputFull.

The non-streaming convenience methods that require nsACString output take care of this.

@@ +831,2 @@
>  
> +  auto src = mozilla::AsBytes(mozilla::MakeSpan(nsDependentCString(stringToUse, inLength)));

No need for the intermediate nsDependentCString.

::: mailnews/mime/src/modlmime.h
@@ +222,5 @@
>     into HTML.  (This should return bytes which may appear in an HTML file,
>     ie, we must be able to scan through the string to search for "<" and
>     turn it in to "&lt;", and so on.)
>  
>     `input' is a non-NULL-terminated string of a single line from the message.

The streaming semantics of this function are unclear to me. The comment suggests that the input always ends in a line break or was immediately followed by one. In that case, for encodings other than ISO-2022-JP, UTF-16LE, UTF-16BE and replacement it doesn't matter whether the conversion is done in a streaming manner (same converter instance for all lines) or not (new converter for each line).

For ISO-2022-JP streaming vs. non-streaming conversion doesn't matter if input conforms to the extra email requirement of shifting back to the ASCII state at the end of each line. Can input be trusted to conform to that requirement, though?

Also, is there a guarantee that this method is never used with UTF-16LE/UTF-16BE input? Or if it is that the caller is able to locate non-ASCII-compatible line breaks?
Attachment #8873658 - Flags: feedback?(hsivonen) → feedback-
Thanks for your comments. Just a brief reply for now.

(In reply to Henri Sivonen (:hsivonen) from comment #25)
> > +  if (!encoding) {
> > +    // Assume input is UTF-8.
> > +    encoding = UTF_8_ENCODING;
> This means that XSS-dangerous encodings like HZ and ISO-2022-KR are decoded
> as UTF-8 rather than being decoded as the replacement encoding.
That was motivated by the current code:
https://dxr.mozilla.org/comm-central/rev/62db49c50e60514a4178651227629a9625381cb4/mailnews/mime/src/mimetext.cpp#369
    MIME_get_unicode_decoder(text->charset, getter_AddRefs(text->inputDecoder));
  // If no decoder found, use ""UTF-8"", that will map most non-US-ASCII chars as invalid
  // A pure-ASCII only decoder would be better, but there is none
  if (text->inputDecoder == nullptr)
    MIME_get_unicode_decoder("UTF-8", getter_AddRefs(text->inputDecoder));

> As noted in the other comment, the streaming semantics of this conversion
> are unclear to me. If it's non-streaming, it would be nice for the outparam
> to be nsACString& in which case you could use a convenience method on
> mozilla::Encoding. It it's streaming, then the decoder should survive over
> multiple calls to this method.
[snip]
> The non-streaming convenience methods that require nsACString output take
> care of this.
For all I can tell, non-streaming, converting a line at a time, there was even the assumptions that lines are 72 characters long, but in fact, according to RCF 821 they can be 1000 characters long.
https://dxr.mozilla.org/comm-central/rev/62db49c50e60514a4178651227629a9625381cb4/mailnews/mime/src/mimemoz2.cpp#816

I have considered changing the output parameter to an nsACString&, but I'd have to see how far that ripples up. There is not a single ns*String used in mimetext.cpp (well, one use to get a preference value).

The "convenience function" appears to be DecodeWithoutBOMHandlingAndWithoutReplacement() which has two interfaces converting to both UTF-16 and UTF-8.
(In reply to Jorg K (GMT+2) from comment #26)
> The "convenience function" appears to be
> DecodeWithoutBOMHandlingAndWithoutReplacement() which has two interfaces
> converting to both UTF-16 and UTF-8.

DecodeWithoutBOMHandling(), rather, since the old code did replacement.
(In reply to Henri Sivonen (:hsivonen) from comment #25)
> > +  // XXX Is it possbile to get a good start length??
> mozilla::Decoder::MaxUTF8BufferLength() if you want simple.
[snip]
> The non-streaming convenience methods that require nsACString output take
> care of this.
I was tempted to use the convenience method, but looking at |inline std::tuple<std::string, bool> decode_without_bom_handling| I see that it uses |size_t needed = decoder->max_utf8_buffer_length(bytes.size());|. So I might as well go with the simple MaxUTF8BufferLength() and won't be worse.
Attached patch 1363281-uconv.patch (v8) (obsolete) — Splinter Review
OK, this takes the last feedback on board. Maybe use interdiff to see the few changes.

Sorry I can't answer all questions about the legacy MIME code.
Attachment #8873658 - Attachment is obsolete: true
Attachment #8873914 - Flags: feedback?(hsivonen)
Attached patch 1363281-uconv.patch (v8b) (obsolete) — Splinter Review
I did something really silly and ran the wrong binary for testing (old 32bit build instead of new 64bit build due to wrong shortcut). Switching to the right binary I noticed some things that didn't work. So with this patch, the application starts, folders and message are displayed. I mildly tested replying to a message and sending a message (exercises code in nsMsgCompose.cpp and nsMsgI18N.cpp).

However, IMAP folder names look really funny since there is currently no UTF-7 support. Obviously my comment #13 was wrong.

Henri, for further feedback, please interdiff v7 and v8b.
Attachment #8873914 - Attachment is obsolete: true
Attachment #8873914 - Flags: feedback?(hsivonen)
Attachment #8874190 - Flags: feedback?(hsivonen)
To look at the UTF-7 issue, I've added some debug and I see:
=== nsMsgI18NConvertToUnicode: x-imap4-modified-utf7 |INBOX.&xUjFSMVIxUjFSMVIxUg-|
=== nsMsgI18NConvertFromUnicode: x-imap4-modified-utf7 |Trash|

So as expected, the IMAP folder names pass though nsMsgI18N.cpp. They should not pass through the compose code or the MIME code (or the MAPI code). So I will add a special case there to get IMAP going again. That shouldn't be too hard with the existing code in  nsUTF7ToUnicode.cpp and nsUnicodeToUTF7.cpp.
This hacks MUTF-7 to Unicode into existence using the existing code. So is this the approach we're going to take? It's basically what Joshua called:
> Switching those [calls in nsMsgI18N.cpp] to using a hard-coded one-shot converter should be feasible if necessary.

With this patch, IMAP folders look good again.

The patch is WIP and the Unicode to MUTF-7 part is still missing but not hard to do.
Attached patch 1363281-utf7-imap.patch (v2) (obsolete) — Splinter Review
OK, this does the UFT-7 encoding to and from Unicode using the existing code.

Tested by renaming an IMAP folder called 안a안b안c안d안e안f안gAAABBBCCC to 안a안b안c안d안e안f안gAAABBBDDD (while having the buffer size set to 10 to exercise the loop).

So the two patches together move TB into the new age of Rust.
Attachment #8874205 - Attachment is obsolete: true
Attachment #8874213 - Flags: feedback?(mkmelin+mozilla)
Attachment #8874213 - Flags: feedback?(Pidgeot18)
Comment on attachment 8874213 [details] [diff] [review]
1363281-utf7-imap.patch (v2)

Review of attachment 8874213 [details] [diff] [review]:
-----------------------------------------------------------------

So, ideally, I'd like to see the x-imap4-modified-utf7 removed from the catch-all functions and just have the code be inlined directly into CopyUTF16ToMUTF7 and its inverse (which means moving their implementation from an inline function in nsMsgI18N.h to nsMsgI18N.cpp). In particular, our current implementation makes it possible to say Content-Type: text/plain; charset="x-imap4-modified-utf7" and trigger using MUTF7, which I don't think is desirable. On the other hand, we are going to need to directly call the UTF-7 decoder from somewhere for jsmime and libmime, and keeping the similarity of code for these pieces may make some sense. I don't exactly see where you're triggering the UTF-7 decoder in such a way that we can decode UTF-7 text/ parts or UTF-7 in RFC 2047 encoded-words, so I can't comment on what work is necessary.

::: mailnews/base/util/nsMsgI18N.cpp
@@ +52,5 @@
>      LossyCopyUTF16toASCII(inString, outString);
>      return NS_OK;
>    }
> +  else if (!PL_strcasecmp(aCharset, "x-imap4-modified-utf7")) {
> +    nsUnicodeToMUTF7 *converter = new nsUnicodeToMUTF7();

You don't need to heap-allocate here. Just say nsUnicodeToMUTF7 converter;, and then you don't need to delete it later.

@@ +53,5 @@
>      return NS_OK;
>    }
> +  else if (!PL_strcasecmp(aCharset, "x-imap4-modified-utf7")) {
> +    nsUnicodeToMUTF7 *converter = new nsUnicodeToMUTF7();
> +    char buffer[IMAP_UTF7_BUF_LENGTH];

Please don't use a stack buffer. Use at least static instead.

@@ +57,5 @@
> +    char buffer[IMAP_UTF7_BUF_LENGTH];
> +    char16_t *in = (char16_t *)inString.get();
> +    int32_t inLen = inString.Length();
> +    outString.Truncate();
> +    while (inLen) {

while (inLen > 0) is better.

@@ +118,5 @@
>      return NS_ERROR_UNEXPECTED;
>    }
> +  else if (!PL_strcasecmp(aCharset, "x-imap4-modified-utf7")) {
> +    nsMUTF7ToUnicode *converter = new nsMUTF7ToUnicode();
> +    char16_t buffer[IMAP_UTF7_BUF_LENGTH];

So, for converting UTF7 to unicode, it's the case that we have a good upper bound on the size of the string we need--namely, the converted string can't be larger than the input string (it doesn't quite hold true for UTF-8--a span of 3*n UCS-2 characters use 8*n + 2 characters in UTF-7 while they could use 9*n characters in UTF-8).

So rather than allocating a temporary buffer and the copying elements in, just ensure that the output string has sufficient space, convert into that buffer. Something like (untested):

// Add a comment about UTF-7 being strictly larger in encoding size than UTF-16.
nsMUTF7ToUnicode converter;
int32_t outLen = inString.Length();
outString.SetCapacity(outLen);
int32_t inLen = outLen;
converter->ConvertNoBuff(in, &inLen, outString.BeginWriting(), &outLen);
MOZ_ASSERT(inLen == inString.Length(), "UTF-7 should not produce a longer output");
outString.SetLength(outLen);
Attachment #8874213 - Flags: feedback?(Pidgeot18) → feedback+
I tried to follow Joshua's advice to move the UTF-7 stuff into CopyUTF16toMUTF7() and its friend. Having done this, I started TB (after having renamed my test folder on a different machine/profile) and I was greeted by:
===
The folder 'h&APw-h&APw-' could not be found, so filter(s) associated with this folder will be disabled. Verify that the folder exists, and that filters point to a valid destination folder.
===

So some other parts of the system run the catch-all functions. So all I can take on board are the other suggestions. Updated patch coming.
Damn, as much as I tried, I could reproduce this condition any more. So let's attach both versions.
Attached patch 1363281-utf7-imap.patch (v3) (obsolete) — Splinter Review
Version with the UTF-7 stuff moved out of the catch-all function.
Processing still in the catch-all function, but Joshua's remaining comments addressed.

I noticed that I can't rename a folder to "Hö" with a trailing Umlaut, so something's wrong here.
Attachment #8874213 - Attachment is obsolete: true
Attachment #8874213 - Flags: feedback?(mkmelin+mozilla)
Attached patch 1363281-utf7-imap.patch (v4) (obsolete) — Splinter Review
OK, fixed the "can't rename to 'Hö'" problem, there was a "finish" call missing. This is following all of Joshua's suggestions now, but I'm afraid we'll see a funny error message as per comment #35. Hmm, no idea how to trigger it.
Attachment #8874262 - Attachment is obsolete: true
Attachment #8874263 - Attachment is obsolete: true
Attachment #8874265 - Flags: feedback?(Pidgeot18)
For the record: This
converter.charset = "x-imap4-modified-utf7";
https://dxr.mozilla.org/comm-central/rev/62db49c50e60514a4178651227629a9625381cb4/mailnews/test/fakeserver/imapd.js#307
will of course not work any more.
(In reply to Jorg K (GMT+2) from comment #28)
> (In reply to Henri Sivonen (:hsivonen) from comment #25)
> > > +  // XXX Is it possbile to get a good start length??
> > mozilla::Decoder::MaxUTF8BufferLength() if you want simple.
> [snip]
> > The non-streaming convenience methods that require nsACString output take
> > care of this.
> I was tempted to use the convenience method, but looking at |inline
> std::tuple<std::string, bool> decode_without_bom_handling| I see that it
> uses |size_t needed = decoder->max_utf8_buffer_length(bytes.size());|. So I
> might as well go with the simple MaxUTF8BufferLength() and won't be worse.

It seems that you are reading the sample code that uses the C++ standard library. That code isn't used in Gecko with XPCOM/MFBT types.

The Gecko-relevant bit is:
https://hg.mozilla.org/try/file/6767a326ba28/intl/encoding_glue/src/lib.rs#l316
which tries to use decoder.max_utf8_buffer_length_without_replacement() first and stretches to decoder.max_utf8_buffer_length() if applicable.

Using the convenience method would also avoid heap-allocating a new decoder for each converted line. (Since the convenience-method is Rust-backed, it can stack-allocate the decoder. Stack allocation isn't available to C++ due to there being no stable ABI for returning a Decoder by value to C. Recycling a heap-allocation is available to C++, though, in the form of NewDecoderWithoutBOMHandlingInto().)
Yep, I was reading in |inline std::tuple<std::string, bool> decode_without_bom_handling(| in the first part of the patch. It didn't look like Rust to me ;-) - Sorry.

I'll look into using NewDecoderWithoutBOMHandlingInto() or would you prefer to drill open that call stack and use nsACString? That can be done, too, and then I don't have to worry about the fine print.
Comment on attachment 8874190 [details] [diff] [review]
1363281-uconv.patch (v8b)

Review of attachment 8874190 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/base/util/nsMsgI18N.cpp
@@ +46,5 @@
>               !PL_strcasecmp(aCharset, "us-ascii") ||
>               !PL_strcasecmp(aCharset, "ISO-8859-1"))) {
>      LossyCopyUTF16toASCII(inString, outString);
>      return NS_OK;
>    }

us-ascii and ISO-8859-1 are no longer canonical names and the above behavior differs from letting encoding_rs handle them (as labels for windows-1252). Is the special-casing of us-ascii and ISO-8859-1 above actually desirable?

@@ +214,5 @@
> +      // Didn't use all the input but the outout isn't full, hence
> +      // there was an unencodable character.
> +      result = false;
> +      break;
> +    }

It seems that the loop should be broken out of (without setting result = false) on kInputEmpty to avoid and infinite loop.

::: mailnews/import/outlook/src/MapiMessage.cpp
@@ +586,5 @@
> +    if (result != mozilla::kInputEmpty && result != mozilla::kOutputFull) {
> +      // Didn't use all the input but the outout isn't full, hence
> +      // there was an unencodable character.
> +      return false;
> +    }

This loop, too, needs breaking out of (or just "return true;") on kInputEmpty.
Attachment #8874190 - Flags: feedback?(hsivonen) → feedback-
(In reply to Jorg K (GMT+2) from comment #42)
> Yep, I was reading in |inline std::tuple<std::string, bool>
> decode_without_bom_handling(| in the first part of the patch. It didn't look
> like Rust to me ;-) - Sorry.
> 
> I'll look into using NewDecoderWithoutBOMHandlingInto() or would you prefer
> to drill open that call stack and use nsACString? That can be done, too, and
> then I don't have to worry about the fine print.

I recommend switching the outparam to nsACString unless that's particularly messy. Note that if the input is in an nsACString, too, the conversion step becomes a mere nsStringBuffer refcount bump when possible.
Thanks for the comments. Today is a public holiday in Germany, but I'll get to it soon.
Henri, you questioned this code:

https://dxr.mozilla.org/comm-central/rev/62db49c50e60514a4178651227629a9625381cb4/mailnews/base/util/nsMsgI18N.cpp#45
  else if (!aReportUencNoMapping && (!*aCharset ||
             !PL_strcasecmp(aCharset, "us-ascii") ||
             !PL_strcasecmp(aCharset, "ISO-8859-1"))) {
    LossyCopyUTF16toASCII(inString, outString);
    return NS_OK;
  }

How about this:
https://dxr.mozilla.org/comm-central/rev/62db49c50e60514a4178651227629a9625381cb4/mailnews/base/util/nsMsgI18N.cpp#127
  else if (!*aCharset || !PL_strcasecmp(aCharset, "us-ascii") ||
           !PL_strcasecmp(aCharset, "ISO-8859-1")) {
    // Despite its name, it also works for Latin-1.
    CopyASCIItoUTF16(inString, outString);
    return NS_OK;
  }
  else if (!PL_strcasecmp(aCharset, "UTF-8")) {
    if (MsgIsUTF8(inString)) {
      nsAutoString tmp;
      CopyUTF8toUTF16(inString, tmp);
      if (!tmp.IsEmpty() && tmp.First() == char16_t(0xFEFF))
        tmp.Cut(0, 1);
      outString.Assign(tmp);
      return NS_OK;
    }
    NS_WARNING("Invalid UTF-8 string");
    return NS_ERROR_UNEXPECTED;
  }
(In reply to Jorg K (GMT+2) from comment #46)
> Henri, you questioned this code:
> 
> https://dxr.mozilla.org/comm-central/rev/
> 62db49c50e60514a4178651227629a9625381cb4/mailnews/base/util/nsMsgI18N.cpp#45
>   else if (!aReportUencNoMapping && (!*aCharset ||
>              !PL_strcasecmp(aCharset, "us-ascii") ||
>              !PL_strcasecmp(aCharset, "ISO-8859-1"))) {
>     LossyCopyUTF16toASCII(inString, outString);
>     return NS_OK;
>   }
> 
> How about this:
> https://dxr.mozilla.org/comm-central/rev/
> 62db49c50e60514a4178651227629a9625381cb4/mailnews/base/util/nsMsgI18N.cpp#127
>   else if (!*aCharset || !PL_strcasecmp(aCharset, "us-ascii") ||
>            !PL_strcasecmp(aCharset, "ISO-8859-1")) {
>     // Despite its name, it also works for Latin-1.
>     CopyASCIItoUTF16(inString, outString);
>     return NS_OK;
>   }

Same issue but the other way round.

>   else if (!PL_strcasecmp(aCharset, "UTF-8")) {
>     if (MsgIsUTF8(inString)) {
>       nsAutoString tmp;
>       CopyUTF8toUTF16(inString, tmp);
>       if (!tmp.IsEmpty() && tmp.First() == char16_t(0xFEFF))
>         tmp.Cut(0, 1);
>       outString.Assign(tmp);
>       return NS_OK;
>     }
>     NS_WARNING("Invalid UTF-8 string");
>     return NS_ERROR_UNEXPECTED;
>   }

This doesn't handle invalid UTF-8 as gracefully as encoding_rs doing replacement.
Attached patch 1363281-uconv.patch (v9) (obsolete) — Splinter Review
OK, I switched to nsACString. It wasn't messy, I had looked at it before but I wasn't aware of the benefits of using the "convenience function".

You may give more feedback now or later. I haven't removed the questionable code quoted in comment #46 yet.
Attachment #8874190 - Attachment is obsolete: true
Attachment #8874375 - Flags: feedback?(hsivonen)
Interdiff v8b to v9 works here:
https://bugzilla.mozilla.org/attachment.cgi?oldid=8874190&action=interdiff&newid=8874375&headers=1

Oops, I didn't notice comment #47 while busily attaching patches. So remove those three special cases, right?

Cutting 0xFEFF is the BOM-handling, right? So should I not use the DecodeWithoutBOMHandlingAndWithoutReplacement() and instead use the one with BOM handling?
Comment on attachment 8874375 [details] [diff] [review]
1363281-uconv.patch (v9)

Review of attachment 8874375 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/base/util/nsMsgI18N.cpp
@@ +212,5 @@
> +      encoder->EncodeFromUTF16WithoutReplacement(src, dst, false);
> +    if (result == mozilla::kInputEmpty) {
> +      // All converted successfully.
> +      break;
> +    } else if (result != mozilla::kInputEmpty && result != mozilla::kOutputFull) {

No need for "else" or "result != mozilla::kInputEmpty" on this line anymore.

::: mailnews/import/outlook/src/MapiMessage.cpp
@@ +585,5 @@
> +      encoder->EncodeFromUTF16WithoutReplacement(src, dst, false);
> +    if (result == mozilla::kInputEmpty) {
> +      // All converted successfully.
> +      break;
> +    } else if (result != mozilla::kInputEmpty && result != mozilla::kOutputFull) {

No need for "else" or "result != mozilla::kInputEmpty" on this line anymore.

::: mailnews/mime/src/mimemoz2.cpp
@@ +821,3 @@
>    }
>  
> +  rv = encoding->DecodeWithoutBOMHandling(nsDependentCString(stringToUse, inLength), outString);

nsDependentString doesn't give you the potential zero-copy benefit. Propagating the XPCOM stringness of the input further up the call stack might. The zero-copy case can work only when the input string's buffer is a heap-allocated nsStringBuffer.
Attachment #8874375 - Flags: feedback?(hsivonen) → feedback+
(In reply to Jorg K (GMT+2) from comment #49)
> Interdiff v8b to v9 works here:
> https://bugzilla.mozilla.org/attachment.
> cgi?oldid=8874190&action=interdiff&newid=8874375&headers=1
> 
> Oops, I didn't notice comment #47 while busily attaching patches. So remove
> those three special cases, right?
> 
> Cutting 0xFEFF is the BOM-handling, right? So should I not use the
> DecodeWithoutBOMHandlingAndWithoutReplacement() and instead use the one with
> BOM handling?

"BOM removal" means removing the BOM for this encoding if one is there (i.e. if this encoding is UTF-8, removing the UTF-8 BOM but treating other BOMs as garbage input). "Without BOM handling" means leaving the BOM there in the output. The third default unqualified option means potentially switching encoding according to the BOM and removing the BOM.

Since this is converting line-by-line and it would be inappropriate to BOM-sniff line-by-line, this should be WithoutBOMHandling.
Henri, thanks for all your help, advice and feedback.

Before I make the final changes addressing comment #50, let me ask/confirm again:
- I'll remove the legacy us-ascii/ISO-8859-1 special casing (comment #47).
- This code |CopyUTF8toUTF16(inString, tmp); if (!tmp.IsEmpty() && tmp.First() == char16_t(0xFEFF))|
  removes the BOM from the UTF-16 string.
  Surely, as you pointed out, this is inappropriate for line-by-line processing but it may
  be needed on the first line. So what do I do with this code? Will WithoutBOMHandling remove it?

Sorry about the NI, I don't know how closely you follow bugs.
Flags: needinfo?(hsivonen)
(In reply to Jorg K (GMT+2) from comment #52)
> - This code |CopyUTF8toUTF16(inString, tmp); if (!tmp.IsEmpty() &&
> tmp.First() == char16_t(0xFEFF))|
>   removes the BOM from the UTF-16 string.
>   Surely, as you pointed out, this is inappropriate for line-by-line
> processing but it may
>   be needed on the first line. So what do I do with this code? Will
> WithoutBOMHandling remove it?

DecodeWithoutBOMHandling won't remove it. UTF_8_ENCODING->DecodeWithBOMRemoval() will remove the UTF-8 BOM but will treat UTF-16LE and UTF-16BE BOMs as garbage entering the decoder like any other garbage. UTF_8_ENCODING->Decode() will remove any of the three BOMs and perform a UTF-16LE or UTF-16BE decode instead of UTF-8 decode if the BOM was the BOM for UTF-16LE or UTF-16BE, respectively, and the second item in the return tuple indicates which encoding was actually used.

So to replace the above-quoted code, you want UTF_8_ENCODING->DecodeWithBOMRemoval().
Flags: needinfo?(hsivonen)
Attached patch 1363281-uconv.patch (v10) (obsolete) — Splinter Review
I hope this is the last iteration. Removed doubtful code and used UTF_8_ENCODING->DecodeWithBOMRemoval().

Thanks again for all your help.
Attachment #8874375 - Attachment is obsolete: true
Attachment #8874907 - Flags: feedback?(hsivonen)
Attachment #8874265 - Attachment is obsolete: true
Attachment #8874265 - Flags: feedback?(Pidgeot18)
Attachment #8874911 - Flags: feedback?(Pidgeot18)
Comment on attachment 8874907 [details] [diff] [review]
1363281-uconv.patch (v10)

Review of attachment 8874907 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/base/util/nsMsgI18N.cpp
@@ +41,5 @@
>    }
> +
> +  auto encoding = mozilla::Encoding::ForLabelNoReplacement(nsDependentCString(aCharset));
> +  if (!encoding)
> +    return NS_ERROR_UCONV_NOCONV;

Sorry about not mentioning this before, but, to be safe, you might check if you got UTF_16LE_ENCODING or UTF_16BE_ENCODING and error out in those cases, too. (If you don't, those actually produce UTF-8 output to match URL parsing and form submission behavior.) Or assert, since, AFAIK, UTF-16 blocking is supposed to happen elsewhere.

@@ +71,2 @@
>    else if (!PL_strcasecmp(aCharset, "UTF-8")) {
> +    return UTF_8_ENCODING->DecodeWithBOMRemoval(inString, outString);

This special case means that the BOM is removed only if aCharset is "UTF-8", but "UTF-16BE", "UTF-16LE" and "utf-8" don't get the BOM removed below. I'm not sure if that matters, but your use of mozilla::Encoding::ForLabelNoReplacement instead of mozilla::Encoding::ForName below implies that you don't trust callers to use "UTF-8" consistently and never to use "utf-8".
Attachment #8874907 - Flags: feedback?(hsivonen) → feedback+
Thanks, I'll look into it. Also, I forgot to address your comment #50 comment completely:
> No need for "else" or "result != mozilla::kInputEmpty" on this line anymore.
I've removed the unnecessary "result != mozilla::kInputEmpty" and added a check for the UTF-16 (outgoing) encoding.

Sadly I don't understand the other comment:
> >    else if (!PL_strcasecmp(aCharset, "UTF-8")) {
> > +    return UTF_8_ENCODING->DecodeWithBOMRemoval(inString, outString);
> This special case means that the BOM is removed only if aCharset is "UTF-8",
> but "UTF-16BE", "UTF-16LE" and "utf-8" don't get the BOM removed below.
Well, "utf-8" in lower case is covered by the case-insensitive compare, no? What am I missing?
Attachment #8874907 - Attachment is obsolete: true
(In reply to Jorg K (GMT+2) from comment #58)
> Sadly I don't understand the other comment:
> > >    else if (!PL_strcasecmp(aCharset, "UTF-8")) {
> > > +    return UTF_8_ENCODING->DecodeWithBOMRemoval(inString, outString);
> > This special case means that the BOM is removed only if aCharset is "UTF-8",
> > but "UTF-16BE", "UTF-16LE" and "utf-8" don't get the BOM removed below.
> Well, "utf-8" in lower case is covered by the case-insensitive compare, no?
> What am I missing?

My mistake. Sorry. I didn't notice the comparison was case-insensitive. Then the difference is only with surrounding whitespace, which I guess isn't an issue in mailnews, and the two UTF-16 variants, which may or may not be an issue.
Comment on attachment 8875618 [details] [diff] [review]
1363281-uconv.patch (v11)

If you could kindly stick another f+ onto it, I think I'm done here for now.

When bug 1261841 lands, I imaging on mozilla56 after the next branch date, I will land this and we'll have a least a working TB. We'll see what tests fail and I'll fix them after the event.

I little hard to push M-C changes to C-C try, especially with large patches like in this case.
Attachment #8875618 - Flags: review?(mkmelin+mozilla)
Attachment #8875618 - Flags: feedback?(hsivonen)
Comment on attachment 8874911 [details] [diff] [review]
1363281-utf7-imap.patch (v4) - refreshed

Whoever comes first ;-) I will have to land this as bustage fix if the review doesn't come on time.
Attachment #8874911 - Flags: review?(mkmelin+mozilla)
Attachment #8874911 - Flags: review?(Pidgeot18)
Attachment #8874911 - Flags: feedback?(Pidgeot18)
Attachment #8875618 - Flags: feedback?(hsivonen) → feedback+
Attachment #8874911 - Flags: review?(Pidgeot18) → review+
Comment on attachment 8875618 [details] [diff] [review]
1363281-uconv.patch (v11)

Joshua, maybe you want to look this over, too. Or you could rs it. I went through many iterations with Henri until he was finally satisfied ;-) - well, he asked some questions in the process that I couldn't answer.
Attachment #8875618 - Flags: review?(Pidgeot18)
Attachment #8874911 - Flags: review?(mkmelin+mozilla)
Comment on attachment 8875618 [details] [diff] [review]
1363281-uconv.patch (v11)

Review of attachment 8875618 [details] [diff] [review]:
-----------------------------------------------------------------

Skimmed through it, rs=mkmelin

::: mailnews/base/util/nsMsgI18N.cpp
@@ +57,5 @@
> +    if (aReportUencNoMapping) {
> +      rv = NS_ERROR_UENC_NOMAPPING;
> +    } else {
> +      rv = NS_OK;
> +    }

could just be 

  rv = (aReportUencNoMapping) ? NS_ERROR_UENC_NOMAPPING : NS_OK;

@@ +182,5 @@
> +
> +  uint8_t buffer[512];
> +  auto src = mozilla::MakeStringSpan(inString);
> +  auto dst = mozilla::MakeSpan(buffer);
> +  for (;;) {

while(true) ?

::: mailnews/import/outlook/src/MapiMessage.cpp
@@ +576,5 @@
>  
> +  uint8_t buffer[512];
> +  auto src = mozilla::MakeSpan(m_body);
> +  auto dst = mozilla::MakeSpan(buffer);
> +  for (;;) {

while (true)
Attachment #8875618 - Flags: review?(mkmelin+mozilla) → review+
Thanks, |for(;;)| seems to be Henri's style which I've just copied. I'll fix the nits before landing it.
https://hg.mozilla.org/comm-central/rev/f2532fa2989b44536653261e53c179edf27c5400
Bustage fix to reduce the massive Xpcshell test failure. If decoding from UTF-7 is really needed during the tests, we'll need to revisit this later.

I ran three of the tests that were failing, and they passed with this patch.
There are also many Mozmill test failures reported. I ran a few manually and they didn't fail:
mozmake SOLO_TEST=composition/test-image-display.js mozmill-one
mozmake SOLO_TEST=composition/test-reply-format-flowed.js mozmill-one
mozmake SOLO_TEST=composition/test-send-format.js mozmill-one
mozmake SOLO_TEST=composition/test-drafts.js mozmill-one

So it looks like preceding test fail and don't clean up properly, so subsequent ones fail.

One that fails for real is:
mozmake SOLO_TEST=composition/test-multipart-related.js mozmill-one
which gives:
SUMMARY-UNEXPECTED-FAIL | c:\mozilla-source\comm-central\mail\test\mozmill\composition\test-multipart-related.js | test-multipart-related.js::test_basic_multipart_related
  EXCEPTION: Expected HTML to refer to cid part1.A82A7414.A4F11358@foo.invalid
    at: test-multipart-related.js line 117

mozmake SOLO_TEST=composition/test-charset-edit.js mozmill-one
fails with:
SUMMARY-UNEXPECTED-FAIL | c:\mozilla-source\comm-central\mail\test\mozmill\composition\test-charset-edit.js | test-charset-edit.js::test_wrong_reply_charset
  EXCEPTION: a != b: 'UTF-8' != 'windows-1252'.
    at: test-folder-display-helpers.js line 2890
       assert_true test-folder-display-helpers.js:2890 11
       assert_equals test-folder-display-helpers.js:2877 3
       test_wrong_reply_charset test-charset-edit.js:108 3
and
SUMMARY-UNEXPECTED-FAIL | c:\mozilla-source\comm-central\mail\test\mozmill\composition\test-charset-edit.js | test-charset-edit.js::test_no_mojibake
  EXCEPTION: a != b: 'x-mac-croatian' != 'utf-7'.
    at: test-folder-display-helpers.js line 2890
       assert_true test-folder-display-helpers.js:2890 11
       assert_equals test-folder-display-helpers.js:2877 3
       test_no_mojibake test-charset-edit.js:141 3

mozmake SOLO_TEST=composition/test-charset-upgrade.js mozmill-one
fails with:
SUMMARY-UNEXPECTED-FAIL | c:\mozilla-source\comm-central\mail\test\mozmill\composition\test-charset-upgrade.js | test-charset-upgrade.js::test_encoding_upgrade_html_compose
  EXCEPTION: a != b: 'UTF-8' != 'windows-1252'.
    at: test-folder-display-helpers.js line 2890
       assert_true test-folder-display-helpers.js:2890 11
       assert_equals test-folder-display-helpers.js:2877 3
       test_encoding_upgrade_html_compose test-charset-upgrade.js:65 3
and
SUMMARY-UNEXPECTED-FAIL | c:\mozilla-source\comm-central\mail\test\mozmill\composition\test-charset-upgrade.js | test-charset-upgrade.js::test_encoding_upgrade_plaintext_compose
  EXCEPTION: a != b: 'UTF-8' != 'windows-1252'.
    at: test-folder-display-helpers.js line 2890
       assert_true test-folder-display-helpers.js:2890 11
       assert_equals test-folder-display-helpers.js:2877 3
       test_encoding_upgrade_plaintext_compose test-charset-upgrade.js:147 3

Also spotted:
JavaScript error: resource:///modules/jsmime.jsm, line 39: : Component returned failure code: 0x80500001 [nsIScriptableUnicodeConverter.charset]

Xpcshell results from the latest push have just arrived, nine real failures after the the landing in comment #68:
TEST-UNEXPECTED-FAIL | mailnews/base/test/unit/test_getMsgTextFromStream.js | run_test/< - [run_test/< : 77] "" == "This message has no content type. This should be assumed to be text/plain."
TEST-UNEXPECTED-FAIL | mailnews/compose/test/unit/test_attachment.js | testInput0 - [testInput0 : 92] "Content-Disposition: attachment;\\r\\n =?UTF-8?Q?_!=22#$%&'=28=29*+=2c-./0123456789:;<=3d>=3f@ABCDEFGHIJKLM?==?UTF-8?Q?NOPQRSTUVW == "Content-Disposition: attachment;\\r\\n filename*0*=UTF-8''%20%21%22%23%24%25%26%27%28%29%2A%2B%2C%2D%2E%2F%30%31;\\r\\n filename*1* 
TEST-UNEXPECTED-FAIL | mailnews/compose/test/unit/test_messageHeaders.js | testContentHeaders - [testContentHeaders : 71] "8bit" == "7bit" 
TEST-UNEXPECTED-FAIL | mailnews/compose/test/unit/test_longLines.js | testBodyWithLongLine - [testBodyWithLongLine : 67] "8bit" == "base64" 
TEST-UNEXPECTED-FAIL | mailnews/imap/test/unit/test_mailboxes.js | checkDiscovery - [checkDiscovery : 38] false == true
TEST-UNEXPECTED-FAIL | mailnews/imap/test/unit/test_mailboxes.js | checkDiscovery - [checkDiscovery : 38] false == true
TEST-UNEXPECTED-FAIL | mailnews/intl/test/unit/test_decode_utf-7_internal.js | checkDecode - [checkDecode : 31] "+LGI--+ITIipSIp- +AocCeQ-oddns +Ad0CjQ- s+ATECZQKH- p+AlAB3QJ5- u+AlACVA- no+Ao4- +Al8-I" == "Ɫ-Ⅎ⊥∩ ʇɹoddns ǝʍ sıɥʇ pɐǝɹ uɐɔ noʎ ɟI"
TEST-UNEXPECTED-FAIL | mailnews/intl/test/unit/test_encode_utf-7_internal.js | checkEncode - [checkEncode : 47] "?-??? ??oddns ?? s??? p??? u?? no? ?I" == "+LGI--+ITIipSIp- +AocCeQ-oddns +Ad0CjQ- s+ATECZQKH- p+AlAB3QJ5- u+AlACVA- no+Ao4- +Al8-I"
TEST-UNEXPECTED-FAIL | mailnews/mime/test/unit/test_jsmime_charset.js | run_test - [run_test : 27] "=?UTF-7?Q?+AKM-1?=" == "£1"

So I guess I'll be here for a while :-(
https://hg.mozilla.org/comm-central/rev/322986aa2dbabda0ba031d76ba2dc959c7cc9f7a
For the time being, we can't decode UTF-7 in JS since we only have a C++ interface for it. Previously the nsIScriptableUnicodeConverter could be used for this purpose.

I'll file another bug to reinstate testing removed here. This will fix/silence
test_jsmime_charset.js that won't be restored and
test_encode_utf-7_internal.js, test_decode_utf-7_internal.js which may be restored and
test_mailboxes.js which should be restored.

The following tests still need investigation:
test_getMsgTextFromStream.js, test_attachment.js, test_messageHeaders.js, test_longLines.js since there the behaviour changed in an unexpected way. Also the Mozmill need to be investigated.
Filed bug 1372899 for UTF-7 follow up work so the slashing can continue here ;-)
Attached patch 1363281-follow-up5.patch - WIP. (obsolete) — Splinter Review
test_getMsgTextFromStream.js can be fixed by restoring this code:
https://hg.mozilla.org/comm-central/rev/f558febc1ead57c76ba5eb5af4443a67680f7fa4#l3.117
since in this test an empty charset is passed.
I noticed that a message written with UTF-8 encoding is stored as an empty message. So that's likely to cause a few test failures ;-)
https://hg.mozilla.org/comm-central/rev/542d84bf54f1b21ce9979fe82bbb1c46a78f41f5

There was a silly typo/cut&paste bug that disabled encoding message content into the target charset, that is, all saved drafts and sent messages were empty, not real good for an e-mail program.

Also I reinstated the option to pass an empty charset.

Three of the four failing Xpcshell tests now pass, there is still trouble with test_longLines.js which fails when testing ISO-2022-JP in plain text mode. Somehow the converter flags a replacement and hence the content is encoded in UTF-8 instead which is undesired/unexpected.

We'll see how the Mozmill tests come out with this latest patch. Since encoding works now, they should be whole lot better, maybe still failing on x-mac-croation. We'll see.
Attachment #8877746 - Attachment is obsolete: true
Henri, I'm having trouble with encoding in ISO-2022-JP. My test encodes this message
  htmlMessage = "<html><head>" +
    "<meta http-equiv=\"content-type\" content=\"text/html; charset=ISO-2022-JP\">" +
    "</head><body>" + longMultibyteLineJapanese + "</body></html>";
where
let longMultibyteLineJapanese = "語".repeat(450);
https://dxr.mozilla.org/comm-central/rev/322986aa2dbabda0ba031d76ba2dc959c7cc9f7a/mailnews/compose/test/unit/test_longLines.js#165

That test passes. The next test encodes the same message, but forced to plain text. Basically, all the tags are removed and only |longMultibyteLineJapanese| is encoded. This time I get a NS_OK_HAD_REPLACEMENTS, so I abort using ISO-2022-JP and fall back to UTF-8, which makes the test fail.

I'm a little puzzled by this. How can a Japanese string embedded in HTML tags be encoded correctly, yet, the string without the tags has replacements? Weird. I tried to reproduce scenario manually, that is compose a message with that string, send as plain text. This works, the message is encoded as ISO-2022-JP.

So my questions:
- Any hints how to debug this? A little hard to dump out the input string at time of encoding.
- Under which circumstances would encoding to ISO-2022-JP return "had replacements" if the text consists of
  語 and maybe a linefeed?
- Is there the remote chance of a bug in the encoder that the status is not reliable?
Flags: needinfo?(hsivonen)
https://hg.mozilla.org/comm-central/rev/c744963d757b8843f5cc4724cfe384038562d154

Aceman, please do a post-landing review here.

test-charset-edit.js failed since it was still using UTF-7 which is no longer supported as body charset. Sadly, it threw before cleaning up the draft it created which made all subsequent tests fail. I've fix this and I am using UFT-16 instead now, which is not perfect, but at least maintains the test.

With this landing, Mozmill should be green. The only failure left should be xpcshell test_longLines.js as per comment #75.

Summarising landing so far:
Comment #65:
https://hg.mozilla.org/comm-central/rev/f558febc1ead57c76ba5eb5af4443a67680f7fa4 - main code
https://hg.mozilla.org/comm-central/rev/ac30eed3d7e54d44f6824d9da864d3a89579e76c - UTF-7 support
https://hg.mozilla.org/comm-central/rev/6e8cb90569125b04aadbb9f34c2ecc560c07837b - mozconfig

Comment #66 and comment #67:
https://hg.mozilla.org/comm-central/rev/8af60715cdafb3bff5ee7ef2522e73ea88a2489f - compile error, follow-up 2
https://hg.mozilla.org/comm-central/rev/a3b550af3c582739249505897fab03c3b51b08e3 - compile error, follow-up 3

Comment #70:
https://hg.mozilla.org/comm-central/rev/322986aa2dbabda0ba031d76ba2dc959c7cc9f7a - remove utf-7-based tests,
follow-up 4, see follow-up bug 1372899.

Comment #74:
https://hg.mozilla.org/comm-central/rev/542d84bf54f1b21ce9979fe82bbb1c46a78f41f5 - corrected encoding, follow-up 5

This comment:
https://hg.mozilla.org/comm-central/rev/c744963d757b8843f5cc4724cfe384038562d154 - fix test-charset-edit.js,
follow-up 6.

Not a nice way to work, but after the initial landing in comment #65 we were basically in bustage-fix mode.
Attachment #8877952 - Flags: review?(acelists)
(In reply to Jorg K (GMT+2) from comment #75)
> So my questions:
> - Any hints how to debug this? A little hard to dump out the input string at
> time of encoding.

Trying to dump the string as close to the encode step as possible seems to be the best bet apart from gdb.

To debug in gdb, the code of interest starts at https://dxr.mozilla.org/comm-central/source/mozilla/third_party/rust/encoding_rs/src/iso_2022_jp.rs#471

However, to debug it, you need to build Thunderbird with -Zdebug-macros in RUSTFLAGS. In the Firefox case, RUSTFLAGS is set in config/rules.mk. With that, setting a breakpoint by line number should work.

> - Under which circumstances would encoding to ISO-2022-JP return "had
> replacements" if the text consists of
>   語 and maybe a linefeed?

There shouldn't be any such circumstances. I just wrote a test case in Rust feeding 語, 語, LF and 語 to the encoder individually and it didn't misbehave.

> - Is there the remote chance of a bug in the encoder that the status is not
> reliable?

*Remote* chance exists, of course, but per my testing above, I don't think that's the issue here.
Flags: needinfo?(hsivonen)
(In reply to Henri Sivonen (:hsivonen) from comment #77)
> Trying to dump the string as close to the encode step as possible seems to
> be the best bet apart from gdb.

Dumping the encoder output would be useful, too.
Attached file 1363281.log - log of test (obsolete) —
OK, here we have the log. The previous test with the HTML message succeeds:

 0:02.52 TEST_STATUS: Thread-1 testBodyWithLongLine PASS [testBodyWithLongLine : 65] 1 == 1
 0:02.52 TEST_STATUS: Thread-1 testBodyWithLongLine PASS [testBodyWithLongLine : 67] "text/html; charset=ISO-2022-JP" == "text/html; charset=ISO-2022-JP"
 0:02.53 TEST_STATUS: Thread-1 testBodyWithLongLine PASS [testBodyWithLongLine : 65] 1 == 1
 0:02.53 TEST_STATUS: Thread-1 testBodyWithLongLine PASS [testBodyWithLongLine : 67] "base64" == "base64"
 0:02.53 TEST_STATUS: Thread-1 testBodyWithLongLine PASS [testBodyWithLongLine : 73] true == true
 0:02.54 TEST_STATUS: Thread-1 testBodyWithLongLine PASS [testBodyWithLongLine : 48] "<html><head><meta http-equiv=\\"content-type\\" content=\\"text/html; charset=ISO-2022-JP\\"></head><body>Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬× == "<html><head><meta http-equiv=\\"content-type\\" content=\\"text/html; charset=ISO-2022-JP\\"></head><body>Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×Þ¬×

For the failing test, I added this code:
  if (rv == NS_OK_HAD_REPLACEMENTS) {
    printf("=== %s\n", aCharset);
    for (uint32_t i=0; i<inString.Length();i++) printf("=== %lx\n", inString.CharAt(i));
    printf("=== Out\n");
    for (uint32_t j=0; j<outString.Length();j++) printf("=== %x\n", outString.CharAt(j));
    rv = aReportUencNoMapping ? NS_ERROR_UENC_NOMAPPING : NS_OK;
  }

and you can see the result in the log. Maybe I should be dumping it out some other way. I'm surprised that the input doesn't appear to be printing more than 0xFF. Also, what would be the replacement character in the output?
Flags: needinfo?(hsivonen)
Attachment #8877960 - Attachment mime type: text/x-log → text/plain
Hmm, never mind, something is wrong here, the input that shows the replacement is:
"=== 3c <"
"=== 68 h"
"=== 74 t"
"=== 6d m"
"=== 6c l"
"=== 3e >"
"=== 3c <"
"=== 68 h"
"=== 65 e"
"=== 61 a"
"=== 64 d"
"=== 3e >"
etc. In the plain text test case only the raw text without the tags should have been presented for conversion. I'll have check what's going on there.
Flags: needinfo?(hsivonen)
Looks like I'm not quite done here:

TEST-UNEXPECTED-FAIL | mailnews/base/test/unit/test_bug366491.js | xpcshell return code: 1 [log…]
PROCESS-CRASH | mailnews/base/test/unit/test_bug366491.js | application crashed [@ mozalloc_abort(char const*)] [log…]
TEST-UNEXPECTED-FAIL | mailnews/db/gloda/test/unit/test_mime_attachments_size.js | xpcshell return code: 1 [log…]
PROCESS-CRASH | mailnews/db/gloda/test/unit/test_mime_attachments_size.js | application crashed [@ mozalloc_abort(char const*)] [log…]
TEST-UNEXPECTED-FAIL | mailnews/extensions/bayesian-spam-filter/test/unit/test_bug228675.js | xpcshell return code: 1 [log…]
PROCESS-CRASH | mailnews/extensions/bayesian-spam-filter/test/unit/test_bug228675.js | application crashed [@ mozalloc_abort(char const*)] [log…]
TEST-UNEXPECTED-FAIL | mailnews/extensions/bayesian-spam-filter/test/unit/test_customTokenization.js | xpcshell return code: 1 [log…]
PROCESS-CRASH | mailnews/extensions/bayesian-spam-filter/test/unit/test_customTokenization.js | application crashed [@ mozalloc_abort(char const*)] [log…]
TEST-UNEXPECTED-FAIL | mailnews/extensions/bayesian-spam-filter/test/unit/test_junkAsTraits.js | xpcshell return code: 1 [log…]
PROCESS-CRASH | mailnews/extensions/bayesian-spam-filter/test/unit/test_junkAsTraits.js | application crashed [@ mozalloc_abort(char const*)] [log…]
TEST-UNEXPECTED-FAIL | mailnews/extensions/bayesian-spam-filter/test/unit/test_traits.js | xpcshell return code: 1 [log…]
PROCESS-CRASH | mailnews/extensions/bayesian-spam-filter/test/unit/test_traits.js | application crashed [@ mozalloc_abort(char const*)] [log…]
TEST-UNEXPECTED-FAIL | mailnews/imap/test/unit/test_partsOnDemand.js | xpcshell return code: 1 [log…]
PROCESS-CRASH | mailnews/imap/test/unit/test_partsOnDemand.js | application crashed [@ mozalloc_abort(char const*)] [log…]
TEST-UNEXPECTED-FAIL | mailnews/imap/test/unit/test_partsOnDemand.js | xpcshell return code: 1 [log…]
PROCESS-CRASH | mailnews/imap/test/unit/test_partsOnDemand.js | application crashed [@ mozalloc_abort(char const*)] [log…]
TEST-UNEXPECTED-FAIL | mailnews/mime/test/unit/test_attachment_size.js | xpcshell return code: 1 [log…]
PROCESS-CRASH | mailnews/mime/test/unit/test_attachment_size.js | application crashed [@ mozalloc_abort(char const*)] [log…]

https://archive.mozilla.org/pub/thunderbird/tinderbox-builds/comm-central-macosx64-debug/1497556368/comm-central_yosemite_r7-debug_test-xpcshell-bm106-tests1-macosx-build30.txt.gz says:

14:50:45     INFO -  PID 7304 | [7304] ###!!! ASSERTION: nsTDependentString must wrap only null-terminated strings. You are probably looking for nsTDependentSubstring.: 'mData[mLength] == 0', file /builds/slave/tb-c-cen-m64-d-000000000000000/build/objdir-tb/dist/include/nsTString.h, line 460
14:50:45     INFO -  PID 7304 | Hit MOZ_CRASH() at /builds/slave/tb-c-cen-m64-d-000000000000000/build/mozilla/memory/mozalloc/mozalloc_abort.cpp:33
14:50:45     INFO -  <<<<<<<
14:50:45     INFO -  mozcrash Copy/paste: /builds/slave/test/build/macosx64-minidump_stackwalk /var/folders/93/f4rdtxvx5xg5dzcxcm29xd4c00000w/T/xpc-other-Zr_IEs/D63F1B09-4E1B-49F1-85C2-C7C4EB663C4B.dmp /builds/slave/test/build/symbols
14:50:59     INFO -  mozcrash Saved minidump as /builds/slave/test/build/blobber_upload_dir/D63F1B09-4E1B-49F1-85C2-C7C4EB663C4B.dmp
14:50:59     INFO -  mozcrash Saved app info as /builds/slave/test/build/blobber_upload_dir/D63F1B09-4E1B-49F1-85C2-C7C4EB663C4B.extra
14:50:59  WARNING -  PROCESS-CRASH | mailnews/extensions/bayesian-spam-filter/test/unit/test_bug228675.js | application crashed [@ mozalloc_abort(char const*)]
14:50:59     INFO -  Crash dump filename: /var/folders/93/f4rdtxvx5xg5dzcxcm29xd4c00000w/T/xpc-other-Zr_IEs/D63F1B09-4E1B-49F1-85C2-C7C4EB663C4B.dmp
14:50:59     INFO -  Operating system: Mac OS X
14:50:59     INFO -                    10.10.5 14F27
14:50:59     INFO -  CPU: amd64
14:50:59     INFO -       family 6 model 69 stepping 1
14:50:59     INFO -       4 CPUs
14:50:59     INFO -  GPU: UNKNOWN
14:50:59     INFO -  Crash reason:  EXC_BAD_ACCESS / KERN_INVALID_ADDRESS
14:50:59     INFO -  Crash address: 0x0
14:50:59     INFO -  Process uptime: 0 seconds
14:50:59     INFO -  Thread 0 (crashed)
14:50:59     INFO -   0  libmozglue.dylib!mozalloc_abort(char const*) [mozalloc_abort.cpp:75be6742abb9 : 33 + 0x0]
14:50:59     INFO -      rax = 0x0000000000000000   rdx = 0x00007fff759f11f8
14:50:59     INFO -      rcx = 0x0000000000000000   rbx = 0x00007fff759f1c50
14:50:59     INFO -      rsi = 0x0000350000003500   rdi = 0x0000340000003503
14:50:59     INFO -      rbp = 0x00007fff57592690   rsp = 0x00007fff57592680
14:50:59     INFO -       r8 = 0x00007fff57592630    r9 = 0x00007fff76274300
14:50:59     INFO -      r10 = 0x0000000000000000   r11 = 0x0000000000000246
14:50:59     INFO -      r12 = 0x0000000000000001   r13 = 0x000000010e1e2fc8
14:50:59     INFO -      r14 = 0x00007fff57592700   r15 = 0x00007fff57595b51
14:50:59     INFO -      rip = 0x000000010fecba99
14:50:59     INFO -      Found by: given as instruction pointer in context
14:50:59     INFO -   1  XUL!Abort [nsDebugImpl.cpp:75be6742abb9 : 451 + 0x5]
14:50:59     INFO -      rbx = 0x00007fff759f1c50   rbp = 0x00007fff575926a0
14:50:59     INFO -      rsp = 0x00007fff575926a0   r12 = 0x0000000000000001
14:50:59     INFO -      r13 = 0x000000010e1e2fc8   r14 = 0x00007fff57592700
14:50:59     INFO -      r15 = 0x00007fff57595b51   rip = 0x0000000108e4e769
14:50:59     INFO -      Found by: call frame info
14:50:59     INFO -   2  XUL!NS_DebugBreak [nsDebugImpl.cpp:75be6742abb9 : 438 + 0x8]
14:50:59     INFO -      rbx = 0x00007fff759f1c50   rbp = 0x00007fff57592b30
14:50:59     INFO -      rsp = 0x00007fff575926b0   r12 = 0x0000000000000001
14:50:59     INFO -      r13 = 0x000000010e1e2fc8   r14 = 0x00007fff57592700
14:50:59     INFO -      r15 = 0x00007fff57595b51   rip = 0x0000000108e4e540
14:50:59     INFO -      Found by: call frame info
14:50:59     INFO -   3  XUL!ConvertToUTF8(char const*, int, char const*, nsACString&) [nsTString.h:75be6742abb9 : 458 + 0x25]
14:50:59     INFO -      rbx = 0x000000010fb19010   rbp = 0x00007fff57592bf0
14:50:59     INFO -      rsp = 0x00007fff57592b40   r12 = 0x000000000000003a
14:50:59     INFO -      r13 = 0x000000011a031000   r14 = 0x00007fff57592c00
14:50:59     INFO -      r15 = 0x00007fff57592b50   rip = 0x0000000108d8f7e5

So something wrong in the new ConvertToUTF8().
Attachment #8875618 - Flags: review?(Pidgeot18)
Comment on attachment 8877952 [details] [diff] [review]
1363281-follow-up6.patch - fix test-charset-edit.js

Review of attachment 8877952 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me, but what was the reason to also drop the japanese text ? What is the encoding of the 'nonASCII' string?

::: mail/test/mozmill/composition/test-charset-edit.js
@@ +140,5 @@
>    let msg = select_click_row(0);
>    assert_selected_and_displayed(mc, msg);
> +  assert_equals(getMsgHeaders(msg).get("").charset, "UTF-16LE");
> +  // We need to check for includes() here since the message
> +  // also contains CRLF, which is interepreted as invalid UTF-16.

Typo.
Attachment #8877952 - Flags: review?(acelists) → review+
Sorry about the typo.

The Japanese text was fine when I had UTF-7 as decodable but invalid charset since UTF-7 is pure ASCII.

Now I was looking for a non-ASCII string with an obvious UFT-16 encoding and found the € sign. The encoding is UTF-16LE, so the 0x20 and 0xAC get swapped around. UFT-16 is quite complicated, so this was my best bet:
https://en.wikipedia.org/wiki/UTF-16#Examples
Depends on: 1511950
You need to log in before you can comment on or make changes to this bug.