Use UTF-8 for all outgoing email

NEW
Unassigned

Status

MailNews Core
Composition
5 years ago
7 months ago

People

(Reporter: hsivonen, Unassigned)

Tracking

({intl})

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

5 years ago
Dealing with character encodings as a lot of complexity to code. Moreover, there is even more complexity arising from MailNews supporting more character encodings than what are healthy to support for the Web. Meanwhile, UTF-8 can represent everything that is possible to write in e-mail, was invented a bit over 20 years ago and has been broadly supported for at least 10 years or so.

Considering that virtually everyone has been able to receive UTF-8-encoded e-mail for years by now, I suggest making MailNews Core always send e-mail as UTF-8 so that the code for managing the outgoing encoding can be removed and, therefore, complexity, including user interface complexity, be reduced.

Comment 1

5 years ago
That's technically a duplicate of bug 224391 but sure worth revisiting after a substantial time has passed. Looking at bug 410333, there have been issues with specific locales, thus the default certainly will still have to be configurable by individual localizations. While mailnews.send_default_charset is defined in mailnews.js, it refers to messenger.properties which is local to each application (mail/ vs. suite/), so that's an application-specific decision.
Keywords: intl
tl;dr - Defaulting to UTF-8 is probably safe now; removing support for emitting non-UTF-8 might not be.

(In reply to Henri Sivonen (:hsivonen) from comment #0)
> Dealing with character encodings as a lot of complexity to code. Moreover,
> there is even more complexity arising from MailNews supporting more
> character encodings than what are healthy to support for the Web. Meanwhile,
> UTF-8 can represent everything that is possible to write in e-mail, was
> invented a bit over 20 years ago and has been broadly supported for at least
> 10 years or so.
> 
> Considering that virtually everyone has been able to receive UTF-8-encoded
> e-mail for years by now, I suggest making MailNews Core always send e-mail
> as UTF-8 so that the code for managing the outgoing encoding can be removed
> and, therefore, complexity, including user interface complexity, be reduced.

At this point, I'm looking at making RFC 2047 encoded words always use UTF-8, largely because the JS TextEncoder service I'm using only supports it, but also because it's painful enough to get the algorithm right with UTF-8; any other multibyte charset is annoying.

(In reply to rsx11m from comment #1)
> That's technically a duplicate of bug 224391 but sure worth revisiting after
> a substantial time has passed. Looking at bug 410333, there have been issues
> with specific locales, thus the default certainly will still have to be
> configurable by individual localizations. While
> mailnews.send_default_charset is defined in mailnews.js, it refers to
> messenger.properties which is local to each application (mail/ vs. suite/),
> so that's an application-specific decision.

I recall UTF-8 versus non-UTF-8 coming up before in mdat or tb-planning. As of 2008, UTF-8 support in Japan was insufficient enough to cause bug 448842 to be filed. In a conversation where I brought that bug up a few years later, it was suggested that this is no longer a major issue.

Rereading bug 224391, it seems the other area of possible contention is use of UTF-8 in Usenet. Citing http://www.crampe.eu.org/statfr/tout.html:
  36.58% :	Mozilla
  19.23% :	MesNews
  10.23% :	Microsoft Outlook Express
  9.47% :	G2
  5.84% :	MacSOUP
  2.68% :	Microsoft Windows Live Mail
  2.62% :	Pan
  1.72% :	Microsoft Windows Mail
  1.61% :	[40tude Dialog]
  1.31% :	slrn
  1.25% :	bleachbot
  1.23% :	Forte Agent 
  6.20% :	[everybody else]

Mozilla, G2, Microsoft * all support UTF-8 unless their architecture is dumb. MesNews has some mumbles in their documentation about HTML, so I suspect they support UTF-8 as well. MacSOUP's documentation lists that it added support for UTF-8 in 2.5, which appears to date to about the same time as OS X 10.1 (!). Pan appears to use UTF-8 internally; slrn and Forte Agent both appear to support UTF-8 as well. I can't find any documentation to suggest whether or not UTF-8 is supported in 40tude, and bleachbot is just a posting bot, so it doesn't count.

To sum it up: I'm willing to believe that there are no major barriers to defaulting to UTF-8. Removing the machinery that allows encoding to legacy charsets is probably not tenable at this point in time. I would suggest, if we do this, that we keep the ability to quickly undo the change until it has fully baked on the ESR branches.

Comment 3

5 years ago
The ability to support other encodings certainly should be retained. The user may want an explicit encoding different from UTF-8 or the default could be different for a specific locale based on the localizer's judgment. Thus, only changing mailnews.send_default_charset seems to be the way to go. Removing the messenger.properties dependency should work as long as localizers can override in all-l10n.js.

The other question is whether or not to change the handling of mailnews.view_default_charset in the process. It may be a bit more tricky here, especially given that a message may not have any charset definition at all, at which point defaulting to UTF-8 rather than a local encoding may be the wrong thing to do (most likely that's for different bug though, but should be considered to maintain parity between sending and receiving ends).
(Reporter)

Comment 4

5 years ago
(In reply to Joshua Cranmer [:jcranmer] from comment #2)
> tl;dr - Defaulting to UTF-8 is probably safe now;

I guess even that would be progress.

> removing support for emitting non-UTF-8 might not be.

:-( Then we wouldn't be able to actually simplify code.

(In reply to rsx11m from comment #3)
> The
> user may want an explicit encoding different from UTF-8 

"May want to" is a terrible reason. This should be based on compat data.

> or the default could be different for a specific locale based on the localizer's judgment.

Again, things should be based on compat data instead of localizers second-guessing data.

> but
> should be considered to maintain parity between sending and receiving ends).

Why is parity needed there? I'm suggesting always sending out UTF-8 no matter what sort of message is being replied to. If the recipient could receive a fresh non-reply UTF-8 message, they can deal with UTF-8 replies, too.

Comment 5

5 years ago
My understanding is that many non-English locales treat UTF-encoded messages as inherently more likely to be spam than their language-specific encoding. This info may be a few years out of date, but if we're going to go down this road, we should be really sure that it's no longer the case.
(In reply to Henri Sivonen (:hsivonen) from comment #4)
> (In reply to Joshua Cranmer [:jcranmer] from comment #2)
> > tl;dr - Defaulting to UTF-8 is probably safe now;
> 
> I guess even that would be progress.
> 
> > removing support for emitting non-UTF-8 might not be.
> 
> :-( Then we wouldn't be able to actually simplify code.

The most complicated code with respect to charset encoding is preparing RFC 2047 encoded words, which I am already moving to be UTF-8-only. For everything else, the code basically looks like this:

let binaryData = charset-encode(charset, text);
[ If you get an exception, fall back to UTF-8 ]
let asciiData = qpEncode | base64Encode | percentEncode (binaryData);

So unless you're planning on ripping out all of our non-UTF-8 charset encoders, there isn't a whole lot of code-simplification going on.

> > but
> > should be considered to maintain parity between sending and receiving ends).
> 
> Why is parity needed there? I'm suggesting always sending out UTF-8 no
> matter what sort of message is being replied to. If the recipient could
> receive a fresh non-reply UTF-8 message, they can deal with UTF-8 replies,
> too.

A. I don't 100% trust libmime to work properly if we start using different charsets when including original message data (go figure)
B. There is more at concern here than just email--Usenet has slightly different conventions. When the message is clearly in the full thralls of MIME (i.e., it uses quoted-printable or base64 for message data), then there is almost no concern. If the message is using 8-bit non-UTF-8 body parts, there is a small chance that downstream people might be expecting that charset in particular and not UTF-8.

This is all speculative, though, and I definitely want real-world usage feedback before we decide to irrevocably pull the plug on making non-UTF-8 messages.
As an example of a preference other than UTF-8 in some locales: the GB18030 charset can encode any Unicode codepoint, but it is biased in favour of Chinese (while UTF-8 is biased in favour of ASCII and Latin); also, it must (by law) be available on every computer hardware & software sold in mainland China. If Thunderbird cannot send email in GB18030, it might quite well get banned from distribution in China.

I agree with jcranmer (comment #2): “Defaulting to UTF-8 is probably safe now; removing support for emitting non-UTF-8 might not be.”
(Reporter)

Comment 8

5 years ago
(In reply to Tony Mechelynck [:tonymec] from comment #7)
> If Thunderbird cannot send email in GB18030, it might quite well get
> banned from distribution in China.

I think we shouldn't make decisions based on speculation about what regulation actually says. This should be checked with legal when we'd otherwise be ready to make output UTF-8-only. Specifically, [citation needed] for PRC regulation requiring GB18030 encoding support for software *output* (as opposed to supporting the particular set of characters however encoded or supporting the GB18030 encoding for input).

Comment 9

5 years ago
Google gives a bunch of matches, do you consider IANA authoritative enough?

(Quoting http://www.iana.org/assignments/charset-reg/GB18030)
>    GB18030 is a "mandatory" standard: starting September 1, 2001, all
>    operating systems sold in Mainland China must support this
>    standard.  (Embedded systems and PDAs are currently exempt.)
>    Eventually, end-user applications must also fully support the
>    GB18030 standard--mere UTF-8 support is not enough. [...]

Now you can argue what "support" means, but in the context of e-mail and news messages, I don't see why they would require it only on one end (receiving yes, sending no). Given that it's their answer to UTF-8 (and essentially their national version thereof), that interpretation wouldn't make much sense.

(In reply to Henri Sivonen (:hsivonen) from comment #4)
> > or the default could be different for a specific locale based on the localizer's judgment.
> 
> Again, things should be based on compat data instead of localizers
> second-guessing data.

The case discussed right now is part of the reason why localizers should be involved in this discussion. They should have a better understanding about culture and technical or legal issues that "compat data" may not register.

Comment 10

5 years ago
(In reply to Henri Sivonen (:hsivonen) from comment #8)
> I think we shouldn't make decisions based on speculation about what
> regulation actually says. This should be checked with legal when we'd
> otherwise be ready to make output UTF-8-only.

That is rather upside-down. First you have to figure out what are the factors are that need to be considered in a decision whether or not all locales can be forced to emit UTF-8 only, then you can go ahead and figure out which parts and how to remove supporting code. And, I absolutely don't see a need to rush any decision in this regard. It would be a major step that needs to be carefully considered in its consequences.
There's actually a Usenet thread going on right now that uses a lot of Unicode characters in place of regular ASCII text (for example:

Ḭ իԲνҿ ท
(Reporter)

Updated

4 years ago
Depends on: 918294
(Reporter)

Comment 12

4 years ago
FWIW, the Russian Localization of Thunderbird defaults to UTF-8 for outgoing email already.
(Reporter)

Comment 13

4 years ago
Filed bug 941545 as a first step.
(Reporter)

Comment 14

2 years ago
(In reply to Joshua Cranmer [:jcranmer] from comment #2)
> As
> of 2008, UTF-8 support in Japan was insufficient enough to cause bug 448842
> to be filed.

Gmail has eliminated their UTF-8 avoidance pref. They now send all outgoing email--even when the user uses Japanese-localized Gmail in a Japanese-localized browser from an IP address assigned to Japan and uses only ISO-2022-JP-encodable characters. This has been the case since at least May 2015.

Stuff like bug 1202401 keeps coming up. I think the Thunderbird developers' time (and mine when changing things on the Gecko side) would be better allocated to other things--as it could be if all the code dealing with UTF-8 avoidance in the sending side of mailnews was gone once and for all.
So in bug 998191, I landed a change that forced all RFC 2047 encoding to use UTF-8 [1], and bug 1324443 made the RFC 2231 encoding use it as well. In searching for bugs, I see no evidence that anyone has complained about this change or noticed it (although I haven't exactly been exhaustive in thinking up synonyms for "garbled subject on outgoing messages" for uninformed bug filers).

This strongly suggests that there is no software in any of our domains that cannot support UTF-8 at a technical level. That said, the question is still unanswered as to whether or not the policy configurations permit UTF-8. I'm uncomfortable making this sort of change without testing it first--and the easiest way to test is to make the change in the first place.

I'm open to testing this, so long as we have a plan in place for backing the change out quickly if the answer turns out to be "no, we can't do this." Discovering the answer "yes" would require a few months baking in the public release before I could breath the sigh of relief.

[1] I made this decision largely on the basis that emitting RFC 2047 correctly (i.e., getting the encoded-word breaks correct) is difficult enough as it is. Requiring UTF-8 means that I can take advantage of its self-synchronizing nature to avoid breaking in the middle of a character (which the old encoder was capable of doing, since it worked by deleting characters one by one until it fit).
(Reporter)

Comment 16

7 months ago
(In reply to Joshua Cranmer [:jcranmer] from comment #15)
> This strongly suggests that there is no software in any of our domains that
> cannot support UTF-8 at a technical level.

Email gateways in Japan have now had two years of Gmail not sending ISO-2022-JP to deal with, so hopefully the gateways have been forced to deal by now (if they hadn't been already two years ago).

> That said, the question is still
> unanswered as to whether or not the policy configurations permit UTF-8.

Is this in reference to GB18030? As far as I can tell as a person not residing in China and not reading Chinese, the use of GB18030-the-encoding in protocol output is not of interest and the issue that matters is that software can handle the Unicode ranges that are applicable to languages used in China and are designated as such by GB18030-the-standard. (Gecko is able to.)

Note that UTF-only formats like JSON seem to be doing fine in this respect.
(In reply to Henri Sivonen (:hsivonen) from comment #16)
> (In reply to Joshua Cranmer [:jcranmer] from comment #15)
> > That said, the question is still
> > unanswered as to whether or not the policy configurations permit UTF-8.
> 
> Is this in reference to GB18030? As far as I can tell as a person not
> residing in China and not reading Chinese, the use of GB18030-the-encoding
> in protocol output is not of interest and the issue that matters is that
> software can handle the Unicode ranges that are applicable to languages used
> in China and are designated as such by GB18030-the-standard. (Gecko is able
> to.)

No, it's a reference to idiots who might do things like "mark all UTF-8 mail as spam" or NNTP servers that reject all UTF-8 posts. As I said, the best approach here is to just flip the switch while being prepared to unflip it if people start complaining.
You need to log in before you can comment on or make changes to this bug.