Closed Bug 1571672 Opened 5 years ago Closed 4 years ago

Enable UTF-8 IMAP store (RFC 6855: add Support for UTF8=ACCEPT and UTF8=ONLY)

Categories

(MailNews Core :: Networking: IMAP, enhancement, P1)

enhancement

Tracking

(thunderbird_esr78 wontfix, thunderbird82 wontfix)

RESOLVED FIXED
83 Branch
Tracking Status
thunderbird_esr78 --- wontfix
thunderbird82 --- wontfix

People

(Reporter: vesely, Assigned: gds)

References

Details

Attachments

(1 file, 20 obsolete files)

82.05 KB, patch
mkmelin
: review+
Details | Diff | Splinter Review

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0

Steps to reproduce:

Open a message that contains non-ASCII header fields from a Courier-IMAP mailstore.

Actual results:

Get a warning like:
Message 2211 appears to be a Unicode message and your E-mail reader did not enable Unicode support. Please use an E-mail reader that supports IMAP with UTF-8.

Note that alert messages are displayed in an unpredictable fashion.

Expected results:

I noticed no hiccups when receiving messages with utf-8 header fields. I can copy them to and from local folders without harm. Note, in particular, that Courier-IMAP does not downgrade messages —that operation is unsafe if it involves, for example, an utf-8 local part.

Because of bug #1563891, I cannot reply to messages from utf-8 local parts addresses. However, they are received well. The proper action would be to let the server know, by adding an IMAP command after login. For example:

server: * CAPABILITY IMAP4rev1 UIDPLUS [...] ENABLE UTF8=ACCEPT
server: 5 OK CAPABILITY completed
client: 6 ENABLE UTF8=ACCEPT
server: * ENABLED UTF8=ACCEPT
server: 6 OK Options enabled

Component: Untriaged → Networking: IMAP
Product: Thunderbird → MailNews Core

There's more than just sending ENABLE UTF8. The complete specification is RFC 6855.

Seems like a larger job.

Hi Jorg,
(In reply to Jorg K (GMT+2) from comment #2)

Seems like a larger job.

IME, the behavior is fine as is. Yes, there may be cases... there always are. To send ENABLE UTF8 will make it official that messages can have utf-8 values in the header. That's the right direction anyway. Possible issues will be addressed when they emerge, as usual.

The alternative, i.e. not enabling UTF8=ACCEPT, can be worse with varying IMAP servers. As RFC 6855 suggests, servers can hide EAI mail or produce surrogates. That would be decisively annoying. Users unable to get their messages won't like it —and we know sporadic unencoded Subject:'s leaked into our Inboxes even before our servers supported SMTPUTF8.

So, I ran into this recently when trying to figure out why Seamonkey's message filtering was no longer working and why I kept getting errors in my huge inbox. (I use Courier IMAP which appears to implement 6855/6856 rather aggressively).

Everything on my linux box has been UTF-8 by default for a couple of decades. Is there any magic flag in about:config or anything lower level to just make Thunderbird/Seamonkey send out this UTF8=ACCEPT so stuff starts working again? I am totally fine with anything manual even if you guys want to be cautious about doing it by default, and all my other clients still work fine.

(In reply to nemo from comment #4)

Everything on my linux box has been UTF-8 by default for a couple of decades. Is there any magic flag in about:config or anything lower level to just make Thunderbird/Seamonkey send out this UTF8=ACCEPT so stuff starts working again? I am totally fine with anything manual even if you guys want to be cautious about doing it by default, and all my other clients still work fine.

I don't see anywhere in the tb code where it responds to imap capability UTF8=ACCEPT. However, this doesn't look like it would be hard to do. Just send ". enable UTF8=ACCEPT" whenever a new connection is authenticated if the capability is present (and ENABLE capability is present which seem like it would be implied by UTF8=ACCEPT).

It could also be disabled by a new pref in case it causes problems.

Should I put this on my list and do this?

Go for it! :)

Attached patch UTF8-accept.diff (obsolete) — Splinter Review

Here's a possible fix for this bug. It now recognizes the UTF8=ACCEPT capability and, on entering AUTHENTICATED imap state, sends the "ENABLE UTF8=ACCEPT" command. Since no action is needed on the OK response, the response is not parsed and acted upon (except to log a possible NO or BAD response).

There are a few other capabilities defined by the same RFC that defines UTF8=ACCEPT. Should we also recognize UTF8=ONLY, UTF8=ALL and UTF8=USER as described in https://tools.ietf.org/html/rfc6855?

I tested this with gmail that advertises UTF8=ACCEPT and I see the enable occur and it succeeds. I haven't tested for any other different behavior. If anyone can test this in a useful way I can make a "try" build. Just need to know your architecture (win64, OSX etc) and if you want it applied to daily trunk or 68.

Assignee: nobody → gds
Attachment #9145349 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9145349 [details] [diff] [review]
UTF8-accept.diff

Review of attachment 9145349 [details] [diff] [review]:
-----------------------------------------------------------------

One thing we should make sure to handle is 

   Once an IMAP client has enabled UTF-8 support with the "ENABLE
   UTF8=ACCEPT" command, it MUST NOT issue a "SEARCH" command that
   contains a charset specification.  If an IMAP server receives such a
   "SEARCH" command in that situation, it SHOULD reject the command with
   a "BAD" response (due to the conflicting charset labels).

I think we also want a server pref to disable it, especially for debugging purposes. Like e.g. condstore https://searchfox.org/comm-central/search?q=condstore&path=

::: mailnews/imap/src/nsImapCore.h
@@ +143,5 @@
>  const eIMAPCapabilityFlag kHasSpecialUseCapability =      0x200000000LL;  /* RFC 6154: Sent, Draft etc. folders */
>  const eIMAPCapabilityFlag kGmailImapCapability =          0x400000000LL;  /* X-GM-EXT-1 capability extension for gmail */
>  const eIMAPCapabilityFlag kHasXOAuth2Capability =         0x800000000LL;  /* AUTH XOAUTH2 extension */
>  const eIMAPCapabilityFlag kHasClientIDCapability =        0x1000000000LL; /* ClientID capability */
> +const eIMAPCapabilityFlag kHasUTF8AcceptCapability =      0x2000000000LL; /* UTF8=ACCEPT capability */

please add: ", see RFC 6855"

::: mailnews/imap/src/nsImapProtocol.cpp
@@ +5317,5 @@
>    if (NS_SUCCEEDED(rv)) ParseIMAPandCheckForNewMail();
>  }
>  
> +// Should we also enable UTF8=ONLY, UTF8=ALL and UTF8=USER as described in
> +// RFC 6855?

No, UTF8=USER and UTF8=ALL are listed as obsolete in that RFC.
Also, it says " For the client, "ENABLE UTF8=ACCEPT" is always used -- never "ENABLE UTF8=ONLY".
Attachment #9145349 - Flags: review?(mkmelin+mozilla) → feedback+
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Priority: -- → P1
Attached patch UTF8-accept-v2.diff (obsolete) — Splinter Review

I tested this on emails with Russian characters and it works as expected. I added the ability to inhibit the "enable UTF8=ACCEPT". I also I remove the "CHARSET <specification>" sub-string from the search string when UTF8=ACCEPT is in effect. There is probably a better way to do what I did but it does work. Tested with gmail and if "CHARSET <specification>" not removed, gmail returns an error and the search fails. Also, if server advertises UTF8=ONLY, I treat it like UTF8=ACCEPT; I think that is correct. (I don't check for the UTF8=<> items you pointed out as obsolete.)

Attachment #9145349 - Attachment is obsolete: true
Attachment #9146988 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9146988 [details] [diff] [review]
UTF8-accept-v2.diff

Review of attachment 9146988 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/imap/public/nsIImapIncomingServer.idl
@@ +76,5 @@
>  
> +  /**
> +   * See IMAP RFC 6855 
> +   **/
> +  attribute boolean disableUtf8Accept;

usually better to do the positive form. maybe useUTF8Accept

(like useCondStore, and useCompressDeflate)

::: mailnews/imap/src/nsImapProtocol.cpp
@@ +7560,5 @@
> +  if ((GetServerStateParser().GetCapabilityFlag() & kHasUTF8AcceptCapability) &&
> +      !m_disableUtf8Accept &&
> +     (searchCriteria == PL_strcasestr(searchCriteria, "SEARCH CHARSET "))) {
> +    nsAutoCString tempString(searchCriteria);
> +    // Example search string: "SEARCH CHARSET UTF-8 TEXT {6}"

Hmm... do we ever send a charset? I didn't find where. Maybe it should be fixed at that point if we do.
Attachment #9146988 - Flags: review?(mkmelin+mozilla) → review?

Hmm... do we ever send a charset? I didn't find where. Maybe it should be fixed at that point if we do.

Yes we do send "search charset <spec> ..." here:
https://searchfox.org/comm-central/rev/edbbc461dc1489444ca7ab72fe5da32c826b28fe/mailnews/base/search/src/nsMsgImapSearch.cpp#110

I too preferred to just skip putting it in at that point instead of taking it out later. However, we only want to not put it in when capability UTF8=ACCEPT has been enabled. I don't know how make this known in nsMsgImapSearch.cpp when the enable occurs in nsImapProtocol.cpp. Maybe an access function of some sort?

Edit: Looking closer at this, I can't figure it out. If you really want this and think it can be done, I need some good advice on how to do it.

usually better to do the positive form. maybe useUTF8Accept

No problem, I can do this.

I guess you should be able to get the server from the folder and then the pref from the server. https://searchfox.org/comm-central/rev/edbbc461dc1489444ca7ab72fe5da32c826b28fe/mailnews/base/search/src/nsMsgSearchTerm.cpp#1638

Then that would need to be passed on to the searchadapter - https://searchfox.org/comm-central/rev/edbbc461dc1489444ca7ab72fe5da32c826b28fe/mailnews/base/search/src/nsMsgImapSearch.cpp#17
(btw, gross that the .h file there is "nsMsgSearchImap.h !!)

Attached patch UTF8-accept-v3.diff (obsolete) — Splinter Review

Thanks for the hints. I think I did what you asked. With UTF8=ACCEPT in effect, It now leaves off the "CHARSET <name>" where the search string is created instead of removing it later before it is sent.
.
One thing I'm not sure is right: The function Encode() is declared as static. I needed to use the member variable m_scope inside of it but that's not allowed. I removed the static and it worked with m_scope inside. I then went back and put the static back and passed the variable m_scope into Encode() as a new parameter, called scope and used it inside. That also worked and seems better to add a parameter rather than remove the static designation.

It seems to be working OK but I probably need to test it a bit more and I've left a few printf's in the diff for now. Also, there are some blanks and end of lines I need to remove.

(btw, gross that the .h file there is "nsMsgSearchImap.h !!)

Yes, I noticed that too. ...SearchImap.h and ...ImapSearch.cpp

Attachment #9146988 - Attachment is obsolete: true
Attachment #9146988 - Flags: review?
Attachment #9149255 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9149255 [details] [diff] [review]
UTF8-accept-v3.diff

Review of attachment 9149255 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me. r=mkmelin with a few minor things changed

::: mailnews/base/search/src/nsMsgImapSearch.cpp
@@ +124,5 @@
> +    if (csname) {
> +      // We have a "CHARSET <name>" string which is typically appended to 
> +      // "SEARCH". But don't append it if server has UTF8=ACCEPT and ENABLE 
> +      // capabilities and if pref allowUtf8Accept is true. 
> +      printf("gds: getting imapFolder\n");

please remove the debugs

::: mailnews/imap/public/nsIImapIncomingServer.idl
@@ +75,5 @@
>    attribute long autoSyncMaxAgeDays;
>  
> +  /**
> +   * See IMAP RFC 6855 
> +   **/

nit: ending is just a single star
*/

@@ +76,5 @@
>  
> +  /**
> +   * See IMAP RFC 6855 
> +   **/
> +  attribute boolean allowUtf8Accept;

the casing is usually UTF8, so
allowUTF8Accept

::: mailnews/mailnews.js
@@ +576,5 @@
>  
>  // Can we change the store type without conversion? (=has the store been used)
>  pref("mail.server.default.canChangeStoreType", false);
>  
> +// Set this false to inhibit enabling of UTF8=ACCEPT if problems arise.

Please change to
// Enable using UTF8=ACCEPT (RFC 6855).
Attachment #9149255 - Flags: review?(mkmelin+mozilla) → review+
Comment on attachment 9149255 [details] [diff] [review]
UTF8-accept-v3.diff

Running with this for a bit, I now notice that Gmail mangled my folder names (I only saw this after a restart or two). 
My Junk mail for that account is in Swedish so it's Skräppost, but it's now showing as Skrᅢᄂppost.

Will have to investigate. Perhaps best to have the pref off for now
Attachment #9149255 - Flags: review+

2020-05-24 11:36:00.874872 UTC - [(null) 7281: IMAP]: I/IMAP 0x7f0b5fadd800:imap.gmail.com:A:CreateNewLineFromSocket: * XLIST (\HasNoChildren \Spam) "/" "[Gmail]/Skräppost"

Maybe it's Thunderbird who doesn't handle that...

2020-05-24 11:36:25.662826 UTC - [(null) 7281: Main Thread]: D/IMAP proposed url = [Gmail]/Skräppo folder for connection INBOX has To Wait = false can run = false
2020-05-24 11:36:26.021864 UTC - [(null) 7281: IMAP]: I/IMAP 0x7f0b5f160000:imap.gmail.com:A:SendData: 6 select "[Gmail]/Skräppo"
2020-05-24 11:36:26.052594 UTC - [(null) 7281: IMAP]: D/IMAP ReadNextLine [stream=0x7f0b5ed2b3a0 nb=66 needmore=0]
2020-05-24 11:36:26.052611 UTC - [(null) 7281: IMAP]: I/IMAP 0x7f0b5f160000:imap.gmail.com:A:CreateNewLineFromSocket: 6 NO [NONEXISTENT] Unknown Mailbox: [Gmail]/Skräppo (Failure)

Or maybe have the pref on ifdef nightly.

(In reply to Magnus Melin [:mkmelin] from comment #17)

Running with this for a bit, I now notice that Gmail mangled my folder names
(I only saw this after a restart or two).
My Junk mail for that account is in Swedish so it's Skräppost, but it's now
showing as Skrᅢᄂppost

Pre Unicode IMAP used a modified UTF-7 encoding. Courier-MTA documents this here. However, after playing with iconv for a few minutes, I haven't been able to find out how the 'ᅢᄂ' came about.

Attached patch 1571672-UTF8-accept-v4.patch (obsolete) — Splinter Review

Magnus, I made the changes you requested. I left the enabling pref true. Not sure if that's how you want it.

Attachment #9149255 - Attachment is obsolete: true
Attachment #9153343 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9153343 [details] [diff] [review]
1571672-UTF8-accept-v4.patch

Review of attachment 9153343 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/mailnews.js
@@ +574,5 @@
>  // Can we change the store type without conversion? (=has the store been used)
>  pref("mail.server.default.canChangeStoreType", false);
>  
> +// Enable using UTF8=ACCEPT (RFC 6855).
> +pref("mail.server.default.allow_utf8_accept", true);

You still have it on by default.

But, can we resolve the folder naming thing? I imagine it could be around here: https://searchfox.org/comm-central/rev/f8b764c571b8503c951ba26a7acc248e211b91d7/mailnews/imap/src/nsImapMailFolder.cpp#451

Potentially all the places we do UTF7 I suppose: https://searchfox.org/comm-central/search?q=CopyUTF16toMUTF7&path=
Attachment #9153343 - Flags: review?(mkmelin+mozilla)

You still have it on by default.

Ok so I assume that by this:

Or maybe have the pref on ifdef nightly.

you wanted it off by default. Wasn't sure what you meant by "ifdef nightly".

But, can we resolve the folder naming thing?

I duplicated the problem and will take a look at it. Thanks for the links.

Tb is not currently designed, as far as I can tell, to accept folder names from the imap server in UTF8 except as single byte 7-bit ascii. In fact I guess it was assumed to be illegal (Re: Bug 63186 comment 23). TB still assumes that any mailbox/folder name returned by the server that contains a character needing more than one UTF8 byte is encoded in MUTF7 (as 7-bit ascii). For example, if I create the mailbox Skräppost--gdsat gmail.com and UTF8=ACCEPT is enabled, gmail returns in the imap LIST response Skräppost--gds where the ä is 8-bit UTF8 chars 0xc3 0xa4 and the others are their 7-bit ascii values. The mailbox displays incorrectly in tb (with Korean characters) and when clicked-on, the mailbox name encoded in the imap SELECT is also invalid and rejected by the server. There are also problems with the mbox and .msf filenames.

But with UTF8=ACCEPT in effect (or if not in effect), I can create in tb the folder Skräppost--eds. Tb encodes it as MUTF7Skr&AOQ-ppost--eds for the imap CREATE command. Gmail returns the same MUTF7 string in imap LIST and the mailbox displays correctly in tb as Skräppost--eds and works correctly when clicked on (the MUTF7 is sent in imap SELECT).

The display of Korean characters is caused by function
nsresult CopyMUTF7toUTF16(const nsACString& aSrc, nsAString& aDest) in mailnews/base/util/nsMsgI18N.cpp: https://searchfox.org/comm-central/rev/83ae42aa38bc39143b0f6d93bb69a40094c4b515/mailnews/base/util/nsMsgI18N.cpp#141
Looking with debugger at aSrc:

(gdb) p aSrc
$1 = (const nsACString &) @0x7fffffff75e8: {<mozilla::detail::nsTStringRepr<char>> = {
mData = 0x7fffd5e56c88 "Skräppost--gds", mLength = 15,
mDataFlags = (mozilla::detail::StringDataFlags::TERMINATED | mozilla::detail::StringDataFlags::REFCOUNTED),
mClassFlags = mozilla::detail::StringClassFlags::NULL_TERMINATED}, static kMaxCapacity = 2147483637}
(gdb) x/15bx aSrc.mData
0x7fffd5e56c88: 0x53 0x6b 0x72 0xc3 0xa4 0x70 0x70 0x6f
0x7fffd5e56c90: 0x73 0x74 0x2d 0x2d 0x67 0x64 0x73

The 2-byte UTF8 value is highlighted above. This gets converted from MUTF7 to UTF16 but it's not MUTF7; it is UTF8. Looking with debugger after conversion:
(gdb) p aDest
$3 = (nsAString &) @0x7fffffff7768: {<mozilla::detail::nsTStringRepr<char16_t>> = {
mData = 0x7fffffff777c u"Skrᅢᄂppost--gds", mLength = 15,
mDataFlags = (mozilla::detail::StringDataFlags::TERMINATED | mozilla::detail::StringDataFlags::INLINE),
mClassFlags = (mozilla::detail::StringClassFlags::INLINE | mozilla::detail::StringClassFlags::NULL_TERMINATED)}, static kMaxCapacity = 1073741817}
(gdb) x /30bx aDest.mData
0x7fffffff777c: 0x53 0x00 0x6b 0x00 0x72 0x00 0xc3 0xff
0x7fffffff7784: 0xa4 0xff 0x70 0x00 0x70 0x00 0x6f 0x00
0x7fffffff778c: 0x73 0x00 0x74 0x00 0x2d 0x00 0x2d 0x00
0x7fffffff7794: 0x67 0x00 0x64 0x00 0x73 0x00

The 2-byte UTF8 chars, highlighted above, are converted to U-FFC3 and U-FFA4 or ᅢand ᄂ

This seems to only really affect the "Pretty" name displayed in the folder list. I don't think it affects the mbox/.msf file names or the mailbox name sent to the server in imap commands (the "online" name). I can fix this in CopyMUTF7toUTF16() by checking aSrc using IsAscii() and if it returns false because aSrc is not a pure 7-bit string, do CopyUTF8toUTF16(aSrc, aDest) instead. This converts the 2-byte UTF8 character to the correct UTF-16 value U-00E4 and it displays correctly in the list.

Another problem is that when the UTF8 folder name containing "ä" is sent by tb to the imap server, the "online name" is corrupted to either one character short and/or the "ä" changed to "ä" (U-00c3,U-00a4). This would cause a failure on the imap SELECT server response due to an invalid mailbox name. I was never able to figure out why but removing this line fixes at least the "too short" problem:
https://searchfox.org/comm-central/rev/83ae42aa38bc39143b0f6d93bb69a40094c4b515/mailnews/imap/src/nsImapMailFolder.cpp#1896
Apparently the call to GetCharPropterty() right before this leaves off the last char(s) when the string has 8-bit chars. From what I can see, the value assigned to m_onlineFolderName is already OK so skipping this causes no problems. (Note: The original developer possibly alluded to this problem here: https://searchfox.org/comm-central/rev/83ae42aa38bc39143b0f6d93bb69a40094c4b515/mailnews/imap/src/nsImapMailFolder.cpp#1854)

The final problem is that on tb restart, the folders containing 8-bit UTF8s would alternately appear and then not appear in the folder list on subsequent restarts. Also the mbox/.msf files would change names (between containing "ä" and "ä") or have new numerical dash suffixes added on restart. Also, all headers and messages bodies for this folder were all re-downloaded on each restart. I think this was due to several places where UTF16 was converted to ASCII in a lossy manner. So changing several calls from NS_LossyConvertUTF16toASCII() to NS_ConvertUTF16toUTF8() fixed that problem.

The attached utf8.diff show the proposed fixes along with a lot of debug printf's and in a very rough form. It seems to fix the utf8 problem and I haven't noticed any other problems that my changes might have caused. But more testing is definitely required.

utf8.diff doesn't show anything from the original patch.

Attachment #9156449 - Flags: feedback?(mkmelin+mozilla)
Attached patch utf8-clean.diff (obsolete) — Splinter Review

This cleans up utf8.diff to remove all the printfs and other debug stuff that make it hard to parse and only shows the essential changes in two files. With attachment 9153343 [details] [diff] [review] applied this seems to work correctly when applied on top of that patch. (It at least fixes the problem identified by Magnus in comment 17.)
Note: There are several places where I replaced CopyASCIItoUTF16() with CopyUTF8toUTF16() that I failed to mention in my previous (long) comment.

Attachment #9156449 - Attachment is obsolete: true
Attachment #9156449 - Flags: feedback?(mkmelin+mozilla)
Attachment #9156480 - Flags: feedback?(mkmelin+mozilla)

(In reply to gene smith from comment #25)

Tb is not currently designed, as far as I can tell, to accept folder names from the imap server in UTF8 except as single byte 7-bit ascii. In fact I guess it was assumed to be illegal (Re: Bug 63186 comment 23). TB still assumes that any mailbox/folder name returned by the server that contains a character needing more than one UTF8 byte is encoded in MUTF7 (as 7-bit ascii). For example, if I create the mailbox ```Skräppost--gds`` [...]

FWIW, Courier-IMAP fixes that:

courier@wmail:Maildir$ grep Skr imaplog.dat
* LIST (\Unmarked \HasNoChildren) "." "INBOX.Skr&AOQ-ppost"
* LIST (\Unmarked \HasNoChildren) "." "INBOX.Skr&AOQ-ppost"
WRITE: * LSUB (\Unmarked \HasNoChildren) "." "INBOX.Skr&AOQ-ppost"
WRITE: * LSUB (\Unmarked \HasNoChildren) "." "INBOX.Skr&AOQ-ppost"
READ: QUOTED_STRING: INBOX.Skr&AOQ-ppost
READ: QUOTED_STRING: INBOX.Skr&AOQ-ppost
WRITE: * MYRIGHTS "INBOX.Skr&AOQ-ppost" "acdilrsw"
READ: QUOTED_STRING: INBOX.Skr&AOQ-ppost
WRITE: * QUOTAROOT "INBOX.Skr&AOQ-ppost" "ROOT"

But

courier@wmail:Maildir$ ls -ld .Sk*
drwx------ 6 courier courier 4096 Jun 13 17:23 .Skräppost

FWIW, Courier-IMAP fixes that:

Is the grep result what courier sends to tb with UTF7=ACCEPT enabled? If so, courier isn't reporting the mailbox/folder name in UTF8 but in MUTF7. Therefore tb with just my original patch should work fine with courier. But with gmail, it returns the folder name as UTF8 so something like my utf8-clean.diff is also needed.

From comment 18, here's how gmail returns the folder name with UTF8=ACCEPT enabled:

2020-05-24 11:36:00.874872 UTC - [(null) 7281: IMAP]: I/IMAP 0x7f0b5fadd800:imap.gmail.com:A:CreateNewLineFromSocket: * XLIST (\HasNoChildren \Spam) "/" "[Gmail]/Skräppost"

With UTF8=ACCEPT not enabled gmail returns MUTF7 just like courier:

2020-05-24 11:36:00.874872 UTC - [(null) 7281: IMAP]: I/IMAP 0x7f0b5fadd800:imap.gmail.com:A:CreateNewLineFromSocket: * XLIST (\HasNoChildren \Spam) "/" "[Gmail]/Skr&AOQ-ppost"

(In reply to gene smith from comment #28)

FWIW, Courier-IMAP fixes that:

Is the grep result what courier sends to tb with UTF7=ACCEPT enabled?M

Oops, no, sorry, I thought it was clear I'm using a stock Thunderbird package (68.8.0). There is no ACCEPT string in the log, as it starts after login.
Here's a telnet session:

Connected to wmail.tana.it.
Escape character is '^]'.
* OK [CAPABILITY IMAP4rev1 UIDPLUS CHILDREN NAMESPACE THREAD=ORDEREDSUBJECT THREAD=REFERENCES SORT QUOTA AUTH=CRAM-MD5 AUTH=CRAM-SHA1 IDLE ACL ACL2=UNION STARTTLS XCOURIEROUTBOX=INBOX.SendMailViaImap XMAGICTRASH ENABLE UTF8=ACCEPT] Courier-IMAP ready. Copyright 1998-2019 Double Precision, Inc.  See COPYING for distribution information.
a login **** ****
a OK LOGIN Ok.
a list "" "INBOX.Sk*"
* LIST (\Unmarked \HasNoChildren) "." "INBOX.Skr&AOQ-ppost"
a OK LIST completed
a status INBOX.Skr&AOQ-ppost messages
* STATUS "INBOX.Skr&AOQ-ppost" (MESSAGES 1)
a OK STATUS Completed.
a enable utf8=accept
* ENABLED UTF8=ACCEPT
a OK Options enabled
a list "" "INBOX.Sk*"
* LIST (\Unmarked \HasNoChildren) "." "INBOX.Skräppost"
a OK LIST completed
a status INBOX.Skr&AOQ-ppost messages
a NO Mailbox does not exist, or must be subscribed to.
a status INBOX.Skräppost messages
a NO Mailbox does not exist, or must be subscribed to.
a status "INBOX.Skräppost" messages
* STATUS "INBOX.Skräppost" (MESSAGES 1)
a OK STATUS Completed.
a logout
* BYE Courier-IMAP server shutting down
a OK LOGOUT completed
Connection closed by foreign host.

Is it painful to install a patched TB?

Ok, I see. So courier and gmail behave the same once UTF8=ACCEPT is enabled and the string has to be quoted which tb always does AFAIK.

Is it painful to install a patched TB?

I can build version of tb 68 using the "try" server with my patches if this is what you want. Sounds like just a linux version would be OK based on your original comment 0. When built I can provide a link. You just un-pack the tarball, cd into the new tb directory and run ./thunderbird. No install is needed.

I attempt to build ESR 68 but I would have had to make unrelated changes to get it to merge and build. If you really want patched 68 let me know and I can do it, but I just went ahead and did the build on top of the latest trunk/daily from comm-central. The try build is here:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=e9ab71c1f3aff80d9aa3c8027e1ea757095d0682
You can click on the green "B" and see the Job Details below. Then you can download and unpack the item called target.tar.bz2 and run thunderbird. The direct link is this:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/S9FhRXY2RXWXaw5ZOfXkjQ/runs/0/artifacts/public/build/target.tar.bz2

When you run the daily build with my patches applied, you will see something about profiles. I don't know what it all means but you can bypass that by running tb like this:
./thunderbird -p -allow-downgrade
and then select your usual profile (typically default). Not sure you really need the -allow-downgrade.
The enable UTF8=ACCEPT is true by default. Magnus asked to to make the pref false by default but I haven't done that yet.

(In reply to gene smith from comment #31)

./thunderbird -p -allow-downgrade

It works! The only thing I noticed, apart from UI changes, if a yellow warning triangle near the Daily Menu saying "PopmailListRecipients requires new permissions".

Server side logs differ:

--- /home/ale/tmp/imaplog.dat   2020-06-13 17:46:16.000000000 +0200
+++ /home/ale/tmp/imaplog-TBpatch.dat   2020-06-15 11:08:06.000000000 +0200
@@ -5,13 +5,20 @@
 READ: EOL
 WRITE: * NAMESPACE (("INBOX." ".")) NIL (("#shared." ".")("shared." "."))
 4 OK NAMESPACE completed.
+READ: NUMBER: 5
+READ: ATOM: ENABLE
+READ: ATOM: UTF8=ACCEPT
+READ: EOL
+WRITE: * ENABLED UTF8=ACCEPT
+5 OK Options enabled
 READ: NUMBER: 4
-READ: ATOM: NAMESPACE
+READ: ATOM: ENABLE
+READ: ATOM: UTF8=ACCEPT
+READ: NUMBER: 6
 READ: EOL
-WRITE: * NAMESPACE (("INBOX." ".")) NIL (("#shared." ".")("shared." "."))
-4 OK NAMESPACE completed.
-READ: NUMBER: 5
 READ: ATOM: LIST
+WRITE: * ENABLED UTF8=ACCEPT
+4 OK Options enabled
 READ: QUOTED_STRING: 
 READ: QUOTED_STRING: INBOX.*
 READ: EOL
@@ -21,7 +28,7 @@
 READ: QUOTED_STRING: INBOX.*
 READ: EOL
 WRITE: * LIST (\Unmarked \HasNoChildren) "." "INBOX.Sent.S2018"
-* LIST (\Unmarked \HasNoChildren) "." "INBOX.Skr&AOQ-ppost"
+* LIST (\Unmarked \HasNoChildren) "." "INBOX.Skräppost"

The enable UTF8=ACCEPT is true by default. Magnus asked to to make the pref false by default but I haven't done that yet.

Yes, it's true. I found it in mail.server.default.allow_utf8_accept. Of course, TB would only enable UTF8=ACCEPT when servers advertise the corresponding capability. At standard use, TB knows better than users whether UTF8 is supported, so it doesn't make much sense to keep it false, except possibly on beta releases. To require that users use Config Editor to set that value in case their server support UTF8 sounds overelaborate.

Looking at Wikipedia table, there are just 6 servers having a green "Yes" in both IMAP and IDN/UTF8 columns, CommuniGate Pro, Courier Mail Server, Dovecot, MDaemon, Microsoft Exchange Server, and Oracle Communications Messaging Server. How many of them should be tested before concluding there are no problems?

It works! The only thing I noticed, apart from UI changes, if a yellow warning triangle near the Daily Menu saying "PopmailListRecipients requires new permissions".

Ok, glad to hear that it worked! Don't think I changed anything about POP so might be some other existing bug/feature.

Do you have an email that was rejected before but now works? I haven't done anything that touches internal headers that might be in UTF8 that might be rejected with the error like you show above:

"Message 2211 appears to be a Unicode message and your E-mail reader did not enable Unicode support. Please use an E-mail reader that supports IMAP with UTF-8."

Do you see this in tb or in your server log?
I only changed the mailbox name handling.
I guess I'm asking if you now don't see the message with the "try" version with the "problem" email? If so, could you attach a sample email that is fixed by the "try" tb version?

I have an assortment of test imap servers that I can access. The only one I see that advertises UTF8=ACCEPT is gmail. I don't see it with Dovecot, MDaemon or Outlook.com (M/S Exchange I assume). I don't have Courier, CG-Pro or Oracle.

(In reply to gene smith from comment #33)

Do you have an email that was rejected before but now works? I haven't done anything that touches internal headers that might be in UTF8 that might be rejected with the error like you show above:
"Message 2211 appears to be a Unicode message and your E-mail reader did not enable Unicode support. Please use an E-mail reader that supports IMAP with UTF-8."

Actually not. That behavior happened in the first UTF-8 release of Courier-IMAP, and was later hacked to return all messages, leaving it to the client to do what it can. The message appears transiently as a warning, not every time, but doesn't prevent fetching the message. Remarkably, TB 68.8.0 works nicely. When replying to an UTF-8 address, the message is not sent due to bug 1563891. However, while the message used to look right before hitting «Send», Daily 79.0a1 (2020-06-14) worsenes the appearance, as the field becomes mim?@fo?.it (see image) —but that's another bug.

In the words of Sam Varshavchik (Courier developer):

The alert is trigged by 8-bit characters in headers but the IMAP client did  
not enable UTF-8 mode.

Not really a bug, but a situation that's falling through the cracks, and  
this serves as a visual heads-up that things have entered "garbage-in,  
garbage-out" state of affairs.

*

That state of affairs is the reason why I'd opt for default true...

I have an assortment of test imap servers that I can access. The only one I see that advertises UTF8=ACCEPT is gmail. I don't see it with Dovecot, MDaemon or Outlook.com (M/S Exchange I assume). I don't have Courier, CG-Pro or Oracle.

Fixed Wikipedia table, thanks.

Comment on attachment 9156480 [details] [diff] [review]
utf8-clean.diff

Review of attachment 9156480 [details] [diff] [review]:
-----------------------------------------------------------------

These changes look alright to me

::: mailnews/imap/src/nsImapMailFolder.cpp
@@ +1910,5 @@
>  
>    if (!onlineName.IsEmpty())
> +  {
> +    // gds: breaks m_onlineFolder name. Don't do this!
> +    //m_onlineFolderName.Assign(onlineName);

https://searchfox.org/comm-central/rev/b0826913e4dd6d1913c9974fcc4dcf89a29a6f6b/mailnews/db/msgdb/public/nsIDBFolderInfo.idl#67

Probably needs to be changed to AUTF8String. ACString is ASCII only.
Attachment #9156480 - Flags: feedback?(mkmelin+mozilla) → feedback+

Magnus, Sorry for the delay. I will try to finish this up tomorrow (Wednesday).

Summary: Enable UTF-8 IMAP store (RFC 6855) → Enable UTF-8 IMAP store (RFC 6855: add Support for UTF8=ACCEPT and UTF8=ONLY)
Attached patch 1571672-UTF8-accept-v5.diff (obsolete) — Splinter Review

Due to bit rot this updates 1571672-UTF8-accept-v4.patch with no other changes to work with current tip. It still needs to be integrated with ut8-clean.diff, after some more testing and changes, to provide the complete patch.

Attachment #9153343 - Attachment is obsolete: true

(In reply to Magnus Melin [:mkmelin] from comment #35)

https://searchfox.org/comm-central/rev/b0826913e4dd6d1913c9974fcc4dcf89a29a6f6b/mailnews/db/msgdb/public/nsIDBFolderInfo.idl#67

Probably needs to be changed to AUTF8String. ACString is ASCII only.

Changing that didn't help. All it changed was a comment in the resulting .h file. Also changed it on setCharProperty() and made no difference either.
It seems that the m_onlineFolderName is 1 char too short when the code is not commented out. The server reports the correct UTF8 folder name: "Skräppost--gds". But the online name tb uses to access the server is "Skräppost--gd" (length=14) instead of the correct "Skräppost--gds" (length=15) where the 'ä' is length 2. When it reads the string from the database, the lenght is 14 but when it writes it the length is 15, so don't know where the missing byte is down in the "mork" stuff. I never could find where it actually stores the 15 (or 14) bytes.
Not sure, but maybe this is the "crunch" referred to here:
https://searchfox.org/comm-central/rev/57ab5f2d0feff8be94d400b745d06aa55dbde61c/mailnews/imap/src/nsImapMailFolder.cpp#1853
It does end up corrupting m_onlineFolderName if that's what he meant by "crunch". But I don't think the code was tested originally with UTF8.

(In reply to Magnus Melin [:mkmelin] from comment #39)

Did you change the CopyASCIItoUTF16 to use CopyUTF8toUTF16? https://searchfox.org/comm-central/rev/57ab5f2d0feff8be94d400b745d06aa55dbde61c/mailnews/imap/src/nsImapMailFolder.cpp#1858

Yes, been doing that all along.

I found a problem here https://searchfox.org/comm-central/rev/c3b6043c388edb0c72244958863fe2cbaf73149c/mailnews/db/msgdb/src/nsMsgDatabase.cpp#3394. Length() returns the number of UTF16 16-bit words and not the number of bytes after conversion to UTF8. That's why a byte was cut off of the onlineName UTF8 string. Doing strlen() on the null terminated converted string gives the right byte length (see diff fragment beflow). A similar use of strlen() is done here: https://searchfox.org/comm-central/rev/c3b6043c388edb0c72244958863fe2cbaf73149c/mailnews/base/src/nsMsgFolderCacheElement.cpp#107

/* static */ struct mdbYarn* nsMsgDatabase::nsStringToYarn(
     struct mdbYarn* yarn, const nsAString& str) {
   yarn->mYarn_Buf = ToNewCString(NS_ConvertUTF16toUTF8(str));
-  yarn->mYarn_Size = str.Length() + 1;
+  yarn->mYarn_Size = strlen((const char*)yarn->mYarn_Buf) + 1;
   yarn->mYarn_Fill = yarn->mYarn_Size - 1;
   yarn->mYarn_Form =
       0;  // what to do with this? we're storing csid in the msg hdr...
   return yarn;
 }

After this, I realized that if you create or rename a folder name with non-ascii UTF8, it is sent to the server as mUTF7. It works OK and looks right in tb but when you go to webmail you see the strange mUTF7 string for the folder name. (I've only checked this on gmail.) I'm still in the process of making create and rename send UTF8 when UTF8=ACCEPT is enabled.
Edit 22 days later: In my patch I now send create and rename strings as UTF8 when UTF8=ACCEPT is in effect. So looks ok in webmail now.

See Also: → 1563891

Well, trying to get this to work has not been fun. With no offline store, when I open a message with multiple attachments and the folder name contains one or more non-ascii chars such as ä, the url is OK and the whole message is fetched and stored to RAM cache. Then each attachment (part) is accessed from cache and displayed inline. Then another URL occurs to fetch the whole message again (not sure why) but this time the url does not contain the proper 2 byte utf8 sequence for ä but instead contains the "replacement character" sequence that displays as a question mark inside a diamond (0xef, 0xbf, 0xbd). The 2nd fetch with the bad URL often causes a hang since it can't get a protocol connection.

I finally found a way to fix this. The bad utf8 was caused by the ä being encoded as a single byte 0xE4 instead of utf8, 0xc3,0xa4. Converting from ASCII to UTF16 and then back to UTF8 fixes it. However it only works when the UTF8 character in the folder name can be represented as a 2 byte "latin" utf8. If you want to have a 3 or more byte "non-latin" utf8 char in the folder name, like ᄂ (U+FFA4, or 0xef,0xbe,0xa4), it won't work.

With offline store the same problem occurs. The messages are all stored OK to mbox file with correctly encoded UTF8 URLs. But when an individual message is accessed, the URL again has the invalid replacement characters. The same fix also works to convert the URL from single byte to multi-byte UTF8 for only the "latin" chars.

The mystery is why the URL to access messages starts out OK, even with non-latin, but then gets "corrupted". (The corruption seem to be that the folder name starts out with proper UTF8 encoding but then somehow gets changed to ISO-8859-1 single byte per char style. It's fixable for "latin" chars that can be represented as a single byte but not for non-latin that won't fit in a single byte.)

Here's the diff fragment for the "fix" I'm referring to above:

diff --git a/mailnews/base/util/nsMsgDBFolder.cpp b/mailnews/base/util/nsMsgDBFolder.cpp
--- a/mailnews/base/util/nsMsgDBFolder.cpp
+++ b/mailnews/base/util/nsMsgDBFolder.cpp
@@ -2843,17 +2843,25 @@ nsresult nsMsgDBFolder::parseURI(bool ne
     nsAutoCString fileName;
     nsAutoCString escapedFileName;
     url->GetFileName(escapedFileName);
     if (!escapedFileName.IsEmpty()) {
       // XXX conversion to unicode here? is fileName in UTF8?
       // yes, let's say it is in utf8
       MsgUnescapeString(escapedFileName, 0, fileName);
       NS_ASSERTION(mozilla::IsUtf8(fileName), "fileName is not in UTF-8");
-      CopyUTF8toUTF16(fileName, mName);
+      if (!mozilla::IsUtf8(fileName))
+      {
+        static nsAutoString temp;
+        printf("gds: 1. fileName is not in UTF-8=%s\n", fileName.get()); 
+        CopyASCIItoUTF16(fileName, temp /*mName*/);
+        mName = temp;
+      }
+      else 
+        CopyUTF8toUTF16(fileName, mName);
     }
   }
 
   // grab the server by parsing the URI and looking it up
   // in the account manager...
   // But avoid this extra work by first asking the parent, if any
   nsCOMPtr<nsIMsgIncomingServer> server = do_QueryReferent(mServer, &rv);
   if (NS_FAILED(rv)) {
@@ -2888,16 +2896,23 @@ nsresult nsMsgDBFolder::parseURI(bool ne
   // now try to find the local path for this folder
   if (server) {
     nsAutoCString newPath;
     nsAutoCString escapedUrlPath;
     nsAutoCString urlPath;
     url->GetFilePath(escapedUrlPath);
     if (!escapedUrlPath.IsEmpty()) {
       MsgUnescapeString(escapedUrlPath, 0, urlPath);
+      if (!mozilla::IsUtf8(urlPath))
+      {
+        static nsAutoString temp;
+        printf("gds: 2. urlPath is not in UTF-8=%s\n", urlPath.get()); 
+        CopyASCIItoUTF16(urlPath, temp);
+        urlPath = NS_ConvertUTF16toUTF8(temp);
+      }
 
       // transform the filepath from the URI, such as
       // "/folder1/folder2/foldern"
       // to
       // "folder1.sbd/folder2.sbd/foldern"
       // (remove leading / and add .sbd to first n-1 folders)
       // to be appended onto the server's path

I have it working properly now for your typical operations like create, rename, [un]subscribe, select, fetch, expunge etc. However, there is still a problem when the the folder name contains a non-latin1 character. That is, it works as long as the unicode value is less than or equal to U+00FF. If you have a folder with a non-latin1 character like U+FFC3 "ᅢ" you will see this warning at the console:

[138089, Main Thread] WARNING: char16_t out of char range; high bits of data lost: 0xffc3: file /home/gene/mozilla/js/xpconnect/src/XPCConvert.cpp, line 408

It comes from here: https://searchfox.org/comm-central/rev/e3ec4fdf727ca65e4acbe42fc4888275977eaf5a/mozilla/js/xpconnect/src/XPCConvert.cpp#406. At this point, it is only a warning. The actual loss of the high byte occurs at this point: https://searchfox.org/comm-central/rev/e3ec4fdf727ca65e4acbe42fc4888275977eaf5a/mozilla/js/xpconnect/src/XPCConvert.cpp#604 and inside the call here: https://searchfox.org/comm-central/rev/e3ec4fdf727ca65e4acbe42fc4888275977eaf5a/mozilla/js/src/jsapi.cpp#4539

This only occurs when accessing mem cached or offline store for messages with attachments and unusual chars (greater than U+00FF) are present in the folder name. (I assume that these "unusual" characters are needed in the many non-latin1 languages.)

Note: There is an bug reported that proposes to change the above warning to a crash, at least for DEBUG builds (bug 1557254). However, after some complaints the associated patch was backed out. The last comment was to re-land that patch but that was about a year ago.

See Also: → 1557254

Alessandro, is it possible for you to provide me with a temporary Courier imap account for testing this? So far, I have only tested with gmail and the only local imap server I have set up here and sort-of working is Dovecot and it doesn't support UTF8=ACCEPT capability. Also, none of my other external accounts I have set up for imap testing support it, outlook, yahoo etc.

Flags: needinfo?(vesely)

Thanks! Test account working and tests in progress. No surprises yet.

Flags: needinfo?(vesely)
Attached patch utf8=accept;comm;with-debug.diff (obsolete) — Splinter Review

This is the diff to comm-central that I am currently running and seems to be working OK. Any folder can be named with a UTF-8 name and works with the basic IMAP functions. It also now allows the folder name to contain non-latin1 characters but only if the subsequent diff to mozilla code (in the next attachment) is applied.
This still contains a lot of printf's and commented out alternatives. These need to be cleaned up for a formal review.

Attachment #9156480 - Attachment is obsolete: true
Attachment #9161408 - Attachment is obsolete: true

To allow non-latin1 UTF-8 characters in folder names with UTF8=ACCEPT in effect, this change to the mozilla-central code seems to be needed. Without this, any UTF-16 character that is greater than U+00FF loses the top byte. This allows the top bytes to be kept in the resulting UTF-8 string instead of being discarded and allows folder names such as "Skrê_ꯍ" to be created in tb.
This address the issue described in comment 42.

Attached patch utf8=accept;comm.diff (obsolete) — Splinter Review

This is the cleaned-up diff against comm-central. Removed the printfs and the various alternatives that were commented out. Functionality is equivalent to attachment 9166487 [details] [diff] [review]

Attached patch utf8=accept;mozilla.diff (obsolete) — Splinter Review

This is the cleaned-up diff against mozilla-central. Removed the printfs and the various alternatives that were commented out. Functionality is equivalent to attachment 9166491 [details] [diff] [review] [diff] [review]

(In reply to Alessandro Vesely from comment #0)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0

Steps to reproduce:

Open a message that contains non-ASCII header fields from a Courier-IMAP mailstore.

Actual results:

Get a warning like:
Message 2211 appears to be a Unicode message and your E-mail reader did not enable Unicode support. Please use an E-mail reader that supports IMAP with UTF-8.

Alessandro, where do you see this warning? This must be in your Courier log, right?. I have only tested with various non-ASCII mailbox names and not with internal non-ASCII headers. Can you provide an example email that has non-ASCII headers that I can test? (I'm not sure what these look like.)

Anyhow, after working on this for probably much too long I have wondered as to the usefulness of all this. Testing with gmail and courier, it appears that even with UTF8=ACCEPT turned off you can still create and have mailbox names with any unicode character. The only difference is that they are encoded in mUTF7 instead of UTF8. So I don't see a lot of improvement here.

Another thing is that the RFC https://tools.ietf.org/html/rfc6855#section-3 it states that "All IMAP servers that support "UTF8=ACCEPT" SHOULD accept UTF-8 in mailbox names...". I take this to mean that some servers may still accept and report mailbox name as mUTF7 even with UTF8=ACCEPT in effect. But it appears that gmail and courier (the only two I have found that really support UTF8=ACCEPT) do report via LIST the mailbox names as UTF8 and accept UTF8 when mailbox is created, renamed and selected when in UTF8=ACCEPT mode.

I do know, at least with gmail, that if you create a mailbox with a mUTF7 string when in UTF8=ACCEPT mode you will see in webmail the literal mUTF7 string and not the intended unicode string. See comment 40 above.

So it seems that the lack of UTF8=ACCEPT support is a non-problem for gmail. Problems have only been reported for courier and often the "solution" is to switch to dovecot. So I'm asking if this complex and high-risk change is really needed due to issues with one server type Courier?

(In reply to Jorg K (CEST = GMT+2) from comment #2)

Seems like a larger job.

Very true!

(In reply to gene smith from comment #46)

Created attachment 9166491 [details] [diff] [review]
utf8=accept;mozilla;with-debug.diff

To allow non-latin1 UTF-8 characters in folder names with UTF8=ACCEPT in effect, this change to the mozilla-central code seems to be needed. Without this, any UTF-16 character that is greater than U+00FF loses the top byte. This allows the top bytes to be kept in the resulting UTF-8 string instead of being discarded and allows folder names such as "Skrê_ꯍ" to be created in tb.

What properties is it that cause a problem? I'm assuming they just have a wrong string type.

(In reply to Magnus Melin [:mkmelin] from comment #50)

What properties is it that cause a problem? I'm assuming they just have a wrong string type.

Don't know which "properties" cause the problem and don't really even know what you mean by "property". I can tell that the problem occurs at two places where JS_EncodeStringToBuffer() is called
here
https://searchfox.org/comm-central/source/mozilla/js/xpconnect/src/XPCConvert.cpp#605
and here
https://searchfox.org/comm-central/source/mozilla/js/xpconnect/src/XPCConvert.cpp#727
Edit: Permalinks not working currently for some reason. Replaced "rev/<long number>" with "source" now works. No idea why?

The first call at line 604 is for string types:
case nsXPTType::T_CHAR_STR:
case nsXPTType::T_PSTRING_SIZE_IS:

The other call at line 726 is for string type:
case nsXPTType::T_CSTRING:

In both cases the problem occurs when a UTF16 string is converted to a byte (UTF8) string and the upper byte is discarded. My fix is to keep the upper byte by decoding it in a different way (only if new parameter "losses" is true). Apparently with mozilla losses are never expected. With TB losses now only occur because folder names at this point are in UTF-16 and, with UTF8=ACCEPT active both byte are significant. With just mUTF7, the upper byte is always zero so no problem occurs.

The actual discard/loss of the upper byte occurs here when decoding 2-byte char strings:https://searchfox.org/comm-central/source/mozilla/js/src/jsapi.cpp#4532

I really have no idea what controls this. But I see this code occur in conjunction with a when a message in a UTF8 encoded folder name with non-latin1 chars (>U+00FF) is accessed and it has attachments inline. It occurs after the full message is fetched and after the inline parts are accessed from offline store or ram cache depending on configuration.

Another different problem also occurs when the compose window appears in response to a reply, forward, edit-as-new or edit draft when the recipient addresses are written to the fields. This is described in bug 1563891 comment 23.

(In reply to gene smith from comment #49)

(In reply to Alessandro Vesely from comment #0)

Alessandro, where do you see this warning? This must be in your Courier log, right?.

Never mind. I already asked this and you answered above. It's the courier log. Sorry. Oh well, still wrong. It is seen in tb. See below.
Still sample emails with UTF8 encoded headers might be helpful.

(In reply to gene smith from comment #49)

Get a warning like:
Message 2211 appears to be a Unicode message and your E-mail reader did not enable Unicode support. Please use an E-mail reader that supports IMAP with UTF-8.

Alessandro, where do you see this warning? This must be in your Courier log, right?

No, Courier sends it via IMAP and TB displays it in a tiny window which disappears after s short time even if you don't dismiss it. It happened the first time I opened the mail I sent you from my Sent folder. I should've been quicker with gnome-screenshot. Reopening it doesn't generate a new warning, even after restarting TB. Probably Courier tries and minimize those warnings.

Another thing is that the RFC https://tools.ietf.org/html/rfc6855#section-3 it states that "All IMAP servers that support "UTF8=ACCEPT" SHOULD accept UTF-8 in mailbox names...".

I'd guess it's not "MUST" to allow for ASCII-only file systems.

[...] So I'm asking if this complex and high-risk change is really needed due to issues with one server type Courier?

Personally, I think UTF-8 can be used as an universal encoding for quite a long time. The only issues, w.r.t. ASCII only are some string operations like truncating a string or align it to a given column, which require programmer's attention. Oh, and IDNA before DNS lookup . Having to deal with obsolete clients is a PITA. It cannot be done cleanly. For comparison, consider how log did utilities to convert filenames into DOS 8.3 format last. Sooner or later obsolete uses have to be dropped. The sooner the better.

Thank you for carrying Thunderbird into modern-day.

Yes, I see the message now with the sample email you sent to my test courier account. Of course I only see it with UTF8=ACCEPT disabled in my patched code. It maybe only happens on the first message access after starting tb, definitely not every time. It is actaully an "Alert" message as part of the OK response on fetching the full message body:

<full body fetch>
:
2020-07-30 18:27:56.657180 UTC - [(null) 3858: IMAP]: I/IMAP 0x7f1e20e2f000:mail.tana.it:S-INBOX:STREAM:CLOSE: Normal Message End Download Stream
2020-07-30 18:27:56.931924 UTC - [(null) 3858: Main Thread]: D/IMAP Updating stored message size from 1309, new size 1309
2020-07-30 18:27:56.932148 UTC - [(null) 3858: IMAP]: D/IMAP ReadNextLine [rv=0x0 stream=0x7f1e1f30b560 nb=216 needmore=0]
2020-07-30 18:27:56.932168 UTC - [(null) 3858: IMAP]: I/IMAP 0x7f1e20e2f000:mail.tana.it:S-INBOX:CreateNewLineFromSocket: * OK [ALERT] Message 3 appears to be a Unicode message and your E-mail reader did not enable Unicode support. Please use an E-mail reader that supports IMAP with UTF-8 (see https://tools.ietf.org/html/rfc6855.html)
2020-07-30 18:27:56.932188 UTC - [(null) 3858: IMAP]: D/IMAP SetConnectionStatus(0x0)
2020-07-30 18:27:56.996999 UTC - [(null) 3858: IMAP]: D/IMAP ReadNextLine [rv=0x0 stream=0x7f1e1f30b560 nb=24 needmore=0]
2020-07-30 18:27:56.997029 UTC - [(null) 3858: IMAP]: I/IMAP 0x7f1e20e2f000:mail.tana.it:S-INBOX:CreateNewLineFromSocket: 19 OK FETCH completed.

Also I see now that non-ASCII header just means something as simple as a U+2014 (Long Dash) in the Subject as in your example. Also, see non-ASCII in the CC address.

The non-ASCII displays correctly in TB main screen with and without UTF8=ACCEPT enabled as expected. The only problem I see is that on "Show Source" the UTF8 is not shown correctly. It seems to be decoding each UTF8 3-byte sequence as a separate IS0-8859-1 characters. So the long dash appears as 3 separate characters —.

(In reply to gene smith from comment #54)

The only problem I see is that on "Show Source" the UTF8 is not shown correctly. It seems to be decoding each UTF8 3-byte sequence as a separate IS0-8859-1 characters. So the long dash appears as 3 separate characters —.

"Show Source" has a View/Text Encoding menu where you can correct that. The default values are probably adjustable, but I haven't found out how, yet.

(In reply to Alessandro Vesely from comment #55)

"Show Source" has a View/Text Encoding menu where you can correct that. The default values are probably adjustable, but I haven't found out how, yet.

Yes, that fixes it if you change from Western to Unicode. Seems like the setting for Incoming mail with the fonts setting should set the default for view source but it doesn't.

There's a bug in my APPEND of UTF8 messages found when trying to save to SENT when sending a UTF8 message. See bug 1563891 comment 22. Probably applies to any situation where imap APPEND occurs.
Thanks Alessandro for researching this and finding the cause.
I failed to implement this detail in the utf8=accept rfc: https://tools.ietf.org/html/rfc6855#section-4

Sam Varshavchik researched and found, see Python bug. I merely pushed the papers. The Python patch which fixed that looks quite simple...

Attached patch utf8Accept-comm-some-debug.diff (obsolete) — Splinter Review

This fixes the append problem. If utf8=accept is enabled, it adds the UTF8, parenthesis and the tilde in the same manner as the python patch pointed to in comment 58. Append to Sent folder now works OK with no alerts with Alessandro's and gmail server.

It also fixes another problem revealed after some email exchanges with Arnt Gulbrandsen (co-author of imap enable RFC and others too I think). He recommended that I go ahead and attempt the enable of utf8=accept if ENABLE capability is detected. If the server doesn't support utf8=accept you still get an OK response from the ENABLE command but you don't get the untagged

* ENABLED UTF8=ACCEPT

response. Before I wasn't even looking for this but just assumed that it worked if the ENABLE and UTF8=ACCEPT capabilities are both present and if OK tagged response occurs (which it always does unless you, for example, misspell ENABLE).

I also now set a flag accessible with m_imapServerSink-> and imapServer-> since there are several other places that utf8=accept needs to be known. Before I was just looking at capabilities and not whether it was really enabled at the server.

The .diff still contains some debug and printfs that need to be cleaned up. It also contain the changes for SMTPUTF8 from bug #1563891 that need to be taken out for this bug.

Attachment #9166879 - Attachment is obsolete: true

To clarify (In reply to gene smith from comment #51)

(In reply to Magnus Melin [:mkmelin] from comment #50)

What properties is it that cause a problem? I'm assuming they just have a wrong string type.

Don't know which "properties" cause the problem and don't really even know what you mean by "property".

I was referring to the object properties as declared in the .idl files, and what string types they have there

  • string / wstring: old style string pointers (char*) - ASCII only
  • ACString: ASCII only, nicer c++ interface
  • AUTF8String/ AString are the non-ASCII versions

If something you're passing around doesn't have the right string type you'll get in trouble for non-ascii characters.

Feel free to needinfo if you have specific questions. Questions inside comments can get lost if I don't have time to answer when I read it.

Magnus, thanks for the explanation. I am now looking again at this bug. But I have another question that is OT to this bug.
I just applied my patch again to a pristine trunk that only affects comm-central code. For some reason, it's re-building a large chunk of mozilla-central code starting with

3:09.45 devtools/shared/heapsnapshot
 3:09.52 devtools/shared/heapsnapshot/tests/gtest
 3:09.58 docshell/base
 3:09.66 docshell/shistory
 3:11.04 dom/abort
 3:20.84 dom/animation
 3:23.96 dom/audiochannel
:

and it takes 30 to 40 minutes to finish on this old (slow) laptop. Does mozilla code have dependencies on comm code? I've only just recently been seeing this. I may be touching something in comm-central that affects the javascript engine and forcing a rebuild? (A full "clobber" build takes about an hour so I'm not see that.)

Sounds like bug 1658188.

Attached patch utf8-accept-comm-works.diff (obsolete) — Splinter Review

Ok, made some progress on this thanks to Jorg's suggestion to use DebugDumpJSStack() at the point where the upper byte is lost in js/xpconnect/src/XPCConvert.cpp. The JS stack dumps shows me where in the JS code the upper byte is getting ignored. Then I found the relevant .idl file and typically changed "string" type to "AUTF8String" and then fixed the many compile errors that "ripple" out. In some cases that change was just from ACString to AUTF8String in the .idl resulting in no compiler errors (not 100% sure these changes are needed, but the "string" to AUTF8String definitely are).

Now I'm not seeing any warnings about the truncation of char16 values and the folder names containing non-latin1 chars are working correctly. However, I still need to go over everything again and make sure all is still working and then clean up the changes and submit an official patch for review. Also, need to run this through the unit tests (probably should do that first).

For now I am leaving in my changes to the mozilla code to detect and correct losses and dump the JS stack for my debug purposes. For reference, I'll attach those changes next.

Attachment #9166487 - Attachment is obsolete: true
Attachment #9166491 - Attachment is obsolete: true
Attachment #9166880 - Attachment is obsolete: true
Attachment #9167800 - Attachment is obsolete: true

For reference, here is my temporary changes to mozilla code to detect when char16's are truncated to char8. When this is detected, DebugDumpJSStack() is called to print the JS stack to help in determining where the problem is occurring. This is not readily evident from just the C++ stack visible in gdb.

I did some more testing by attempting to designate drafts, archives, template and sent as folders containing non-ascii utf8. Drafts and archives worked OK with my latest diff. But when I tried trash with a non-ascii chars, e.g., Skräpppost, problems occurred. For example, what looks like a duplicate folder was created but with MUTF7 encoding that became the designated trash folder. (Both folder appeared in the list with pretty name Skräpppost.)

The attached fix-trash-utf8.diff shows the changes I made to just fix this problem. I still need to test some more and also try my current changes with template and sent special folders having non-ascii utf8 chars.

Noticed another problem when testing the trash folder selection issue. If the folder selected for trash is non-ascii utf8, on startup and the server setup page invoked, it fails to initialize the "picker" widget for the trash folder name; it only says "Choose folder...". The correct trash folder is still actually selected according to the icon and tb behavior, so this is a display problem.

I'm not sure if it has anything to do with this comment in the code: https://searchfox.org/comm-central/rev/ec64095d118d299477d3e5f9fd2dc11912d0b573/mailnews/base/prefs/content/am-server.js#531, but this is where the folder
selected for trash is displayed. (Looking with tb's built-in debugger, I don't actually see the "try" for select failing and causing the "catch" to occur.)

Another random observation is that you can't choose an arbitrary trash folder (e.g., on another account or Local Folders) like you can for Sent, Drafts, etc. (I thought maybe I had broken it while tweaking the code, but I hadn't.) Apparently this was requested but never acted upon in bug 277854.

(In reply to gene smith from comment #66)

Noticed another problem when testing the trash folder selection issue. If the folder selected for trash is non-ascii utf8, on startup and the server setup page invoked, it fails to initialize the "picker" widget for the trash folder name; it only says "Choose folder...". The correct trash folder is still actually selected according to the icon and tb behavior, so this is a display problem.

The attached incremental diff to am-server.js shows a fix for the problem. It only show the changes to fix this problem required only in am-server.js function setupImapDeleteUI(). A full cleaned-up patch is still needed for the full bug fix showing all the files and functions.

I'm not sure if it has anything to do with this comment in the code: https://searchfox.org/comm-central/rev/ec64095d118d299477d3e5f9fd2dc11912d0b573/mailnews/base/prefs/content/am-server.js#531, but this is where the folder
selected for trash is displayed. (Looking with tb's built-in debugger, I don't actually see the "try" for select failing and causing the "catch" to occur.)

No, this comment seems to be not an issue anymore. i never see the "select" actually fail and trigger the catch. The "select" can return false which causes the "Choose folder..." to be displayed. I now check the return value for SelectFolder() to know that the trash folder name is probably stored utf8 instead of the normal mutf7 due to UTF8=ACCEPT and process the strings differently.

This is a raw diff of all changes working correctly to support utf8=accept. It includes some commented out code and and fairly useful printfs for debug and verification purposes.
I am marking all the older attachments as obsolete but they are still useful for documentation and explanation purposes.
Next I will attach the complete cleaned-up patch for feedback and/or review.

Attachment #9171491 - Attachment is obsolete: true
Attachment #9171492 - Attachment is obsolete: true
Attachment #9171977 - Attachment is obsolete: true
Attachment #9172582 - Attachment is obsolete: true
Attached patch 1571672-UTF8-accept-v6.patch (obsolete) — Splinter Review

Here's the full patch without printfs and ready for review.
There is included some changes in nsImapProtocol.cpp to work around bug 1661992 to avoid an immediate crash so those changes will be removed in later patch versions (function Suspend() and Resume()).

Overall, this required a lot of changes to allow folder names to have non-ascii utf-8 chars at all usage points. I think I have checked everything except namespace and server directory name as utf-8.

Reporter Alessandro, thanks for letting me use your Courier server to test this. I have also tested it on gmail which also support UTF8=ACCEPT capability. Any testing you could do on this would be appreciated. If that's possible let me know and I will produce a "try" build for you to use.

I have manually run my changed version through all the comm unit tests and don't see any failure that can be attributed to my changes.

Finally, clang formatting has yet to be done.

Attachment #9172963 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9172963 [details] [diff] [review]
1571672-UTF8-accept-v6.patch

Review of attachment 9172963 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay, seems to work! r=mkmelin

Before landing please clean up the debug, run clang-format and a successful try run

::: mailnews/base/prefs/content/am-server.js
@@ +530,2 @@
>    // TODO: There is something wrong here, selectFolder() fails even if the
> +  // folder does exist. Try to fix in bug 802609. gds: XXX I don't see the problem here.

If you don't see it I think we can just drop this TODO.

@@ +535,4 @@
>    } catch (ex) {
>      trashPopup.parentNode.setAttribute("label", trashFolder.prettyName);
> +    // gds XXX maybe: selected = false;
> +  }

please remove the debug

@@ +609,5 @@
> +    let utf8Decoder = new TextDecoder("utf-8");
> +    try {
> +      unesc = utf8Decoder.decode(typedarray);
> +    } catch (e) {
> +      dump("gds: failed decode\n")  // XXX maybe don't need catch?

probably not no

::: mailnews/compose/src/nsMsgComposeService.cpp
@@ +417,5 @@
>          if (NS_SUCCEEDED(GetOrigWindowSelection(type, aMsgWindow, selHTML)))
>            pMsgComposeParams->SetHtmlToQuote(selHTML);
>        }
>  
> +      if (originalMsgURI.Length() > 0) {

would prefer if (!originalMsgURI.IsEmpty())

::: mailnews/imap/src/nsImapProtocol.cpp
@@ +9631,5 @@
>    return NS_OK;
>  }
>  
>  NS_IMETHODIMP nsImapMockChannel::Cancel(nsresult status) {
> +  printf("gds: nsImapMockChannel::Cancel\n");

please remove debug
Attachment #9172963 - Flags: review?(mkmelin+mozilla) → review+
Attached patch 1571672-UTF8-accept-v7.patch (obsolete) — Splinter Review

Did clang-format on the cpp/h files. Skipped all the other, js, idl. I think that's right.
Did try build. Fixed a reported lint problem on a .js file. See https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=53e34454281dab86162c3108569aa5c2a65db3d6

Attachment #9172959 - Attachment is obsolete: true
Attachment #9172963 - Attachment is obsolete: true
Attachment #9177560 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9177560 [details] [diff] [review]
1571672-UTF8-accept-v7.patch

Review of attachment 9177560 [details] [diff] [review]:
-----------------------------------------------------------------

::: mail/base/content/mailWindowOverlay.js
@@ +1669,5 @@
>    let uniqueAddresses = MailServices.headerParser.removeDuplicateAddresses(
>      addresses,
>      ""
>    );
> +  let numAddresses = MailServices.headerParser.parseEncodedHeaderW(

why this change?

I remember doing that but never made a note as to why. So I went back and retested and tried to recall why I did it. It looks like the change is not needed but doesn't break anything. I think I was concern that with non-ascii addresses that the parsing would fail and produce the wrong count of the number of addresses. However unlike with Bug 1658361 that also changed parseEncodedHeader to parseEncodedHeadeW, the actual parsed address strings are not used here but just the count of the number of address strings so the proper count is returned which determines if the "Reply All" button is enabled or not.

I'll remove this from the patch since it really doesn't apply to this bug even if there was a problem.

Attached patch 1571672-UTF8-accept-v8.patch (obsolete) — Splinter Review

Removed changes to mailWindowOverlay.js from the patch per comment 72 and comment 73. Otherwise same as v7 patch version.

Attachment #9177560 - Attachment is obsolete: true
Attachment #9177560 - Flags: review?(mkmelin+mozilla)
Attachment #9177801 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9177801 [details] [diff] [review]
1571672-UTF8-accept-v8.patch

Review of attachment 9177801 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay, this looks good. Thanks for fixing it! r=mkmelin

Please do a try run after you fix the few nits below

::: mailnews/base/prefs/content/am-server.js
@@ +588,5 @@
>    // the path of the URI like we do in nsImapIncomingServer::DiscoveryDone().
>    // Note that the path is returned with a leading slash which we need to remove.
>    var folderPath = Services.io.newURI(folder.URI).pathQueryRef.substring(1);
>    // We need to convert that from MUTF-7 to Unicode.
>    var manager = Cc["@mozilla.org/charset-converter-manager;1"].getService(

please move this down to where it's first used

@@ +600,5 @@
> +  if (/[\x80-\xff]/.exec(unesc)) {
> +    // Foldername appears to be utf8 due to UTF8=ACCEPT capability in effect.
> +    // Convert the value to a typed-array of utf-8 bytes.
> +    var typedarray = new Uint8Array(unesc.length);
> +    for (var i = 0; i < unesc.length; i++) {

nit: please use "let"

::: mailnews/imap/src/nsImapIncomingServer.cpp
@@ +348,5 @@
> +      NS_ENSURE_SUCCESS(rv, rv);
> +      nsCString trashURI;
> +      trashFolder->GetURI(trashURI);
> +      GetMsgFolderFromURI(trashFolder, trashURI, getter_AddRefs(trashFolder));
> +      if (NS_SUCCEEDED(rv) && trashFolder) {

should probably be rv= GetMsgFolderFromURI here. (though existing code, yes)

::: mailnews/mailnews.js
@@ +616,5 @@
>  // Can we change the store type without conversion? (=has the store been used)
>  pref("mail.server.default.canChangeStoreType", false);
>  
> +// Enable use of imap capability UTF8=ACCEPT described in RFC 6855.
> +pref("mail.server.default.allow_utf8_accept", false);

I think this should be true by default. The pref is needed only to debug problems and/or workaround server bugs.
Attachment #9177801 - Flags: review?(mkmelin+mozilla) → review+

Made the requested changes and ran a try build:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=8b625e4f213380c84e9209dc9bda6ebbac20feb1

I've fixed the whitespace error shown.

Note: I had problems with pushing my changes to try-comm and did several re-pushes since I thought I must be doing something wrong. But apparently my initial push was OK and the server was just busy or something, so they all queued up and they all finally ran. Sorry for the extra traffic. Anyhow, the link above is to my first push.

On my later re-push there was an X* error on the windows build but I see similar errors on other people's pushes so I'm assuming it's OK:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=1289707e4621d0de2d171552b7db185e7352cce7

Attachment #9177801 - Attachment is obsolete: true
Attachment #9180234 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9180234 [details] [diff] [review]
1571672-UTF8-accept-v9.patch

Review of attachment 9180234 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/compose/src/nsMsgCompose.cpp
@@ +399,1 @@
>          if (!newUrl) continue;

there's an (existing) rv check missing here. will add
Attachment #9180234 - Flags: review?(mkmelin+mozilla) → review+

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/0054ea73e658
Add support for imap capability UTF8=ACCEPT (RFC 6855). r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 83 Branch
Blocks: 1512053
Regressions: 1686034
No longer regressions: 1686034
Regressions: 1686034

As mentioned in bug 1686034 comment #15, it's unclear why URI's were moved to UTF-8 here
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l2.12
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l4.12
and elsewhere in the patch when maybe proper escaping as for mailbox: URIs could have done the trick.

There's some very hacky stuff, for example CopyMUTF7toUTF16() can now also handle UTF-8 input. Overall it's not clear when a folder name is transported as UTF-8 and when it's transported as MUTF-7, IOW, which parts of the code now always expect UTF-8. Here here, the name is now unconditionally converted to UTF-8:
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l16.68

The test which decides whether the input is UTF-7 or UTF-8 is here:
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l5.22
And it's based on whether the folder name is ASCII. Looks like creating a folder by the name of test&AOk- now results in creating a folder testé.

This hunk here https://hg.mozilla.org/comm-central/rev/0054ea73e658#l16.149 is unnecessary. The original code compiles.

BTW, apart from bug 1686034, this also caused bug 1685033 and bug 1686415.

Back at comment 49 I question the usefulness of this UTF8=ACCEPT stuff since MUTF7 allows you to create any mailbox name you want without using UTF8. This was after much frustration getting anything to work without causing other problems.

Anyhow, I'm not sure what you mean by "proper escaping" to fix the problem.

I did just verify that even with UTF8=ACCEPT disabled in config editor, even on an account that doesn't support UTF8=ACCEPT capability, that a folder created as test&AOk- produces a folder named testé. I haven't tried running without it, but I assume this is a regression due to the patch for the current bug.

Anyhow, I'm not sure what you mean by "proper escaping" to fix the problem.

Well, please read bug 1686034 comment #15 again. It shows that mailbox: URLs escape testé as mailbox-message://nobody@Local%20Folders/test%C3%A9#1 in nsMailboxService::MessageURIToMsgHdr(). The é is C3A9 in UTF-8 and it's percent encoded.

You try to push UTF-8 into the URI without escaping here and in bug 1686034 (and bug 1685033) to get imap-message://klaus.bartosch%40gmail.com@imap.gmail.com/INBOX/testé#1.

That may be an option, but it's certainly not the way other URL types are handled.

Overall, as stated in comment #79, it's unclear which encodings are used in a string retrieved from an IMAP folder name, so sadly CopyMUTF7toUTF16() now tries to guess based on the ASCII-ness of the input.

I'm also wondering if/why https://hg.mozilla.org/comm-central/rev/0054ea73e658#l16.68 is correct. IMHO, instead of spot-fixing any encoding issues, if would be good to design/define at which locations in the code which encoding is used. There is a comment to that extent here:
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l16.131

MUTF-7 is used for folder names in almost all IMAP servers except Gmail and Courier if I read comment #49 correctly. Sure, moving towards UTF-8 is the way to go, but I suspect that MUTF-7 is here to stay for another decade.

All that said, sadly the changes here introduced a heap of regressions due to incorrect charset handling: bug 1686034, bug 1685033 and bug 1686415. So whilst getting rid of a "relic from the past", Gmail processing went broken in a few ways. So even if you wanted to use the new feature, you can't.

I've looked through the patch again. Looks like the idea was to make folder names UTF-8 internally, and only convert them to MUTF-7 at the "last minute":
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l23.38
That also answers the question re. https://hg.mozilla.org/comm-central/rev/0054ea73e658#l16.68.

In view of bug 1686034 comment #24 all the NS_ConvertUTF16toUTF8/NS_ConvertUTF8ToUTF16 should be inspected to check whether they should be NS_CopyUnicodeToNative/NS_CopyNativeToUnicode.

If would also be good to determine whether a folder name coming back from IMAP is MUTF-7 or UTF-8. That test shouldn't be done in CopyMUTF7toUTF16(). Surely, if the server doesn't have UTF-8 support, MUTF-7 is returned. It's only questionable whether servers with UTF-8 support are guaranteed to return UTF-8 and not MUTF-7 any more.

Looks like a trash folder name is constructed here:
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l19.48
Looking at bug 1686415 I wonder whether some processing was missed for the other "special" folders, like Sent, etc.

Further nits are the inconsistent spelling of UTF-8 and MUTF-7 throughout the patch, I see: uft8, uft-8, mUTF7, MUTF7, mutf7, mutf-7.

Comment on attachment 9180234 [details] [diff] [review]
1571672-UTF8-accept-v9.patch

Review of attachment 9180234 [details] [diff] [review]:
-----------------------------------------------------------------

A few comments, some of them already made elsewhere. The main problem here seems to be that in many places in the code it's unknown which encoding is actually present. Additionally, all the UTF8toUTF16 and UTF16toUTF8 need to be checked to make sure those strings don't come from the file system, in which they will not necessarily be Unicode. Unless on Windows UTF-8 is set as system locale, file/folder names are encoded in the local ANSI code page. See bug 1686034 comment #26 for possible side effects. Not only various accented Western characters in names should be checked but also some CJK text.

::: mailnews/base/prefs/content/am-server.js
@@ +530,2 @@
>    try {
> +    var selected = trashPopup.selectFolder(trashFolder);

var in a try block is bad style. It should be a let before the try block. I'm surprised linting doesn't complain here.

::: mailnews/base/src/nsMsgI18N.cpp
@@ +137,5 @@
>    return NS_OK;
>  }
>  
>  nsresult CopyMUTF7toUTF16(const nsACString& aSrc, nsAString& aDest) {
> +  if (NS_IsAscii(aSrc.BeginReading(), aSrc.Length())) {

This is hacky and such a test for UTF-7 should have been done elsewhere.

::: mailnews/imap/src/nsImapIncomingServer.cpp
@@ +1108,3 @@
>          nsImapUrl::UnescapeSlashes(folderName.BeginWriting());
> +        // Set length to preclude showing a null "box" symbol.
> +        folderName.SetLength(strlen(folderName.BeginWriting()));

This is totally ugly and hacky. Setting the null-byte inside the function and the length outside is wrong.

@@ +2309,5 @@
>  
> +  // Extra step here prevents chars flagged as invalid, don't know why.
> +  nsAutoString temp;
> +  temp.Assign(nsDependentString(aName));
> +  return SubscribeToFolder(temp, true, nullptr);

This is unnecessary.

::: mailnews/imap/src/nsImapMailFolder.cpp
@@ +488,5 @@
>            int32_t leafPos = currentFolderNameStr.RFindChar(delimiter);
>            if (leafPos > 0) currentFolderNameStr.Cut(0, leafPos + 1);
>  
>            // take the utf7 full online name, and determine the utf7 leaf name
> +          CopyUTF8toUTF16(onlineFullUtf7Name, utf7LeafName);

The code doesn't match the comment. Is it UTF-8 or UTF-7 now?

@@ +1775,1 @@
>      m_onlineFolderName.Assign(onlineName);

Can you please explain this change. What are "replacement chars"? If there unwanted characters, that points to some other issue.

@@ +8193,5 @@
>      NS_ENSURE_SUCCESS(rv, rv);
>  
>      // XXX : Fix this non-sense by fixing AddSubfolderWithPath
>      nsAutoString unicodeLeafName;
> +    CopyUTF8toUTF16(utf7LeafName, unicodeLeafName);

utf7LeafName contains MUTF-7, that's 100% sure since you populated it with rv = CopyUTF16toMUTF7(folderName, utf7LeafName);
a few lines above.
I'd still be curious to know why you unconditionally use MUTF-7 here. This code is in nsImapMailFolder::RenameSubFolders(). Are you sure that if you rename from testé to testè the result will be UTF-8?

::: mailnews/imap/src/nsImapService.cpp
@@ +2224,5 @@
>          urlSpec.Append(hierarchyDelimiter);
>        }
>  
>        nsAutoCString utfNewName;
> +      utfNewName.Append(NS_ConvertUTF16toUTF8(newFolderName));

Why append to a new string?

@@ +2895,5 @@
>        nsAutoCString utfFolderName;
> +      // folderName is already mutf7 at this point with not utf8=accept. When
> +      // utf8=accept in effect it is utf8. So change to mutf7 not needed at
> +      // all it seems.
> +      utfFolderName.Append(NS_ConvertUTF16toUTF8(folderName));

Why use append on a new string? Are you aware of this syntax:
NS_ConvertUTF16toUTF8 utfFolderName(folderName);
which you have elsewhere in your patch.

A comment of the form "it seems" is not very encouraging. Referencing MUFT-7 here is confusing, since the old code that converted to MUTF-7 is gone. It would be good to know why the name is already MUTF-7.

(In reply to Klaus B. from comment #85)

Using TB 78, I have a folder named Skr&AOQ-ppost on disk. It should be converted when switching to UTF-8.

I don't know Gmail, but the first time I installed an UTF-8 version of Courier, before launching the new daemon I had to convert any directory on disk. Courier provides a conversion utility for that.

Switching to UTF-8 simplifies IMAP a bit, although, having to support both UTF-8 and mUTF-7 actually complicates it. I'm not clear how much longer will we have to endure UTF-16.

Using TB 78, I have a folder named Skr&AOQ-ppost on disk. It should be converted when switching to UTF-8.

Exactly, I see testé and testé.msf now. On Windows, you will run into trouble when your local ANSI page can't hold the name. I'm looking into bug 1686034 comment #26 right now and the reason why this fails is that the filename is NOT UTF-8 when returned from Windows.

UTF-16 is the internal encoding in Mozilla, also used in JavaScript and for Windows WCHAR type. It will be forever.

A bit of pedantism. The various wchar types out there were invented in the era of UCS-2 and are mostly UTF-16 oblivious. UTF-16 surrogate pairs had to be hacked in later. Windows wchar falls in that camp - it is UTF-16 surrogate pair oblivious and thus mangling is quite possible. This is the reason for WTF-8. http://simonsapin.github.io/wtf-8/

Otherwise, yeah, we're stuck w/ UTF-16 for a good long while. Hopefully not forever. Windows is trying to slowly move off of it too. Maybe in another hundred years the limitations of UTF-16 in the UTF-8 encoding length can be dropped ☺
https://utf8everywhere.org/

Anyway, carry on - I'm very much looking forward to this bug landing!

(In reply to Alessandro Vesely from comment #87)

(In reply to Klaus B. from comment #85)

I don't know Gmail, but the first time I installed an UTF-8 version of Courier, before launching the new daemon I had to convert any directory on disk. Courier provides a conversion utility for that.

Does that utility convert files in tb profile files or Courier's own files? Just curious.

Anyway, carry on - I'm very much looking forward to this bug landing!

This bug has landed months ago and is now causing problems in beta, see comment #84. Those bugs revealed some fundamental issues, so it's going to be a while before this is ready for prime-time.

Thanks for the links, fascinating reading :-)

Does that utility convert files in tb profile files or Courier's own files?

Looks like it re-encodes MUTF-7 file names to UTF-8. As far as I know, (some) IMAP servers use maildir, just like TB.

(In reply to Klaus B. from comment #91)

Does that utility convert files in tb profile files or Courier's own files?

Looks like it re-encodes MUTF-7 file names to UTF-8. As far as I know, (some) IMAP servers use maildir, just like TB.

It also converts maildrop recipes, if they were created by Courier's automaton.

(In reply to nemo from comment #89)

https://utf8everywhere.org/

That manifesto mentions Java but not Javascript.

Magnus, can you please set bug 1686415, bug 1685033, bug 1685450 and bug 1687452 as additional regressions. There are also various issues that slipped the review, see comment #86. Finally, there are a few call sites in the code base https://searchfox.org/comm-central/search?q=CopyUTF16toMUTF7&path=&case=false&regexp=false where CopyUTF16toMUTF7 is called unconditionally. That is most likely wrong.

Backing out this bug here would be the way to go since you're already in a "whack a mole" situation where more bugs are bound to pop up.

Flags: needinfo?(mkmelin+mozilla)

And bug 1685449 appears to be another regression.

Regressions: 1687452

I don't think we want to back it out, but rather fix the discovered problems.

Flags: needinfo?(mkmelin+mozilla)
Regressions: 1686415, 1685033, 1685450
Regressions: 1687727
Regressions: 1687938
Regressions: 1685449

To all the people following this bug: All (known) regressions have been fixed, so this should work perfectly in TB Daily and TB 86 beta soon to come.

Note that if your file system can handle the names, you will get the mailbox name as file name. If your filesystem can't handle the name, it will be "hashed up". For example, on my US-English Windows machine with ANSI/windows-1252 locale, a folder with name テストテストテスト ends up as 61ef77d0.msf.

Regressions: 1737514
Regressions: 1739784
Regressions: 1739789
Regressions: 1739814
Regressions: 1739903
Regressions: 1746348
See Also: → 1817822
See Also: → 1818676
See Also: → 1828199
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: