1783538 - For host with self-signed certificate, error code returned to thunderbird is not always in class ERROR_CLASS_BAD_CERT

Reporter

Description

•

2 years ago

+++ This bug was initially created as a clone of Bug #1764770 +++
Thunderbird itself now calls the "Add Security Exception" dialog when it sees an error indicating that the imap server's certificate is bad to allow the user to override the exception, such as 0x805a3ff2. However, randomly with a probability of about .5, a non-security error will be returned for the same server such as one indicating that the connection has been dropped: 0x804b0014.

With a bad certificate, the networking code should only returned errors in the class ERROR_CLASS_BAD_CERT.

Here's an example of two attempted connections returning the unexpected and expected codes. This is from an IMAP:5 protocol log:

I/IMAP 7fe897924000:wally.dbnet.lan:NA:CreateNewLineFromSocket: clearing IMAP_CONNECTION_IS_OPEN - rv = 804b0014

I/IMAP 7fe89900e800:wally.dbnet.lan:NA:CreateNewLineFromSocket: clearing IMAP_CONNECTION_IS_OPEN - rv = 805a3ff2

Please see bug 1764770 comment 48 for more details.

gene smith

Reporter

Updated

•

2 years ago

Comment 1

•

2 years ago

I think the comment here describes what is going on:

We could in theory overwrite a previously-set error code, but we'll always have some sort of error.

It appear the the "bad certificate" error code is recorded first and then the "connection reset" code overwrites it and is reported to TB. (This is probably random and can happen in the opposite order too.)
I have no clue as to how to fix it properly or if it can or should be fixed. But I proved that that is what is happening by patching some low level moz code and checking for the hard-coded error codes in question. With this experimental change, I prevent the overwrite of the already recorded "bad certificate" error code and TB no longer sees the "connection reset" error code so the "Get Messages" button always invokes the override dialog:

diff --git a/nsprpub/pr/src/misc/prerror.c b/nsprpub/pr/src/misc/prerror.c
--- a/nsprpub/pr/src/misc/prerror.c
+++ b/nsprpub/pr/src/misc/prerror.c
@@ -18,17 +18,23 @@ PR_IMPLEMENT(PRInt32) PR_GetOSError(void
 {
     PRThread *thread = PR_GetCurrentThread();
     return thread->osErrorCode;
 }
 
 PR_IMPLEMENT(void) PR_SetError(PRErrorCode code, PRInt32 osErr)
 {
     PRThread *thread = PR_GetCurrentThread();
-    thread->errorCode = code;
+    if (code == -5961) // set code to connection reset
+      printf("gds: errorCode = %d\n", thread->errorCode);
+    // if thread has code bad cert and trying to set to conn reset, keep as is.
+    if (thread->errorCode == -16370 && code == -5961)
+      printf("gds: keep code at %d, bad cert\n", thread->errorCode);
+    else
+      thread->errorCode = code;
     thread->osErrorCode = osErr;
     thread->errorStringLength = 0;
 }

Dragana Damjanovic [:dragana]

Updated

•

2 years ago

Component: Networking → Security: PSM

Dana Keeler (she/her) (use needinfo) [:keeler]

Comment 2

•

2 years ago

Can you use https://mozilla.github.io/mozregression/quickstart.html to determine when this started happening? Thanks!

Flags: needinfo?(gds)

gene smith

Reporter

Comment 3

•

2 years ago

•

Edited

I found that up to TB nightly 2022-06-13 that clicking "get messages" always brings up the exception dialog since only "bad certificate" error (0x805a3ff2) is returned by the mozilla code.
But at TB daily build 2022-06-14, only the TCP disconnected error is returned (0x804b0014) so the exception dialog never appears.
Then on and after 2022-06-15, there appears to be a "partial fix" in that both 0x805a3ff2 and 0x804b0014 occur randomly so sometimes the override dialog appears and sometimes it doesn't on click of "Get Messages".

So in conclusion, the regression occurs with tb daily build 2022-06-14:
https://archive.mozilla.org/pub/thunderbird/nightly/2022/06/2022-06-14-10-56-18-comm-central/

Daily 2022-06-14 is defined by these revisions per the *.txt files:

20220614105618
https://hg.mozilla.org/comm-central/rev/9163e343a98baab477ae50139f37a2909a646b6b
https://hg.mozilla.org/mozilla-central/rev/ac931cd7399bfdc23b6c427a10e050b3dc074a0e

Looking at the corresponding push logs, I couldn't identify any obvious causes for this bug.

Flags: needinfo?(gds)

Dana Keeler (she/her) (use needinfo) [:keeler]

Comment 4

•

2 years ago

Looks like there are some networking changes around that time.

Component: Security: PSM → Networking

gene smith

Reporter

Comment 5

•

2 years ago

•

Edited

The 2022-06-13 nightly, which doesn't have the issue, was the last version without the new "grease" parameters. When I change security.tls.ech.grease_probability from default of 50 to 0 on a recent nightly TB build, the bug goes away and I only see the "bad certificate" error and don't see the "tcp disconnect" errors when clicking "Get Messages".
So this workaround fixes the immediate TB issue but whatever the "grease probability" setting at 0% may break, I have no real idea.
Re: https://datatracker.ietf.org/doc/html/draft-ietf-tls-grease-01

Teal Dulcet [:tdulcet] (TB Council)

Comment 6

•

2 years ago

(In reply to gene smith from comment #5)

When I change security.tls.ech.grease_probability from default of 50 to 0 on a recent nightly TB build, the bug goes away and I only see the "bad certificate" error and don't see the "tcp disconnect" errors when clicking "Get Messages".

Thanks for debugging this and finding a workaround. That default value of 50 is currently only set on Nightly, so it looks like this was regressed by Bug 1774001. For some reason that is a blocked security bug, but you can see the changes here and here. Also see Bug 1767974. That is great news that this does not affect TB Beta or 102 ESR.

edgul

Updated

•

2 years ago

Component: Networking → Security

Dana Keeler (she/her) (use needinfo) [:keeler]

Comment 7

•

2 years ago

Looks like maybe an ECH-incompatible server/middlebox?

Flags: needinfo?(djackson)

Steven Michaud [:smichaud] (Retired)

Comment 8

•

2 years ago

•

Edited

Does any server software support ECH (Encrypted Client Hello) yet? I checked Dovecot and Postfix (including grepping their source code) and didn't find any indication of support in either. (In both cases I grepped the source code for the latest release versions.)

gene smith

Reporter

Comment 9

•

2 years ago

From Teal in comment 6:

That default value of 50 is currently only set on Nightly,

I'm seeing it at 50% on tb beta 104.0b4. I think it's because I ran nightly and and the values present there stuck when running beta with same profile.

From Dana in comment 7:

Looks like maybe an ECH-incompatible server/middlebox?

Only thing between my laptop running tb and imap server for testing (dovecot on fedora 35 I think) is asus home wifi router.

From Steven in comment 8:

Does any server software support ECH (Encrypted Client Hello) yet? I checked Dovecot and Postfix (including grepping their source code) and didn't find any indication of support in either. (In both cases I grepped the source code for the latest release versions.)

Thanks. Good to know that.

Dana Keeler (she/her) (use needinfo) [:keeler]

Comment 10

•

2 years ago

(In reply to gene smith from comment #9)

Only thing between my laptop running tb and imap server for testing (dovecot on fedora 35 I think) is asus home wifi router.

Try a different router?

Flags: needinfo?(gds)

gene smith

Reporter

Comment 11

•

2 years ago

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #10)

(In reply to gene smith from comment #9)

Only thing between my laptop running tb and imap server for testing (dovecot on fedora 35 I think) is asus home wifi router.

Try a different router?

Don't have another router. Besides, commenter Teal sees the same issue and I think his server is somewhere on the "real" internet.

Flags: needinfo?(gds)

Steven Michaud [:smichaud] (Retired)

Comment 12

•

2 years ago

(In reply to Steven Michaud [:smichaud] (Retired) from comment #8)

Does any server software support ECH (Encrypted Client Hello) yet? I checked Dovecot and Postfix (including grepping their source code) and didn't find any indication of support in either. (In both cases I grepped the source code for the latest release versions.)

For what it's worth, Wikipedia's article on Server Name Indication (a predecessor of ECH) lists some server software that supports it. They're all web servers.

gene smith

Reporter

Comment 13

•

2 years ago

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #10)

(In reply to gene smith from comment #9)

Only thing between my laptop running tb and imap server for testing (dovecot on fedora 35 I think) is asus home wifi router.

Try a different router?

I connected the laptop running TB to the fedora server box running dovecot directly with an ethernet cable to eliminate the router. With grease probability set to 50 I still see about half the responses as 804b0014 (tcp disconnect) and half as 805a3ff2 (bad cert) as when a router is in between.
When I set the probability to 0 all the responses are 805a3ff2 (bad cert) as they should be.
Running dovecot 2.3.19.1 on fedora 36 (not 35 as stated above).
So it appears that the router (or lack of a router) is not a factor.

Teal Dulcet [:tdulcet] (TB Council)

Comment 14

•

2 years ago

(In reply to gene smith from comment #11)

Besides, commenter Teal sees the same issue and I think his server is somewhere on the "real" internet.

Yes, the server I am having issues with is on the public internet.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #12)

For what it's worth, Wikipedia's article on Server Name Indication (a predecessor of ECH) lists some server software that supports it. They're all web servers.

Yeah, according to the Mozilla Security Blog, enabling ECH requires setting a HTTPSSVC DNS record, which seems geared towards web servers.

However, Mozilla's ECH implementation requires DoH (DNS over HTTPS, see Bug 1500289), which for some reason is disabled by default in Thunderbird, so ECH should not even be being attempted. Actually, Mozilla's two TRR/DoH providers are not listed in TB Beta or Daily (another regression, see Bug 1757761), so users could not use ECH even if it was supported by their mail server without first knowing the (hidden) Mozilla TRR URIs and then adding a "Custom" provider. For reference, the Cloudflare URI seems to be: https://mozilla.cloudflare-dns.com/dns-query based on the network.trr.default_provider_uri preference.

Magnus Melin [:mkmelin]

Comment 15

•

2 years ago

Testing with self signed certificates, I see this very frequently.
Regression range, from comment 3: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3c71dd51a31a126cda6ac40eec82ae1090ad8537&tochange=ac931cd7399bfdc23b6c427a10e050b3dc074a0e

gene smith

Reporter

Comment 16

•

2 years ago

(In reply to Magnus Melin [:mkmelin] from comment #15)

Testing with self signed certificates, I see this very frequently.
Regression range, from comment 3: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3c71dd51a31a126cda6ac40eec82ae1090ad8537&tochange=ac931cd7399bfdc23b6c427a10e050b3dc074a0e

Re: comment 5
The regression is one of the commits with the word "Grease" or "ECH" in the title.
If they can't/won't fix it, maybe tb could just default security.tls.ech.grease_probability to 0?
If that's possible, it fixes the need to retry "Get Messages" but does it affect security in TB?

Dennis Jackson

Comment 17

•

2 years ago

Thank you for the reports Gene, Teal and Magnus.

Encrypted Client Hello (ECH) can operate in two modes: GREASE and 'real'. In GREASE mode, NSS adds a dummy ECH extension which is syntactically valid but filled with random data. All correct TLS server implementations are required to ignore unrecognised extensions and carry on the connection. There's no actual encryption of the SNI or other fields and the client is fully expecting the server to ignore the extension.

From Gene's report, it seems we can rule out any kind of network interference from a third party and so it initially sounded to me like a bug in the dovecot TLS server implementation. However, its pretty unusual that all the reports involve an untrusted certificate, since dovecot ought to also be breaking connections with a valid cert. That suggests the bug is probably in the error handling code in Thunderbird.

In checkHandshake, we set wantRetry to true if the connection used GREASE. This in turn causes the error to be set to PR_CONNECT_RESET_ERROR, but this behaviour has long been used to work around TLS version intolerances and shouldn't be problematic.

Our options here are to either improve the retry mechanism in Firefox to better report the underlying connection state and so allow Thunderbird to continue with an invalid cert error or else to have Thunderbird adopt a similar retry mechanism as Firefox.

In the short term, setting security.tls.ech.grease_probability to 0 has no security consequences. However, this isn't a good long term fix because a) we do want to use GREASE and b) this mechanism will still break the next time we add an additional retry mechanism to PSM.

Flags: needinfo?(djackson)

Magnus Melin [:mkmelin]

Comment 18

•

2 years ago

(In reply to Dennis Jackson from comment #17)

From Gene's report, it seems we can rule out any kind of network interference from a third party and so it initially sounded to me like a bug in the dovecot TLS server implementation. However, its pretty unusual that all the reports involve an untrusted certificate, since dovecot ought to also be breaking connections with a valid cert. That suggests the bug is probably in the error handling code in Thunderbird.

It is IMO pretty unlikely to be a bug in Thunderbird. For the cases we noticed, it's cases where we expect that the certificate may be self-signed, and we try to catch that case to be able to allow the user to add the security exception if needed. Gene saw it with the C++ imap code, and I've seen it with POP3 js code (which uses TCPSocket). That is, reacting to a TCPSocketErrorEvent: https://searchfox.org/comm-central/rev/c2de20c00273b0d8795bc2f9b316bc0a37ef5290/mailnews/local/src/Pop3Client.jsm#351

Thunderbird is not really handling the error code in any way other than probing what we got back and checking IF there is a possibility to do a cert override for that. But that now only works sometimes since the general error is instead often obtained instead of ERROR_CLASS_BAD_CERT

Our options here are to either improve the retry mechanism in Firefox to better report the underlying connection state and so allow Thunderbird to continue with an invalid cert error or else to have Thunderbird adopt a similar retry mechanism as Firefox.

AFAIK it's not caused any errors once we actually have a proper secure connection. It's a problem for the initial connection if what we're getting back is ERROR_CLASS_BAD_CERT, but we can't detect that.

Dennis Jackson

Comment 19

•

2 years ago

@Magnus - The key bit is the code in PSM here. We overwrite the error code in order to prompt Firefox to retry the connection. The problem is that a) that's a bad way to signal we need to retry and b) apparently Thunderbird does not use the same retry logic.

gene smith

Reporter

Comment 20

•

2 years ago

(In reply to Dennis Jackson from comment #19)

@Magnus - The key bit is the code in PSM here. We overwrite the error code in order to prompt Firefox to retry the connection. The problem is that a) that's a bad way to signal we need to retry and b) apparently Thunderbird does not use the same retry logic.

TB doesn't do any low level (TCP or TLS) retries. If there's a non-overridable TLS error, TB c++ puts up an alert pop-up and the user can't access their mailboxes until they fix the problem. If there's an overridable TLS error (like self-signed cert or issuer unknown) again, the user can't access mailboxes but there is no alert. For overridable TLS errors the user must click the "Get Messages" button which trigger a select URL. The URL expects to see the overridable TLS error response and invoke the override dialog to clear the error and allow mailbox access. But, after clicking "Get Messages" if the select URL sees PR_CONNECT_RESET_ERROR, which will currently happen about half the time with grease prob. at 50%, the override dialog will not be invoked.

FWIW, this small diff fixes the problem for TB with grease probability at 50%. It's actually just a generalization of my diff in comment 1 that also sort-of "fixed" the problem. My thought is that if there's an overridable cert error, there is no point in triggering a retry so just leave the err value at what it is and don't set it to PR_CONNECT_RESET_ERROR. But there's probably more to this that I'm not understanding.

diff --git a/security/manager/ssl/nsNSSIOLayer.cpp b/security/manager/ssl/nsNSSIOLayer.cpp
--- a/security/manager/ssl/nsNSSIOLayer.cpp
+++ b/security/manager/ssl/nsNSSIOLayer.cpp
@@ -1287,20 +1287,20 @@ int32_t checkHandshake(int32_t bytesTran
     if (handleHandshakeResultNow) {
       wantRetry = retryDueToTLSIntolerance(PR_END_OF_FILE_ERROR, socketInfo);
     }
   }
 
   if (wantRetry) {
     MOZ_LOG(gPIPNSSLog, LogLevel::Debug,
             ("[%p] checkHandshake: will retry with lower max TLS version\n",
              ssl_layer_fd));
-    // We want to cause the network layer to retry the connection.
-    err = PR_CONNECT_RESET_ERROR;
+    // We (may?) want to cause the network layer to retry the connection.
+    if (!ErrorIsOverridable(err)) err = PR_CONNECT_RESET_ERROR;
     if (wasReading) bytesTransfered = -1;
   }
 
   // TLS intolerant servers only cause the first transfer to fail, so let's
   // set the HandshakePending attribute to false so that we don't try the logic
   // above again in a subsequent transfer.
   if (handleHandshakeResultNow) {
     // Report the result once for each handshake. Note that this does not
     // get handshakes which are cancelled before any reads or writes

Daniel Veditz [:dveditz]

Comment 21

•

2 years ago

Moving back to PSM given comment 19. Better (permanent) link for that comment:
https://hg.mozilla.org/mozilla-central/file/7dfc84c3338c5c69e618d04783a275534b5f6641/security/manager/ssl/nsNSSIOLayer.cpp#l1296

Component: Security → Security: PSM

Dana Keeler (she/her) (use needinfo) [:keeler]

Updated

•

2 years ago

Severity: -- → S3

Priority: -- → P3

Whiteboard: [psm-backlog]

Magnus Melin [:mkmelin]

Comment 22

•

1 year ago

I think this bug disappeared. Gene, can you confirm?

gene smith

Reporter

Comment 23

•

1 year ago

(In reply to Magnus Melin [:mkmelin] from comment #22)

I think this bug disappeared. Gene, can you confirm?

Yes, It appears to have been fixed. My "grease" pref settings are all at default and I don't have any patches related to this in effect in comm or moz code. I also get an exception dialog on first click of "get messages" after I've deleted my self-signed certificate for my local dovecot.

So looks good now.

Magnus Melin [:mkmelin]

Updated

•

1 year ago

Status: NEW → RESOLVED

Closed: 1 year ago

Resolution: --- → WORKSFORME