Closed Bug 1783538 Opened 2 years ago Closed 1 year ago

For host with self-signed certificate, error code returned to thunderbird is not always in class ERROR_CLASS_BAD_CERT

Categories

(Core :: Security: PSM, defect, P3)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: gds, Unassigned)

References

()

Details

(Whiteboard: [psm-backlog])

+++ This bug was initially created as a clone of Bug #1764770 +++
Thunderbird itself now calls the "Add Security Exception" dialog when it sees an error indicating that the imap server's certificate is bad to allow the user to override the exception, such as 0x805a3ff2. However, randomly with a probability of about .5, a non-security error will be returned for the same server such as one indicating that the connection has been dropped: 0x804b0014.

With a bad certificate, the networking code should only returned errors in the class ERROR_CLASS_BAD_CERT.

Here's an example of two attempted connections returning the unexpected and expected codes. This is from an IMAP:5 protocol log:

I/IMAP 7fe897924000:wally.dbnet.lan:NA:CreateNewLineFromSocket: clearing IMAP_CONNECTION_IS_OPEN - rv = 804b0014

I/IMAP 7fe89900e800:wally.dbnet.lan:NA:CreateNewLineFromSocket: clearing IMAP_CONNECTION_IS_OPEN - rv = 805a3ff2

Please see bug 1764770 comment 48 for more details.

See Also: → 1764770

I think the comment here describes what is going on:

We could in theory overwrite a previously-set error code, but we'll always have some sort of error.

It appear the the "bad certificate" error code is recorded first and then the "connection reset" code overwrites it and is reported to TB. (This is probably random and can happen in the opposite order too.)
I have no clue as to how to fix it properly or if it can or should be fixed. But I proved that that is what is happening by patching some low level moz code and checking for the hard-coded error codes in question. With this experimental change, I prevent the overwrite of the already recorded "bad certificate" error code and TB no longer sees the "connection reset" error code so the "Get Messages" button always invokes the override dialog:

diff --git a/nsprpub/pr/src/misc/prerror.c b/nsprpub/pr/src/misc/prerror.c
--- a/nsprpub/pr/src/misc/prerror.c
+++ b/nsprpub/pr/src/misc/prerror.c
@@ -18,17 +18,23 @@ PR_IMPLEMENT(PRInt32) PR_GetOSError(void
 {
     PRThread *thread = PR_GetCurrentThread();
     return thread->osErrorCode;
 }
 
 PR_IMPLEMENT(void) PR_SetError(PRErrorCode code, PRInt32 osErr)
 {
     PRThread *thread = PR_GetCurrentThread();
-    thread->errorCode = code;
+    if (code == -5961) // set code to connection reset
+      printf("gds: errorCode = %d\n", thread->errorCode);
+    // if thread has code bad cert and trying to set to conn reset, keep as is.
+    if (thread->errorCode == -16370 && code == -5961)
+      printf("gds: keep code at %d, bad cert\n", thread->errorCode);
+    else
+      thread->errorCode = code;
     thread->osErrorCode = osErr;
     thread->errorStringLength = 0;
 }
Component: Networking → Security: PSM

Can you use https://mozilla.github.io/mozregression/quickstart.html to determine when this started happening? Thanks!

Flags: needinfo?(gds)

I found that up to TB nightly 2022-06-13 that clicking "get messages" always brings up the exception dialog since only "bad certificate" error (0x805a3ff2) is returned by the mozilla code.
But at TB daily build 2022-06-14, only the TCP disconnected error is returned (0x804b0014) so the exception dialog never appears.
Then on and after 2022-06-15, there appears to be a "partial fix" in that both 0x805a3ff2 and 0x804b0014 occur randomly so sometimes the override dialog appears and sometimes it doesn't on click of "Get Messages".

So in conclusion, the regression occurs with tb daily build 2022-06-14:
https://archive.mozilla.org/pub/thunderbird/nightly/2022/06/2022-06-14-10-56-18-comm-central/

Daily 2022-06-14 is defined by these revisions per the *.txt files:

20220614105618
https://hg.mozilla.org/comm-central/rev/9163e343a98baab477ae50139f37a2909a646b6b
https://hg.mozilla.org/mozilla-central/rev/ac931cd7399bfdc23b6c427a10e050b3dc074a0e

Looking at the corresponding push logs, I couldn't identify any obvious causes for this bug.

Flags: needinfo?(gds)

Looks like there are some networking changes around that time.

Component: Security: PSM → Networking

The 2022-06-13 nightly, which doesn't have the issue, was the last version without the new "grease" parameters. When I change security.tls.ech.grease_probability from default of 50 to 0 on a recent nightly TB build, the bug goes away and I only see the "bad certificate" error and don't see the "tcp disconnect" errors when clicking "Get Messages".
So this workaround fixes the immediate TB issue but whatever the "grease probability" setting at 0% may break, I have no real idea.
Re: https://datatracker.ietf.org/doc/html/draft-ietf-tls-grease-01

(In reply to gene smith from comment #5)

When I change security.tls.ech.grease_probability from default of 50 to 0 on a recent nightly TB build, the bug goes away and I only see the "bad certificate" error and don't see the "tcp disconnect" errors when clicking "Get Messages".

Thanks for debugging this and finding a workaround. That default value of 50 is currently only set on Nightly, so it looks like this was regressed by Bug 1774001. For some reason that is a blocked security bug, but you can see the changes here and here. Also see Bug 1767974. That is great news that this does not affect TB Beta or 102 ESR.

Component: Networking → Security

Looks like maybe an ECH-incompatible server/middlebox?

Flags: needinfo?(djackson)

Does any server software support ECH (Encrypted Client Hello) yet? I checked Dovecot and Postfix (including grepping their source code) and didn't find any indication of support in either. (In both cases I grepped the source code for the latest release versions.)

From Teal in comment 6:

That default value of 50 is currently only set on Nightly,

I'm seeing it at 50% on tb beta 104.0b4. I think it's because I ran nightly and and the values present there stuck when running beta with same profile.

From Dana in comment 7:

Looks like maybe an ECH-incompatible server/middlebox?

Only thing between my laptop running tb and imap server for testing (dovecot on fedora 35 I think) is asus home wifi router.

From Steven in comment 8:

Does any server software support ECH (Encrypted Client Hello) yet? I checked Dovecot and Postfix (including grepping their source code) and didn't find any indication of support in either. (In both cases I grepped the source code for the latest release versions.)

Thanks. Good to know that.

(In reply to gene smith from comment #9)

Only thing between my laptop running tb and imap server for testing (dovecot on fedora 35 I think) is asus home wifi router.

Try a different router?

Flags: needinfo?(gds)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #10)

(In reply to gene smith from comment #9)

Only thing between my laptop running tb and imap server for testing (dovecot on fedora 35 I think) is asus home wifi router.

Try a different router?

Don't have another router. Besides, commenter Teal sees the same issue and I think his server is somewhere on the "real" internet.

Flags: needinfo?(gds)

(In reply to Steven Michaud [:smichaud] (Retired) from comment #8)

Does any server software support ECH (Encrypted Client Hello) yet? I checked Dovecot and Postfix (including grepping their source code) and didn't find any indication of support in either. (In both cases I grepped the source code for the latest release versions.)

For what it's worth, Wikipedia's article on Server Name Indication (a predecessor of ECH) lists some server software that supports it. They're all web servers.

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #10)

(In reply to gene smith from comment #9)

Only thing between my laptop running tb and imap server for testing (dovecot on fedora 35 I think) is asus home wifi router.

Try a different router?

I connected the laptop running TB to the fedora server box running dovecot directly with an ethernet cable to eliminate the router. With grease probability set to 50 I still see about half the responses as 804b0014 (tcp disconnect) and half as 805a3ff2 (bad cert) as when a router is in between.
When I set the probability to 0 all the responses are 805a3ff2 (bad cert) as they should be.
Running dovecot 2.3.19.1 on fedora 36 (not 35 as stated above).
So it appears that the router (or lack of a router) is not a factor.

(In reply to gene smith from comment #11)

Besides, commenter Teal sees the same issue and I think his server is somewhere on the "real" internet.

Yes, the server I am having issues with is on the public internet.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #12)

For what it's worth, Wikipedia's article on Server Name Indication (a predecessor of ECH) lists some server software that supports it. They're all web servers.

Yeah, according to the Mozilla Security Blog, enabling ECH requires setting a HTTPSSVC DNS record, which seems geared towards web servers.

However, Mozilla's ECH implementation requires DoH (DNS over HTTPS, see Bug 1500289), which for some reason is disabled by default in Thunderbird, so ECH should not even be being attempted. Actually, Mozilla's two TRR/DoH providers are not listed in TB Beta or Daily (another regression, see Bug 1757761), so users could not use ECH even if it was supported by their mail server without first knowing the (hidden) Mozilla TRR URIs and then adding a "Custom" provider. For reference, the Cloudflare URI seems to be: https://mozilla.cloudflare-dns.com/dns-query based on the network.trr.default_provider_uri preference.

(In reply to Magnus Melin [:mkmelin] from comment #15)

Testing with self signed certificates, I see this very frequently.
Regression range, from comment 3: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3c71dd51a31a126cda6ac40eec82ae1090ad8537&tochange=ac931cd7399bfdc23b6c427a10e050b3dc074a0e

Re: comment 5
The regression is one of the commits with the word "Grease" or "ECH" in the title.
If they can't/won't fix it, maybe tb could just default security.tls.ech.grease_probability to 0?
If that's possible, it fixes the need to retry "Get Messages" but does it affect security in TB?

Thank you for the reports Gene, Teal and Magnus.

Encrypted Client Hello (ECH) can operate in two modes: GREASE and 'real'. In GREASE mode, NSS adds a dummy ECH extension which is syntactically valid but filled with random data. All correct TLS server implementations are required to ignore unrecognised extensions and carry on the connection. There's no actual encryption of the SNI or other fields and the client is fully expecting the server to ignore the extension.

From Gene's report, it seems we can rule out any kind of network interference from a third party and so it initially sounded to me like a bug in the dovecot TLS server implementation. However, its pretty unusual that all the reports involve an untrusted certificate, since dovecot ought to also be breaking connections with a valid cert. That suggests the bug is probably in the error handling code in Thunderbird.

In checkHandshake, we set wantRetry to true if the connection used GREASE. This in turn causes the error to be set to PR_CONNECT_RESET_ERROR, but this behaviour has long been used to work around TLS version intolerances and shouldn't be problematic.

Our options here are to either improve the retry mechanism in Firefox to better report the underlying connection state and so allow Thunderbird to continue with an invalid cert error or else to have Thunderbird adopt a similar retry mechanism as Firefox.

In the short term, setting security.tls.ech.grease_probability to 0 has no security consequences. However, this isn't a good long term fix because a) we do want to use GREASE and b) this mechanism will still break the next time we add an additional retry mechanism to PSM.

Flags: needinfo?(djackson)

(In reply to Dennis Jackson from comment #17)

From Gene's report, it seems we can rule out any kind of network interference from a third party and so it initially sounded to me like a bug in the dovecot TLS server implementation. However, its pretty unusual that all the reports involve an untrusted certificate, since dovecot ought to also be breaking connections with a valid cert. That suggests the bug is probably in the error handling code in Thunderbird.

It is IMO pretty unlikely to be a bug in Thunderbird. For the cases we noticed, it's cases where we expect that the certificate may be self-signed, and we try to catch that case to be able to allow the user to add the security exception if needed. Gene saw it with the C++ imap code, and I've seen it with POP3 js code (which uses TCPSocket). That is, reacting to a TCPSocketErrorEvent: https://searchfox.org/comm-central/rev/c2de20c00273b0d8795bc2f9b316bc0a37ef5290/mailnews/local/src/Pop3Client.jsm#351

Thunderbird is not really handling the error code in any way other than probing what we got back and checking IF there is a possibility to do a cert override for that. But that now only works sometimes since the general error is instead often obtained instead of ERROR_CLASS_BAD_CERT

Our options here are to either improve the retry mechanism in Firefox to better report the underlying connection state and so allow Thunderbird to continue with an invalid cert error or else to have Thunderbird adopt a similar retry mechanism as Firefox.

AFAIK it's not caused any errors once we actually have a proper secure connection. It's a problem for the initial connection if what we're getting back is ERROR_CLASS_BAD_CERT, but we can't detect that.

@Magnus - The key bit is the code in PSM here. We overwrite the error code in order to prompt Firefox to retry the connection. The problem is that a) that's a bad way to signal we need to retry and b) apparently Thunderbird does not use the same retry logic.

(In reply to Dennis Jackson from comment #19)

@Magnus - The key bit is the code in PSM here. We overwrite the error code in order to prompt Firefox to retry the connection. The problem is that a) that's a bad way to signal we need to retry and b) apparently Thunderbird does not use the same retry logic.

TB doesn't do any low level (TCP or TLS) retries. If there's a non-overridable TLS error, TB c++ puts up an alert pop-up and the user can't access their mailboxes until they fix the problem. If there's an overridable TLS error (like self-signed cert or issuer unknown) again, the user can't access mailboxes but there is no alert. For overridable TLS errors the user must click the "Get Messages" button which trigger a select URL. The URL expects to see the overridable TLS error response and invoke the override dialog to clear the error and allow mailbox access. But, after clicking "Get Messages" if the select URL sees PR_CONNECT_RESET_ERROR, which will currently happen about half the time with grease prob. at 50%, the override dialog will not be invoked.

FWIW, this small diff fixes the problem for TB with grease probability at 50%. It's actually just a generalization of my diff in comment 1 that also sort-of "fixed" the problem. My thought is that if there's an overridable cert error, there is no point in triggering a retry so just leave the err value at what it is and don't set it to PR_CONNECT_RESET_ERROR. But there's probably more to this that I'm not understanding.

diff --git a/security/manager/ssl/nsNSSIOLayer.cpp b/security/manager/ssl/nsNSSIOLayer.cpp
--- a/security/manager/ssl/nsNSSIOLayer.cpp
+++ b/security/manager/ssl/nsNSSIOLayer.cpp
@@ -1287,20 +1287,20 @@ int32_t checkHandshake(int32_t bytesTran
     if (handleHandshakeResultNow) {
       wantRetry = retryDueToTLSIntolerance(PR_END_OF_FILE_ERROR, socketInfo);
     }
   }
 
   if (wantRetry) {
     MOZ_LOG(gPIPNSSLog, LogLevel::Debug,
             ("[%p] checkHandshake: will retry with lower max TLS version\n",
              ssl_layer_fd));
-    // We want to cause the network layer to retry the connection.
-    err = PR_CONNECT_RESET_ERROR;
+    // We (may?) want to cause the network layer to retry the connection.
+    if (!ErrorIsOverridable(err)) err = PR_CONNECT_RESET_ERROR;
     if (wasReading) bytesTransfered = -1;
   }
 
   // TLS intolerant servers only cause the first transfer to fail, so let's
   // set the HandshakePending attribute to false so that we don't try the logic
   // above again in a subsequent transfer.
   if (handleHandshakeResultNow) {
     // Report the result once for each handshake. Note that this does not
     // get handshakes which are cancelled before any reads or writes
Component: Security → Security: PSM
Severity: -- → S3
Priority: -- → P3
Whiteboard: [psm-backlog]

I think this bug disappeared. Gene, can you confirm?

(In reply to Magnus Melin [:mkmelin] from comment #22)

I think this bug disappeared. Gene, can you confirm?

Yes, It appears to have been fixed. My "grease" pref settings are all at default and I don't have any patches related to this in effect in comm or moz code. I also get an exception dialog on first click of "get messages" after I've deleted my self-signed certificate for my local dovecot.

So looks good now.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.