[autoconfig] Exchange config fails due to TLS OCSP requests
Categories
(Thunderbird :: Account Manager, defect)
Tracking
(thunderbird_esr6069+ fixed, thunderbird_esr6869+ fixed, thunderbird69 fixed, thunderbird70 fixed)
People
(Reporter: BenB, Assigned: neil)
References
Details
Attachments
(1 file, 4 obsolete files)
3.79 KB,
patch
|
neil
:
review+
jorgk-bmo
:
approval-comm-beta+
jorgk-bmo
:
approval-comm-esr60+
jorgk-bmo
:
approval-comm-esr68+
|
Details | Diff | Splinter Review |
In the automatic account creation dialog:
Situation:
When we find an Exchange config, we fetch addons.json to get the addons that can support Exchange servers in Thunderbird. We request https://live.thunderbird.net/autoconfig/addons.json , which redirects to https://autoconfig.thunderbird.net/addons.json .
In FetchHTTP, we are settings a timeout of 5 seconds for any HTTPS calls that we make during the autoconfiguration. That is because we contact a number of different servers, including servers from the ISP, and they are often badly configured and do not respond in time or at all. To avoid the user waiting a lot time, we set this timeout. An HTTP server that needs more than 5 seconds to respond, I would consider it broken. DNS can add time, but that's usually no more than 1 second for a cold cache. Then we added HTTPS into the mix. TLS setup take a moment, but usually only 200 ms.
There's a utility called hey which can measure the time taken to make HTTP requests. With its default test settings of 200 requests, max 50 simultaneously, the slowest request only took 1.3671 seconds, with 25% of requests actually finishing in half the time. Given that we're using a timeout of 5 second, this should be plenty of time for the request to finish.
Bug:
However, Thunderbird enabled OSCP by default, for TLS certificate revokation checks. That happens transparently during the TLS setup and makes another server request, to the CA. And apparently, that call is slooooooow.
On the same machine that ran hey, the call to https://live.thunderbird.net/autoconfig/addons.json took up to 4532ms on TLS setup alone! Consequently, the redirect calls to https://autoconfig.thunderbird.net/addons.json were cancelled due to the timeout.
Impact:
In the statistics, I am seeing a massive discrepancy between people who fetched addons.json (from the server perspective) and who installed Owl. The discrepancy cannot be explained with people not wanting the addon, it's way too huge. This bug is a possible explanation: The users request addons.json, but never receive it, so they cannot install the addon.
Reporter | ||
Comment 1•6 years ago
|
||
Solution:
3 options:
- Remove the redirect on the server (cuts the time in half)
- Increase the timeout to 15 seconds, only for this addons.json call
- Pre-warm the OSCP by making TLS requests to these 2 servers at the time when we open the dialog. That would fetch the OSCP while the user types, it would be cached, so at the time we call the servers, we already have the OSCP response and the calls are much faster. Additional advantage: The MX and ISPDB requests would also be faster.
We should combine at least 2 of the above solutions, i.e. pre-warm (to reduce user wait time) and increase timeout (to increase reliability).
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Comment 2•6 years ago
|
||
(In reply to Ben Bucksch from comment #1)
- Increase the timeout to 15 seconds, only for this addons.json call
Actually the issue affects all calls, for instance I made a test capture in a new profile and it was the request tohttps://live.thunderbird.net
redirected tohttps://autoconfig.thunderbird.net
that failed (again presumably due to an OCSP timeout, as retrying the request succeeded).
Assignee | ||
Comment 3•6 years ago
|
||
Sorry, that wasn't clear of me, because the request for addons.json
uses the same servers. Sorry about that. I think the request I was looking at was for the ISPdata.
Assignee | ||
Comment 4•6 years ago
|
||
I increased the timeout to 10 seconds for all requests because we now make a lot of requests to various servers as part of autodiscover.
Assignee | ||
Comment 5•6 years ago
|
||
Comment 6•6 years ago
|
||
(In reply to Ben Bucksch (:BenB) from comment #0)
Bug:
However, Thunderbird enabled OSCP by default, for TLS certificate revokation checks. That happens transparently during the TLS setup and makes another server request, to the CA. And apparently, that call is slooooooow.On the same machine that ran hey, the call to https://live.thunderbird.net/autoconfig/addons.json took up to 4532ms on TLS setup alone! Consequently, the redirect calls to https://autoconfig.thunderbird.net/addons.json were cancelled due to the timeout.
I don't really pretend to fully understand OSCP, but are you saying that any https request Thunderbird makes takes an additional 4000ms because of this? That can't be the case -- my locally installed beta doesn't behave this way. Even on a new profile it takes well under 200ms to contact live.thunderbird.net in the account setup process. Is this an outlier? Where is the CA and why does it take so long to connect to it? Does OCSP really make an additional, sequential server request for each https URL? The certificate on live.thunderbird.net and the one on autoconfig are the exact same certificate.
Basically what I'm saying here is that this kind of delay can't possibly be acceptable, normal operation and, if it is actually widespread, should be considered a bug.
autoconfig latency is poor(~800-1100ms) in many regions because it's not behind any CDN for logging purposes(Cloudflare does not allow log access without an Enterprise account), but we may not need real IPs in those logs anymore, so I could put it back behind Cloudflare. I would need to remove the rate limiter first.
Reporter | ||
Comment 7•6 years ago
|
||
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Comment 8•6 years ago
|
||
(In reply to Andrei Hajdukewycz from comment #6)
the call to https://live.thunderbird.net/autoconfig/addons.json took up to 4532ms on TLS setup alone! Consequently, the redirect calls to https://autoconfig.thunderbird.net/addons.json were cancelled due to the timeout.
I don't really pretend to fully understand OSCP, but are you saying that any https request Thunderbird makes takes an additional 4000ms because of this? That can't be the case -- my locally installed beta doesn't behave this way. Even on a new profile it takes well under 200ms to contact live.thunderbird.net in the account setup process. Is this an outlier?
All I can tell you is what I observed on my machine while I was doing a lot of testing of new profiles, where the developer tools claim that it's spending that long in TLS setup, and warming up the request significantly cuts down the time taken for the real request.
Reporter | ||
Comment 9•6 years ago
•
|
||
Hey sancus, not every request takes 4000 ms, but some do. For example, I do not see that problem on my machine. The tests were performed on Neil's machine. He saw it in some, but not all requests. Still, it was fairly reproducible for him. He could code and test against these timeouts.
Of course, the OSCP checks are cached for a certain amount of time, so you would only see this in a new profile. That made it harder to discover the bug.
this kind of delay can't possibly be acceptable, normal operation and, if it is actually widespread, should be considered a bug.
I concur, and I am glad you consider it the same.
As to whether this is widespread: I don't know any tools or facilities that would both allow me to query from different DSL connections around the world, and also do it in the same way as Thunderbird does, with OSCP. Commandline tools normally don't bother with this, as evidenced by the good and fast 'hay' results that Neil saw on the same machine where Thunderbird requests are slow.
I can only go by symptoms, and if the various server calls don't give a reliable response, that would explain a number of symptoms that I see in the statistics that were inexplicable so far. But I cannot be certain.
autoconfig latency is poor(~800-1100ms) in many regions
While 1100 ms is certainly not fast, it's still acceptable. I would worry upwards of 2000ms, or outliers above 4000ms, including all redirects commulative.
Assignee | ||
Comment 10•6 years ago
|
||
Reporter | ||
Comment 11•6 years ago
|
||
The pre-warming would help in any case. We already know we will make these calls, and that the OSCP check is necessary, so we can make the check earlier, and shave off 200ms in the fast case, and rescue an otherwise failing call in the worst case. We chose HTTP OPTIONS as method, to avoid messing up the stats, and give less work for the server. But still, that would increase the server calls, so that's why I've asked you. sancus, do you agree with this change?
Reporter | ||
Comment 12•6 years ago
|
||
Comment 13•6 years ago
|
||
(In reply to Ben Bucksch (:BenB) from comment #11)
But still, that would increase the server calls, so that's why I've asked you. sancus, do you agree with this change?
The additional server calls are fine, no problem.
Assignee | ||
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
Comment 14•6 years ago
|
||
Sigh, no HG header, the patch doesn't apply to trunk (I fixed that) and no commit message. What's a good commit message, certainly not the bug summary.
Reporter | ||
Comment 15•6 years ago
|
||
[autoconfig] Increase timeouts and pre-fetch TLS OSCP to avoid ISPDB/addon.json fetch failures. r=BenB
Reporter | ||
Comment 16•6 years ago
|
||
@Neil: Please provide a patch with HG header, this summary, and a longer explanation about the 2 fixes.
Comment 17•6 years ago
|
||
Oh, I was going to land it now, I can do the HG header, the summary, but not the explanation. While you're at it, please rebase to latest trunk, there was a merge failure in emailWizard.js.
Assignee | ||
Comment 18•6 years ago
|
||
Sorry for forgetting the commit message.
Comment 19•6 years ago
|
||
Almost perfect now. The line with the commit message needs the bug number, bug I'll add that, don't worry.
Assignee | ||
Comment 20•6 years ago
|
||
(In reply to Jorg K from comment #19)
Almost perfect now. The line with the commit message needs the bug number, bug I'll add that, don't worry.
Bah, I didn't see this, otherwise I would have fixed my other attachment just now. Sorry about that.
Comment 21•6 years ago
|
||
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/e056b7118c3a
Mitigate first-time HTTPS overhead to avoid ISPDB/addon.json fetch failures. r=BenB
Comment 22•6 years ago
|
||
hg qimport bz:1572418 failed here :-(
You guys cause me head-aches. One says ...
[autoconfig] Increase timeouts and pre-fetch TLS OSCP to avoid ISPDB/addon.json fetch failures. r=BenB
@Neil: Please provide a patch with HG header, this summary, and a longer explanation about the 2 fixes.
... and the other provides this commit message:
Mitigate first-time HTTPS overhead
So I combined the two since most people won't understand the TLS OSCP.
And then the explanation:
- Preflight a couple of URLs that we know we're likely to access for speed
- Increase the timeout on those two URLs for reliability
Why does one sentence say "a couple of" and the other "two". It's the same two you pre-fetch. What's "preflight"? I fixed that explanation as I deemed fit. Sigh, I just noticed that "for speed" had it's counterpart in "for reliability", so I shouldn't have removed it. Too bad.
Assignee | ||
Comment 23•6 years ago
|
||
(In reply to Jorg K from comment #22)
You guys cause me head-aches.
Actually I had a (real) headache at the time, which I'm going to blame for overlooking that Ben said "this summary". Sorry about that.
Assignee | ||
Comment 24•6 years ago
|
||
Updated•6 years ago
|
Comment 25•6 years ago
|
||
TB 69 beta 4:
https://hg.mozilla.org/releases/comm-beta/rev/5477fcc25b929b4d74bc4cb60ed24cccd030bb0e
Reporter | ||
Comment 26•6 years ago
|
||
Assignee | ||
Comment 27•6 years ago
|
||
(In reply to Ben Bucksch from comment #26)
This should uplift only in combination with bug 1571772.
Have you got your bugs mixed up? Bug 1572467 is the one that belongs with bug 1571772.
Comment 28•6 years ago
|
||
Well, I'd be inclined to to throw bug 1572418 (this one here) and bug 1571772 and bug 1572467 into the same lot and uplift all or none.
Reporter | ||
Comment 29•6 years ago
•
|
||
Neil wrote in comment 27:
Have you got your bugs mixed up? Bug 1572467 is the one that belongs with bug 1571772.
No, I really meant bug 1571772 and this bug 1572418 together. The relation is on the high level, not on the technical level.
Jörg wrote in comment 28:
I'd be inclined to to throw bug 1572418 (this one here) and bug 1571772 and bug 1572467 into the same lot and uplift all or none.
Agreed. That's what I meant.
Assignee | ||
Comment 30•6 years ago
|
||
(In reply to Ben Bucksch from comment #29)
Agreed. That's what I meant.
Sorry, I was confused because you made different requests on bug 1572467.
Updated•6 years ago
|
Comment 31•6 years ago
|
||
TB 68.1 ESR:
https://hg.mozilla.org/releases/comm-esr68/rev/1b43445ce5f59d9932741c5abd188f6ece733afe
Updated•6 years ago
|
Comment 32•6 years ago
|
||
TB 60.9 ESR:
https://hg.mozilla.org/releases/comm-esr60/rev/ee3679e4a7199ba12c08a251080575d961b4b420
Reporter | ||
Comment 33•6 years ago
|
||
Thanks, Jörg!
Reporter | ||
Updated•6 years ago
|
Comment 34•5 years ago
•
|
||
There seems something wrong with the logic of this whole adjustment:
The addons.json used to determine if there's a suitable add-on for the protocol (exchange) couldn't be loaded due to a delay in lookup. But that is only (should only) be called after we found a config. And to find a config we ask the ISPDB, which is on same server we just queried to get a configuration in the first place! All network setup should already be done when we go into the next call.
Am I missing something?
(I think increasing the timeout is alright though.)
Reporter | ||
Comment 35•5 years ago
•
|
||
It's true that the addons.json is loaded only after we found a config that has an Exchange server in it. There are currently 2 ways to get an Exchange config: Via Microsoft AutoDiscover protocol, or via ISPDB. The AutoDiscover protocol has been there since last year (you had reviewed that change). This exists to support on-premise Exchange servers. That ISPDB can also return an Exchange config is rather new and was added to support Office365, which has a config in the ISPDB.
We have so many different autoconfig methods, and each need network calls, which are slow, that we start all the network calls in parallel. We query the email domain with our own autoconfig method. We query the ISPDB for the domain. We get the MX host, and query the ISPDB for that MX domain. We do AutoDiscover. Unfortunately, because many networks are configured to DROP packets, some of these queries have to time out before they finally fail, so many of these network calls can take 10s. If you chain them, one after the other, it becomes really slow. But even though the network calls are all started at the same time, we use them in the order that I mentioned, and if a more preferred method succeeds, we abort the rest. This change has been very much needed and dramatically increased the speed of the configuration from what could be 1.5 minutes to 10 seconds. Note that most calls are to the very domain where you host your emails, in fact the very first call is to the domain, and it's always been like that.
I think what happened here is that the AutoDiscover results were quick, because they came from the on-premise Exchange and the company-internal network, but the ISPDB and addons.json calls were slow due to the OSCP lookups, and the corresponding DNS lookups. Even if the ISPDB call happens first, and takes 7 seconds, and the addons.json call starts 1 second later while the ISPDB call is still in flight, the addons.json call would still wait for the same DNS lookups and OSCP queries, and block as well, and timeout as well.
Please don't concentrate on the addons.json call here. That's just how we found it. But the same problem also happened for the ISPDB call. These calls also timed out, even though I have no statistics about it. Given that they happen later than the ISPDB, I would expect that even more of the ISPDB calls timed out, based on the same reasoning that you mentioned, namely that the ISPDB call happens first.
Comment 36•5 years ago
|
||
That the calls to ispdb or addons.json has an explicit timeout at all seems like a bug. It should just be a normal fetch
and results will show once done.
Description
•