Closed Bug 1572418 Opened 3 months ago Closed 2 months ago

[autoconfig] Exchange config fails due to TLS OSCP requests

Categories

(Thunderbird :: Account Manager, defect)

defect
Not set

Tracking

(thunderbird_esr6069+ fixed, thunderbird_esr6869+ fixed, thunderbird69 fixed, thunderbird70 fixed)

RESOLVED FIXED
Thunderbird 70.0
Tracking Status
thunderbird_esr60 69+ fixed
thunderbird_esr68 69+ fixed
thunderbird69 --- fixed
thunderbird70 --- fixed

People

(Reporter: BenB, Assigned: neil)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 4 obsolete files)

In the automatic account creation dialog:

Situation:
When we find an Exchange config, we fetch addons.json to get the addons that can support Exchange servers in Thunderbird. We request https://live.thunderbird.net/autoconfig/addons.json , which redirects to https://autoconfig.thunderbird.net/addons.json .

In FetchHTTP, we are settings a timeout of 5 seconds for any HTTPS calls that we make during the autoconfiguration. That is because we contact a number of different servers, including servers from the ISP, and they are often badly configured and do not respond in time or at all. To avoid the user waiting a lot time, we set this timeout. An HTTP server that needs more than 5 seconds to respond, I would consider it broken. DNS can add time, but that's usually no more than 1 second for a cold cache. Then we added HTTPS into the mix. TLS setup take a moment, but usually only 200 ms.

There's a utility called hey which can measure the time taken to make HTTP requests. With its default test settings of 200 requests, max 50 simultaneously, the slowest request only took 1.3671 seconds, with 25% of requests actually finishing in half the time. Given that we're using a timeout of 5 second, this should be plenty of time for the request to finish.

Bug:
However, Thunderbird enabled OSCP by default, for TLS certificate revokation checks. That happens transparently during the TLS setup and makes another server request, to the CA. And apparently, that call is slooooooow.

On the same machine that ran hey, the call to https://live.thunderbird.net/autoconfig/addons.json took up to 4532ms on TLS setup alone! Consequently, the redirect calls to https://autoconfig.thunderbird.net/addons.json were cancelled due to the timeout.

Impact:
In the statistics, I am seeing a massive discrepancy between people who fetched addons.json (from the server perspective) and who installed Owl. The discrepancy cannot be explained with people not wanting the addon, it's way too huge. This bug is a possible explanation: The users request addons.json, but never receive it, so they cannot install the addon.

Solution:
3 options:

  1. Remove the redirect on the server (cuts the time in half)
  2. Increase the timeout to 15 seconds, only for this addons.json call
  3. Pre-warm the OSCP by making TLS requests to these 2 servers at the time when we open the dialog. That would fetch the OSCP while the user types, it would be cached, so at the time we call the servers, we already have the OSCP response and the calls are much faster. Additional advantage: The MX and ISPDB requests would also be faster.

We should combine at least 2 of the above solutions, i.e. pre-warm (to reduce user wait time) and increase timeout (to increase reliability).

Assignee: nobody → ben.bucksch
Assignee: ben.bucksch → neil

(In reply to Ben Bucksch from comment #1)

  1. Increase the timeout to 15 seconds, only for this addons.json call
    Actually the issue affects all calls, for instance I made a test capture in a new profile and it was the request to https://live.thunderbird.net redirected to https://autoconfig.thunderbird.net that failed (again presumably due to an OCSP timeout, as retrying the request succeeded).

Sorry, that wasn't clear of me, because the request for addons.json uses the same servers. Sorry about that. I think the request I was looking at was for the ISPdata.

Attached patch Proposed patch (obsolete) — Splinter Review

I increased the timeout to 10 seconds for all requests because we now make a lot of requests to various servers as part of autodiscover.

Attachment #9084019 - Flags: review?(ben.bucksch)
Attached patch Specific requests only (obsolete) — Splinter Review
Attachment #9084129 - Flags: review?(ben.bucksch)

(In reply to Ben Bucksch (:BenB) from comment #0)

Bug:
However, Thunderbird enabled OSCP by default, for TLS certificate revokation checks. That happens transparently during the TLS setup and makes another server request, to the CA. And apparently, that call is slooooooow.

On the same machine that ran hey, the call to https://live.thunderbird.net/autoconfig/addons.json took up to 4532ms on TLS setup alone! Consequently, the redirect calls to https://autoconfig.thunderbird.net/addons.json were cancelled due to the timeout.

I don't really pretend to fully understand OSCP, but are you saying that any https request Thunderbird makes takes an additional 4000ms because of this? That can't be the case -- my locally installed beta doesn't behave this way. Even on a new profile it takes well under 200ms to contact live.thunderbird.net in the account setup process. Is this an outlier? Where is the CA and why does it take so long to connect to it? Does OCSP really make an additional, sequential server request for each https URL? The certificate on live.thunderbird.net and the one on autoconfig are the exact same certificate.

Basically what I'm saying here is that this kind of delay can't possibly be acceptable, normal operation and, if it is actually widespread, should be considered a bug.

autoconfig latency is poor(~800-1100ms) in many regions because it's not behind any CDN for logging purposes(Cloudflare does not allow log access without an Enterprise account), but we may not need real IPs in those logs anymore, so I could put it back behind Cloudflare. I would need to remove the rate limiter first.

Comment on attachment 9084129 [details] [diff] [review]
Specific requests only

Good patch.

Please change HEAD to OPTIONS

r+ BenB with that change
Attachment #9084129 - Flags: review?(ben.bucksch) → review+
Attachment #9084019 - Attachment is obsolete: true
Attachment #9084019 - Flags: review?(ben.bucksch)
Attachment #9084019 - Flags: review-

(In reply to Andrei Hajdukewycz from comment #6)

the call to https://live.thunderbird.net/autoconfig/addons.json took up to 4532ms on TLS setup alone! Consequently, the redirect calls to https://autoconfig.thunderbird.net/addons.json were cancelled due to the timeout.

I don't really pretend to fully understand OSCP, but are you saying that any https request Thunderbird makes takes an additional 4000ms because of this? That can't be the case -- my locally installed beta doesn't behave this way. Even on a new profile it takes well under 200ms to contact live.thunderbird.net in the account setup process. Is this an outlier?

All I can tell you is what I observed on my machine while I was doing a lot of testing of new profiles, where the developer tools claim that it's spending that long in TLS setup, and warming up the request significantly cuts down the time taken for the real request.

Hey sancus, not every request takes 4000 ms, but some do. For example, I do not see that problem on my machine. The tests were performed on Neil's machine. He saw it in some, but not all requests. Still, it was fairly reproducible for him. He could code and test against these timeouts.

Of course, the OSCP checks are cached for a certain amount of time, so you would only see this in a new profile. That made it harder to discover the bug.

this kind of delay can't possibly be acceptable, normal operation and, if it is actually widespread, should be considered a bug.

I concur, and I am glad you consider it the same.

As to whether this is widespread: I don't know any tools or facilities that would both allow me to query from different DSL connections around the world, and also do it in the same way as Thunderbird does, with OSCP. Commandline tools normally don't bother with this, as evidenced by the good and fast 'hay' results that Neil saw on the same machine where Thunderbird requests are slow.

I can only go by symptoms, and if the various server calls don't give a reliable response, that would explain a number of symptoms that I see in the statistics that were inexplicable so far. But I cannot be certain.

autoconfig latency is poor(~800-1100ms) in many regions

While 1100 ms is certainly not fast, it's still acceptable. I would worry upwards of 2000ms, or outliers above 4000ms, including all redirects commulative.

Attached patch Use OPTIONS (obsolete) — Splinter Review
Attachment #9084129 - Attachment is obsolete: true
Attachment #9084282 - Flags: review?(ben.bucksch)

The pre-warming would help in any case. We already know we will make these calls, and that the OSCP check is necessary, so we can make the check earlier, and shave off 200ms in the fast case, and rescue an otherwise failing call in the worst case. We chose HTTP OPTIONS as method, to avoid messing up the stats, and give less work for the server. But still, that would increase the server calls, so that's why I've asked you. sancus, do you agree with this change?

Comment on attachment 9084282 [details] [diff] [review]
Use OPTIONS

r+ BenB

Thanks.
Attachment #9084282 - Flags: review?(ben.bucksch) → review+

(In reply to Ben Bucksch (:BenB) from comment #11)

But still, that would increase the server calls, so that's why I've asked you. sancus, do you agree with this change?

The additional server calls are fine, no problem.

Keywords: checkin-needed
Summary: Autoconfig: Exchange config fails due to TLS OSCP requests → [autoconfig] Exchange config fails due to TLS OSCP requests

Sigh, no HG header, the patch doesn't apply to trunk (I fixed that) and no commit message. What's a good commit message, certainly not the bug summary.

Flags: needinfo?(ben.bucksch)
Keywords: checkin-needed

[autoconfig] Increase timeouts and pre-fetch TLS OSCP to avoid ISPDB/addon.json fetch failures. r=BenB

Flags: needinfo?(ben.bucksch)

@Neil: Please provide a patch with HG header, this summary, and a longer explanation about the 2 fixes.

Attached patch 1572418.diff - rebased (obsolete) — Splinter Review

Oh, I was going to land it now, I can do the HG header, the summary, but not the explanation. While you're at it, please rebase to latest trunk, there was a merge failure in emailWizard.js.

Sorry for forgetting the commit message.

Attachment #9084282 - Attachment is obsolete: true
Attachment #9084987 - Attachment is obsolete: true
Attachment #9085041 - Flags: review+

Almost perfect now. The line with the commit message needs the bug number, bug I'll add that, don't worry.

Keywords: checkin-needed

(In reply to Jorg K from comment #19)

Almost perfect now. The line with the commit message needs the bug number, bug I'll add that, don't worry.

Bah, I didn't see this, otherwise I would have fixed my other attachment just now. Sorry about that.

Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/e056b7118c3a
Mitigate first-time HTTPS overhead to avoid ISPDB/addon.json fetch failures. r=BenB

Status: NEW → RESOLVED
Closed: 2 months ago
Keywords: checkin-needed
Resolution: --- → FIXED

hg qimport bz:1572418 failed here :-(

You guys cause me head-aches. One says ...
[autoconfig] Increase timeouts and pre-fetch TLS OSCP to avoid ISPDB/addon.json fetch failures. r=BenB
@Neil: Please provide a patch with HG header, this summary, and a longer explanation about the 2 fixes.

... and the other provides this commit message:
Mitigate first-time HTTPS overhead

So I combined the two since most people won't understand the TLS OSCP.

And then the explanation:

  • Preflight a couple of URLs that we know we're likely to access for speed
  • Increase the timeout on those two URLs for reliability

Why does one sentence say "a couple of" and the other "two". It's the same two you pre-fetch. What's "preflight"? I fixed that explanation as I deemed fit. Sigh, I just noticed that "for speed" had it's counterpart in "for reliability", so I shouldn't have removed it. Too bad.

Target Milestone: --- → Thunderbird 70.0

(In reply to Jorg K from comment #22)

You guys cause me head-aches.

Actually I had a (real) headache at the time, which I'm going to blame for overlooking that Ben said "this summary". Sorry about that.

Comment on attachment 9085041 [details] [diff] [review]
Rebase and add message and description

[Approval Request Comment]
Regression caused by (bug #): 
User impact if declined: Risk of new user being unable to autoconfigure correctly
Testing completed (on c-c, etc.): 
Risk to taking this patch (and alternatives if risky):
Attachment #9085041 - Flags: approval-comm-beta?
Attachment #9085041 - Flags: approval-comm-beta? → approval-comm-beta+
Comment on attachment 9085041 [details] [diff] [review]
Rebase and add message and description

This should uplift only in combination with bug 1571772.
Attachment #9085041 - Flags: approval-comm-esr68?
Attachment #9085041 - Flags: approval-comm-esr60?

(In reply to Ben Bucksch from comment #26)

This should uplift only in combination with bug 1571772.

Have you got your bugs mixed up? Bug 1572467 is the one that belongs with bug 1571772.

Well, I'd be inclined to to throw bug 1572418 (this one here) and bug 1571772 and bug 1572467 into the same lot and uplift all or none.

Neil wrote in comment 27:

Have you got your bugs mixed up? Bug 1572467 is the one that belongs with bug 1571772.

No, I really meant bug 1571772 and this bug 1572418 together. The relation is on the high level, not on the technical level.

Jörg wrote in comment 28:

I'd be inclined to to throw bug 1572418 (this one here) and bug 1571772 and bug 1572467 into the same lot and uplift all or none.

Agreed. That's what I meant.

(In reply to Ben Bucksch from comment #29)

Agreed. That's what I meant.

Sorry, I was confused because you made different requests on bug 1572467.

Attachment #9085041 - Flags: approval-comm-esr68? → approval-comm-esr68+
Attachment #9085041 - Flags: approval-comm-esr60? → approval-comm-esr60+

Thanks, Jörg!

You need to log in before you can comment on or make changes to this bug.