Closed Bug 1356611 Opened 5 years ago Closed 5 years ago

necko is spending a lot of time verifying certificates via JoinConnection (http2)

Categories

(Core :: Networking, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: keeler, Unassigned)

Details

Attachments

(1 file)

See bug 1353216 comment 33.
STR:

* Enable http2 and certificate transparency (set security.pki.certificate_transparency.mode to 1)
* Install the profiler add-on installed ( https://perf-html.io )
* Configure it to capture the socket thread (so settings -> threads = socket)
* Load https://tv.yahoo.co.jp/listings/fm/
* ctrl + shift + 2
* Select the socket thread, filter on "verify"

(See e.g. https://perfht.ml/2pixfR6 )

It seems there are many code paths in necko that hit JoinConnection, which is actually a bit expensive since it builds a certificate chain (particularly so if CT is enabled, which involves a number of ECDSA verifications, which are not performant at the moment). Either something is amiss here or we need to add some sort of cache that will cut down on how often this is happening.
I think I can make it so that IsAcceptableHost() is called a lot less - maybe even less than on 54 though 55 coalesces some new cases so I can't promise. I'll do that in this bug.

yahoo is a classic test case for the coalescing code :)

I guess its not really necessary that I understand, but can you help me understand why IsAcceptableHost() needs to do all the crypto dance? (or more likely if we should be doing something different?) We're only calling this on connections with complete handshakes and we're looking to know if another name is covered by the already validated cert. does that require crypto or sct work? maybe this is another optimization that can be made.
Flags: needinfo?(dkeeler)
sorta good news - the new coalescing scheme is actually better here than <= 54 

54: 345 calls
55-2-days-ago: 247
mc today: 189

that's with no cache. still going to add that :)
Since we're asking "is this certificate acceptable for this hostname?" we have to ensure that all of the same hostname-specific checks are performed. The certificate has already validated correctly for TLSServerAuth, so things like key usage and name constraints actually don't have to be checked again, and it would be tempting to just call CheckCertHostname. Unfortunately, this could cause a pinning bypass since pinsets are keyed on hostname. We could just call both CheckCertHostname and re-do the pinset validation given the new hostname, but that requires a full certificate chain, which we don't have (this is a shortcoming we've encountered in many places - maybe it's time to actually keep around the validated certificate chain). However, even if we did have the certificate chain, if the pinset validation fails, what's the right thing to do there? We could just say "no, this host can't join this", but that leaves out the possibility that there is a different certificate chain that does match the pinset (at that point, since this is only a performance hit, maybe it would be best to just say "no, don't join"). Finally, I'm not entirely certain that there aren't other hostname-specific checks I've forgotten about (I almost forgot about the pin checks, to be honest). For instance, we would have to be sure that a host couldn't join a connection where the user had overridden an error (my understanding is necko wouldn't actually know if this was the case or not unless it checked).
Flags: needinfo?(dkeeler)
thanks.. that's interesting and I had thought about the pinset issues afterwards.

The bulk of the savings is going to come from a straight up cache - I'm validating a patch for that right now.. but maybe its still worth finding an optimal path for first check (e.g. is CT relevant other than on first receipt of the cert?)
the only tricky part here is that joinconnection(foo) can fail because the handshake isn't complete and pass after it is.. I don't believe there is anything else in it that should cause a change in result after the handshake, david wdyt?

The easiest thing to do at the necko level is to refuse to join until server bytes have been read - we know that doesn't happen until after the handshake and we know every server sends the h2 preamble immediately.
Comment on attachment 8858394 [details]
Bug 1356611 - per connection cache of JoinConnecton()

https://reviewboard.mozilla.org/r/130336/#review133066

::: netwerk/protocol/http/Http2Session.cpp:4236
(Diff revision 1)
> +    key2.Append(':');
> +    key2.Append('k');
> +    key2.AppendInt(port);
> +    if (!mJoinConnectionCache.Get(key2)) {
> +      mJoinConnectionCache.Put(key2, joinedReturn);
> -  }
> +    }

nit: I'm not sure, but it looks like mozreview is telling me there's a hard tab here (and on the next bracket down).
Attachment #8858394 - Flags: review?(hurley) → review+
David, do you want to try one of these builds and evaluate it?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c95cf3ee535e3a3f8c9eb648ccdd7dbcb3ecac2f

it varies a lot by site (and this is only an issue for h2 based sites).. but some random testing of fb, goog, and y! seemed to reduce the number of joinConnection() calls by ~95%
(In reply to Patrick McManus [:mcmanus] from comment #5)
> the only tricky part here is that joinconnection(foo) can fail because the
> handshake isn't complete and pass after it is.. I don't believe there is
> anything else in it that should cause a change in result after the
> handshake, david wdyt?

Given the code we have now, after the handshake, the JoinConnection cert verification could fail if the certificate has expired in the meantime (as in, it was already about to expire when we verified it the first time), if the user has changed trust settings on the root, if the user gets updated revocation information that indicates the certificate has been revoked, if the user gets updated pinning information (e.g. HPKP with includeSubdomains set, and the joining hostname is a subdomain), etc.. That said, since we've already trusted that certificate and joined it at least once, I'm not sure we need to care about those edge-cases, since the user is already open to attack at that point.

(In reply to Patrick McManus [:mcmanus] from comment #8)
> David, do you want to try one of these builds and evaluate it?
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=c95cf3ee535e3a3f8c9eb648ccdd7dbcb3ecac2f
> 
> it varies a lot by site (and this is only an issue for h2 based sites).. but
> some random testing of fb, goog, and y! seemed to reduce the number of
> joinConnection() calls by ~95%

In testing https://tv.yahoo.co.jp/listings/fm/ I'm seeing a similar improvement in the time spent verifying CT. You might also ask the other folks that investigated bug 1353216 and bug 1356099 (e.g. Ehsan) if they see similar results.
fwiw this patch will definitely only apply to 55. If we wanted something for 54 we could have something conceptually similar but probably a little more complex to apply to the old coalescing logic.
Comment on attachment 8858394 [details]
Bug 1356611 - per connection cache of JoinConnecton()

https://reviewboard.mozilla.org/r/130336/#review133068

::: netwerk/protocol/http/Http2Session.cpp:4236
(Diff revision 1)
> +    key2.Append(':');
> +    key2.Append('k');
> +    key2.AppendInt(port);
> +    if (!mJoinConnectionCache.Get(key2)) {
> +      mJoinConnectionCache.Put(key2, joinedReturn);
> -  }
> +    }

I double checked the patch and the whitespace around there is ok.

I think that's just mozreview diff magic related to the bracing moving around.. its flagging a } as special that previously was part of the removed for(){} and now is part of that if(){}

or something..

diff is hard.
Pushed by mcmanus@ducksong.com:
https://hg.mozilla.org/integration/autoland/rev/b29865f8d36b
per connection cache of JoinConnecton() r=nwgh
thanks
Flags: needinfo?(mcmanus)
Pushed by mcmanus@ducksong.com:
https://hg.mozilla.org/integration/autoland/rev/2d2f8ee3dd5b
per connection cache of JoinConnecton() r=nwgh
https://hg.mozilla.org/mozilla-central/rev/2d2f8ee3dd5b
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
this fix actually moved the "transaction wait time for dispatch metric" by a couple of ms for a lot of transactions from telemetry. thanks david for clueing me in.

http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/369/alerts/?from=2017-04-17&to=2017-04-17
You need to log in before you can comment on or make changes to this bug.