1356611 - necko is spending a lot of time verifying certificates via JoinConnection (http2)

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Reporter

Description

•

7 years ago

See bug 1353216 comment 33.
STR:

* Enable http2 and certificate transparency (set security.pki.certificate_transparency.mode to 1)
* Install the profiler add-on installed ( https://perf-html.io )
* Configure it to capture the socket thread (so settings -> threads = socket)
* Load https://tv.yahoo.co.jp/listings/fm/
* ctrl + shift + 2
* Select the socket thread, filter on "verify"

(See e.g. https://perfht.ml/2pixfR6 )

It seems there are many code paths in necko that hit JoinConnection, which is actually a bit expensive since it builds a certificate chain (particularly so if CT is enabled, which involves a number of ECDSA verifications, which are not performant at the moment). Either something is amiss here or we need to add some sort of cache that will cut down on how often this is happening.

Patrick McManus [:mcmanus]

Comment 1

•

7 years ago

I think I can make it so that IsAcceptableHost() is called a lot less - maybe even less than on 54 though 55 coalesces some new cases so I can't promise. I'll do that in this bug.

yahoo is a classic test case for the coalescing code :)

I guess its not really necessary that I understand, but can you help me understand why IsAcceptableHost() needs to do all the crypto dance? (or more likely if we should be doing something different?) We're only calling this on connections with complete handshakes and we're looking to know if another name is covered by the already validated cert. does that require crypto or sct work? maybe this is another optimization that can be made.

Flags: needinfo?(dkeeler)

Patrick McManus [:mcmanus]

Comment 2

•

7 years ago

sorta good news - the new coalescing scheme is actually better here than <= 54 

54: 345 calls
55-2-days-ago: 247
mc today: 189

that's with no cache. still going to add that :)

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Reporter

Comment 3

•

7 years ago

Since we're asking "is this certificate acceptable for this hostname?" we have to ensure that all of the same hostname-specific checks are performed. The certificate has already validated correctly for TLSServerAuth, so things like key usage and name constraints actually don't have to be checked again, and it would be tempting to just call CheckCertHostname. Unfortunately, this could cause a pinning bypass since pinsets are keyed on hostname. We could just call both CheckCertHostname and re-do the pinset validation given the new hostname, but that requires a full certificate chain, which we don't have (this is a shortcoming we've encountered in many places - maybe it's time to actually keep around the validated certificate chain). However, even if we did have the certificate chain, if the pinset validation fails, what's the right thing to do there? We could just say "no, this host can't join this", but that leaves out the possibility that there is a different certificate chain that does match the pinset (at that point, since this is only a performance hit, maybe it would be best to just say "no, don't join"). Finally, I'm not entirely certain that there aren't other hostname-specific checks I've forgotten about (I almost forgot about the pin checks, to be honest). For instance, we would have to be sure that a host couldn't join a connection where the user had overridden an error (my understanding is necko wouldn't actually know if this was the case or not unless it checked).

Flags: needinfo?(dkeeler)

Patrick McManus [:mcmanus]

Comment 4

•

7 years ago

thanks.. that's interesting and I had thought about the pinset issues afterwards.

The bulk of the savings is going to come from a straight up cache - I'm validating a patch for that right now.. but maybe its still worth finding an optimal path for first check (e.g. is CT relevant other than on first receipt of the cert?)

Patrick McManus [:mcmanus]

Comment 5

•

7 years ago

the only tricky part here is that joinconnection(foo) can fail because the handshake isn't complete and pass after it is.. I don't believe there is anything else in it that should cause a change in result after the handshake, david wdyt?

The easiest thing to do at the necko level is to refuse to join until server bytes have been read - we know that doesn't happen until after the handshake and we know every server sends the h2 preamble immediately.

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/130336/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/130336/

u408661

Comment 7

•

7 years ago

mozreview-review

Comment on attachment 8858394 [details]
Bug 1356611 - per connection cache of JoinConnecton()

https://reviewboard.mozilla.org/r/130336/#review133066

::: netwerk/protocol/http/Http2Session.cpp:4236
(Diff revision 1)
> +    key2.Append(':');
> +    key2.Append('k');
> +    key2.AppendInt(port);
> +    if (!mJoinConnectionCache.Get(key2)) {
> +      mJoinConnectionCache.Put(key2, joinedReturn);
> -  }
> +    }

nit: I'm not sure, but it looks like mozreview is telling me there's a hard tab here (and on the next bracket down).

Attachment #8858394 - Flags: review?(hurley) → review+

Patrick McManus [:mcmanus]

Comment 8

•

7 years ago

David, do you want to try one of these builds and evaluate it?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c95cf3ee535e3a3f8c9eb648ccdd7dbcb3ecac2f

it varies a lot by site (and this is only an issue for h2 based sites).. but some random testing of fb, goog, and y! seemed to reduce the number of joinConnection() calls by ~95%

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Reporter

Comment 9

•

7 years ago

(In reply to Patrick McManus [:mcmanus] from comment #5)
> the only tricky part here is that joinconnection(foo) can fail because the
> handshake isn't complete and pass after it is.. I don't believe there is
> anything else in it that should cause a change in result after the
> handshake, david wdyt?

Given the code we have now, after the handshake, the JoinConnection cert verification could fail if the certificate has expired in the meantime (as in, it was already about to expire when we verified it the first time), if the user has changed trust settings on the root, if the user gets updated revocation information that indicates the certificate has been revoked, if the user gets updated pinning information (e.g. HPKP with includeSubdomains set, and the joining hostname is a subdomain), etc.. That said, since we've already trusted that certificate and joined it at least once, I'm not sure we need to care about those edge-cases, since the user is already open to attack at that point.

(In reply to Patrick McManus [:mcmanus] from comment #8)
> David, do you want to try one of these builds and evaluate it?
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=c95cf3ee535e3a3f8c9eb648ccdd7dbcb3ecac2f
> 
> it varies a lot by site (and this is only an issue for h2 based sites).. but
> some random testing of fb, goog, and y! seemed to reduce the number of
> joinConnection() calls by ~95%

In testing https://tv.yahoo.co.jp/listings/fm/ I'm seeing a similar improvement in the time spent verifying CT. You might also ask the other folks that investigated bug 1353216 and bug 1356099 (e.g. Ehsan) if they see similar results.

Patrick McManus [:mcmanus]

Comment 10

•

7 years ago

fwiw this patch will definitely only apply to 55. If we wanted something for 54 we could have something conceptually similar but probably a little more complex to apply to the old coalescing logic.

Patrick McManus [:mcmanus]

Comment 11

•

7 years ago

mozreview-review

Comment on attachment 8858394 [details]
Bug 1356611 - per connection cache of JoinConnecton()

https://reviewboard.mozilla.org/r/130336/#review133068

::: netwerk/protocol/http/Http2Session.cpp:4236
(Diff revision 1)
> +    key2.Append(':');
> +    key2.Append('k');
> +    key2.AppendInt(port);
> +    if (!mJoinConnectionCache.Get(key2)) {
> +      mJoinConnectionCache.Put(key2, joinedReturn);
> -  }
> +    }

I double checked the patch and the whitespace around there is ok.

I think that's just mozreview diff magic related to the bracing moving around.. its flagging a } as special that previously was part of the removed for(){} and now is part of that if(){}

or something..

diff is hard.

Pulsebot

Comment 12

•

7 years ago

Pushed by mcmanus@ducksong.com:
https://hg.mozilla.org/integration/autoland/rev/b29865f8d36b
per connection cache of JoinConnecton() r=nwgh

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 13

•

7 years ago

Backed out for frequent windows xpcshell failures like https://treeherder.mozilla.org/logviewer.html#?job_id=91848418&repo=autoland

https://hg.mozilla.org/integration/autoland/rev/292f19781579

Flags: needinfo?(mcmanus)

Comment hidden (mozreview-request)

Comment on attachment 8858394 [details]
Bug 1356611 - per connection cache of JoinConnecton()

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/130336/diff/1-2/

Comment hidden (mozreview-request)

Comment on attachment 8858394 [details]
Bug 1356611 - per connection cache of JoinConnecton()

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/130336/diff/2-3/

Patrick McManus [:mcmanus]

Comment 16

•

7 years ago

thanks

Flags: needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Comment 17

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=162a12108bb9d988650f1aa9813416ae0ea20834

Pulsebot

Comment 18

•

7 years ago

Pushed by mcmanus@ducksong.com:
https://hg.mozilla.org/integration/autoland/rev/2d2f8ee3dd5b
per connection cache of JoinConnecton() r=nwgh

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 19

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/2d2f8ee3dd5b

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox55: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

Patrick McManus [:mcmanus]

Comment 20

•

7 years ago

this fix actually moved the "transaction wait time for dispatch metric" by a couple of ms for a lot of transactions from telemetry. thanks david for clueing me in.

http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/369/alerts/?from=2017-04-17&to=2017-04-17

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Reporter

Comment 21

•

7 years ago

Cool!