ssl parallelism to new host restricted to 1

RESOLVED FIXED in Firefox 49

Status

()

Core
Networking: HTTP
RESOLVED FIXED
5 years ago
2 years ago

People

(Reporter: mcmanus, Assigned: mcmanus)

Tracking

Trunk
mozilla49
x86_64
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox49 fixed)

Details

(Whiteboard: [necko-active])

Attachments

(1 attachment, 6 obsolete attachments)

(Assignee)

Description

5 years ago
on first connect to a new ssl host we restrict the parallelism to 1 until the ssl handshake is complete. We do that in order to identify spdy hosts (which won't use the parallelism).

In retrospect that is optimizing for the wrong thing. It is designed to avoid connection terminations but what we really want to avoid is un-necessary parallel active connections. To fix this we can just lift the restrict-to-1 restriction in the case of hosts which aren't known to speak spdy (hosts known to speak that will still have it in place). If we should happen to have more than 1 active spdy connection to the same host then we will just gracefully shut down the extra ones.

In practice I expect any extra connections to carry 0 transactions - because the first one to establish will be able to dispatch the whole queue pretty much immediately due to the muxxing nature of spdy. but HTTP/1 connections will get more parallelism and therefore better latency properties.

There is a significant side effect here. The first SSL connection to a host is also the one that will trigger OCSP (which is afterwards cached) and also establish resumable SSL sessions. So with this change we can expect to see more OCSP and non-resumed handshakes - though it should all be done in parallel and I would expect not actually add latency.

https://tbpl.mozilla.org/?tree=Try&rev=02d84ac9e07f
(Assignee)

Comment 2

5 years ago
Created attachment 741599 [details] [diff] [review]
patch 0

jason r+'d this in person
Attachment #741599 - Flags: review+
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/83a790e5acd8 as the apparent cause of fairly frequent Linux failures in test_spdy.js, like https://tbpl.mozilla.org/php/getParsedLog.php?id=22224595&tree=Mozilla-Inbound (2 in 10 frequent in the retriggers on your push on Linux64 opt, which seems to have been the most failure-prone).
Interestingly enough it also seems, probably, perhaps, I think maybe, to have moved the heinous complex of Android shutdown certificate crashes in bug 761987 and friends away from being shutdown crashes and into being earlier on during test crashes. I triggered a crapload of Android jsreftests on your push in https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=96a806212cac and got several of those in-a-test ones; I triggered a crapload on your parent push in https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=ba1bd4eccd7c and I'm quite looking forward to seeing whether I can manage to not get any at all there.
And indeed, all of the bug 761987 crashes on your parent were during shutdown, so somehow that patch made Android both more likely to crash @nssCertificate_Destroy and made it do it during test runs instead of only at shutdown.
(Assignee)

Comment 7

5 years ago
dougt, did you say there was a promising mismatched allocator patch for 761987? (see comments 5 and 6 in this bug)

I want to repush this patch when I figure out comment 4, but I'd hate to spread out some orange that at least is centralized right now. The patch will give us a latency win in setting up parallel http/1 ssl connections.
Flags: needinfo?(doug.turner)

Comment 8

5 years ago
I was hopeful that it was bug 850332 which landed a week ago.
Flags: needinfo?(doug.turner)
(In reply to Patrick McManus [:mcmanus] from comment #7)
> dougt, did you say there was a promising mismatched allocator patch for
> 761987? (see comments 5 and 6 in this bug)
> 
> I want to repush this patch when I figure out comment 4, but I'd hate to
> spread out some orange that at least is centralized right now. The patch
> will give us a latency win in setting up parallel http/1 ssl connections.

There must be a race condition in the certificate validation logic. I will fix it as soon as I can find it, which I am going to start on today.
(Assignee)

Updated

5 years ago
Depends on: 866867
(Assignee)

Updated

5 years ago
Depends on: 761987
(Assignee)

Comment 10

5 years ago
Created attachment 743233 [details] [diff] [review]
patch 1

updated for whitespace and bitrot
Attachment #741599 - Attachment is obsolete: true
Attachment #743233 - Flags: review+
(Assignee)

Comment 11

5 years ago
Created attachment 8368640 [details] [diff] [review]
ssl parallelism for some cert handshakes limited to 1
(Assignee)

Updated

5 years ago
Attachment #8368640 - Flags: review+
(Assignee)

Updated

5 years ago
Attachment #743233 - Attachment is obsolete: true
(Assignee)

Comment 12

5 years ago
the new patch is just updated for bitrot.

there is a try build over here
https://tbpl.mozilla.org/?tree=Try&rev=9532f191d24e

I triggered ~20 JS1 runs on android 4.. and got 2 free arenalist shutdown crashes from it.

I more or less read that as situation unchanged - what do you think brian?
Flags: needinfo?(brian)
(Assignee)

Updated

5 years ago
Depends on: 963317
Thanks Patrick. We need to fix the underlying issue. I will work with cviecco and keeler to solve it.

FWIW, I suspect the main problem is that the socket transport service thread is not implementing the nsNSSShutDownObject interface, and thus it doesn't know to stop all the SSL threads and avoid starting up new ones when NSS shuts down. If there are any networking team people available to work on that particular part of the problem, please let me know.
Depends on: 934902
Flags: needinfo?(brian)
(Assignee)

Comment 14

5 years ago
i'd be happy to help myself
(In reply to Patrick McManus [:mcmanus] from comment #12)
> I triggered ~20 JS1 runs on android 4.. and got 2 free arenalist shutdown
> crashes from it.
> 
> I more or less read that as situation unchanged - what do you think brian?

Yesterday I ran 775 runs of jreftest J1 and there were 3 such crashes on Android 2.2 opt and 1 such crash on Android 4.0 opt. 4/775 = 5/1000 failures. Also, those failures occur intermittently even without this patch on a different try run. So, I don't think that the crash bugs need to block this bug from landing. What do you think, Patrick?

https://tbpl.mozilla.org/?tree=Try&rev=b70890f16240
No longer depends on: 871575, 934902, 963317
Flags: needinfo?(mcmanus)
(Assignee)

Comment 16

4 years ago
dkeeler landed something a few weeks ago that I wondered at the time if it would help with my experience in comment 12.. perhaps that is the improvement you see in comment 15.

let me think for a bit about what that was.. I don't recall exactly off the top of my head.
Flags: needinfo?(mcmanus)
I think it is bug 967629 that you are referring to.
(Assignee)

Comment 18

4 years ago
(In reply to Brian Smith (:briansmith, was :bsmith; NEEDINFO? for response) from comment #17)
> I think it is bug 967629 that you are referring to.

yes - thank you.

imo - let's go forward with this right after we get a fresh ff31 tree, to avoid a crash landing.
(Assignee)

Comment 19

4 years ago
Created attachment 8507384 [details] [diff] [review]
ssl parallelism to new host should not be 1
Attachment #8507384 - Flags: review?(hurley)
(Assignee)

Comment 20

4 years ago
this fell off my radar accidentally - we really want this patch.

I've un-bit rotted it and improved the logic for quickly activating transactions that are attached to secondary connections when a spdy connection is negotiated.
Comment on attachment 8507384 [details] [diff] [review]
ssl parallelism to new host should not be 1

Review of attachment 8507384 [details] [diff] [review]:
-----------------------------------------------------------------

This LGTM, but let's make sure we get a whole bunch of linux xpcshell runs before landing to make sure this isn't still susceptible to the intermittent orange that philor saw on the original patch (comment 4). I don't *think* it will be, but better safe than sorry.
Attachment #8507384 - Flags: review?(hurley) → review+
(Assignee)

Comment 23

4 years ago
try is complaining about an unused rv (debug only rv) on opt anyhow.. will respin and rerun tests.
(Assignee)

Comment 27

4 years ago
comments 25 and 26 are the checkin and backout of the patch from comment 21.

the version from comment 24 should have been commited - the only difference is a DebugOnly declaration that is needed to avoid breaking opt builds that error on unused variables.
https://hg.mozilla.org/mozilla-central/rev/06acd829f970
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
Backed out for causing topcrashes. I'm also respinning nightlies.

https://hg.mozilla.org/mozilla-central/rev/da125623d9cb
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla36 → ---
Depends on: 1089638
Depends on: 1089749

Updated

4 years ago
Depends on: 1089766
Version: 18 Branch → Trunk
Depends on: 1090238
(Assignee)

Comment 31

4 years ago
Created attachment 8528591 [details] [diff] [review]
ssl parallelism to new host should not be 1
Attachment #8528591 - Flags: review?(hurley)
(Assignee)

Updated

4 years ago
Attachment #8507384 - Attachment is obsolete: true
(Assignee)

Updated

4 years ago
Attachment #8368640 - Attachment is obsolete: true
(Assignee)

Comment 32

4 years ago
Created attachment 8528594 [details] [diff] [review]
interdiff

this is the interdiff from the previously reviewed version to this one. fyi
(Assignee)

Comment 33

4 years ago
The previous version had code that dealt with what happens when we get a confirmed spdy/h2 session and still have other connections pending. It tried to migrate the transactions on those half open connections onto the new one.

That was too clever by half. The transactions are actually listed twice - once as a weak ref on the half open connection and once in the pending queue for the origin connection entry. Because the patch only looked at the former it allowed for the transaction to be executed twice.

I say it was too clever, because all that needed to be done was to Abandon() the half open connection. Because the transaction is still in the pending queue it will automatically be assigned to the new connection the next time the queues are considered (which happens at the end of the function).

The previous patch also had the same problem as I fixed in 1104993 (which led to me filing that bug).
Attachment #8528591 - Flags: review?(hurley) → review+
(Assignee)

Comment 34

4 years ago
given that this patch has proven tricky to land I'm going to wait to do so until gecko-37 is created.
(Assignee)

Updated

4 years ago
Target Milestone: --- → mozilla37
https://hg.mozilla.org/mozilla-central/rev/61ee2e053920
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → FIXED
(Assignee)

Comment 37

4 years ago
this caused crashes (bug 1069838) so I'll backout while its investigated.

m-i is closed right now - so the backout is pending.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 39

4 years ago
comment 39 is the backout of the comment 35 push
Keywords: leave-open
(Assignee)

Updated

4 years ago
Depends on: 1073825
(Assignee)

Updated

4 years ago
Depends on: 1111217
(Assignee)

Updated

4 years ago
Keywords: leave-open
(Assignee)

Updated

2 years ago
Whiteboard: [necko-active]
(Assignee)

Comment 42

2 years ago
failures from dec 14 include (comment 37 had a typo)

https://bugzilla.mozilla.org/show_bug.cgi?id=1089638
which was probably a dup of
https://bugzilla.mozilla.org/show_bug.cgi?id=1111217
which in turn was fixed by cset
https://hg.mozilla.org/integration/mozilla-inbound/rev/511d0d709406

none of that is directly caused by 865314 but because cancel's of h2 spiked with this patch, the above mentioned crashes spiked too. Indeed you can find > 1000 crashes related to 1111217 in the last 4 weeks of crash stats data, but they are pretty much limited to esr 31 (predates the fix).

So nothing should prevent landing this patch, except brad lassey ran a try build of 865314 plus 1111217 and experienced two crashes that had useless stacks. Now useless stacks have not really been a signature part of this bug, so I suspect that perhaps it was more indicative of something about try (or an ephemeral unrelated issue).
(Assignee)

Comment 43

2 years ago
hey brad, in the past you've had a knack for making this patch crash. I feel like its ready to reland but we still have an unresolved issue from 1111217 where you had a couple crashes with useless stacks that weren't explained. Enough water under the bridge to retest I think. Could you run it for a day and see if its stable for you?

there is a try build in comment 41

the try builds sets network.http.connection-retry-timeout to 1 just to make sure to exercise some code on the try tests. Feel free to set it back to 250 (which is how it will land) or just enjoy the background connections!
Flags: needinfo?(blassey.bugs)
No crashes, but the last two mornings this build has been completely hung. Unfortunately I don't have symbols so I can't get a stack for where we're hung.
Flags: needinfo?(blassey.bugs)
(Assignee)

Comment 45

2 years ago
Created attachment 8751206 [details] [diff] [review]
allow 6 parallel connects to new https origin
(Assignee)

Updated

2 years ago
Attachment #8751206 - Flags: review+
(Assignee)

Updated

2 years ago
Attachment #8528594 - Attachment is obsolete: true
(Assignee)

Updated

2 years ago
Attachment #8528591 - Attachment is obsolete: true
(Assignee)

Updated

2 years ago
Target Milestone: mozilla37 → mozilla49

Comment 47

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/b1e8d6cf54e6
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago2 years ago
status-firefox49: --- → fixed
Resolution: --- → FIXED
(Assignee)

Updated

2 years ago
Depends on: 1275917
You need to log in before you can comment on or make changes to this bug.