Closed Bug 865314 Opened 11 years ago Closed 8 years ago

ssl parallelism to new host restricted to 1

Categories

(Core :: Networking: HTTP, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: mcmanus, Assigned: mcmanus)

References

Details

(Whiteboard: [necko-active])

Attachments

(1 file, 6 obsolete files)

on first connect to a new ssl host we restrict the parallelism to 1 until the ssl handshake is complete. We do that in order to identify spdy hosts (which won't use the parallelism).

In retrospect that is optimizing for the wrong thing. It is designed to avoid connection terminations but what we really want to avoid is un-necessary parallel active connections. To fix this we can just lift the restrict-to-1 restriction in the case of hosts which aren't known to speak spdy (hosts known to speak that will still have it in place). If we should happen to have more than 1 active spdy connection to the same host then we will just gracefully shut down the extra ones.

In practice I expect any extra connections to carry 0 transactions - because the first one to establish will be able to dispatch the whole queue pretty much immediately due to the muxxing nature of spdy. but HTTP/1 connections will get more parallelism and therefore better latency properties.

There is a significant side effect here. The first SSL connection to a host is also the one that will trigger OCSP (which is afterwards cached) and also establish resumable SSL sessions. So with this change we can expect to see more OCSP and non-resumed handshakes - though it should all be done in parallel and I would expect not actually add latency.

https://tbpl.mozilla.org/?tree=Try&rev=02d84ac9e07f
Attached patch patch 0 (obsolete) — Splinter Review
jason r+'d this in person
Attachment #741599 - Flags: review+
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/83a790e5acd8 as the apparent cause of fairly frequent Linux failures in test_spdy.js, like https://tbpl.mozilla.org/php/getParsedLog.php?id=22224595&tree=Mozilla-Inbound (2 in 10 frequent in the retriggers on your push on Linux64 opt, which seems to have been the most failure-prone).
Interestingly enough it also seems, probably, perhaps, I think maybe, to have moved the heinous complex of Android shutdown certificate crashes in bug 761987 and friends away from being shutdown crashes and into being earlier on during test crashes. I triggered a crapload of Android jsreftests on your push in https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=96a806212cac and got several of those in-a-test ones; I triggered a crapload on your parent push in https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=ba1bd4eccd7c and I'm quite looking forward to seeing whether I can manage to not get any at all there.
And indeed, all of the bug 761987 crashes on your parent were during shutdown, so somehow that patch made Android both more likely to crash @nssCertificate_Destroy and made it do it during test runs instead of only at shutdown.
dougt, did you say there was a promising mismatched allocator patch for 761987? (see comments 5 and 6 in this bug)

I want to repush this patch when I figure out comment 4, but I'd hate to spread out some orange that at least is centralized right now. The patch will give us a latency win in setting up parallel http/1 ssl connections.
Flags: needinfo?(doug.turner)
I was hopeful that it was bug 850332 which landed a week ago.
Flags: needinfo?(doug.turner)
(In reply to Patrick McManus [:mcmanus] from comment #7)
> dougt, did you say there was a promising mismatched allocator patch for
> 761987? (see comments 5 and 6 in this bug)
> 
> I want to repush this patch when I figure out comment 4, but I'd hate to
> spread out some orange that at least is centralized right now. The patch
> will give us a latency win in setting up parallel http/1 ssl connections.

There must be a race condition in the certificate validation logic. I will fix it as soon as I can find it, which I am going to start on today.
Depends on: 866867
Depends on: 761987
Attached patch patch 1 (obsolete) — Splinter Review
updated for whitespace and bitrot
Attachment #741599 - Attachment is obsolete: true
Attachment #743233 - Flags: review+
Attachment #8368640 - Flags: review+
Attachment #743233 - Attachment is obsolete: true
the new patch is just updated for bitrot.

there is a try build over here
https://tbpl.mozilla.org/?tree=Try&rev=9532f191d24e

I triggered ~20 JS1 runs on android 4.. and got 2 free arenalist shutdown crashes from it.

I more or less read that as situation unchanged - what do you think brian?
Flags: needinfo?(brian)
Depends on: 963317
Thanks Patrick. We need to fix the underlying issue. I will work with cviecco and keeler to solve it.

FWIW, I suspect the main problem is that the socket transport service thread is not implementing the nsNSSShutDownObject interface, and thus it doesn't know to stop all the SSL threads and avoid starting up new ones when NSS shuts down. If there are any networking team people available to work on that particular part of the problem, please let me know.
Depends on: 934902
Flags: needinfo?(brian)
i'd be happy to help myself
(In reply to Patrick McManus [:mcmanus] from comment #12)
> I triggered ~20 JS1 runs on android 4.. and got 2 free arenalist shutdown
> crashes from it.
> 
> I more or less read that as situation unchanged - what do you think brian?

Yesterday I ran 775 runs of jreftest J1 and there were 3 such crashes on Android 2.2 opt and 1 such crash on Android 4.0 opt. 4/775 = 5/1000 failures. Also, those failures occur intermittently even without this patch on a different try run. So, I don't think that the crash bugs need to block this bug from landing. What do you think, Patrick?

https://tbpl.mozilla.org/?tree=Try&rev=b70890f16240
No longer depends on: 871575, 934902, 963317
Flags: needinfo?(mcmanus)
dkeeler landed something a few weeks ago that I wondered at the time if it would help with my experience in comment 12.. perhaps that is the improvement you see in comment 15.

let me think for a bit about what that was.. I don't recall exactly off the top of my head.
Flags: needinfo?(mcmanus)
I think it is bug 967629 that you are referring to.
(In reply to Brian Smith (:briansmith, was :bsmith; NEEDINFO? for response) from comment #17)
> I think it is bug 967629 that you are referring to.

yes - thank you.

imo - let's go forward with this right after we get a fresh ff31 tree, to avoid a crash landing.
Attachment #8507384 - Flags: review?(hurley)
this fell off my radar accidentally - we really want this patch.

I've un-bit rotted it and improved the logic for quickly activating transactions that are attached to secondary connections when a spdy connection is negotiated.
Comment on attachment 8507384 [details] [diff] [review]
ssl parallelism to new host should not be 1

Review of attachment 8507384 [details] [diff] [review]:
-----------------------------------------------------------------

This LGTM, but let's make sure we get a whole bunch of linux xpcshell runs before landing to make sure this isn't still susceptible to the intermittent orange that philor saw on the original patch (comment 4). I don't *think* it will be, but better safe than sorry.
Attachment #8507384 - Flags: review?(hurley) → review+
try is complaining about an unused rv (debug only rv) on opt anyhow.. will respin and rerun tests.
comments 25 and 26 are the checkin and backout of the patch from comment 21.

the version from comment 24 should have been commited - the only difference is a DebugOnly declaration that is needed to avoid breaking opt builds that error on unused variables.
https://hg.mozilla.org/mozilla-central/rev/06acd829f970
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
Backed out for causing topcrashes. I'm also respinning nightlies.

https://hg.mozilla.org/mozilla-central/rev/da125623d9cb
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla36 → ---
Depends on: 1089638
Depends on: 1089749
Depends on: 1089766
Version: 18 Branch → Trunk
Depends on: 1090238
Attachment #8528591 - Flags: review?(hurley)
Attachment #8507384 - Attachment is obsolete: true
Attachment #8368640 - Attachment is obsolete: true
Attached patch interdiff (obsolete) — Splinter Review
this is the interdiff from the previously reviewed version to this one. fyi
The previous version had code that dealt with what happens when we get a confirmed spdy/h2 session and still have other connections pending. It tried to migrate the transactions on those half open connections onto the new one.

That was too clever by half. The transactions are actually listed twice - once as a weak ref on the half open connection and once in the pending queue for the origin connection entry. Because the patch only looked at the former it allowed for the transaction to be executed twice.

I say it was too clever, because all that needed to be done was to Abandon() the half open connection. Because the transaction is still in the pending queue it will automatically be assigned to the new connection the next time the queues are considered (which happens at the end of the function).

The previous patch also had the same problem as I fixed in 1104993 (which led to me filing that bug).
Attachment #8528591 - Flags: review?(hurley) → review+
given that this patch has proven tricky to land I'm going to wait to do so until gecko-37 is created.
Target Milestone: --- → mozilla37
https://hg.mozilla.org/mozilla-central/rev/61ee2e053920
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
this caused crashes (bug 1069838) so I'll backout while its investigated.

m-i is closed right now - so the backout is pending.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
comment 39 is the backout of the comment 35 push
Keywords: leave-open
Depends on: 1073825
Depends on: 1111217
Keywords: leave-open
Whiteboard: [necko-active]
failures from dec 14 include (comment 37 had a typo)

https://bugzilla.mozilla.org/show_bug.cgi?id=1089638
which was probably a dup of
https://bugzilla.mozilla.org/show_bug.cgi?id=1111217
which in turn was fixed by cset
https://hg.mozilla.org/integration/mozilla-inbound/rev/511d0d709406

none of that is directly caused by 865314 but because cancel's of h2 spiked with this patch, the above mentioned crashes spiked too. Indeed you can find > 1000 crashes related to 1111217 in the last 4 weeks of crash stats data, but they are pretty much limited to esr 31 (predates the fix).

So nothing should prevent landing this patch, except brad lassey ran a try build of 865314 plus 1111217 and experienced two crashes that had useless stacks. Now useless stacks have not really been a signature part of this bug, so I suspect that perhaps it was more indicative of something about try (or an ephemeral unrelated issue).
hey brad, in the past you've had a knack for making this patch crash. I feel like its ready to reland but we still have an unresolved issue from 1111217 where you had a couple crashes with useless stacks that weren't explained. Enough water under the bridge to retest I think. Could you run it for a day and see if its stable for you?

there is a try build in comment 41

the try builds sets network.http.connection-retry-timeout to 1 just to make sure to exercise some code on the try tests. Feel free to set it back to 250 (which is how it will land) or just enjoy the background connections!
Flags: needinfo?(blassey.bugs)
No crashes, but the last two mornings this build has been completely hung. Unfortunately I don't have symbols so I can't get a stack for where we're hung.
Flags: needinfo?(blassey.bugs)
Attachment #8751206 - Flags: review+
Attachment #8528594 - Attachment is obsolete: true
Attachment #8528591 - Attachment is obsolete: true
Target Milestone: mozilla37 → mozilla49
https://hg.mozilla.org/mozilla-central/rev/b1e8d6cf54e6
Status: REOPENED → RESOLVED
Closed: 10 years ago8 years ago
Resolution: --- → FIXED
Depends on: 1275917
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: