Closed Bug 511393 Opened 10 years ago Closed 8 years ago

Use multiple SSL worker threads (nsSSLThread)

Categories

(Core :: Security: PSM, defect)

defect
Not set

Tracking

()

RESOLVED WONTFIX

People

(Reporter: KaiE, Assigned: briansmith)

References

Details

Attachments

(1 file, 1 obsolete file)

Today I wrote bug 508633 comment 2.

This reminded me that I had intended to reduce this bottleneck.
I had already implemented a patch 6 months ago in bug 390036 comment 87 then forgot about it, but maybe the timing wasn't right to land it.

Now the patch has bitrotted and needs to be merged to trunk.
We should do so soon, now that we are far away from the next big release.
Attached patch Patch v1 (obsolete) — Splinter Review
Good news, the patch did NOT bitrot, it still applies cleanly!

Anyone interested to review it?

Should I produce a tryserver build so that people can test it?
I brought the patch to a state where it worked for me in February, but still, the patch is new and risky.
Comment on attachment 395305 [details] [diff] [review]
Patch v1

I've been a bit too enthusiastic, applies, but doesn't build...
Attachment #395305 - Attachment is obsolete: true
Attached patch Patch v2Splinter Review
Merged to trunk. Manual testing looked ok.

Patch contains two fprintf statements for debugging, which will obviouosly removed before a final patch.

This patch uses multiple worker threads, each using its own pollable event.
Attachment #395343 - Flags: review?
I'll with love take a look at this :)
Attachment #395343 - Flags: review? → review?(honzab.moz)
Comment on attachment 395343 [details] [diff] [review]
Patch v2

Thanks Honza!

Yesterday I ran the full mochitest text suite. I got 62 test-unexpected-fail, but none of them was related to https. All tests that used https worked.
Comment on attachment 395343 [details] [diff] [review]
Patch v2

After few hours spent I think I really understand this code. For now I have to give r- for the following reasons:

- You are initiating all threads with a common mutex, the nsSSLIOLayerHelpers::mutex. In requestRead and requestWrite methods checkHandshake method may get called and it happens under this lock. checkHandshake may re-enter it. I strongly suggest to use own lock held in nsSSLThreadControl and in thread implementation use just this->mMutex then.

- You often have a block kept under the lock of the thread (this->mMutex) and then other block, immediately after it, that is kept under the nsSSLIOLayerHelpers::mutex, that is the same as this->mMutex. Somewhere you close/open/close the same lock w/o an obvious need to do that (nsSSLThread.cpp:618 after patch applied for instance)

- In branches of condition 'if (!this_socket_is_busy && si->HandshakeTimeout())' you do 't->restoreOriginalSocket_locked()' where t is at that moment obviously null. It would crash. You have to search for the thread the socket belongs to. However, I'm not sure it belongs to any at that moment and you really have to restore the socket, it should not be switched to pollable event at this point, or should it?

- You should local copy and then null out the nsSSLThreadControl::mThreads array under the common lock before destruction in nsSSLThreadControl::Stop() to avoid assigning and working with a thread just before it's going to be freed from memory. Just a precaution. I base this suggestion on fact that access to mThreads is always under the lock.

Then some cosmetics:

- s/scheduled_to_destroyed/scheduled_to_destroy/
- nsSSLThreadControl should have all definitions together, at the top or at the bottom, if you don't have a special reason to keep this way
Attachment #395343 - Flags: review?(honzab.moz) → review-
Comment on attachment 395343 [details] [diff] [review]
Patch v2

And one more detail, but not probably blocking this bug (might be a follow up?): turn the static arrays of events, sockets keeping events etc to class members of nsSSLThread, let it be managed by the thread instance.
Thanks for your thorough review. I will address your comments separately.

(In reply to comment #6)
> 
> - You are initiating all threads with a common mutex, the
> nsSSLIOLayerHelpers::mutex. In requestRead and requestWrite methods
> checkHandshake method may get called and it happens under this lock.
> checkHandshake may re-enter it. I strongly suggest to use own lock held in
> nsSSLThreadControl and in thread implementation use just this->mMutex then.

Thanks for catching this. I agree, checkHandshake should not get called with the lock held. Although I don't understand why you think it can be re-entered, checkHandshake may bring up UI etc., and we really should call it while not holding a lock.

But I think, maybe we don't need a separate lock?

In all 3 scenarios, we immediately exit the function after calling checkHandshake. I think it is sufficient to unlock prior to the call to checkHandshake, do you agree?
(In reply to comment #8)
> In all 3 scenarios, we immediately exit the function after calling
> checkHandshake. I think it is sufficient to unlock prior to the call to
> checkHandshake, do you agree?

Yes, I agree. I checked that all other objects' methods we access in checkHandshake are called just from there or unlocked from other places. So it could work. However, remember, that the change with lock you do might bring hidden problems. We have to test carefully.
And to answer why it is possible to reenter the lock, nsSSLThread::checkHandshake calls nsSSLIOLayerHelpers::rememberPossibleTLSProblemSite that calls nsSSLIOLayerHelpers::addIntolerantSite that locks the helper lock.
And we should figure out bug 480878 before we make these changes.
And probably also bug 532482.
Blocks: 508633
The only motivation for multiple SSL worker threads is the blocking nature of OCSP and/or CRL and/or AIA fetching, right?

If so, why not use the non-blocking libpkix API instead of using multiple threads?
I talked to Kai on IRC about an alternate approach that is also different from the one I suggested in comment 13. We will discuss it on the phone on Thursday morning before the normal NSS teleconference to see if we agree that this approach will or not work. If it won't work, then we will pribably fix this by implementing the multiple SSL worker threads instead.

The new idea is:

* Remove nsSSLThread completely.
* Change the signature of AuthCertificateCallback to match SSLBadCertHandler and rename it to StartCertAuth.
* Write a new cert auth hook SaveParametersForCertAuth that saves its parameters and unconditionally returns SECFailure.
* Replace "SSL_AuthCertificateHook(sslSock, AuthCertificateCallback, 0)" 
     with "SSL_AuthCertificateHook(sslSock, SaveParametersForCertAuth, 0)."
* Replace "SSL_BadCertHook(fd, nsNSSBadCertHandler, infoObject)"
     with "SSL_BadCertHook(fd, StartCertAuth, infoObject)."
* Move all the code in StartCertAuth after the call to PSM_SSL_PKIX_AuthCertificate to a new function AfterCertAuth.
* Change (or overload) nsCertVerificationThread::addJob() to return a pollable event.
* Change nsCertVerificationThread to maintain a pollable event that is set when certificate verification is finished; make addJob return the FD of the pollable event.
* Replace the call to PSM_SSL_PKIX_AuthCertificate with a call to nsCertVerificationThread::addJob(), change the file descriptor for the SSL socket to be the FD for the pollable event, and return SECWouldBlock.
* (In a follow-up bug) change nsCertVerificationThread so that it can verify multiple certificate chains in parallel.

The above steps will cause StartCertAuth to be called unconditionally during certificate validation. StartCertAuth returning SECWouldBlock will cause the SSL handshake to be paused in a non-blocking manner so that the main thread can continue working while certificate verification takes place. 

(Note that BadCertHook can return SECWouldBlock but the AuthCertificateHook can't, which is why the above steps are kind of roundabout; a clearer alternative would be to change libssl to handle the AuthCertificateHook returning SECWouldBlock in a way similar to the way it handles the BadCertHook returning SECWouldBlock. I propose that instead we just keep all the changes for this bug internal to PSM so we don't need to change NSS for this.)

Replacing the SSL socket's FD with the pollable event will result in the SSL transport being restarted up when verification is complete. When the SSL transport is woken up, it must always check to see if it was woken up because of an event on its SSL FD or because of an event on the certificate verification pollable event. If the later, it must switch its FD back to the SSL FD, call AfterCertAuth, and then call SSL_RestartHandshakeAfterServerCert to resume the SSL handshake.

The result of all of this would be that SSL traffic on one connection would no longer block SSL traffic on other connections.
Summary: Allow multiple SSL worker threads → De-serialize SSL traffic
Other important side effects of the change in comment 14:

* Necko/PSM would become simpler to maintain, which is an important goal for the Necko team.

* The bugs about the SSL thread accessing main-thread-only resources off the main thread would be fixed automatically, because the SSL thread would no longer exist. (Note that any similar bugs in nsCertVerificationThread would continue to exist; however, there are fewer of these, and switching to libpkix by default will eliminate them, I believe.)
Assignee: kaie → bsmith
bsmith: SSL_RestartHandshakeAfterServerCert is broken even though it is exported
from ssl.def.  Adam Langley wrote a patch to fix it in 542832.
(In reply to comment #15)
> 
> * The bugs about the SSL thread accessing main-thread-only resources off the
> main thread would be fixed automatically, because the SSL thread would no
> longer exist.


I'm worried you might have false expectations.

I believe the SSL thread isn't the culprit for those bugs.

Necko uses multiple threads itself. Necko's socket transport threads are the ones that execute the read/write calls, and those are the ones that call into the I/O layers implemented by PSM and NSS.

When this happens, we access PSM resources from secondary threads (let's call the main thread the primary thread, and all other threads are secondary threads).

It has always been like that, even before we introduced the SSL thread.

Even if you removed the SSL thread, you would still have secondary threads executing PSM code that accesses Gecko resources.
I would have preferred if you had kept the ideas and bugs separate.
I would have preferred to keep this bug for the idea of multiple SSL worker threads, and not hijack it for the new idea.
As per Wan-Teh's comment, we would have to add the dependency to bug 511393, but that only applies to the new idea.
Depends on: 542832
The original idea to introduce multiple SSL worker threads will immediately allow for multiple parallel OCSP requests.

If your alternative solution doesn't allow that right away, it's not an improvement.
(In reply to comment #10)
> And to answer why it is possible to reenter the lock,
> nsSSLThread::checkHandshake calls
> nsSSLIOLayerHelpers::rememberPossibleTLSProblemSite that calls
> nsSSLIOLayerHelpers::addIntolerantSite that locks the helper lock.

The solution is probbaly to use a "monitor" (not a "lock").
> Change (or overload) nsCertVerificationThread::addJob() to return a 
> pollable event.

The SSL thread uses long lived pollable event objects.
I'm worried dynamically creating lots of pollable events might cause problems.

They are considered expensive objects. I think it would be better to have a pool and reuse the objects (this is the approach that my other thread used).

Some Firewall products track new server sockets on localhost (e.g. to identify backdoors). By using a pool of pollable events you also reduce the amount of interaction events.
> Replace "SSL_BadCertHook(fd, nsNSSBadCertHandler, infoObject)"
>     with "SSL_BadCertHook(fd, StartCertAuth, infoObject)."

I don't understand what the bad-cert-hook has to do with regular SSL connections, which don't involve bad certs.

In your proposal, StartCertAuth only gets called when a cert is bad. I believe that's not what you want.

While I'm worried you have too much indirection in your proposal in general, I believe this is a place where it's obviously too indirect.

It seems, your plan is to wait until a cert is rejected as bad, and only then you want to start a verification? That doesn't sound right.

I might misunderstand your plan.
(In reply to comment #17)
> > * The bugs about the SSL thread accessing main-thread-only resources off the
> > main thread would be fixed automatically, because the SSL thread would no
> > longer exist.
> 
> I believe the SSL thread isn't the culprit for those bugs.
>
> Even if you removed the SSL thread, you would still have secondary threads
> executing PSM code that accesses Gecko resources.

Thank you for pointing this out.

(In reply to comment #19)
> The original idea to introduce multiple SSL worker threads will immediately
> allow for multiple parallel OCSP requests.
> 
> If your alternative solution doesn't allow that right away, it's not an
> improvement.

It would still be an improvement because all the traffic besides the processing of the server Certificate message would become de-serialized. It is true that the processing of certificates would still be serialized, but then we could implement a solution very similar to your multi-threaded solution for nsSSLThread, except for nsCertVerificationThread instead, like you mentioned in comment 21. 

(In reply to comment #21)
> I'm worried dynamically creating lots of pollable events might cause
> problems.
> 
> They are considered expensive objects. I think it would be better to have a
> pool and reuse the objects (this is the approach that my other thread used).

Seems reasonable to me.

(In reply to comment #22)
> > Replace "SSL_BadCertHook(fd, nsNSSBadCertHandler, infoObject)"
> >     with "SSL_BadCertHook(fd, StartCertAuth, infoObject)."
> 
> I don't understand what the bad-cert-hook has to do with regular SSL
> connections, which don't involve bad certs.
>
> It seems, your plan is to wait until a cert is rejected as bad, and only
> then you want to start a verification? That doesn't sound right.
> 
> I might misunderstand your plan.

The problem is that libssl doesn't let the cert auth hook return SECWouldBlock; it only lets the bad cert hook return SECWouldBlock. The code for this in libssl is this:

    rv = (SECStatus)(*ss->authCertificate)(ss->authCertificateArg, ss->fd,
					   PR_TRUE, isServer);
    if (rv) {
	errCode = PORT_GetError();
	if (!ss->handleBadCert) {
	    goto bad_cert;
	}
	rv = (SECStatus)(*ss->handleBadCert)(ss->badCertArg, ss->fd);
	if ( rv ) {
	    if ( rv == SECWouldBlock ) {

I agree this isn't straightforward but I think it can be understood clearly with an appropriate comment. I think this would be much easier to understand and maintain than the code in nsSSLIOLayer. (Note that the simplification we are hoping for isn't the removal of the SSL thread, but rather reduction in complexity of nsSSLIOLayer).
I agree with Kai that Brian's idea have to have its own bug.

Pollable events:
- first, pollable events are not able to be created on some systems with certain firewalls like ZoneAlarm, unconditionally, so we have to have a solution to wake up/don't deadlock even w/o pollable events (necko does)
- they are expensive, on the windows platform it is a real client/server socket on localhost
- if we have a pool for the Brian's solution then I think it is the same as having the pool of ssl threads each having its own pollable event (bellow I outline how we can implement the Brian's idea w/o a pollable event at all)
- when a socket gets "replaced" with a pollable event, we must forward PR_Poll to that pollable event, the rest of socket functions must work with the original ssl/tcp socket stack

I don't see any progress on review/update the patch in bug 542832.  It blocks the Brian's idea.  With it, we need NSS modifications anyway.  It would be appropriate to also enhance the NSS code (straightforward and simple) to accept SECWouldBlock from authCertificate callbacks, as we have to wait for a new NSS release anyway.


Brian's idea, pros/cons:
+ simplifies the code by removing ssl thread (huge advantage from my point of view)
+ fixes two crash bugs we have in the ssl thread code, that are hard to trace and actually block creating multiple threads what could make the problem even deeper and harder to trace
+ may fix some (at least I know about one) smaller bugs with call of necko code on a bad thread
+ we may try to solve this w/o a pollable event at all - by overriding just and only the |poll| FD method implementation and posting a dummy event to the socket thread (we will have to make it reachable from PSM) what triggers the necko's pollable event to simply wake polling up
+/- I expect it to be simpler to implement then to finish and tune the multiple ssl threads patch (but this is a weak argument; we don't know what regression this may bring)
+/- when we fallback to use pllable events, we will be limited in parallel verifications by the pollable events number, but it is similar to the multiple ssl threads problem
- needs an update to NSS, there is an old bitrotted patch, though

Kai's multiple ssl threads pros/cons:
+ with the current trend of CPU architecture we gain performance by concurrent encryption operations processing spread across multiple CPU cores
+ might also have some other performance benefits (just a feeling, no objective data on that)
- more complex code based on an already complex and buggy code

I would personally give the Brian's idea a try, but let's move it to a separate bug, summary comments from here, and revert this bug to the original title.
OS: Linux → All
Hardware: x86 → All
Summary: De-serialize SSL traffic → Use multiple SSL worker threads (nsSSLThread)
The idea discussed in comment 14 through comment 24 is now bug 674147. The discussion of pollable events in those comments is still relevant to this bug.
No longer depends on: 542832
We implemented the solution from bug 674147, removing the SSL worker thread, so we don't need to do this. The patch from bug 674147 uses a pool of threads to do certificate validation, de-serializing SSL handshakes.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.