Last Comment Bug 674147 - remove the SSL thread (nsSSLThread)
: remove the SSL thread (nsSSLThread)
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Security: PSM (show other bugs)
: Trunk
: All All
: -- normal with 1 vote (vote)
: mozilla11
Assigned To: Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
:
:
Mentors:
: 634084 642697 (view as bug list)
Depends on: 705755 706955 759592 807711 542832 682329 686150 686248 698552 698988 703508 706728 710176 712363 716636 934902
Blocks: 536815 615342 360420 394830 480878 502707 508633 SPDY 532482 538283 548253 616072 617381 634084 636810 653565 658479 666999 698359 719887
  Show dependency treegraph
 
Reported: 2011-07-25 19:48 PDT by Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
Modified: 2013-11-12 14:25 PST (History)
22 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
WIP: hackish implementation to show that the idea seems to work (132.41 KB, patch)
2011-07-25 20:01 PDT, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
no flags Details | Diff | Splinter Review
Part 1: Move certificate validation code to new file, preserving history (48.25 KB, patch)
2011-11-19 13:10 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
honzab.moz: review+
Details | Diff | Splinter Review
Part 2: Everything else (sorry) (141.76 KB, patch)
2011-11-19 13:16 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
no flags Details | Diff | Splinter Review
Part 2: Everything else (sorry) [v2, fixes leak and potential very unlikely crash] (141.83 KB, patch)
2011-11-19 13:49 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
no flags Details | Diff | Splinter Review
SSL_RestartHandshakeAfterServerCert patches [not for checkin] (35.66 KB, patch)
2011-11-19 14:05 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
no flags Details | Diff | Splinter Review
[not for checkin] diff -w of AuthCertificate and DispatchCertErrorRunnable (26.50 KB, patch)
2011-11-19 14:19 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
no flags Details | Diff | Splinter Review
Part 2: Everything else (sorry) [v3, fixes leak and a not-so-unlikely crash at shutdown] (141.89 KB, patch)
2011-11-19 16:59 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
no flags Details | Diff | Splinter Review
[not for checkin] diff -w of AuthCertificate and DispatchCertErrorRunnable (31.99 KB, patch)
2011-11-19 17:08 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
no flags Details | Diff | Splinter Review
Part 2: Everything else (sorry) [v4] (141.75 KB, patch)
2011-11-21 10:51 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
honzab.moz: review+
Details | Diff | Splinter Review
Part 1: Move certificate validation code to new file, preserving history [v2] (48.25 KB, patch)
2011-11-29 18:50 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
brian: review+
kaie: superreview+
Details | Diff | Splinter Review
Part 2: Everything else (sorry) [v5] (145.40 KB, patch)
2011-11-29 18:51 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
brian: review+
Details | Diff | Splinter Review
Part 2: [v5] - merged (145.37 KB, patch)
2011-11-30 13:36 PST, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
Part 2: [v5] - merged (146.29 KB, patch)
2011-11-30 14:48 PST, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
Part 2: Everything else (sorry) [v6] (145.93 KB, patch)
2011-11-30 18:09 PST, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
honzab.moz: review+
kaie: superreview+
Details | Diff | Splinter Review

Description Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-07-25 19:48:03 PDT
+++ This bug was initially created as a clone of Bug #511393 +++

This was discussed in Bug 511393 Comment 14 through Comment 24. See Bug 511393 for the full discussion of the pros/cons of choosing to do this vs. implementing multiple SSL threads.

In Bug 511393 Comment 14 I wrote:
Brian Smith wrote:
> I talked to Kai on IRC about an alternate approach that is also different
> from the one I suggested in comment 13. We will discuss it on the phone on
> Thursday morning before the normal NSS teleconference to see if we agree
> that this approach will or not work. If it won't work, then we will pribably
> fix this by implementing the multiple SSL worker threads instead.
> 
> The new idea is:
> 
> * Remove nsSSLThread completely.
> * Change the signature of AuthCertificateCallback to match SSLBadCertHandler
> and rename it to StartCertAuth.
> * Write a new cert auth hook SaveParametersForCertAuth that saves its
> parameters and unconditionally returns SECFailure.
> * Replace "SSL_AuthCertificateHook(sslSock, AuthCertificateCallback, 0)" 
>      with "SSL_AuthCertificateHook(sslSock, SaveParametersForCertAuth, 0)."
> * Replace "SSL_BadCertHook(fd, nsNSSBadCertHandler, infoObject)"
>      with "SSL_BadCertHook(fd, StartCertAuth, infoObject)."
> * Move all the code in StartCertAuth after the call to
> PSM_SSL_PKIX_AuthCertificate to a new function AfterCertAuth.
> * Change (or overload) nsCertVerificationThread::addJob() to return a
> pollable event.
> * Change nsCertVerificationThread to maintain a pollable event that is set
> when certificate verification is finished; make addJob return the FD of the
> pollable event.
> * Replace the call to PSM_SSL_PKIX_AuthCertificate with a call to
> nsCertVerificationThread::addJob(), change the file descriptor for the SSL
> socket to be the FD for the pollable event, and return SECWouldBlock.
> * (In a follow-up bug) change nsCertVerificationThread so that it can verify
> multiple certificate chains in parallel.
> 
> The above steps will cause StartCertAuth to be called unconditionally during
> certificate validation. StartCertAuth returning SECWouldBlock will cause the
> SSL handshake to be paused in a non-blocking manner so that the main thread
> can continue working while certificate verification takes place. 
> 
> (Note that BadCertHook can return SECWouldBlock but the AuthCertificateHook
> can't, which is why the above steps are kind of roundabout; a clearer
> alternative would be to change libssl to handle the AuthCertificateHook
> returning SECWouldBlock in a way similar to the way it handles the
> BadCertHook returning SECWouldBlock. I propose that instead we just keep all
> the changes for this bug internal to PSM so we don't need to change NSS for
> this.)
> 
> Replacing the SSL socket's FD with the pollable event will result in the SSL
> transport being restarted up when verification is complete. When the SSL
> transport is woken up, it must always check to see if it was woken up
> because of an event on its SSL FD or because of an event on the certificate
> verification pollable event. If the later, it must switch its FD back to the
> SSL FD, call AfterCertAuth, and then call
> SSL_RestartHandshakeAfterServerCert to resume the SSL handshake.
> 
> The result of all of this would be that SSL traffic on one connection would
> no longer block SSL traffic on other connections.

And in Bug 511393 Comment 15 I wrote:
> Other important side effects of the change:
> 
> * Necko/PSM would become simpler to maintain, which is an important goal for
> the Necko team.
> 
> * [Some] bugs about the SSL thread accessing main-thread-only resources
> off the main thread would be fixed automatically, because the SSL
> thread would no longer exist. (Note that any similar bugs in 
> nsCertVerificationThread [and the socket transport thread] would
> continue to exist; however, there are fewer of these, and switching
> to libpkix by default will eliminate them, I believe.)

Kai noted:
> Even if you removed the SSL thread, you would still have secondary
> threads executing PSM code that accesses Gecko resources.
[...]
> The SSL thread uses long lived pollable event objects.
> I'm worried dynamically creating lots of pollable events
> might cause problems.

Honza wrote:
> + we may try to solve this w/o a pollable event at all -
> by overriding just and only the |poll| FD method
> implementation and posting a dummy event to the socket
> thread (we will have to make it reachable from PSM) what
> triggers the necko's pollable event to simply wake polling
> up
Comment 1 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-07-25 19:53:53 PDT
Kai also noted that fixing this bug would still cause some parts of SSL to be serialized because it doesn't allow for multiple certificate chain validations. My plan is to de-serialize them in the separate follow-up bug 674148.
Comment 2 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-07-25 20:01:12 PDT
Created attachment 548357 [details] [diff] [review]
WIP: hackish implementation to show that the idea seems to work

This is a WIP. nsSSLThread is gone and SSL still works, mostly. I am not sure yet how to handle renegotiations though, and there is a still a considerable amount of work to do. This may depend on the the patches in bug 651523 and bug 593077. It includes a modified version of the patch in bug 542832, so don't apply the patch in that bug if you want to check it out.
Comment 3 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-07-25 20:07:47 PDT
Honza, my WIP implements a variant of the idea you proposed. But, instead of a dummy event, I actually put the call to SSL_RestartHandshakeAfterServerCert in the event; this way, all the calls into libssl that might do I/O or buffer management will happen on the socket transport thread.
Comment 4 Kai Engert (:kaie) 2011-07-28 09:03:01 PDT
Have you tested with a proxy configuration?
Comment 5 Kai Engert (:kaie) 2011-07-28 09:14:13 PDT
Patch does not apply to mozilla-central.
Could you please attach one that does (and includes all necessary base patches)?
Comment 6 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-09-18 23:13:31 PDT
Patrick told me that removing the SSL thread would help his SPDY implementation. I was also told (by Josh? Patrick? Jason?) that we expect this to improve SPDY performance as well.
Comment 7 Honza Bambas (:mayhemer) 2011-10-08 09:02:07 PDT
Comment on attachment 548357 [details] [diff] [review]
WIP: hackish implementation to show that the idea seems to work

Review of attachment 548357 [details] [diff] [review]:
-----------------------------------------------------------------

I think this is the right way to go, I didn't go deeply by your new logic, though.

Brian, what are the next steps you plan on this?

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +249,5 @@
> +  // Unconditionally return SECFailure so that libssl will call StartAuthCertificate.
> +  // We need to return SECWouldBlock after starting certificate verification,
> +  // but libssl interprets any result other than SECSuccess from AuthCertificate as
> +  // failure. libssl processes SECWouldBlock from the BadCertHook in the way we need,
> +  // so we just defer all the work to that.

I think we should fix NSS to accept SECWouldBlock from this callback too.  The code in NSS to call both auth callback and bad cert callback is close to each other and it is quit logical to allow auth callback return would-block too.  That change would be fairly simple.

I don't think it is quit right to have a code like this in the production code.

@@ +2297,5 @@
> +    MutexAutoLock lock(socketInfo->mutex);
> +    if (socketInfo->mCertToVerify) {
> +      PR_LOG(gPIPNSSLog, PR_LOG_DEBUG, ("[%p] poll using pollable event %d\n", (void*)fd, (int) in_flags));
> +      *out_flags = 0;
> +      return 0;

Yes, this was exactly my idea :)

The log comment might be wrong.

::: security/nss/cmd/lib/SSLerrs.h
@@ +403,3 @@
>  ER3(SSL_ERROR_UNSAFE_NEGOTIATION,              (SSL_ERROR_BASE + 113),
>  "Peer attempted old style (potentially vulnerable) handshake.")
>  

Huh?  The number is wrong.

::: security/nss/cmd/tstclnt/tstclnt.c
@@ +1088,5 @@
>  	    /* Read from socket and write to stdout */
>  	    nb = PR_Recv(pollset[SSOCK_FD].fd, buf, sizeof buf, 0, maxInterval);
>  	    FPRINTF(stderr, "%s: Read from server %d bytes\n", progName, nb);
>  	    if (nb < 0) {
> +		if (PR_GetError() == SSL_ERROR_HANDSHAKE_SUSPENDED) {

I don't see any place this gets set.

::: security/nss/lib/ssl/sslsecur.c
@@ +158,4 @@
>  	++loopCount;
>      /* This code must continue to loop on SECWouldBlock, 
>       * or any positive value.	See XXX_1 comments.
>       */

I know this the change from bug 542832 but it really needs an explanation.  It also conflicts with the comment bellow.
Comment 8 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-10-08 10:47:08 PDT
(In reply to Honza Bambas (:mayhemer) from comment #7)
> Brian, what are the next steps you plan on this?

I have an improved version of the patch that is rebased on the XPCOM proxy work and which address some problems with this one. There is a problem with the bad cert handling still. I will fix it early next week and post it (and the entire queue it depends on) for review.

> I think we should fix NSS to accept SECWouldBlock from [the Auth Cert
> Callback] too. The code in NSS to call both auth callback and bad cert
> callback is close to each other and it is quit logical to allow auth
> callback return would-block too.  That change would be fairly simple.

I don't care either way. The main reason for doing it this way is to maintain compatibility with other NSS-based applications. You could return SECWouldBlock from AuthCertificate and your bad cert handler would be called right away. I do not know if any (non-Mozilla) products rely on that behavior. 

> I don't think it is quit right to have a code like this in the production
> code.

I agree that it is really ugly, and I know Kai does too. Though, it is not totally without precedent; IIRC, Chromium does something very similar.

Regarding the changes to NSS: I have since found a better way of doing this in the NSS code. I will post my patch in bug 542832 once everything is working perfectly.
Comment 9 Honza Bambas (:mayhemer) 2011-10-19 09:41:29 PDT
A side note: if we do parallel connections to a host:port we don't have a cached ssl session for, we should probably do the handshake just by one of the sockets and all other let wait until the full handshake is done.  Those waiting sockets should then use the just created cached session.  This way we could save load on our side as well as on the server side.

I'm not sure, but the current ssl thread implementation may enforce this behavior.

I don't have a knowledge whether NSS does this somehow internally.
Comment 10 Wan-Teh Chang 2011-10-19 11:40:32 PDT
Honza: Yes, that's a good idea.  I believe NSS doesn't do that
internally.

Note: this optimization is difficult when SSL False Start is
enabled.  With SSL False Start, the SSL handshake completion
callback is called before the handshake is truly completed
and the session ID written to the session ID cache.
Comment 12 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-10-19 12:11:31 PDT
I filed bug 695789 for Honza's suggestion in comment 9.
Comment 13 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-10-26 14:08:22 PDT
Honza, it will be easier to land the final version of this patch if we have bug 686248 fixed. If you don't have time to review the patch in bug 686248, then I will be able to de-tangle them, but I would like to avoid doing so.
Comment 14 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-10-26 14:09:27 PDT
Thanks for your try submission (http://hg.mozilla.org/try/pushloghtml?changeset=8dbbf3bd423e).  It's the best!

Watch https://tbpl.mozilla.org/?tree=Try&rev=8dbbf3bd423e for your results to come in.

Builds and logs will be available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bsmith@mozilla.com-8dbbf3bd423e.
Comment 15 Patrick McManus [:mcmanus] 2011-10-27 07:26:39 PDT
I tried out the patch series in the try build from comment 14 (all the patches not just the ones explicitly labeled remove-ssl-thread), and with them applied I can no longer connect to https://www.cotendo.com/

I get a SSL_ERROR_RX_UNEXPECTED_CERTIFICATE error (see stack below) that I don't get without the patch set.

This host does speak NPN and SPDY along with regular HTTPS, though my problem report is done without any of my spdy patches applied (so we are not sending a NPN list nor trying to speak spdy or using forcehandshake - its just the patchset from comment 14) so I would expect to see regular HTTPS happen.

I tried other known SPDY/NPN hosts (e.g. mail.google.com) and was able to handshake fine.

Breakpoint 1, ssl3_HandleCertificate (ss=0x7fffd5c6a000, b=0x7fffd6bb6000 "", length=3087) at ssl3con.c:7842
7842		errCode = SSL_ERROR_RX_UNEXPECTED_CERTIFICATE;
(gdb) bt
#0  ssl3_HandleCertificate (ss=0x7fffd5c6a000, b=0x7fffd6bb6000 "", length=3087) at ssl3con.c:7842
#1  0x00007ffff6a273ca in ssl3_HandleHandshakeMessage (ss=0x7fffd5c6a000, b=0x7fffd6bb6000 "", length=3087) at ssl3con.c:8708
#2  0x00007ffff6a27a58 in ssl3_HandleHandshake (ss=0x7fffd5c6a000, origBuf=0x7fffd5c6a390) at ssl3con.c:8871
#3  0x00007ffff6a28576 in ssl3_HandleRecord (ss=0x7fffd5c6a000, cText=0x7fffe53fe780, databuf=0x7fffd5c6a390) at ssl3con.c:9171
#4  0x00007ffff6a2975d in ssl3_GatherCompleteHandshake (ss=0x7fffd5c6a000, flags=0) at ssl3gthr.c:209
#5  0x00007ffff6a2c5a8 in ssl_GatherRecord1stHandshake (ss=0x7fffd5c6a000) at sslcon.c:1258
#6  0x00007ffff6a3917e in ssl_Do1stHandshake (ss=0x7fffd5c6a000) at sslsecur.c:155
#7  0x00007ffff6a3bb1f in ssl_SecureSend (ss=0x7fffd5c6a000,
Comment 16 Patrick McManus [:mcmanus] 2011-10-27 07:28:01 PDT
(In reply to Patrick McManus from comment #15)
> I tried out the patch series in the try build from comment 14 (all the
> patches not just the ones explicitly labeled remove-ssl-thread), and with
> them applied I can no longer connect to https://www.cotendo.com/
> 

I forgot to mention that I confirmed that is a regression from mozilla-central.
Comment 17 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-10-27 09:35:13 PDT
Thanks Patrick. I will relay your report over to bug 542832, since that seems to be the cause of the regression.
Comment 18 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-10-28 00:38:57 PDT
I got www.cotendo.com working. Will post the new set of patches in the morning.

There is one problem left (AFAICT): The SSL_ERROR_RX_UNEXPECTED_CERTIFICATE error should have been reported in an error page, but instead no error page was displayed. I have a pretty good idea why. Will fix tomorrow.

Watch https://tbpl.mozilla.org/?tree=Try&rev=c8fc5cc5dc40 for your results to come in.

Builds and logs will be available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bsmith@mozilla.com-c8fc5cc5dc40.
Comment 19 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-10-30 22:57:42 PDT
Patrick, when you ran into the SSL_ERROR_RX_UNEXPECTED_CERTIFICATE error, was the error reported using the "fancy" error pages, or did the connection just seem to hang with the previous page's contents shown?
Comment 20 Patrick McManus [:mcmanus] 2011-10-31 05:51:27 PDT
(In reply to Brian Smith (:bsmith) from comment #19)
> Patrick, when you ran into the SSL_ERROR_RX_UNEXPECTED_CERTIFICATE error,
> was the error reported using the "fancy" error pages, or did the connection
> just seem to hang with the previous page's contents shown?

no fancy page. (it didn't hang in the spinning sense, but nothing was updated and the connection was closed.)
Comment 21 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-19 13:10:50 PST
Created attachment 575693 [details] [diff] [review]
Part 1: Move certificate validation code to new file, preserving history

This patch moves all the certificate path validation code from nsNSSCallbacks.cpp into a new file, SSLServerCertVerification.cpp, preserving the Mercurial change history. This is useful because the AuthCertificateCallback will be made a member of the SSLServerCertVerification class in the next patch, and it allows us to keep the entire definition of SSLServerCertVerification in one file (including even the class definition).
Comment 22 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-19 13:14:07 PST
There is a bug in the splinter code review tool regarding "hg copy" operations. It cannot be used to review this patch. The best way to review the patch is to use a visual diff tool like DiffMerge or kdiff, to verify that no code changes were made other than moving the relevant code to the new file.
Comment 23 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-19 13:16:34 PST
Created attachment 575694 [details] [diff] [review]
Part 2: Everything else (sorry)

I spent a long time trying to figure out a good way to break this up into smaller pieces for review, but it would take *forever*. Instead, if there are things that you want broken out or explained in more detail than what is in the comments, please don't hesitate to ask for it. I know it sucks to review a giant patch like this.
Comment 24 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-19 13:44:39 PST
Tips for people reviewing the code: Earlier this week, I found I had overlooked some important logic in nsSSLThread.cpp that isn't usually necessary. (e.g. things that have to do with TLS intolerance.) It is a good idea to look at the log of nsSSLThread.cpp to see what bug fixes were made to it, to verify that those things were not overlooked in my patch. (I just did this over the course of the last two days and found a few bugs in doing so.)

The test coverage for this code (e.g. TLS intolerance) is very poor. I will be working on more/better tests Monday.
Comment 25 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-19 13:49:07 PST
Created attachment 575695 [details] [diff] [review]
Part 2: Everything else (sorry) [v2, fixes leak and potential very unlikely crash]
Comment 26 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-19 14:05:13 PST
Created attachment 575698 [details] [diff] [review]
SSL_RestartHandshakeAfterServerCert patches [not for checkin]

Here is the try run currently in progress:
https://tbpl.mozilla.org/?tree=Try&rev=b64e5b061571

To build:
1. python client.py update_nss HEAD
2. Apply the patches from this attachment (which are from bug 542832)
3. Apply the patch in bug 703508
4. Apply the two patches above.
Comment 27 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-19 14:19:15 PST
Created attachment 575701 [details] [diff] [review]
[not for checkin] diff -w of AuthCertificate and DispatchCertErrorRunnable

Here is a diff -u of parts where there indention changes obscure the real changes.
Comment 28 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-19 16:59:47 PST
Created attachment 575715 [details] [diff] [review]
Part 2: Everything else (sorry) [v3, fixes leak and a not-so-unlikely crash at shutdown]

Here is a tryserver run that looks like it is going to be substantially more green than the last one I posted:
https://tbpl.mozilla.org/?tree=Try&rev=a4d0af78a5a8
Comment 29 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-19 17:08:53 PST
Created attachment 575717 [details] [diff] [review]
[not for checkin] diff -w of AuthCertificate and DispatchCertErrorRunnable

I didn't leave enough of the diff metadata for splinter to parse this diff -w patch. Here's another attempt at getting something that is readable in splinter.
(It would be better to review these changes in DiffMerge or another graphical diff program that knows how to handle changes of indention.)
Comment 30 Patrick McManus [:mcmanus] 2011-11-21 08:52:34 PST
Using the build instructions from comment 26 I frequently receive the following assertion:

###!!! ASSERTION: Should not have an error message for successful certificate verification: 'errorCode != 0', file ../../../../../spdy2/security/manager/ssl/src/nsNSSIOLayer.cpp, line 872

for instance https://www.paypal.com/ generates it.

Things seem to work fine.
Comment 31 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-21 10:51:10 PST
Created attachment 575907 [details] [diff] [review]
Part 2: Everything else (sorry) [v4]

That assertion shouldn't have been in the patch. (Originally, that function took an additional parameter and that assertion was testing additional conditions.)
Comment 32 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-21 10:54:08 PST
Comment on attachment 575907 [details] [diff] [review]
Part 2: Everything else (sorry) [v4]

The interdiff tool seems busted too :(. The only change I made was to remove the unnecessary assertion that Patrick pointed out.
Comment 33 Honza Bambas (:mayhemer) 2011-11-22 12:17:50 PST
Comment on attachment 575693 [details] [diff] [review]
Part 1: Move certificate validation code to new file, preserving history

When applying this patch with hg 1.9.2 I'm getting the following:

$ hg qpush -v
applying psm-move-AuthCertificateCallback.patch
patching file security/manager/ssl/src/Makefile.in
patching file security/manager/ssl/src/SSLServerCertVerification.cpp
list index out of range
security/manager/ssl/src/Makefile.in
patch failed, rejects left in working dir
errors during apply, please fix and refresh psm-move-AuthCertificateCallback.patch

SSLServerCertVerification.cpp is not created, nor any rej file for it.  Copying the patch for just the file to a new patch solves the problem.
Comment 34 Honza Bambas (:mayhemer) 2011-11-23 05:32:38 PST
Comment on attachment 575693 [details] [diff] [review]
Part 1: Move certificate validation code to new file, preserving history

Review of attachment 575693 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab

Except the problem with qpush.  You have to split this patch to 3 pieces (one patch per one file) to get this in the queue.
Comment 35 Honza Bambas (:mayhemer) 2011-11-24 11:34:09 PST
Comment on attachment 575693 [details] [diff] [review]
Part 1: Move certificate validation code to new file, preserving history

BTW: some other parts could be moved out from nsNSSIOLayer.cpp, e.g. client authentication stuff.  But not in this bug...
Comment 36 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-24 21:54:46 PST
In some places, I set the error code to an NSPR PR_xxx_ERROR error code. We may need to update the SECStatus -> nsresult mapping code in Necko to handle NSPR errors. We also should investigate the code in DocShell that decides which error codes should result in an error page.
Comment 37 Honza Bambas (:mayhemer) 2011-11-28 09:40:06 PST
Comment on attachment 575907 [details] [diff] [review]
Part 2: Everything else (sorry) [v4]

Review of attachment 575907 [details] [diff] [review]:
-----------------------------------------------------------------

Very nice!  Thanks.

r=honzab (please use this signature in commit messages) with:

You are regressing bug 378629:
https://bugzilla.mozilla.org/show_bug.cgi?id=356470#c54
https://bugzilla.mozilla.org/show_bug.cgi?id=378629#c11

PSMSend must watch for libssl sending just request - 1 amount of data (= a "short write").  In that case, the simplest solution is IMO that we have to limit next data send to just 1 byte, simply explained in a pseudo code:

PSMSend(buffer, count)
{
  static flag = false;
  if (flag)
    count = 1;
    flag = false;

  written = lower->send(buffer, count)
  if (written == count - 1)
    flag = true;
}

We can do this in a separate bug, but ASAP.  The final solution is up to you, but I'd like to review and test it.

::: netwerk/base/src/nsSocketTransportService2.cpp
@@ +54,5 @@
>  
>  #include "mozilla/FunctionTimer.h"
>  
> +// XXX: There is no good header file to put these in. :(
> +namespace mozilla { namespace psm {

namespace mozilla {
namespace psm {

@@ +616,5 @@
>  nsSocketTransportService::Run()
>  {
>      SOCKET_LOG(("STS thread init\n"));
>  
> +    psm::InitializeSSLServerCertVerificationThreads();

Why don't you just put this in nsNSSComponent::Init() and ::Shutdown() ?

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +42,5 @@
>   * ***** END LICENSE BLOCK ***** */
> +
> +/* 
> + * All I/O is done on the socket transport service (STS) thread, including all
> + * calls into libssl. That is, all SSL_* functions must be called on the STS

Personally I would rather use "network thread" instead of "STS thread" in comments.  It's commonly referred in the code as "network" and is not confusing with an abbreviation that overlaps with strict transport security.

Or don't use the abbreviation and refer fully as "socket transport thread".

@@ +60,5 @@
> + *
> + * Consequently, we must always call the CERT_*Verify* cert functions off the
> + * socket transport thread. To accomplish this, our auth cert hook dispatches a
> + * SSLServerCertVerification job to a pool of background threads, and then
> + * immediatley return SECWouldBlock to libssl. These jobs are where the

returns

@@ +69,5 @@
> + * thread so that HTTP requests--in particular, the OCSP/CRL/cert
> + * requests needed for cert verification as mentioned above--can be processed.
> + *
> + * Once the CERT_*Verify* function returns, the cert verification job posts a
> + * SSLServerCertVerificationResult on the STS thread that will notify libssl

_to_ the STS thread?

@@ +114,5 @@
> + *
> + * SSLServerCertVerificationResult must be dispatched to the socket transport
> + * thread because we must only call SSL_* functions on the socket transport
> + * thread since they may do I/O and because many parts of nsNSSSocketInfo and
> + * the PSM NSS I/O layer are not thread-safe.

And also because it wakes the thread and polling up, right?

--

You might also want to add we process bad certs our self and don't rely on NSS bad cert hook any more.

@@ +137,5 @@
>  #endif
>  
> +namespace mozilla { namespace psm {
> +
> +namespace {

Why nested anonymous namespace?

@@ +139,5 @@
> +namespace mozilla { namespace psm {
> +
> +namespace {
> +// do not use a nsCOMPtr to avoid static initializer/destructor
> +nsIThreadPool * gThreadPool = nsnull;

I would call this gCertVerificationThreadPool

@@ +182,5 @@
> +{
> +  if (gThreadPool) {
> +    gThreadPool->Shutdown();
> +    gThreadPool->Release();
> +    gThreadPool = nsnull;

NS_RELEASE(gThreadPool);

@@ +186,5 @@
> +    gThreadPool = nsnull;
> +  }
> +}
> +
> +namespace {

Why nested anonymous namespace?

@@ +188,5 @@
> +}
> +
> +namespace {
> +
> +class SSLServerCertVerification : public nsRunnable

Maybe call this SSLServerCertVerificationJob ?

@@ +242,2 @@
>                                  survivingParams->GetRawPointerForNSS(),
>                                  cvout, pinarg);

Indention.

@@ +500,5 @@
>    return rv;
>  }
> +
> +/*static*/ SECStatus
> +SSLServerCertVerification::Dispatch(const void * fdForLogging,

SSLServerCertVerificationJob::Start ?

@@ +513,5 @@
> +    PR_SetError(PR_INVALID_STATE_ERROR, 0);
> +    return SECFailure;
> +  }
> +
> +  if (!serverCert) {

According the previous condition, it seems you never get here if this is true.

@@ +525,5 @@
> +    return SECFailure;
> +  }
> +  
> +  SSLServerCertVerification * job
> +    = new SSLServerCertVerification(fdForLogging, *socketInfo, *serverCert);

I think you should use nsRefPtr here, unless I'm missing something.

@@ +528,5 @@
> +  SSLServerCertVerification * job
> +    = new SSLServerCertVerification(fdForLogging, *socketInfo, *serverCert);
> +
> +  socketInfo->SetCertVerificationWaiting();
> +  nsresult nrv = gThreadPool->Dispatch(job, NS_DISPATCH_NORMAL);

You should check gThreadPool is non-null.

@@ +564,5 @@
> +    SECStatus rv = AuthCertificate(nssShutdownPrevention);
> +    if (rv == SECSuccess) {
> +      nsRefPtr<SSLServerCertVerificationResult> restart 
> +        = new SSLServerCertVerificationResult(*mSocketInfo, 0);
> +      restart->Dispatch();

Maybe have DispatchCertVerificationResult method?

@@ +595,5 @@
> +  failure->Dispatch();
> +  return NS_OK;
> +}
> +
> +SSLServerCertVerification::SSLServerCertVerification(

Please put constructors and destructors as the first in methods definition.

@@ +613,5 @@
> +} // anonymous namespace
> +
> +// Extracts whatever information we need out of fd (using SSL_*) and passes it
> +// to SSLServerCertVerification::Dispatch. SSLServerCertVerification should
> +// never call do any operations on fd except logging.

Check this comment grammar.

@@ +617,5 @@
> +// never call do any operations on fd except logging.
> +SECStatus
> +AuthCertificateHook(void *arg, PRFileDesc *fd, PRBool checkSig, PRBool isServer)
> +{
> +  // TODO: Assert that we're on the socket transport thread

Why is this assertion a TODO?  I suppose it is ensured.

@@ +647,5 @@
> +    = do_GetService(NS_SOCKETTRANSPORTSERVICE_CONTRACTID);
> +  NS_ASSERTION(sts, "Failed to get socket transport service");
> +  nsCOMPtr<nsIEventTarget> stsTarget(do_QueryInterface(sts));
> +  NS_ASSERTION(stsTarget,
> +               "Failed to get socket transport service event target");

Maybe simpler would be to get directly nsIEventTarget and log rv of do_GetService.

::: security/manager/ssl/src/SSLServerCertVerification.h
@@ +71,5 @@
> +public:
> +  void Dispatch();
> +  SSLServerCertVerificationResult(nsNSSSocketInfo & socketInfo,
> +                                  PRErrorCode errorCode);
> +  NS_DECL_NSIRUNNABLE

Please in this order:

NS_DECL_NSIRUNNABLE

SSLServerCertVerificationResult(nsNSSSocketInfo & socketInfo,
PRErrorCode errorCode);

void Dispatch();

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +857,5 @@
> +void
> +nsNSSSocketInfo::SetCertVerificationWaiting()
> +{
> +  NS_ASSERTION(mCertVerificationState == before_cert_verification,
> +               "Invalid state transition to waiting_for_cert_verification");

What about renegotiation?  Wouldn't the state be after_cert_verification?

@@ +865,5 @@
> +void
> +nsNSSSocketInfo::SetCertVerificationResult(PRErrorCode errorCode)
> +{
> +  NS_ASSERTION(mCertVerificationState == waiting_for_cert_verification,
> +                "Invalid state transition to cert_verification_finished");

Indention

@@ +870,5 @@
> +
> +  if (errorCode != 0) {
> +    SetCanceled(errorCode, CertErrorMessage);
> +  } else if (mFd) {
> +    // We haven't closed the connection alraedy, so restart it

already (typo)

@@ +1425,2 @@
>    // this function is called, the SSL thread is waiting on this thread (the
>    // main thread). So, no mutex is necessary for SetCanceled()/GetError*().

s/SSL thread/network thread/ in this comment.

@@ +1466,5 @@
>        : nsNSSSocketInfo::SuppressedErrorMessage;
>    socketInfo->SetCanceled(err, errorMessageType);
>  }
>  
> +namespace {

Why do you enclose this in mozilla::psm::`anonymous namespace' ?

@@ +1483,5 @@
> +  NS_ASSERTION(isPSMSocket, "not a PSM socket");
> +  NS_ASSERTION(socketInfo, "nsNSSSocketInfo was null for a PSM socket");
> +  NS_ASSERTION(fd->lower, "PSM socket has no lower layer");
> +
> +  if (!isPSMSocket || !fd->lower || !socketInfo ||

Redundant check for !fd->lower (you probably want to remove it from the first condition)

@@ +1492,5 @@
> +
> +  if (socketInfo->GetErrorCode()) {
> +    // If we get here, it is probably because cert verification failed and this
> +    // is the first I/O attempt since that failure.
> +    PR_SetError(socketInfo->GetErrorCode(), 0);

Maybe cache the error code in local var?

@@ +1512,4 @@
>      return PR_FAILURE;
>    
> +  PRStatus status = PR_SUCCESS;  
> +  status = fd->lower->methods->connect(fd->lower, addr, timeout);

Probably no need to do status = PR_SUCCESS; first.

@@ +1689,3 @@
>  }
>  
>  PRStatus nsNSSSocketInfo::CloseSocketAndDestroy()

Don't you want to inline this method to nsSSLIOLayerClose directly?  Also when compared to nsNSSIOLayer before ssl thread.

@@ +1700,5 @@
> +  
> +  // the nsNSSSocketInfo instance can out-live the connection, so we need some
> +  // indication that the connection has been closed. mFd == nsnull is that
> +  // indication. This is needed, for example, when the connection is closed
> +  // before we have finished validationg the server's certificate.

validating (typo)

@@ +1899,5 @@
>      
>      // This is the common place where we trigger an error message on a SSL socket.
>      // This might be reached at any time of the connection.
> +    if (!wantRetry && (IS_SSL_ERROR(err) || IS_SEC_ERROR(err)) &&
> +        !socketInfo->GetErrorCode()) {

This needs a comment with good explanation.  As I understand, this is here to prevent dispatch of SSLErrorRunnable more then ones since it sets the error code on the socket info object, right?

@@ +1952,5 @@
>  
> +  // TODO: 
> +  // if ((in_flags & PR_POLL_EXCEPT) == 0) { 
> +  //     NS_WARNING("Every caller should be polling for errors");
> +  // }

Maybe remove this code and have a new bug if there is anything you want to fix.

@@ +2037,5 @@
>  
>  static PRStatus PR_CALLBACK PSMGetsockname(PRFileDesc *fd, PRNetAddr *addr)
>  {
>    nsNSSShutDownPreventionLock locker;
> +  if (getSocketInfoIfRunning(fd, locker) == nsnull)

!getSocketInfoIfRunning() is better, also on all other places.

@@ +3282,3 @@
>  }
>  
>  class CertErrorRunnable : public SyncRunnableBase

Not sure on 100% on this, but would be nice to move this class and closely related stuff to SSLServerCertVerification or nsNSSCallbacks.

nsNSSIOLayer should be strictly about the socket and I/O implementation.

Probably not for this bug..

@@ +3305,2 @@
>    virtual void RunOnTargetThread();
> +  nsCOMPtr<nsIRunnable> result; // out

mResult

@@ +3309,2 @@
>    
> +  const void * const mFdForLogging; // may become an invalid pointer; do not deference

dereference?

@@ +3326,5 @@
> +// Returns SECFailure with the error code set if it does not dispatch the
> +// CertErrorRunnable. In that case, the caller should dispatch its own 
> +// SSLServerCertVerificationResult with the error code from PR_GetError().
> +SECStatus
> +DispatchCertErrorRunnable(PRErrorCode defaultErrorCodeToReport,

This should be called HandleBadCertificate.
Add a comment this is called on the cert verification thread.

@@ +3499,5 @@
>  
> +} } // namespace mozilla::psm
> +
> +void
> +nsNSSSocketInfo::SetStatusErrorBits(nsIX509Cert & cert,

Use * instead of &.

@@ +3652,5 @@
> +    DispatchToMainThreadAndWait(); // step 1
> +
> +    // step 3
> +    if (!result) {
> +      // Either the dispatch failed or there CheckCertOverrides wrongly

s/there/on the main thread/ ?

@@ +3674,5 @@
> +  // thread. This is exactly the state we need to be in to call
> +  // CheckCertOverrides; CheckCertOverrides must block events on both of
> +  // these threads because it calls nsNSSSocketInfo::GetInterface(),
> +  // which calls nsHttpConnection::GetInterface() through
> +  // nsNSSSocketInfo::::mCallbacks. nsHttpConnection::GetInterface must always

::::
Comment 38 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-28 10:43:02 PST
(In reply to Honza Bambas (:mayhemer) from comment #37)
> You are regressing bug 378629:
> https://bugzilla.mozilla.org/show_bug.cgi?id=356470#c54
> https://bugzilla.mozilla.org/show_bug.cgi?id=378629#c11
>
> We can do this in a separate bug, but ASAP.  The final solution is up to
> you, but I'd like to review and test it.

I filed bug 705755.

> > +namespace mozilla { namespace psm {
> 
> namespace mozilla {
> namespace psm {

I don't care much either way, but:
1. Other code I wrote in PSM already uses this style.
2. bsmedberg suggested I add this style to the coding style guide.

> @@ +616,5 @@
> >  nsSocketTransportService::Run()
> >  {
> >      SOCKET_LOG(("STS thread init\n"));
> >  
> > +    psm::InitializeSSLServerCertVerificationThreads();
> 
> Why don't you just put this in nsNSSComponent::Init() and ::Shutdown() ?

1. Startup perf. nsSocketTransportService::Run() runs on a background thread. nsNSSComponent::Init() runs on the main thread and blocks startup. I didn't measure how much time it takes to create the thread pool; however, generally my plan is to remove as much stuff work from nsNSSComponent::Init() as possible and avoid adding work to it whenever possible--not just to reduce any CPU-level performance impact, but also to delay loading as many NSS DLLs during startup on the main thread as possible.

2. I tried to explain this in the comments for InitializeSSLServerCertVerificationThreads. We need these threads for the entire lifetime of the socket transport thread, and only for that lifetime. This way, we can easily see the code is correct. If we instead initialize and tear down the thread pool in nsNSSComponent, it is more difficult to reason about the correctness, because of the complicated shutdown dance we do in nsNSSComponent.

> @@ +114,5 @@
> > + *
> > + * SSLServerCertVerificationResult must be dispatched to the socket transport
> > + * thread because we must only call SSL_* functions on the socket transport
> > + * thread since they may do I/O and because many parts of nsNSSSocketInfo and
> > + * the PSM NSS I/O layer are not thread-safe.
> 
> And also because it wakes the thread and polling up, right?

Right. I think I put this in a comment somewhere else, but I will add it here too.

> You might also want to add we process bad certs our self and don't rely on
> NSS bad cert hook any more.

This is a good thing for us all to be aware of, but I plan to soon change the bad cert handling even more in bug 650296 so that this comment wouldn't make much sense if I were to add it.

> > +namespace {
> 
> Why nested anonymous namespace?

From looking at code by bsmedberg and others, this seems to be the new way in the codebase to declare file-scoped static functions and variables.

> Please in this order:
> 
> NS_DECL_NSIRUNNABLE
> 
> SSLServerCertVerificationResult(nsNSSSocketInfo & socketInfo,
> PRErrorCode errorCode);
> 
> void Dispatch();

Is this explained somewhere? I thought we usually do constructors and destructors first, then overrides, then new methods?

> ::: security/manager/ssl/src/nsNSSIOLayer.cpp
> @@ +857,5 @@
> > +void
> > +nsNSSSocketInfo::SetCertVerificationWaiting()
> > +{
> > +  NS_ASSERTION(mCertVerificationState == before_cert_verification,
> > +               "Invalid state transition to waiting_for_cert_verification");
> 
> What about renegotiation?  Wouldn't the state be after_cert_verification?

Indeed. And, there seems to be something wrong with my build since apparently such assertions are never being brought to my attention.

> >  PRStatus nsNSSSocketInfo::CloseSocketAndDestroy()
> 
> Don't you want to inline this method to nsSSLIOLayerClose directly?  Also
> when compared to nsNSSIOLayer before ssl thread.

When we fix 427948, we will probably need to use the nsNSSSocketInfo mutex in CloseSocketAndDestroy, and the mutex is private, so I think we should just leave it as it is.

> > +    if (!wantRetry && (IS_SSL_ERROR(err) || IS_SEC_ERROR(err)) &&
> > +        !socketInfo->GetErrorCode()) {
> 
> This needs a comment with good explanation.  As I understand, this is here
> to prevent dispatch of SSLErrorRunnable more then ones since it sets the
> error code on the socket info object, right?

That, and plus, it ensures that we don't replace the first error encountered with a different one. I will double-check this logic.

> >  class CertErrorRunnable : public SyncRunnableBase
> 
> Not sure on 100% on this, but would be nice to move this class and closely
> related stuff to SSLServerCertVerification or nsNSSCallbacks.
> 
> nsNSSIOLayer should be strictly about the socket and I/O implementation.
> 
> Probably not for this bug..

I agree on all points. I filed bug 705773.

> > +void
> > +nsNSSSocketInfo::SetStatusErrorBits(nsIX509Cert & cert,
> 
> Use * instead of &.

I like to use references to indicate that the parameter is never null so that it is clear that we don't have to handle the null case. IIRC, I do this in several places, so we should decide on a consistent rule.

I would rather not rename SSLServerCertVerification to SSLServerCertVerificationJob. SSLServerCertVerification is already a noun, it already has a very long name, and also it isn't derived from nsCertVerificationJob like the other *Job classes.
Comment 39 Honza Bambas (:mayhemer) 2011-11-28 11:18:19 PST
(In reply to Brian Smith (:bsmith) from comment #38)
> I filed bug 705755.
Thanks
> 
> > > +namespace mozilla { namespace psm {
> 2. bsmedberg suggested I add this style to the coding style guide.
OK
> > Why don't you just put this in nsNSSComponent::Init() and ::Shutdown() ?
> 
> 1. Startup perf. nsSocketTransportService::Run() runs on a background
> thread. nsNSSComponent::Init() runs on the main thread and blocks startup. I
> didn't measure how much time it takes to create the thread pool; however,
> generally my plan is to remove as much stuff work from
> nsNSSComponent::Init() as possible and avoid adding work to it whenever
> possible--not just to reduce any CPU-level performance impact, but also to
> delay loading as many NSS DLLs during startup on the main thread as possible.
> 
> 2. I tried to explain this in the comments for
> InitializeSSLServerCertVerificationThreads. We need these threads for the
> entire lifetime of the socket transport thread, and only for that lifetime.
> This way, we can easily see the code is correct. If we instead initialize
> and tear down the thread pool in nsNSSComponent, it is more difficult to
> reason about the correctness, because of the complicated shutdown dance we
> do in nsNSSComponent.

I'm still not very convinced this code belongs here.  nsNSSComponent over-lives sockets.  Creation of the pool costs creation of just a monitor (lock + cond var).  Threads are created on an even post.  Shutdown means to wait for all threads to join.

> > You might also want to add we process bad certs our self and don't rely on
> > NSS bad cert hook any more.
> 
> This is a good thing for us all to be aware of, but I plan to soon change
> the bad cert handling even more in bug 650296 so that this comment wouldn't
> make much sense if I were to add it.

Until that bug land that comments makes probably sense.  Up to you.

> > Why nested anonymous namespace?
> 
> From looking at code by bsmedberg and others, this seems to be the new way
> in the codebase to declare file-scoped static functions and variables.

Huh...  It prolongs lines in the debugger quit a lot.  OK then.

> 
> > Please in this order:
> > 
> > NS_DECL_NSIRUNNABLE
> > 
> > SSLServerCertVerificationResult(nsNSSSocketInfo & socketInfo,
> > PRErrorCode errorCode);
> > 
> > void Dispatch();
> 
> Is this explained somewhere? I thought we usually do constructors and
> destructors first, then overrides, then new methods?

I use the order I suggest based on what I see in existing files and expect in new files then.

> Indeed. And, there seems to be something wrong with my build since
> apparently such assertions are never being brought to my attention.

XPCOM_DEBUG_BREAK ?

> 
> > >  PRStatus nsNSSSocketInfo::CloseSocketAndDestroy()
> > 
> > Don't you want to inline this method to nsSSLIOLayerClose directly?  Also
> > when compared to nsNSSIOLayer before ssl thread.
> 
> When we fix 427948, we will probably need to use the nsNSSSocketInfo mutex
> in CloseSocketAndDestroy, and the mutex is private, so I think we should
> just leave it as it is.
OK

> That, and plus, it ensures that we don't replace the first error encountered
> with a different one. I will double-check this logic.
Thanks.

> I agree on all points. I filed bug 705773.
Thanks

> I like to use references to indicate that the parameter is never null so
> that it is clear that we don't have to handle the null case. IIRC, I do this
> in several places, so we should decide on a consistent rule.

Hmm.. understand.  It is OK in case of the proofOfLock case, but here I'd rather see non-null assertion and use of *.  Mixing '.Method()' and '->Method()' is not good IMO, but up to you.

> and also it isn't derived from
> nsCertVerificationJob like the other *Job classes.

Well, but the name doesn't enough suggest it is something being processed in the background.  Therefor I wanted to suffix it.  Don't be afraid of long names when they make sense :)
Comment 40 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-28 11:34:54 PST
(In reply to Honza Bambas (:mayhemer) from comment #39)
> > > Why don't you just put this in nsNSSComponent::Init() and ::Shutdown() ?
> > 
> > 1. Startup perf. nsSocketTransportService::Run() runs on a background
> > thread. nsNSSComponent::Init() runs on the main thread and blocks startup. I
> > didn't measure how much time it takes to create the thread pool; however,
> > generally my plan is to remove as much stuff work from
> > nsNSSComponent::Init() as possible and avoid adding work to it whenever
> > possible--not just to reduce any CPU-level performance impact, but also to
> > delay loading as many NSS DLLs during startup on the main thread as possible.
> > 
> > 2. I tried to explain this in the comments for
> > InitializeSSLServerCertVerificationThreads. We need these threads for the
> > entire lifetime of the socket transport thread, and only for that lifetime.
> > This way, we can easily see the code is correct. If we instead initialize
> > and tear down the thread pool in nsNSSComponent, it is more difficult to
> > reason about the correctness, because of the complicated shutdown dance we
> > do in nsNSSComponent.
> 
> I'm still not very convinced this code belongs here.  nsNSSComponent
> over-lives sockets.  Creation of the pool costs creation of just a monitor
> (lock + cond var).  Threads are created on an even post.  Shutdown means to
> wait for all threads to join.

There is one more reason I forgot to mention: This way, the thread pool is created and accessed only on the socket transport thread, so no locks are required in SSLServerCertVerification for this. If we do it in nsNSSComponent, we end up needing locks. I suggest we do it the way I have it now and see how it goes.
 
> 
> > 
> > > Please in this order:
> > > 
> > > NS_DECL_NSIRUNNABLE
> > > 
> > > SSLServerCertVerificationResult(nsNSSSocketInfo & socketInfo,
> > > PRErrorCode errorCode);
> > > 
> > > void Dispatch();
> > 
> > Is this explained somewhere? I thought we usually do constructors and
> > destructors first, then overrides, then new methods?
> 
> I use the order I suggest based on what I see in existing files and expect
> in new files then.

OK.

> > I like to use references to indicate that the parameter is never null so
> > that it is clear that we don't have to handle the null case. IIRC, I do this
> > in several places, so we should decide on a consistent rule.
> 
> Hmm.. understand.  It is OK in case of the proofOfLock case, but here I'd
> rather see non-null assertion and use of *.  Mixing '.Method()' and
> '->Method()' is not good IMO, but up to you.

It seems like this is an issue that should already have been decided in the coding guidelines, but I can't find it. I will start a flamewar on dev.platform about it now and then decipher the results of it after I have made the other changes.

> Well, but the name doesn't enough suggest it is something being processed in
> the background.  Therefor I wanted to suffix it.  Don't be afraid of long
> names when they make sense :)

OK. I will rename it.
Comment 41 Honza Bambas (:mayhemer) 2011-11-28 11:58:27 PST
(In reply to Brian Smith (:bsmith) from comment #40)
> There is one more reason I forgot to mention: This way, the thread pool is
> created and accessed only on the socket transport thread, so no locks are
> required in SSLServerCertVerification for this. If we do it in
> nsNSSComponent, we end up needing locks. I suggest we do it the way I have
> it now and see how it goes.

That's a good argument.  However, we should one day do this somehow code-independently thought something like observer notification or so.  On the other hand, you can post events to network thread to init/shutdown the pool.
Comment 42 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-29 18:50:38 PST
Created attachment 577831 [details] [diff] [review]
Part 1: Move certificate validation code to new file, preserving history [v2]

This patch applies fine in my version of hg so I didn't split it up.
Comment 43 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-29 18:51:31 PST
Created attachment 577832 [details] [diff] [review]
Part 2: Everything else (sorry) [v5]
Comment 44 Kai Engert (:kaie) 2011-11-30 08:26:37 PST
+// Called when the STS thread finishes, to destroy the thread pool. Since the
+// STS service has stopped processing events, it will not attempt any more SSL
+// I/O operations, so it is clearly safe to shut down the SSL cert verification
+// infrastructure. Also, the STS will not dispatch many SSL verification result

typo? dispatch *any* ?


+// events at this point, so any pending cert verifications will (correctly) fail
+// at the point they are dispatched dispatched.

typo: double "dispatched"



+SECStatus
+AuthCertificateHook(void *arg, PRFileDesc *fd, PRBool checkSig, PRBool isServer)
+{
+  // TODO: Assert that we're on the socket transport thread
+  PR_LOG(gPIPNSSLog, PR_LOG_DEBUG,
+         ("[%p] starting AuthCertificateHook\n", fd));
+      
+  CERTCertificate *serverCert = SSL_PeerCertificate(fd);


You ignore the isServer value, and instead hardcode to !isServer.
Let's not remove wisdom from the code.

Please carry the bool around and keep the code that
dynamically decides about the usage.

    /* this may seem backwards, but isn't. */
    certUsage = isServer ? certUsageSSLClient : certUsageSSLServer;
    certificateusage = isServer ? certificateUsageSSLClient : certificateUsageSSLServer;
Comment 45 Kai Engert (:kaie) 2011-11-30 08:56:54 PST
-    if (port == 443) {
-      params[0] = host.get();
-    }
-    else {
-      hostWithPort = host;
+    hostWithPort.AssignASCII(host);


Why do you change this to ASCII? What about hostnames with international characters?
Comment 46 Honza Bambas (:mayhemer) 2011-11-30 10:47:20 PST
(In reply to Kai Engert (:kaie) from comment #45)
> -    if (port == 443) {
> -      params[0] = host.get();
> -    }
> -    else {
> -      hostWithPort = host;
> +    hostWithPort.AssignASCII(host);
> 
> 
> Why do you change this to ASCII? What about hostnames with international
> characters?

'host' is in ASCII here: 
http://hg.mozilla.org/mozilla-central/annotate/9ae1d4f44b8b/netwerk/protocol/http/HttpBaseChannel.cpp#l143
http://hg.mozilla.org/mozilla-central/annotate/ca140190529a/netwerk/protocol/http/nsHttpConnectionMgr.cpp#l1361
http://hg.mozilla.org/mozilla-central/annotate/ca140190529a/netwerk/base/src/nsSocketTransport2.cpp#l978

-> nsNSSSocketInfo::mHostName is filled with value from nsHttpConnectionInfo::Host().
Comment 47 Kai Engert (:kaie) 2011-11-30 10:59:09 PST
When testing https://kuix.de:9449 I run into:

###!!! ASSERTION: GetErrorMessage called for cert error without cert: 'mErrorMessageType != CertErrorMessage || (mSSLStatus != nsnull && mSSLStatus->mServerCert)', file ...../mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp, line 482

(note this was patch v4)
Comment 48 Kai Engert (:kaie) 2011-11-30 11:03:13 PST
I read through all of your patch, and it looks like a great solution and a very good patch, nicely written comments.

For the large blocks of code that moved (such as the error processing code), I didn't do a careful comparison to find changes. If you would like to have changes hidden in there reviewed, it might make sense to attach manual diffs of such blocks.
Comment 49 Kai Engert (:kaie) 2011-11-30 11:04:42 PST
I've done local testing, and can confirm OCSP through a proxy still works, too.

(I doesn't work if the proxy requires authentication, but I remember you said you're already aware of that regression - which exists independent of this patch.)
Comment 50 Honza Bambas (:mayhemer) 2011-11-30 11:06:36 PST
(In reply to Kai Engert (:kaie) from comment #44)
> +SECStatus
> +AuthCertificateHook(void *arg, PRFileDesc *fd, PRBool checkSig, PRBool
> isServer)
> +{
> +  // TODO: Assert that we're on the socket transport thread
> +  PR_LOG(gPIPNSSLog, PR_LOG_DEBUG,
> +         ("[%p] starting AuthCertificateHook\n", fd));
> +      
> +  CERTCertificate *serverCert = SSL_PeerCertificate(fd);
> 
> 
> You ignore the isServer value, and instead hardcode to !isServer.
> Let's not remove wisdom from the code.

There is no support for an SSL server in Necko and I believe to introduce it would lead to changes to PSM as well, and also probably to this particular piece of code we discuss here too.  

But that doesn't necessarily mean to remove the related code from PSM.  I personally don't have that strong opinion on this, but preserving it is wiser.
Comment 51 Honza Bambas (:mayhemer) 2011-11-30 11:08:07 PST
(In reply to Kai Engert (:kaie) from comment #48)
> For the large blocks of code that moved (such as the error processing code),
> I didn't do a careful comparison to find changes. If you would like to have
> changes hidden in there reviewed, it might make sense to attach manual diffs
> of such blocks.

Kai, I did a manual line to line compare and all is ok with that patch.
Comment 52 Honza Bambas (:mayhemer) 2011-11-30 11:11:21 PST
(In reply to Kai Engert (:kaie) from comment #47)
> When testing https://kuix.de:9449 I run into:
> 
> ###!!! ASSERTION: GetErrorMessage called for cert error without cert:
> 'mErrorMessageType != CertErrorMessage || (mSSLStatus != nsnull &&
> mSSLStatus->mServerCert)', file
> ...../mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp, line 482
> 
> (note this was patch v4)

With v5 I'm getting "Peer's Certificate has been revoked." page, no assertion failures, is that expected?
Comment 53 Kai Engert (:kaie) 2011-11-30 11:18:12 PST
(In reply to Honza Bambas (:mayhemer) from comment #52)
> 
> With v5 I'm getting "Peer's Certificate has been revoked." page, no
> assertion failures, is that expected?

Yes, "revoked" is expected, that's a cert that is supposed to trigger one of our internal cert blacklists.

If the assertion is gone, great. I should try to apply the newer patches.
Comment 54 Kai Engert (:kaie) 2011-11-30 12:38:05 PST
(In reply to Kai Engert (:kaie) from comment #53)
> 
> If the assertion is gone, great. I should try to apply the newer patches.

doesn't apply cleanly. I applied part1-v5, worked fine, then tried part2-v5, failed:

error: patch failed: security/manager/ssl/src/nsNSSIOLayer.cpp:292
error: security/manager/ssl/src/nsNSSIOLayer.cpp: patch does not apply
error: patch failed: security/manager/ssl/src/nsNSSIOLayer.h:159
error: security/manager/ssl/src/nsNSSIOLayer.h: patch does not apply
Comment 55 Honza Bambas (:mayhemer) 2011-11-30 13:36:45 PST
Created attachment 578064 [details] [diff] [review]
Part 2: [v5] - merged

This is the merged patch that I have in my queue (was just missing update to OverridableCertError).
Comment 56 Honza Bambas (:mayhemer) 2011-11-30 14:48:32 PST
Created attachment 578098 [details] [diff] [review]
Part 2: [v5] - merged

qref...
Comment 57 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-30 15:11:15 PST
(In reply to Honza Bambas (:mayhemer) from comment #50)
> (In reply to Kai Engert (:kaie) from comment #44)
> > You ignore the isServer value, and instead hardcode to !isServer.
> > Let's not remove wisdom from the code.
> 
> But that doesn't necessarily mean to remove the related code from PSM.  I
> personally don't have that strong opinion on this, but preserving it is
> wiser.

Normally, I would do such cleanup in a separate bug, but I did in as part of this patch in this bug for a few reasons:

* SSL_RestartHandshakeAfterAuthCertificate, the foundation of all of this, is documented as only working for clients, because nobody has contributed the server-side code for it. (In fact, in the current libssl implementation of ssl_HandleCertificate, for the server role we treat SECWouldBlock from the auth certificate callback as an error and unconditionally fail with a "feature not implemented for servers" error code.)

* I had to create a new functions; these functions would have had to have their own isServer parameters, which just perpetuates the mistaken notion that the code could possibly work for the server role.

* By making the assumption that this is all client-side code, we can simplify other parts of the patch. For example, we don't have to guard against the peer's hostname being NULL or empty when calling CERT_VerifyCertName, because this isn't allowed in the client role. Making the this change made it clearer that it was OK to use nsNSSSocketInfo::GetHostName() the way I use it in this patch.

I am against the idea of adding this logic back at all (even in a separate bug), because:

* The isServer cases are all dead code, like Honza noted. It isn't clear how many bugs there would be in this code, if it were to execute in the server role, and we have no realistic way of testing it in the server role. It's better to just be clear and accept reality that Necko and PSM is client-only code.

* There *are* other parts of PSM that assume the client role, especially the code that configures the socket. E.g. the part that configures the socket does not configure the session cache, and this would lead to crashes if we were to operate PSM in the server role.
Comment 58 Honza Bambas (:mayhemer) 2011-11-30 15:18:14 PST
(In reply to Brian Smith (:bsmith) from comment #57)
> * There *are* other parts of PSM that assume the client role, especially the
> code that configures the socket. E.g. the part that configures the socket
> does not configure the session cache, and this would lead to crashes if we
> were to operate PSM in the server role.

Yes, there would be other work to do if we would want to implement SSL server support in Necko/PSM.

Then, please make the callback return SECFailure unconditionally on isServer == true and comment that the callback is valid only for client socket implementation.
Comment 59 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-11-30 18:09:08 PST
Created attachment 578150 [details] [diff] [review]
Part 2: Everything else (sorry) [v6]

Honza, please review the interdiff, which implements the extra checks you asked for.

This is building on try now:
https://tbpl.mozilla.org/?tree=Try&rev=c92d042ba24e
Comment 60 Kai Engert (:kaie) 2011-12-01 08:52:14 PST
Are you sure besides the list of patches in comment 26, nothing else is necessary? patch v6 still doesn't apply for me (conflicts in nsNSSIOLayer)
Comment 61 Kai Engert (:kaie) 2011-12-01 08:56:04 PST
Has anybody tested Thunderbird with this patch already?
Comment 62 Honza Bambas (:mayhemer) 2011-12-01 09:56:21 PST
(In reply to Kai Engert (:kaie) from comment #60)
> Are you sure besides the list of patches in comment 26, nothing else is
> necessary? patch v6 still doesn't apply for me (conflicts in nsNSSIOLayer)

Apply the newer version of bug 703508 patch.

(In reply to Kai Engert (:kaie) from comment #61)
> Has anybody tested Thunderbird with this patch already?

I'll do it.
Comment 63 Honza Bambas (:mayhemer) 2011-12-01 10:00:52 PST
Comment on attachment 578150 [details] [diff] [review]
Part 2: Everything else (sorry) [v6]

r=honzab for the interdiff.
Comment 64 Honza Bambas (:mayhemer) 2011-12-01 11:54:44 PST
(In reply to Honza Bambas (:mayhemer) from comment #62)
> > Has anybody tested Thunderbird with this patch already?
> 
> I'll do it.

Works perfectly for me.
Comment 65 Kai Engert (:kaie) 2011-12-01 12:32:31 PST
I've built thunderbird trunk, and I've tested it with my primary work profile, which uses multiple IMAP/SSL accounts, and also using the nightly lightning calendar plugin, and my profile is setup to use two calendars over https.

I couldn't find any issues caused by the SSL thread removal patch.

(Both with and without this patch I see a busy loop while loading the calendars, it's painfully slow, might be an unrelated regression.)
Comment 66 Kai Engert (:kaie) 2011-12-01 13:12:39 PST
Some superreview level comments:

In the past there was resistance (not from me) to put any PSM related code into any other module such as core Necko. I guess this resistance is gone?

With your patch, I believe for the first time, there will be a direct function call from Necko to PSM (not using XPCOM interfaces). If that's OK with the Necko team, it's OK with me, too.

This
  // XXX: There is no good header file to put these in. :(
isn't clean.

If I were a strict superreviewer, I would have to ask you to find a better solution.

A better solution would be:
- define a new interface like nsIPSMCertVerifyThreadPool (as a service)
- add this IDL file somewhere in Necko
- implement it inside PSM
  (not using any of the XPCOM init macros that trigger NSS init,
   just one of the standard ones)
- instead of a direct function call, create the service,
  which inits the pool
- inside PSM, query the service
Comment 67 Honza Bambas (:mayhemer) 2011-12-01 13:21:02 PST
(In reply to Kai Engert (:kaie) from comment #66)
> Some superreview level comments:
> 
> In the past there was resistance (not from me) to put any PSM related code
> into any other module such as core Necko. I guess this resistance is gone?
> 
> With your patch, I believe for the first time, there will be a direct
> function call from Necko to PSM (not using XPCOM interfaces). If that's OK
> with the Necko team, it's OK with me, too.
> 
> This
>   // XXX: There is no good header file to put these in. :(
> isn't clean.
> 
> If I were a strict superreviewer, I would have to ask you to find a better
> solution.
> 
> A better solution would be:
> - define a new interface like nsIPSMCertVerifyThreadPool (as a service)
> - add this IDL file somewhere in Necko
> - implement it inside PSM
>   (not using any of the XPCOM init macros that trigger NSS init,
>    just one of the standard ones)
> - instead of a direct function call, create the service,
>   which inits the pool
> - inside PSM, query the service


I didn't like this too, but we could handle it in a follow up.  

I think this should be initialized in PSM, not in Necko.  Necko is stupid from this point of view and doesn't (have to) know there is some pool for verifying whatever.

Only argument I have seen valid is to have better thread synchronization of the gCertVerificationThreadPool global - access it just on the network thread.

But that can be solved with sending an event to that thread or by introducing a callback that network thread invokes during the start and the end.

All of this is relatively complicated and we should find a really good way to architecture this.  I think followup is fine for these reasons (I will submit it now).
Comment 68 Patrick McManus [:mcmanus] 2011-12-01 13:36:00 PST
so close I can taste it :)

thanks!
Comment 69 Kai Engert (:kaie) 2011-12-01 13:40:39 PST
Comment on attachment 577831 [details] [diff] [review]
Part 1: Move certificate validation code to new file, preserving history [v2]

sr=kaie
Comment 70 Kai Engert (:kaie) 2011-12-01 13:42:57 PST
Comment on attachment 578150 [details] [diff] [review]
Part 2: Everything else (sorry) [v6]

I'm ok with solving this in a follow up bug, but I'd like to ask you to please address the follow up bug quickly, should someone kick us for landing it in this state.

sr=kaie
Comment 72 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-12-01 16:21:29 PST
(In reply to Kai Engert (:kaie) from comment #70)
> Part 2: Everything else (sorry) [v6]
> 
> I'm ok with solving this in a follow up bug, but I'd like to ask you to
> please address the follow up bug quickly, should someone kick us for landing
> it in this state.

Kai, what is "this" that you are referring to? Bug 705755?
Comment 73 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-12-01 16:26:45 PST
I see you referring to bug 706955. (Our email was down, didn't notice all the comments that occurred right before I checked in.)
Comment 74 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-12-14 22:57:08 PST
*** Bug 634084 has been marked as a duplicate of this bug. ***
Comment 75 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2011-12-15 05:05:59 PST
*** Bug 642697 has been marked as a duplicate of this bug. ***
Comment 76 Kai Engert (:kaie) 2011-12-18 07:49:39 PST
I would like to remind you that this bug depends on completing NSS bug 542832 in time.

It's necessary to use a released version of NSS that contains the code from bug 542832. This requires that the work in that bug gets completed.

Given that the current mozilla-central cycle is nearing its end, I would like to remind you that shipping an unreleased new NSS API in a Mozilla release it not acceptable.

I understand it would be a nightmare for you to back out the SSL thread removal, so please make sure that you will be allowed to update Aurora with the full official release of NSS, probably 3.12.2, or back out. (Even this Aurora path requires that you get bug 542832 done soon.)

Note You need to log in before you can comment on or make changes to this bug.