Closed Bug 358248 Opened 18 years ago Closed 18 years ago

MP SSL Cache thread LockPoller is never stopped

Categories

(NSS :: Libraries, defect, P1)

x86
Linux

Tracking

(Not tracked)

VERIFIED FIXED
3.11.5

People

(Reporter: rcritten, Assigned: wtc)

Details

Attachments

(1 file, 2 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1b2) Gecko/20060910 SeaMonkey/1.1b
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1b2) Gecko/20060910 SeaMonkey/1.1b

Tested on Fedora Core 4 running NSPR 4.6.3, NSS 3.11.3 and mod_nss 1.0.5. It crashes in a similar way on Solaris 9 and 10.

A standard multi-process NSS server program starts up something like this:

PR_Init();
SSL_ConfigMPServerSIDCache();
NSS_Initialize();
(and setup policy, ciphers, etc).

And then at some point it might shut down with:

SSL_ShutdownServerSessionIDCache();
NSS_Shutdown();

The call to SSL_ConfigMPServerSIDCache() launches a new thread, LockPoller, that wakes up every 30 seconds and checks the value of cache->sharedCache->stopPolling to see if it should exit. If it is ever PR_TRUE then the thread will exit gracefully.

This condition will NEVER be true. Nowhere in the NSS code is this set to PR_TRUE.

What I see with mod_nss is memory corruption and eventually a core when I attempt to restart apache with httpd -k restart. 

What happens with a restart in Apache is all modules are unloaded and reloaded. In the  case of mod_nss this causes the shutdown routines to run (as above) and then another initialization of NSS>

On a few occasions Apache crashed in LockPoller testing the value of cache->stopPolling. As I looked at this more I determined that even with a graceful shutdown this thread will never go away.

To test this I set the value of the timeout in LockPoller to 1 second and then add this to CacheClose() at the very top:

     cache->sharedCache->stopPolling = PR_TRUE;
     PR_Sleep(PR_SecondsToInterval(2));

So obviously this is not graceful code but it does make mod_nss stop crashing on restarts by giving LockPoller a chance to quit before continuing the cache cleanup.

In summary, the problem is that the LockPoller continues to run after shutting down the cache. So the next time it comes out of the PR_Sleep() it craps out because its memory has been freed.

Reproducible: Always

Steps to Reproduce:
1. Build and install mod_nss 
2. Start Apache
3. Restart Apache with httpd -k restart

Actual Results:  
core dump. The parent httpd process will die but its children will live.

Expected Results:  
No crash.
Status: UNCONFIRMED → NEW
Ever confirmed: true
This bug is that SSL_ShutdownServerSessionIDCache does not
stop and join with the LockPoller thread.  It is not a
duplicate of bug 339466, which is the problem with forking
caused by the LockPoller thread.  The fixes for these two
bugs could share the code for stopping and joining with the
LockPoller thread.
Attached patch Proposed patch v1 (obsolete) — Splinter Review
Rob, please test this patch.  Thanks.
Assignee: nobody → wtchang
Priority: -- → P1
Target Milestone: --- → 3.11.5
Comment on attachment 245473 [details] [diff] [review]
Proposed patch v1

>+/* Stop the thread that polls cache for expired locks */
>+static SECStatus 
>+StopLockPoller(cacheDesc *cache)
>+{
>+    if (cache->poller) {

I'm about 99% sure that should be  
      if (!cache->poller) {

>+	return SECSuccess;
>+    }
Attachment #245473 - Attachment description: Proposed patch → Proposed patch v1
Attachment #245473 - Flags: review-
Attached patch Proposed patch v2 (obsolete) — Splinter Review
Nelson, thank you for catching the error.

Rob, please test this new patch.
Attachment #245473 - Attachment is obsolete: true
patch v2 is working ok for me in some limited tested with Apache and mod_nss. I can now restart the server without crashing when using the forked model.
Comment on attachment 245501 [details] [diff] [review]
Proposed patch v2

Rob, thank you for testing this patch.

Bob, Nelson, please review this patch.  Please speak out if you think
this patch needs more testing and should not go into NSS 3.11.4.

I wrote the new StopLockPoller function to be the opposite of
LaunchLockPoller.  In particular, since SSL_ConfigMPServerSIDCache
calls LaunchLockPoller last, I have SSL_ShutdownServerSessionIDCache
call StopLockPoller first.

Let me know if you want me to change "StopLockPoller" to
"ShutdownLockPoller".
Attachment #245501 - Flags: superreview?(nelson)
Attachment #245501 - Flags: review?(rrelyea)
Comment on attachment 245501 [details] [diff] [review]
Proposed patch v2

r+ assuming that is it not a problem for us to have a joinable thread hanging around from the life of the process. (This can happen if the PULL time is set to '0' in the environment variable and LockPoller exits immediately) - Thread state is kept until the Join later when the system is shutdown.

bob
Attachment #245501 - Flags: review?(rrelyea) → review+
Comment on attachment 245501 [details] [diff] [review]
Proposed patch v2

I share Bob's concern about what happens when the envariable
NSS_SSL_SERVER_CACHE_MUTEX_TIMEOUT is set to 0. 
If the thread is going to be joinable, then we should test this 
envariable before launching, and avoid the launch when it's zero.
Attachment #245501 - Flags: superreview?(nelson) → superreview-
I implemented Nelson's suggested change.  Rob, could you please test
this patch?
Attachment #245501 - Attachment is obsolete: true
Attachment #246729 - Flags: superreview?(rrelyea)
Attachment #246729 - Flags: review?(nelson)
Comment on attachment 246729 [details] [diff] [review]
Proposed patch v3

I forgot to describe this patch.

To pass the value of the environment variable
NSS_SSL_SERVER_CACHE_MUTEX_TIMEOUT to the LockPoller thread,
I added a new field 'mutexTimeout' to the cacheDescStr
structure.

Note that the comment "Only the private copies of these pointers are valid"
may need to be updated because of the addition of 'mutexTimeout'.
I don't know how to update the comment because there is already a
non-pointer structure field 'shared'.

The name 'mutexTimeout' was chosen after the environment
variable NSS_SSL_SERVER_CACHE_MUTEX_TIMEOUT.  Suggestions
of a better name are welcome.

So the local variable 'expiration' is now simply an alias of
cache->mutexTimeout.  I kept the local variable 'expiration'
to minimize the diffs.  I can replace 'expiration' by
cache->mutexTimeout if you want.

I could initialize cache->mutexTimeout to SID_LOCK_EXPIRATION_TIMEOUT
in the InitCache function rather than in the LockPoller function.
I am initializing cache->mutexTimeout to SID_LOCK_EXPIRATION_TIMEOUT
in LockPoller to stay similar to the original code.
Comment on attachment 246729 [details] [diff] [review]
Proposed patch v3

r=nelson
Attachment #246729 - Flags: review?(nelson) → review+
Attachment #246729 - Flags: superreview?(rrelyea) → superreview+
I checked in the patch on the NSS trunk (NSS 3.12) and NSS_3_11_BRANCH
(NSS 3.11.5).  Rob, please verify this bug by testing the current
NSS_3_11_BRANCH.

Checking in sslsnce.c;
/cvsroot/mozilla/security/nss/lib/ssl/sslsnce.c,v  <--  sslsnce.c
new revision: 1.39; previous revision: 1.38
done

Checking in sslsnce.c;
/cvsroot/mozilla/security/nss/lib/ssl/sslsnce.c,v  <--  sslsnce.c
new revision: 1.36.2.3; previous revision: 1.36.2.2
done
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
I tested NSS_3_11_BRANCH and Apache 2.0.54. It works for me.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: