SSL session cache locking issue with NT fibers

RESOLVED FIXED

Status

NSS
Libraries
RESOLVED FIXED
16 years ago
16 years ago

People

(Reporter: Julien Pierre, Assigned: Julien Pierre)

Tracking

x86
Windows NT

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments)

(Assignee)

Description

16 years ago
When NT fibers are used in an SSL server, there is a possibility of deadlock in
the current (3.4) implementation of the session cache. The code acquires a Win32
mutex which is a blocking system call. If the fiber yields after acquiring the
mutex (as a result of sleep or NSPR I/O for example), another fiber could be
scheduled on the same native thread and block indefinitely when trying to
acquire the same Win32 mutex. This would be a deadlock. It is somewhat of a
theoritical problem that hasn't been demonstrated to occur with the current
code, but it is conceivable in the future that if the cache gets modified this
might happen. For this reason it is   necessary to implement a robust locking
mechanism that works with fibers.

If the fiber application is single-process, the solution is to use a PRLock for
locking, as PRLocks are safe for use with NT fibers and won't cause deadlocks.
However, PRLocks are local to a process and cannot work if the NT fiber
application is multi-process. In that case, a 2-level mechanism for locking is
required. First, a PRLock is used to prevent the aforementioned problem with NT
fibers within a process. Then, the Win32 mutex can be safely acquired to make
sure that the operation is happening safely accross processes.

The implementation of this scheme however is a little bit complicated,
especially because the code is also shared with the Win95 implementation of locking.
The patch involves modifying the Win32 sslMutex object from a lone Win32 mutex
(which, as demonstrated above, isn't adequate for fiber applications) to a
structure that can contain both a PRLock and a Win32 mutex, as well as a flag
indicating whether the mutex is cross-process (shared) or not. This information
was already passed to sslMutex_init but is needed further at runtime so it is
now stored.

From then on there are 3 cases :
1) for a single-process application, on either Win95 or WINNT :
Only the PRLock in the structure gets used . The higher-level sslMutex_ calls
fall back to generic single_process_sslMutex calls.

2) for a multi-process application on Win95.
Only the Win32 mutex in the structure gets used . This is equivalent to the old
implementation and is adequate since fibers are not available on WINNT. The
Win32  mutex provides exclusion from any threads in any process and is therefore
adequate.

3) for a multi-process application on WINNT.

This is the most complicated case. It may end up being of somewhat academic use,
as no multi-process servers are currently known to use fibers on NT. selfserv is
in the process of being modified to do that but iWS probably won't be.

For this case, the parent process creates sslMutex objects containing only a
Win32 mutex in shared memory, at the time the multi-process session cache is
initialized.

When child processes of the application execute, the session cache inheritance
function makes a local copy of the sidCacheLock structures, and then updates
them to add a PRLock into the sslMutex object. This has to be done in the
high-level code in sslsnce.c as the lower-level code only manipulates sslMutex
objects which are initially created in shared memory. Fortunately this code is
very well isolated within an #ifdef WINNT .
Later on, when locking is needed, the PRLock is first acquired for fiber &
thread safety, then the Win32 mutex for process-safety.
(Assignee)

Comment 1

16 years ago
Created attachment 51290 [details] [diff] [review]
proposed patch
(Assignee)

Comment 2

16 years ago
I have attached a proposed patch. Please review and comment. I may not exactly
have preserved all the existing code style yet but don't expect this to be quite
the final patch for this yet because of its size.

In addition to the above-mentioned changes, I have also changed the Unix
implementations to always use a PRLock if the application is single-process. If
the application is multi-process, the code falls back to the older Unix
implementations of sslMutex. This should cause performance to improve on
single-process SSL servers for Unix.
(Assignee)

Comment 3

16 years ago
It appears there is still a problem right now on Unix (Solaris). Please hold off
the review. Thanks.
(Assignee)

Comment 4

16 years ago
Created attachment 51366 [details] [diff] [review]
new proposed patch
(Assignee)

Comment 5

16 years ago
The problem with Solaris was related to values I wasn't returning properly to
the higher level. I fixed the formatting to be more consistent with the rest of
the code, and to look OK in vi. I also added an implementation for "other"
platforms (non-Windows and non-Unix) that will work in single-process mode.

I just attached a new patch which solves all the known issues. I have tested it
with selfserv in single and multi-process on NT with fibers, NT with native
threads, Solaris, and AIX .

Comment 6

16 years ago
Marking NEW so it can get a review/sr= for the patch.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: patch, review
Keyser Sosez: This is an NSS change for servers, not going into mozilla,
not subject to the usual browser review rules.
(Assignee)

Comment 8

16 years ago
Created attachment 52159 [details] [diff] [review]
one more patch for review, hopefully good enough for check-in

Comment 9

16 years ago
Julien,

I reviewed your latest patch (attachment 52159 [details] [diff] [review]).  In general
your patch is good.  Most of my suggested changes are just
stylistic changes.

1. In single_process_sslMutex_Destroy(), we should return
   SECFailure after the PORT_SetError() call.

2. In SECStatus sslMutex_ChildInit(sslMutex *sem), it is
   better to rename the 'sem' parameter as 'pMutex'.  Also,
   not only the child processes but the parent process
   should also call this function, right?  I would suggest
   that you not add the new function sslMutex_ChildInit().
   We can just call single_process_sslMutex_Init().

3. In the Windows version of sslMutex_Destroy, the data type
   of 'retvalue' should be SECStatus.

4. In sslmutex.h, it is better to say
   #include "prtypes.h"
   #include "prlock.h"
   
   <foo.h> should only be used for standard headers.  NSPR is
   not there yet ;-)

5. In sslmutex.h, the Unix version of sslMutex,  it would be
   better to rename 'sslMutx' as something like 'sem' or 'sslSem'
   to suggest the fact that it is a sem_t.

6. In sslsnce.c, it is better to use PORT_Alloc() in place of malloc().
   We also need to make sure that the new WINNT sidCacheLocks are freed
   when we destroy the cache.


Updated

16 years ago
Attachment #52159 - Flags: review+
Attachment #52159 - Flags: needs-work+
(Assignee)

Comment 10

16 years ago
1. fixed
2. I have renamed the childinit function to sslMutex_2LevelInit, and created a
matching sslMutex_2LevelDestroy . This needed to be done in the parent as well
which was not the case, so I have fixed that now.
3. fixed
4. fixed
5. fixed
6. I have a comment in the code about the need to free the memory, when a
session cache free function is created
(Assignee)

Comment 11

16 years ago
Created attachment 52276 [details] [diff] [review]
last patch hopefully

Comment 12

16 years ago
Comment on attachment 52276 [details] [diff] [review]
last patch hopefully

This patch is good.  I only suggested one minor
change (make sslMutex_2LevelDestroy static).
Attachment #52276 - Flags: review+
(Assignee)

Comment 13

16 years ago
Checked in the fix.
Status: NEW → RESOLVED
Last Resolved: 16 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.