Closed Bug 291982 Opened 19 years ago Closed 19 years ago

NSPR Crash in prtpool io_wstart due to corrupt job list

Categories

(NSPR :: NSPR, defect)

x86
Windows 2000
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kamil, Assigned: wtc)

Details

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

This is against the 4.4.1 code base.

The problem occurs if nspr detects that a job has been cancelled at the top of
its poll loop.

The nspr job queue is driven by a loop that executes poll.  Each job on and fd
is placed in a poll set, and this set is used to drive work on those jobs.   The
problem has to do with the initialization and cleanup of the jobs that drive the
poll set.

Nspr uses a circular  doubly linked list of jobs for its job queue.  A marker
value is kept at the head to indicate start and end.  This queue is protected by
a lock.  This  job queue is walked to create the poll set.  When jobs are
cancelled, they must be removed from the job queue.  Removal is not handled by
the Cancel function, but rather removal is done by the poll loop.  This is to
keep the list consistent across unlocked calls to PR_Poll.

The linked list is managed by the PRCList functions out of pr/inlcude/prclist.h.

Initial the list looks like this:
  L (next = L, prev = L);
We insert an accept job (J1) into the list, and the list now looks like this:
  L (next = J1, prev = J1) 
  J1 (next = L, prev = L)
Now we hit the following code (from pr/src/misc/prtpool.c @ approx line 361,
nspr 4.4.1).
This is in the poll loop:

        PR_Lock(tp->ioq.lock);
        for (qp = tp->ioq.list.next; qp != &tp->ioq.list; qp = qp->next) {
            jobp = JOB_LINKS_PTR(qp);
            if (jobp->cancel_io) {
                CANCEL_IO_JOB(jobp);
                continue;
            }
            if (pollfds_used == (pollfd_cnt))
                break;
            pollfds[pollfds_used].fd = jobp->iod->socket;
            pollfds[pollfds_used].in_flags = jobp->io_poll_flags;
            pollfds[pollfds_used].out_flags = 0;
            polljobs[pollfds_used] = jobp;

            pollfds_used++;
        }

Where CANCEL_IO_JOB is the following MACRO (also from prtpool.c):
 
#define CANCEL_IO_JOB(jobp)                             \
                PR_BEGIN_MACRO                          \
                jobp->cancel_io = PR_FALSE;             \
                jobp->on_ioq = PR_FALSE;                \
                PR_REMOVE_AND_INIT_LINK(&jobp->links);  \
                tp->ioq.cnt--;                          \
                PR_NotifyCondVar(jobp->cancel_cv);      \
                PR_END_MACRO

The problem occurs where J1 has just been cancelled and is still in the queue. 
The poll loop has a  notify descriptor that can be used to wake Poll if an event
occurs. Cancelation proceeds by setting jobp->cancel_io true,  and waking Poll.
  The case we are encountering here, is one where the cancelation happens as the
loop has released its lock on the job list, and is in the process of going back
around to restart its loop and grabs the lock again.

So the top loop looks like this:
        for (qp = tp->ioq.list.next; qp != &tp->ioq.list; qp = qp->next) {
qp = j1
j1->cancel_io = true
so we execute the CANCEL_IO_JOB macro.  It is the call to
PR_REMOVE_AND_INIT_LINK that gets us hosed.
The PR_REMOVE_AND_INIT_LINK is as follows (pr/include/prclist.h @ approx line 111):

/*
** Remove the element "_e" from it's circular list. Also initializes the
** linkage.
*/
#define PR_REMOVE_AND_INIT_LINK(_e)    \
    PR_BEGIN_MACRO             \
    (_e)->prev->next = (_e)->next; \
    (_e)->next->prev = (_e)->prev; \
    (_e)->next = (_e);         \
    (_e)->prev = (_e);         \
    PR_END_MACRO

This makes our list look like this:
	L  (next = L, prev = L)
	J1 (next = J1,prev = J1)

Now if we return to our loop, you will notice that the loop termination
condition is qp != &tp->ioq.list
and the increment condition is qp = qp->next. qp != &tp->ioq.list simply means,
that the qp pointer
has not run back around to the list begin/end delimiter (which I called L
above).  But its
the qp = qp->next that really gets us hosed.  Once The PR_REMOVE_AND_INIT_LINK
macro has be executed
J1->next == J1.   And if it wasn't for the fact that pollfd_cnt has been set
earlier in the poll loop
we would sit here and spin.

But we don't.  Instead, we fill the pollfd set with pointers to an fd and job
info that is about to become invalid.   These then get passed to the PRPoll call.

How does this lead to a crash on close:
Our code is doing the following:  
queueAccept Job (j1),
cancelJob		(j1)		//we do this to stop the job
joinJob			(j1)		//we do this so that we can cleanly          
                                        //cleanup, and so that nspr 
					//deletes its job datastructures. 
                                        //(this deletes J1) 
close			(j1.socket) //we do this to cleanup the socket.

The point at which we crash is after the close call where the cancel has hit the
NSPR bug described above.  The Poll call still has in its poll set the fd of the
now defunct job.  When the fd is closed, the Poll call returns.  It then
atemptes to operate on a deleted job pointer.
Specifically I can get it to reliably do:
                    PR_Lock(tp->ioq.lock);
                    if (jobp->cancel_io) {
                        CANCEL_IO_JOB(jobp); 	<<< dies in here on the garbage CVar
                        PR_Unlock(tp->ioq.lock);
                        continue;
                    }
But wierd crashes down in PR_Poll also common.

This crash is racy, but for some reason, in the debugger I can get to happen all
the time.  In any case, the fix is fixing nspr.   I have been working on the
4.4.1 code base.


Reproducible: Always

Steps to Reproduce:
Create a socket, listen on it, queue an accept job on it.
Call CancelJob, JoinJob, and the PR_Close on the fd.

Actual Results:  
The conditions are racy, but a crash in io_wstart or one of the functions it
calls occurs.  I have been able to get my test code to exhibit this consistently
in visual C++.

Expected Results:  
No crashed.  Removed the pointer from the job list, without corrupting its poll set.
Thank you very much for the detailed bug report.
Although I haven't tried to reproduce the bug, I
understand your bug description and confirmed this
is indeed a bug.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Target Milestone: --- → 4.6
Attached patch Proposed patchSplinter Review
I don't have the setup to test this fix, so
please review and test it.

The fix is to save qp->next in a variable at the
beginning of each iteration of the for loop, so
that it doesn't matter if qp->next is changed (to
qp) in the middle of the loop.

I found two for loops with this problem, so I
fixed them both.
I applied the patch to prtpool.c, my code no longer crashes in the debugger.
I stepped through the first loop as well, in a condition that would generate the
fault.  It nolonger occurs.  Finally, I ran through the code change on paper to
verify list consistency.  It seems to be ok.
Comment on attachment 181905 [details] [diff] [review]
Proposed patch

Requesting NSPRPUB_PRE_4_2_CLIENT_BRANCH approval for
aviary1.1a and mozilla1.8b2.

The Mozilla clients are not using the API in question,
so this fix has zero risk for the Mozilla clients.

I wrote the patch and the bug submittor reviewed and
tested it.
Attachment #181905 - Flags: approval1.8b2?
Attachment #181905 - Flags: approval-aviary1.1a?
Comment on attachment 181905 [details] [diff] [review]
Proposed patch

a=asa
Attachment #181905 - Flags: approval1.8b2?
Attachment #181905 - Flags: approval1.8b2+
Attachment #181905 - Flags: approval-aviary1.1a?
Attachment #181905 - Flags: approval-aviary1.1a+
Patch checked in on the NSPR trunk for NSPR 4.6.

Checking in prtpool.c;
/cvsroot/mozilla/nsprpub/pr/src/misc/prtpool.c,v  <--  prtpool.c
new revision: 3.9; previous revision: 3.8
done

Patch checked in on the NSPRPUB_PRE_4_2_CLIENT_BRANCH
for Mozilla 1.8 Beta 2.

Checking in prtpool.c;
/cvsroot/mozilla/nsprpub/pr/src/misc/prtpool.c,v  <--  prtpool.c
new revision: 3.5.4.4; previous revision: 3.5.4.3
done
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: