Closed
Bug 291982
Opened 19 years ago
Closed 19 years ago
NSPR Crash in prtpool io_wstart due to corrupt job list
Categories
(NSPR :: NSPR, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
4.6
People
(Reporter: kamil, Assigned: wtc)
Details
Attachments
(1 file)
1.10 KB,
patch
|
asa
:
approval-aviary1.1a1+
asa
:
approval1.8b2+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 This is against the 4.4.1 code base. The problem occurs if nspr detects that a job has been cancelled at the top of its poll loop. The nspr job queue is driven by a loop that executes poll. Each job on and fd is placed in a poll set, and this set is used to drive work on those jobs. The problem has to do with the initialization and cleanup of the jobs that drive the poll set. Nspr uses a circular doubly linked list of jobs for its job queue. A marker value is kept at the head to indicate start and end. This queue is protected by a lock. This job queue is walked to create the poll set. When jobs are cancelled, they must be removed from the job queue. Removal is not handled by the Cancel function, but rather removal is done by the poll loop. This is to keep the list consistent across unlocked calls to PR_Poll. The linked list is managed by the PRCList functions out of pr/inlcude/prclist.h. Initial the list looks like this: L (next = L, prev = L); We insert an accept job (J1) into the list, and the list now looks like this: L (next = J1, prev = J1) J1 (next = L, prev = L) Now we hit the following code (from pr/src/misc/prtpool.c @ approx line 361, nspr 4.4.1). This is in the poll loop: PR_Lock(tp->ioq.lock); for (qp = tp->ioq.list.next; qp != &tp->ioq.list; qp = qp->next) { jobp = JOB_LINKS_PTR(qp); if (jobp->cancel_io) { CANCEL_IO_JOB(jobp); continue; } if (pollfds_used == (pollfd_cnt)) break; pollfds[pollfds_used].fd = jobp->iod->socket; pollfds[pollfds_used].in_flags = jobp->io_poll_flags; pollfds[pollfds_used].out_flags = 0; polljobs[pollfds_used] = jobp; pollfds_used++; } Where CANCEL_IO_JOB is the following MACRO (also from prtpool.c): #define CANCEL_IO_JOB(jobp) \ PR_BEGIN_MACRO \ jobp->cancel_io = PR_FALSE; \ jobp->on_ioq = PR_FALSE; \ PR_REMOVE_AND_INIT_LINK(&jobp->links); \ tp->ioq.cnt--; \ PR_NotifyCondVar(jobp->cancel_cv); \ PR_END_MACRO The problem occurs where J1 has just been cancelled and is still in the queue. The poll loop has a notify descriptor that can be used to wake Poll if an event occurs. Cancelation proceeds by setting jobp->cancel_io true, and waking Poll. The case we are encountering here, is one where the cancelation happens as the loop has released its lock on the job list, and is in the process of going back around to restart its loop and grabs the lock again. So the top loop looks like this: for (qp = tp->ioq.list.next; qp != &tp->ioq.list; qp = qp->next) { qp = j1 j1->cancel_io = true so we execute the CANCEL_IO_JOB macro. It is the call to PR_REMOVE_AND_INIT_LINK that gets us hosed. The PR_REMOVE_AND_INIT_LINK is as follows (pr/include/prclist.h @ approx line 111): /* ** Remove the element "_e" from it's circular list. Also initializes the ** linkage. */ #define PR_REMOVE_AND_INIT_LINK(_e) \ PR_BEGIN_MACRO \ (_e)->prev->next = (_e)->next; \ (_e)->next->prev = (_e)->prev; \ (_e)->next = (_e); \ (_e)->prev = (_e); \ PR_END_MACRO This makes our list look like this: L (next = L, prev = L) J1 (next = J1,prev = J1) Now if we return to our loop, you will notice that the loop termination condition is qp != &tp->ioq.list and the increment condition is qp = qp->next. qp != &tp->ioq.list simply means, that the qp pointer has not run back around to the list begin/end delimiter (which I called L above). But its the qp = qp->next that really gets us hosed. Once The PR_REMOVE_AND_INIT_LINK macro has be executed J1->next == J1. And if it wasn't for the fact that pollfd_cnt has been set earlier in the poll loop we would sit here and spin. But we don't. Instead, we fill the pollfd set with pointers to an fd and job info that is about to become invalid. These then get passed to the PRPoll call. How does this lead to a crash on close: Our code is doing the following: queueAccept Job (j1), cancelJob (j1) //we do this to stop the job joinJob (j1) //we do this so that we can cleanly //cleanup, and so that nspr //deletes its job datastructures. //(this deletes J1) close (j1.socket) //we do this to cleanup the socket. The point at which we crash is after the close call where the cancel has hit the NSPR bug described above. The Poll call still has in its poll set the fd of the now defunct job. When the fd is closed, the Poll call returns. It then atemptes to operate on a deleted job pointer. Specifically I can get it to reliably do: PR_Lock(tp->ioq.lock); if (jobp->cancel_io) { CANCEL_IO_JOB(jobp); <<< dies in here on the garbage CVar PR_Unlock(tp->ioq.lock); continue; } But wierd crashes down in PR_Poll also common. This crash is racy, but for some reason, in the debugger I can get to happen all the time. In any case, the fix is fixing nspr. I have been working on the 4.4.1 code base. Reproducible: Always Steps to Reproduce: Create a socket, listen on it, queue an accept job on it. Call CancelJob, JoinJob, and the PR_Close on the fd. Actual Results: The conditions are racy, but a crash in io_wstart or one of the functions it calls occurs. I have been able to get my test code to exhibit this consistently in visual C++. Expected Results: No crashed. Removed the pointer from the job list, without corrupting its poll set.
Assignee | ||
Comment 1•19 years ago
|
||
Thank you very much for the detailed bug report. Although I haven't tried to reproduce the bug, I understand your bug description and confirmed this is indeed a bug.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Target Milestone: --- → 4.6
Assignee | ||
Comment 2•19 years ago
|
||
I don't have the setup to test this fix, so please review and test it. The fix is to save qp->next in a variable at the beginning of each iteration of the for loop, so that it doesn't matter if qp->next is changed (to qp) in the middle of the loop. I found two for loops with this problem, so I fixed them both.
I applied the patch to prtpool.c, my code no longer crashes in the debugger. I stepped through the first loop as well, in a condition that would generate the fault. It nolonger occurs. Finally, I ran through the code change on paper to verify list consistency. It seems to be ok.
Assignee | ||
Comment 4•19 years ago
|
||
Comment on attachment 181905 [details] [diff] [review] Proposed patch Requesting NSPRPUB_PRE_4_2_CLIENT_BRANCH approval for aviary1.1a and mozilla1.8b2. The Mozilla clients are not using the API in question, so this fix has zero risk for the Mozilla clients. I wrote the patch and the bug submittor reviewed and tested it.
Attachment #181905 -
Flags: approval1.8b2?
Attachment #181905 -
Flags: approval-aviary1.1a?
Comment 5•19 years ago
|
||
Comment on attachment 181905 [details] [diff] [review] Proposed patch a=asa
Attachment #181905 -
Flags: approval1.8b2?
Attachment #181905 -
Flags: approval1.8b2+
Attachment #181905 -
Flags: approval-aviary1.1a?
Attachment #181905 -
Flags: approval-aviary1.1a+
Assignee | ||
Comment 6•19 years ago
|
||
Patch checked in on the NSPR trunk for NSPR 4.6. Checking in prtpool.c; /cvsroot/mozilla/nsprpub/pr/src/misc/prtpool.c,v <-- prtpool.c new revision: 3.9; previous revision: 3.8 done Patch checked in on the NSPRPUB_PRE_4_2_CLIENT_BRANCH for Mozilla 1.8 Beta 2. Checking in prtpool.c; /cvsroot/mozilla/nsprpub/pr/src/misc/prtpool.c,v <-- prtpool.c new revision: 3.5.4.4; previous revision: 3.5.4.3 done
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•