Open Bug 301231 Opened 19 years ago Updated 2 years ago

PR Thread pool code does not allow PR_Close to interrupt pending PR_read/PR_write

Categories

(NSPR :: NSPR, defect)

defect

Tracking

(Not tracked)

People

(Reporter: kamil, Unassigned)

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4

I'm using the nspr 4.4.1 thread pool code patched to fix:
https://bugzilla.mozilla.org/show_bug.cgi?id=291982

I have a write job pending for an fd and I wish to close the fd.  There is no
way to do this (on debug bits) without exposing a race condition that leads to
an invalid io method assert.
(#3 0xfedf466c in PR_Assert (s=0xfee30328 "!\"I/O method is invalid\"",
file=0xfee30348 "priometh.c",).

The problem is as follows:
A write job is queued for an fd. Another thread wishes to close the fd, and
clean up the write job cleanly.  The job function looks approximately as follows:
void foo(void* bar){
  //do some stuff
  PR_Write(...)
  //do some more stuff
}
Also, assuming joinable jobs, there are three functions from the nspr api that
need to be used: PR_CancelJob, PR_JoinJob and PR_Close.

The closing function looks something like this:
PR_CancelJob
PR_Close
PR_JoinJob

Note that we can't Cancel, Join, then Close, because, if the write is sending a
 large buffer, we might wait for some time before we can clean up the fd and
actually call close.  We need PR_Close to close the fd and interrupt the write
in progress. 

The PR_Cancel will stop the job if it is still in the io queue. But once the job
is in its job function, it is no-longer on the io queue, and the cancel returns:
PR_INVALID_STATE_ERROR .  At this point it is not possible for the closing
thread to know if the job thread has called PR_Write yet, or not.  If it has not
then the closing thread must not call PR_Close before the job thread enters
PR_Write.  If it does, then the PR_write encounters an invalid fd and asserts
that the I/O error methods are invalid.  However, there is no way for the job
thread to notify the closing thread that it is entering PR_Write without leaving
open the window that between the notification and the PR_Write, PR_Close is
called.  Condvars, locking around PR_Write, atomic increments, none of these work.

The problem is, that in debug builds, an invalid fd causes an assert.  In
optimized builds it does the right thing, and simply returns an error (though
that error is just -1).

Reproducible: Sometimes

Steps to Reproduce:
Create an application that reads and writes from a socket using the thread pool
code.  Have a thread that tries  to close the sockets while reads and writes are
in progress.

This is a race condition, so it sometimes happens and sometimes does not,
depending on interleaving of threads.  It should be possible to generate it by
stopping threads in the debugger.
Actual Results:  
Assert.

Expected Results:  
_PR_InvalidInt should cause the operation to return PR_FAILURE and a failure
code in debug mode rather than asserting.

In optimized mode it returns -1.

This is also a problem on PR_Read, but generally PR_Read returns as much data as
is available on the fd, and is called once the socket has become readable, so
the chances that it will block for a long time are slim.  Thus it would be
acceptable to wait for it complete.  PR_Write does not have this feature, if you
write a big buffer.  Finally, while PR_Writev or some other function with a
timeout could be used but that would either beat the **** out of the system (if
the write was failing on timeout all the time) or cause close to wait sometimes
several seconds in cases where a socket needs to be dropped immediately.

I have seen this on xp and on solaris 9.
Kamil,

I'm not familiar with the NSPR job or thread pool API, so some of my comments
may not be applicable here. However, having done a lot of multithreaded
programming with NSPR sockets, I think that trying to interrupt the socket I/O
with PR_Close is not the right way to approach your problem. PR_Close not only
ends up calling the PRFileDesc's close method, which ends up calling the system
close() call on the socket, but it calls all the intermediate layers close
methods, which may cause freeing of memory. This makes it invalid to perform any
other operation on the PRFileDesc while PR_Close is going on, or after it. That
would probably mean that you should have no jobs pending on a PRFileDesc before
you call PR_Close .
There is no synchronization inherent to PR_Close . Your best bet is for your
application to ensure that only one thread is using the PRFileDesc before it
calls PR_Close, otherwise segmentation violations are likely to occur . If you
rewrote all the NSPR layers from top to bottom you could conceivably write them
in a way that allows the behavior you want (possibly by keeping some structures
around even after PR_Close is done), but the default I/O layers in NSPR/NSS just
don't do this. They put the responsibility on the caller to finish their I/O
before calling PR_Close .
Look into things like PR_Shutdown() and PR_Interrupt() .
Hi Pierre,
Thanks, for your suggestion.  I'm going to try slapping in a shutdown and see
what happens. 

In the underlying os a close call interrupts pending (blocked) read and write
calls, and that is what I want to happen.  It is in fact also what happens if
I'm already in PR_Write when I call PR_Close.  The optimized behavior of nspr
(returning an error when an invalid fd is used) is also fine, but in debug mode
it asserts.  I can't prevent this from happening via synchronization.  I'm
hoping (based on your suggestion) that shutdown will interrupt the PR_Write call
without rendering the socket invalid, and that then I will be able to call close
once the PR_JoinJob has completed.  If I were using bare os calls this would not
be a problem as I would get back an EBADF if I were using an invalid fd and I
could just keep going.

With respect to some of your other points: 
PR_Interrupt does not apply in this case, as the thread making the write call is
owned by the thread pool, and I can't use it.  Finally, if you only have one
thread at a time accessing the fd, how do you do simultaneous reads and writes?
 In my app, it is possible for one thread to be calling read and one to be
calling write on the same FD.  A third thread may also come in and wipe them
both out with a close.  The first two threads are not owned by me, but rather
the job pool.
PR_Close not only closes the underlying OS file
descriptor but also frees the memory associated
with the PRFileDesc structure.  This is why we
can't use PR_Close to abort pending I/O operations
on a fd.

The reason we must use an assertion failure to
report this in debug builds is that this is
a serious error.  We can only determine this
by reading freed memory (PRFileDesc).  Reading
freed memory is bad, so we must fail an assertion
when we detect that.

I thought about the problem you reported and agreed
that it is a real problem.  The only solution I can
come up with is for you to record the thread that's
doing the PR_Write in your job function, and call
PR_Interrupt on that thread after you call PR_CancelJob.

The only legal way to abort a pending I/O operation
in NSPR is PR_Interrupt.  Even if PR_Shutdown works,
it's undocumented behavior and it may not work on all
platforms.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
QA Contact: wtchang → nspr

The bug assignee didn't login in Bugzilla in the last 7 months and this bug has severity 'major'.
:KaiE, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: wtc → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(kaie)
Severity: major → S4
Flags: needinfo?(kaie)
You need to log in before you can comment on or make changes to this bug.