Open
Bug 301231
Opened 19 years ago
Updated 2 years ago
PR Thread pool code does not allow PR_Close to interrupt pending PR_read/PR_write
Categories
(NSPR :: NSPR, defect)
NSPR
NSPR
Tracking
(Not tracked)
NEW
People
(Reporter: kamil, Unassigned)
Details
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4 I'm using the nspr 4.4.1 thread pool code patched to fix: https://bugzilla.mozilla.org/show_bug.cgi?id=291982 I have a write job pending for an fd and I wish to close the fd. There is no way to do this (on debug bits) without exposing a race condition that leads to an invalid io method assert. (#3 0xfedf466c in PR_Assert (s=0xfee30328 "!\"I/O method is invalid\"", file=0xfee30348 "priometh.c",). The problem is as follows: A write job is queued for an fd. Another thread wishes to close the fd, and clean up the write job cleanly. The job function looks approximately as follows: void foo(void* bar){ //do some stuff PR_Write(...) //do some more stuff } Also, assuming joinable jobs, there are three functions from the nspr api that need to be used: PR_CancelJob, PR_JoinJob and PR_Close. The closing function looks something like this: PR_CancelJob PR_Close PR_JoinJob Note that we can't Cancel, Join, then Close, because, if the write is sending a large buffer, we might wait for some time before we can clean up the fd and actually call close. We need PR_Close to close the fd and interrupt the write in progress. The PR_Cancel will stop the job if it is still in the io queue. But once the job is in its job function, it is no-longer on the io queue, and the cancel returns: PR_INVALID_STATE_ERROR . At this point it is not possible for the closing thread to know if the job thread has called PR_Write yet, or not. If it has not then the closing thread must not call PR_Close before the job thread enters PR_Write. If it does, then the PR_write encounters an invalid fd and asserts that the I/O error methods are invalid. However, there is no way for the job thread to notify the closing thread that it is entering PR_Write without leaving open the window that between the notification and the PR_Write, PR_Close is called. Condvars, locking around PR_Write, atomic increments, none of these work. The problem is, that in debug builds, an invalid fd causes an assert. In optimized builds it does the right thing, and simply returns an error (though that error is just -1). Reproducible: Sometimes Steps to Reproduce: Create an application that reads and writes from a socket using the thread pool code. Have a thread that tries to close the sockets while reads and writes are in progress. This is a race condition, so it sometimes happens and sometimes does not, depending on interleaving of threads. It should be possible to generate it by stopping threads in the debugger. Actual Results: Assert. Expected Results: _PR_InvalidInt should cause the operation to return PR_FAILURE and a failure code in debug mode rather than asserting. In optimized mode it returns -1. This is also a problem on PR_Read, but generally PR_Read returns as much data as is available on the fd, and is called once the socket has become readable, so the chances that it will block for a long time are slim. Thus it would be acceptable to wait for it complete. PR_Write does not have this feature, if you write a big buffer. Finally, while PR_Writev or some other function with a timeout could be used but that would either beat the **** out of the system (if the write was failing on timeout all the time) or cause close to wait sometimes several seconds in cases where a socket needs to be dropped immediately. I have seen this on xp and on solaris 9.
Comment 1•19 years ago
|
||
Kamil, I'm not familiar with the NSPR job or thread pool API, so some of my comments may not be applicable here. However, having done a lot of multithreaded programming with NSPR sockets, I think that trying to interrupt the socket I/O with PR_Close is not the right way to approach your problem. PR_Close not only ends up calling the PRFileDesc's close method, which ends up calling the system close() call on the socket, but it calls all the intermediate layers close methods, which may cause freeing of memory. This makes it invalid to perform any other operation on the PRFileDesc while PR_Close is going on, or after it. That would probably mean that you should have no jobs pending on a PRFileDesc before you call PR_Close . There is no synchronization inherent to PR_Close . Your best bet is for your application to ensure that only one thread is using the PRFileDesc before it calls PR_Close, otherwise segmentation violations are likely to occur . If you rewrote all the NSPR layers from top to bottom you could conceivably write them in a way that allows the behavior you want (possibly by keeping some structures around even after PR_Close is done), but the default I/O layers in NSPR/NSS just don't do this. They put the responsibility on the caller to finish their I/O before calling PR_Close . Look into things like PR_Shutdown() and PR_Interrupt() .
Hi Pierre, Thanks, for your suggestion. I'm going to try slapping in a shutdown and see what happens. In the underlying os a close call interrupts pending (blocked) read and write calls, and that is what I want to happen. It is in fact also what happens if I'm already in PR_Write when I call PR_Close. The optimized behavior of nspr (returning an error when an invalid fd is used) is also fine, but in debug mode it asserts. I can't prevent this from happening via synchronization. I'm hoping (based on your suggestion) that shutdown will interrupt the PR_Write call without rendering the socket invalid, and that then I will be able to call close once the PR_JoinJob has completed. If I were using bare os calls this would not be a problem as I would get back an EBADF if I were using an invalid fd and I could just keep going. With respect to some of your other points: PR_Interrupt does not apply in this case, as the thread making the write call is owned by the thread pool, and I can't use it. Finally, if you only have one thread at a time accessing the fd, how do you do simultaneous reads and writes? In my app, it is possible for one thread to be calling read and one to be calling write on the same FD. A third thread may also come in and wipe them both out with a close. The first two threads are not owned by me, but rather the job pool.
Comment 3•19 years ago
|
||
PR_Close not only closes the underlying OS file descriptor but also frees the memory associated with the PRFileDesc structure. This is why we can't use PR_Close to abort pending I/O operations on a fd. The reason we must use an assertion failure to report this in debug builds is that this is a serious error. We can only determine this by reading freed memory (PRFileDesc). Reading freed memory is bad, so we must fail an assertion when we detect that. I thought about the problem you reported and agreed that it is a real problem. The only solution I can come up with is for you to record the thread that's doing the PR_Write in your job function, and call PR_Interrupt on that thread after you call PR_CancelJob. The only legal way to abort a pending I/O operation in NSPR is PR_Interrupt. Even if PR_Shutdown works, it's undocumented behavior and it may not work on all platforms.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Updated•18 years ago
|
QA Contact: wtchang → nspr
Comment 4•2 years ago
|
||
The bug assignee didn't login in Bugzilla in the last 7 months and this bug has severity 'major'.
:KaiE, could you have a look please?
For more information, please visit auto_nag documentation.
Assignee: wtc → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(kaie)
Updated•2 years ago
|
Severity: major → S4
Flags: needinfo?(kaie)
You need to log in
before you can comment on or make changes to this bug.
Description
•