Closed Bug 303123 Opened 19 years ago Closed 16 years ago

assert in notify_ioq

Categories

(NSPR :: NSPR, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: kamil, Assigned: wtc)

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4

This is on windows XP running nspr 4.4.1 (debug) (patched for bug
https://bugzilla.mozilla.org/show_bug.cgi?id=291982)

My code does the following:
1. connect a socket to a remote host
2. queue a read job and a write job
3. when the read job or write job returns call PR_Recv or PR_Send.  Each of
these calls is made with the PR_INTERVAL_NO_WAIT.
4. If PR_Recv or PR_Send return PR_IO_TIMEOUT_ERROR then from the job handler, I
call queue another read job or a write job.

The result of this is that eventually an assert is encountered prtpool.c on line
989  (in notify_ioq).

This apears to have something to do with io_pending already being set when
SendSocket is called from the bowels of notify_ioq.

Reproducible: Always

Steps to Reproduce:
As per above.  Queue a job from the handler of another job, on the same fd.
Actual Results:  
code asserts in nspr

Expected Results:  
Job should have been qeueued.

Though, its not clean why a job might complete, and then a call like PR_Recv or
PR_Send return a timeout.
Hi Kamil,

Whenever you get PR_IO_TIMEOUT_ERROR or PR_PENDING_INTERRUPT_ERROR
in the WINNT configuration of NSPR, you need to handle it like this:

#include "priv/pprio.h"  // for PR_NT_CancelIo

    if (PR_Recv(fd, ...) == -1) {
        // PR_Recv failed
        PRErrorCode err = PR_GetError();
#ifdef WINNT
        if (err == PR_IO_TIMEOUT_ERROR || err == PR_PENDING_INTERRUPT_ERROR) {
            // make the socket usable again
            PR_NT_CancelIo(fd);
        }
#endif
        if (err == PR_IO_TIMEOUT_ERROR) {
            // queue another read job
        }
    }

The reason for the PR_NT_CancelIo call is explained in this
(out-of-date) tech note:
http://www.mozilla.org/projects/nspr/tech-notes/ntiotimeoutinterrupt.html

I suggest that you call PR_Recv and PR_Send with a
non-zero timeout.  In particular, for PR_Recv, you can
use PR_INTERVAL_NO_TIMEOUT.  This is because PR_Recv
will return as soon as some data are read.  You can't
use PR_INTERVAL_NO_TIMEOUT for PR_Send because PR_Send
(in blocking mode) will try to send the entire buffer
of data.

Note that if PR_Send times out, *some data* may have been
sent, but NSPR doesn't tell you how many bytes have been
sent, so you have no choice but to close the socket.  It is
okay to continue to use the socket after PR_Recv times out.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Hi Wan-Teh,

Thanks for your response.  It does however beget a couple of other questions:
1. Does a similar limitation exist in the other platform implementations of
NSPR?  If run this code on solaris, should I expect it to work with out the
specified limitations?
2. Is there anyway for me to find out how much data PR_Send is guaranteed to
accept without returning the timeout error?   It is important to me that I not
have to close the socket.
The need to call PR_NT_CancelIo only exists in the WINNT
configuration of NSPR.  It does not exist on other platforms
or the "WIN95" (generic WIN32) configuration.  This is why
the special code to recover from PR_IO_TIMEOUT_ERROR
or PR_PENDING_INTERRUPT_ERROR is ifdef'ed with WINNT.

The problem of not knowing how many bytes have been sent
when a *blocking* PR_Send fails with PR_IO_TIMEOUT_ERROR
or PR_PENDING_INTERRUPT_ERROR exists in all configurations
of NSPR.  Unfortunately there is no way for you to find out
how many bytes have been sent.

You can try this options.
1. Use a very large timeout for PR_Send.  When a *blocking*
socket is writable, the only reason PR_Send may block is
that TCP flow control kicks in because the receiver can't
consume the data as fast.  This can happen but it rarely
happens.  If you use a large timeout, when PR_Send times
out, you know something is seriously wrong with the receiver,
so it is fine for you to close the socket.
2. Try setting the socket in non-blocking mode.  I don't
know if the prtpool code works with non-blocking sockets
though.
QA Contact: wtchang → nspr
No NSPR bug was identified in this bug report.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.