Closed Bug 100776 Opened 23 years ago Closed 19 years ago

PR_Connect fails inexplicably with localhost and in multi-threaded programs

Categories

(NSPR :: NSPR, defect, P3)

4.1.3
x86
OS/2
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pedemonte, Assigned: julien.pierre)

References

()

Details

Attachments

(5 files, 12 obsolete files)

20.56 KB, application/octet-stream
Details
3.86 KB, text/plain
Details
54.49 KB, patch
mkaply
: review+
Details | Diff | Splinter Review
6.62 KB, patch
Details | Diff | Splinter Review
1.46 KB, patch
Details | Diff | Splinter Review
When trying to connect to http://localhost (with no server running), we should
get a dialog that says 'connection was refused'.  This works on win32 and linux,
but on OS/2, nothing happens.  The throbber runs for a split second then
immediately stops, but no error message is produced.
This seems to be due to an error in the OS/2 TCP/IP implementation.  When doing
a select() on localhost, both win32 and linux will set the exception bit for
that socket, to let you know that you cannot connect to localhost.  On OS/2,
however, we set the read or write bit, but not the exception bit.  So mozilla
tries to read from it and fails, but does not pop up the dialog box.  

My workaround, then, was to call getsockopt when we check the read or write
bits.  If getsockopt says that there is an error for that socket, then we treat
it as an exception.  Otherwise, we treat it as a read or write.
Attached file os2sock.c file (obsolete) —
Attached file os2poll.c file (obsolete) —
Although the problem was mainly in os2poll.c, I also cleaned up os2sock.c, which
had a similar fix for select().  I attached the whole file, though, since there
were many changes.
Attached patch nsprpub\configure.in patch (obsolete) — Splinter Review
Also, added patch to make VACPP builds use OS/2 version of select, while EMX
builds still use the BSD select().
Hi Javier,

The function you need to fix is SocketConnectContinue()
in mozilla/nsprpub/pr/src/io/prsocket.c.  The OS/2
implementation for that function looks like this:

    if (out_flags & PR_POLL_EXCEPT) {
        int len = sizeof(err);
        if (getsockopt(osfd, (int)SOL_SOCKET, SO_ERROR, (char *) &err, &len)
                == SOCKET_ERROR) {
            _PR_MD_MAP_GETSOCKOPT_ERROR(WSAGetLastError());
            return PR_FAILURE;
        }
        if (err != 0) {
            _PR_MD_MAP_CONNECT_ERROR(err);
        } else {
            PR_SetError(PR_UNKNOWN_ERROR, 0);
        }
        return PR_FAILURE;
    }

It seems to me that we should remove the
"if (out_flags & PR_POLL_EXCEPT)".
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
wtc: I just tried messing around with that code, and I was not able to make it
work as a fix for this problem;  I could continue browsing, but the problem
still existed.  Besides, we want to get this fixed in TCP/IP, and if it is
fixed, the code I attached should still work properly.
wtc:

Can we get this one into NSPR?

See Javier's last comments.

Thanks
Javier wrote:
> This seems to be due to an error in the OS/2 TCP/IP implementation.  When doing
> a select() on localhost, both win32 and linux will set the exception bit for
> that socket, to let you know that you cannot connect to localhost.

This is not true.

Unix will mark the socket as both readable and writable if the
non-blocking connect failed.  Unix does not use the exception
bit for this purpose.  Winsock is the only socket implementation
that uses the exception bit to report a non-blocking connect
failure.

> On OS/2, however, we set the read or write bit, but not the exception bit.

This is consistent with the socket implementation on Unix.

This is NOT a bug in the OS/2 TCP/IP implementation, unless
OS/2 needs to be like Winsock.

Javier, what you need to fix is _PR_MD_CONNECT() and
SocketConnectContinue().  I also need you to attach diffs
as opposed to whole files.


Attachment #50093 - Flags: needs-work+
Attachment #50094 - Flags: needs-work+
Comment on attachment 50095 [details] [diff] [review]
nsprpub\configure.in patch

This patch is fine by me.  Michael, could you review this patch too?
Attachment #50095 - Flags: review+
wtc:
I changed the files according to your suggestions.  I removed the code that
handled exceptions differently in os2sock.c and os2poll.c, and made the XP_OS2
section in prsocket.c be more like the XP_UNIX section.  Hoever, this does not
fix the problem of connecting to localhost.  I get no error dialog, it just
stops loading.  I debugged this and it turns out that SocketConnectContinue
never gets hit when trying to load localhost, although _PR_MD_CONNECT and
_PR_MD_PR_POLL do get called.  So where do you suggest I put the exception
checking code?
You need to call getsockopt() in SocketConnectContinue (if
fd->secret->nonblocking is true) and _PR_MD_CONNECT (if
fd->secret->nonblocking is false).
Attached patch patch (obsolete) — Splinter Review
wtc:  I've created a patch based on your suggestions.  I now call getsockopt()
in SocketConnectContinue and _PR_MD_CONNECT.  However, I still get no error
message when connecting to localhost.  I think I also need to be calling
getsockopt in PR_POLL.  Am I missing something here?
Javier,

I implemented my solution against NSPR 4.1.2.
I tried it on Julien's OS/2 machine here and I
was able to get the error code from connect
whether I connected to a remote machine or the
localhost.

Since Julien has not installed autoconf on his
OS/2 machine, I can't build the version of NSPR
that the Mozilla client is using.  I modified
my solution for the NSPR client branch, which
is the patch I just attached (attachment 52007 [details] [diff] [review]).
Since I can't build it, it may need minor tweaking.
But it shows you in concrete terms what I think
is the right solution.

Please let me know if my patch works for you.
I am also wondering if you and I are using
different Winsock implementations.
No, this still won't work for the issue at hand.  Your patch fixes the exception
case in PR_MD_CONNECT.  However, when contacting 'localhost', it goes through
the PR_POLL code.  This is the code that needs to be fixed in this case.  I need
to be calling getsockopt in PR_POLL, when checking the read/write flags.
Also, in regards to your changes, is it correct to completely remove the
exception flag checking?  Is there no exception other than non-blocking
connection failure that could be set?
Javier,

I will have to give up on giving you advice on this
problem.  I am not familiar with OS/2.

Please find an OS/2 developer (for example, mkaply)
to review your patch, and I'll check it in for you.
Thanks.
Javier,

I'd like to understand this bug better. What code is falling back to PR_Poll
when connecting to localhost on OS/2 ? Is that a part of NSPR or some code in
the browser ?

Wan-Teh - it would appear that the NSPR socket code is using a Winsock library
that ships with OS/2, so in that sense it is supposed to behave like Windows.
I'm not clear why the decision was made to use that vs the BSD-like socket API
that is native to OS/2. Either API can be used but I think the Winsock.dll fro
OS/2 is an emulation library layered on top of the other socket API.

Julien:
It seems to me that the socket code that is actually trying to contact the
'localhost' server is in PR_POLL.  I have updated PR_MD_CONNECT and
SocketConnectionContinue properly, but I was still seeing the problem.  It was
only after I updated os2poll.c likewise that I saw the 'connection was refused'
error message.
In regards to your winsock statement, I believe you are referring to the code
segments that are #ifndef BSD_SELECT.  OS/2 supports two version of select: one
is the BSD version, and one is a native OS/2 version, which is more efficient. 
I have coded both versions, but for the Visual Age compiled mozilla we used the
native select, while the EMX/GCC compiler uses the more common BSD version.
Javier,

I wasn't referring to select but to the fact that you use things in your code
snippets like WSAGetLastError() which are part of the Winsock emulation library
of OS/2; ie from winsock.h.

When I wrote socket apps for OS/2 before I used more BSD-like functions . Eg for
the error codes it was sock_errno() , which is from sys/socket.h .

These two headers - winsock.h and sys/socket.h - which both come with the OS/2
toolkit, define two independent sets of socket APIs for OS/2. I believe it is
preferable to use the socket API in sys/socket.h over the one from winsock.h .
It is closer to the native OS/2 stack which is a port from the BSD TCP/IP stack
in AIX in the first place.

Julien,

We don't use any Winsock code at all for the OS/2 port. I'm not sure what you 
are referring to.
Mike,

I was referring to the code quoted by Wan-Teh above on 9/20 at 10:55 which
included a call to WSAGetLastError() in prsocket.c and was referred to as "OS/2
implementation". However I went back and checked the tree and that code is in a
block only for WIN16 or WIN32. The XP_OS/2 code is below and has the proper
implementation which calls sock_errno . Sorry for the confusion. I think Wan-Teh
probably meant that was the Windows implementation.

At 2001-10-03 21:24, I wrote:
> Javier,
>
> [...]
>
> I am also wondering if you and I are using
> different Winsock implementations.

Please disregard that comment.  I was just
passing along a comment of Julien's.

Attached patch another patch (obsolete) — Splinter Review
This is my latest working patch.  Everything works now.  I do have some
questions though.

wtc, in your patch, you removed the exceptions check in PR_MD_CONNECT and
replaced it with a call to getsockopt.  In my patch, I still have the exceptions
check, but when checking if the write bit is set, I call getsockopt to see if
there is an error on the socket.  Do I still need the exception check when using
BSD-socket model?

In this patch, I have now added a call to getsockopt when checking the read and
write bits.  If they have an error on the socket, I treat it as an exception.
Javier,

Although I said earlier that I have given up on
this bug, I still couldn't resist the temptation
to look at your latest patch.

You wrote:
> Do I still need the exception check when using
> BSD-socket model?

No.  BSD-socket model does not use the exception flag.
Only Winsock uses the exception flag.  BSD-socket
model would mark the socket as both readable and writable
if the non-blocking connect failed.

> In this patch, I have now added a call to getsockopt
> when checking the read and write bits.  If they have
> an error on the socket, I treat it as an exception.

This change to PR_Poll is what I have been objecting to.
PR_Poll should not call getsockopt.  It is
SocketConnectContinue and _PR_MD_CONNECT that should call
getsockopt.

Are you saying that the BSD-socket on OS/2 does not
mark the socket as readable and writable when the
non-blocking connect failed?  That would be a bug.

Attachment #50093 - Attachment is obsolete: true
Attachment #50094 - Attachment is obsolete: true
wtc:
OK, so I removed the exception code in _PR_MD_CONNECT, and wrote up PR_POLL like
it was previously.  Now, if I try to connect to a nonexistant URL (such as
www.lalksdfj.com), then I correctly get an error dialog.  However, I still do
not get a dialog to pop up if I try to connect to "localhost".  I am still
seeing the same thing.  Is this handled differently for some reason?
Attachment #51869 - Attachment is obsolete: true
Attachment #52275 - Attachment is obsolete: true
Priority: -- → P1
Target Milestone: --- → 4.2
Attached patch patch using unix socket code (obsolete) — Splinter Review
I went back and took as much of the unix code as I could, in both os2poll.c and
os2sock.c.  While this still does not fix the localhost issue, I like this code
better than what I had previously because it uses the enhancements found inthe
unix code.  Plus, it looks much cleaner.

wtc:  I am trying to debug the localhost issue on linux, but it uses pthreads. 
How can I build nspr on linux without pthreads?
Attachment #50095 - Attachment is obsolete: true
Attachment #52007 - Attachment is obsolete: true
Attachment #52952 - Attachment is obsolete: true
Attached patch removed _PR_MD_CONNECT hack (obsolete) — Splinter Review
Attachment #64044 - Attachment is obsolete: true
I have created a test program for PR_Connect . I couldn't get the NSPR test
build environment to work right so I used NSS coreconf build.

1. Using a regular browser build environment, create a directory and unzip this
file.

2. Run build.cmd . This should pull the latest NSPR and NSS and build them,
including all tools.

3. go to mozilla/security/nss/cmd/connect and run gmake

4. run the connect program from mozilla/dist/os22.45_icc_dbg.obj/bin

eg. connect localhost:1234 should show an error
if you have telnetd listening , connect localhost:23 should succeed
if you don't have an httpd listening, connect localhost:80 should fail

As far as I can tell, PR_Connect works as expected in all cases, so the bug has
to be in the browser code.

If you don't want to do the whole NSS build, just use the connect.exe from the
zip and run it against your existing NSPR DLLs.
Julien "Mad Brain" Pierre wants to own this bug.  His passion
about OS/2 really touched me.
Assignee: wtc → jpierre
Status: ASSIGNED → NEW
The test case I attached was using PR_Connect in blocking mode, which 
is the default. As Wan-Teh told me, the browser uses non-blocking mode. 
I'm attaching an updated connect.cpp that tests non-blocking mode. I 
have reproduced the problem using that test case, but have not yet 
traced through it.
Attached file non-blocking connect test case (obsolete) —
Wan-Teh told me there is a test for this in the NSPR tests, nbconn . I was not
able to get it to build however. See bug 119098 .
I ran "nbconn localhost", and I got this on both OS/2 and Linux (with 
classic nspr threads):

  host: localhost.austin.ibm.com
  Connect in progress
  PR_Poll returns 1
  PR_POLL_WRITE
  PR_GetConnectStatus: connect failed: (-5981, 10061)
  PASS

What does this mean?
Actually, for Linux (classic nspr threads), I get a different error 
code:

PR_GetConnectStatus: connect failed: (-5981, 111)
  host: localhost.austin.ibm.com

This is the host name.

  Connect in progress

This means the non-blocking PR_Connect() failed with
PR_IN_PROGRESS_ERROR, which is the expected error.
We are supposed to call PR_Poll() next to wait for the
completion of the non-blocking connect.

  PR_Poll returns 1

This means PR_Poll returns successfully.

  PR_POLL_WRITE

This means PR_Poll reports that the socket is writable,
which means the non-blocking connect is complete.  We
need to call PR_GetConnectStatus() to find out the
connect completion status.

  PR_GetConnectStatus: connect failed: (-5981, 10061)

This means the non-blocking connect failed.  NSPR error
code is -5981 and OS error code is 10061.

  PASS

The test passed.
The Linux (with pthreads) test case outputs this:

  host: localhost.austin.ibm.com
  Connect in progress
  PR_Poll returns 1
  PR_POLL_ERR
  PR_GetConnectStatus: connect failed: (-5981, 111)
  PASS
So the issue seems to be this:  one Linux w/ pthreads and Win32 (and 
others), PR_Poll returns a PR_POLL_ERR, PR_POLL_EXCEPT, or 
PR_POLL_HUP, all which lead to an error condition in the browser (see 
http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsSocketTrans
port.cpp lines#872 and 
http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsSocketTrans
port.cpp#998).

However, on OS/2 and Linux w/ classic NSPR threads, PR_POLL returns a 
PR_POLL_WRITE (according to the testcase).  The test case knows that 
the connection failed, though, because it calls PR_GetConnectStatus().  
We need to add calls to this function in the browser code.  I'll try 
this now.
OK, here is the deal.

The PR_Poll() call only tells you that the non-blocking
connect has completed.  But it doesn't tell you whether
it was a success or a failure.  In particular, it doesn't
tell you the reason (error code) of the failure.

You need to call PR_GetConnectStatus() to find out the
connect success/fail status and the error code.  Actually,
you should call PR_ConnectContinue() instead of
PR_GetConnectStatus() in NSPR 4.1 or newer.
I just did a LXR search in the mozilla source tree for PR_ConnectContinue
and PR_GetConnectStatus.  I am surprised to find out that mozilla is not
calling either function.  Two conclusions.

1. Mozilla cannot determine the success/failure status of the non-blocking
   connect in a portable manner.  The determination of the success/failure
   status of non-blocking connect is platform depenedent, as Javier has
   found out.  The PR_GetConnectStatus() function (and its replacement
   PR_ConnectContinue()) was designed to hide the non-portable code.

2. When the non-blocking connect fails, Mozilla doesn't have the error code.
Right now, the relevent PR_Poll call is in nsSocketTransportService.cpp#469. 
There is a for loop after that call that loops through each of the sockets.  Is
this where we would call PR_ConnectContinue()?  Or would we call it outside the
loop, right after PR_Poll?
Attached file update to test case
Modified test case to take a timeout parameter, as well as to try both blocking
and non-blocking connections.
I just attached a test case that tries both blocking and non-blocking 
modes. Also, the timeout to PR_Connect is configurable and is passed in 
both tests.

Results :

1) testing on localhost with no daemon and 5 s timeout. This should 
fail

[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect localhost:1234 5
Blocking connection to localhost:1234 failed.
NSPR error : -5981.
OS error : 10061.
Non-blocking connection to localhost:1234 successful.

=> bug in non-blocking code

2) testing on localhost and ftp port, with ftpd listening and 5 s 
timeout. This should succeed

[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect localhost:21   5
Blocking connection to localhost:21 successful.
Non-blocking connection to localhost:21 successful.

=> This is OK

3) testing on www.ibm.com:80 and 5 s timeout. This should succeed

[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect www.ibm.com:80 5
Blocking connection to www.ibm.com:80 successful.
Non-blocking connection to www.ibm.com:80 successful.

[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]

=> this is OK

4) This should fail
[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect www.ibm.com:81 5
Blocking connection to www.ibm.com:81 failed.
NSPR error : -5990.
OS error : 0.
PR_Poll error.
Non-blocking connection to www.ibm.com:81 failed.
NSPR error : -5934.
OS error : 10036.

=> OK

5) Test on www.ibm.com:80 using zero timeout . This should fail in 
blocking mode (0 is not enough time to connect over the Internet), but 
pass in non-blocking mode as timeout is not supposed to affect the 
success/failure in non-blocking mode

[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect www.ibm.com:80 0
Blocking connection to www.ibm.com:80 failed.
NSPR error : -5990.
OS error : 0.
PR_Poll error.
Non-blocking connection to www.ibm.com:80 failed.
NSPR error : -5934.
OS error : 10036.

=> bug in non-blocking mode

6) connecting to my Sun on port 80 on LAN. should fail as I have no 
HTTPD listening on it

[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect strange:80 0
Blocking connection to strange:80 failed.
NSPR error : -5990.
OS error : 0.
Non-blocking connection to strange:80 successful.

=> bug in non-blocking mode

From these tests I conclude that the non-blocking mode is broken about 
half the time, and the blocking mode is working as designed.
Attachment #64239 - Attachment is obsolete: true
I built my test case on Solaris. I got all the expected results except 
for 5) . I may have been incorrect in stating that it should work in 
non-blocking mode. I am using the same timeout for PR_Connect and 
PR_Poll, and in test 5), that is zero. The result is a failure both in 
blocking and non-blocking modes on Solaris when going to www.ibm.com:80 
(which passes with a non-zero timeout).

It would seem from these results that this bug is specific to OS/2.
Julien:  Using the patch in this bug, I get a different result for 
test (1).  Here is my output:

1) [D:\builds\trunk\mozilla\obj\dist\bin]julien localhost:1234 5
Blocking connection to localhost:1234 failed.
NSPR error : -5981.
OS error : 10061.
PR_Connect error.
Non-blocking connection to localhost:1234 failed.
NSPR error : -5981.
OS error : 10061.
Who handles the networking code on the browser side?  Could we get 
that person involved in this bug?  According to the NSPR docs, we 
should be calling PR_GetConnectStatus()/PR_ConnectContinue() after 
doing a PR_Poll, but this is not so in the browser.  
Javier,

I tried your NSPR patch with my test case.
It fixes problems in my test cases 1) as you reported, as well as 6).
Behavior in test 5) is not what I would expect - both blocking & 
non-blocking tests fail - but it is the same behavior as Solaris.
I'd say your patch looks good from this end.

Wan-Teh, what do you think the behavior should be for a non-blocking 
PR_Connect with 0 timeout and subsequent PR_Poll with 0 timeout, if 
there is a daemon listening on the host/port we are trying to connect 
to ? This is what test case 5 does. I assumed it should succeed since 
it works with a 5s timeout and a 0 timeout should not change it - but 
apparently I was wrong. Perhaps the timeout doesn't matter on 
PR_Connect but is still required on PR_Poll ?

Thanks.
A PR_Poll with 0 timeout may not wait long enough for the
completion of the non-blocking connect.  It is reasonable
for this PR_Poll call to return 0, which means it times out.
Attached patch fix memory leakSplinter Review
Attachment #64150 - Attachment is obsolete: true
I am a little confused aboutsome code.  In the part of the socket code 
in unix.c 
(http://lxr.mozilla.org/seamonkey/source/nsprpub/pr/src/md/unix/unix.c
#1142), there is an if statement for "if(!nonblocking && 
EINPROGRESS)".  What does "fd->secret->nonblocking" mean?  If it is 
true, is the socket blocking or nonblocking?  I ask because just above 
this code there is a comment that if "The NSPR socket is nonblocking 
and connect() fails with EINPROGRESS", then we do select and 
getsockopt.  

Basically, the if statement and the comment seem to be saying 
different things.  Is one of these incorrect, or is my understanding 
of the variable 'nonblocking' incorrect?
I have created bug 120181 to handle my solution in necko.  The patch 
attached to that bug, along with the patch here fixes the networking 
issues I was seeing.
wtc:  Can you check in the latest patch here?
Javier,

Good news.  Someone else independently discovered another
problem caused by the lack of PR_ConnectContinue calls in
nsSocketTransportService.cpp.  See bug 106188.

I suggest that you wait until bug 106188 is resolved to
finalize your patch for this bug.
Comment on attachment 64905 [details] [diff] [review]
fix memory leak

Javier, could you ask your colleague Michael Kaply to
review this patch too?

I won't be able to check in the trivial change to
mozilla/configure.in in your patch without a review
by Chris Seawood.  I think that change is fine.
Comment on attachment 64905 [details] [diff] [review]
fix memory leak

r=mkaply
Attachment #64905 - Flags: review+
r=cls on the configure.in changes.
OK, I'm ready to check in Javier's latest patch now,
but found that the Mozilla tree is closed except for
approved checkins.

I don't know how to get the checkin approval for this
bug.
Comment on attachment 64905 [details] [diff] [review]
fix memory leak

I've checked in this patch into the tip of NSPR.

I will check it into the client patch of NSPR after
someone gets the approval for me or when the Mozilla
tree is open for general checkins again.
I'll send a note to drivers@mozilla.org. 0.9.8 hasn't branched yet, so this 
should be able to go on the client branch before that happens.
wtc, please go ahead and check the fix into the client branch too,
a=brendan@mozilla.org.  Thanks,

/be
Comment on attachment 64905 [details] [diff] [review]
fix memory leak

OK, I just checked this patch into the client branch of NSPR.

I also checked in the change to mozilla/configure.in, but I
don't know how to generate mozilla/configure.  I understand
that mozilla/configure will be automatically updated and
checked in by some script of Leaf's.  If you know how to
manually generate mozilla/configure, you can also go ahead
and commit the change.
Marked the bug fixed.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Just wanted to note that Leaf's script has automatically
updated mozilla/configure.
Re-opening this bug.
I have found there are still problems with PR_Connect on OS/2.
I used the strsclnt tool from NSS to stress selfserv, an HTTPS (SSL 
HTTP) web server running on the local machine on port 2000.
As soon as I set -t 2 in strsclnt to have multiple client threads, I 
got bombarded with PR_Connect failures, as follows :

strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.
strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.
strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.
strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.
strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.
strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.
strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.

The connect errors occur regardless of whether I connect to the server 
using 127.0.0.1:2000 or the machine's actual IP address on port 2000. 
They occur as long as the connection is to a local daemon. When 
connecting to a remote server, multithreaded PR_Connect works fine.

If I set -t 1 to make strsclnt single-threaded, then PR_Connect works 
fine ...

I don't know if this is a bug in the OS/2 implementation of PR_Connect. 
It might well be a bug in the OS/2 TCP/IP stack. I reproduced this on 
the IBM TCP/IP 4.3 stack :

[e:\dev\nss\mozilla\dist\os22.45_icc_opt.obj\bin]inetver
Version numbers of TCP/IP protocol drivers:
   SOCKETS.SYS: 6.3000
   AFOS2.SYS:   6.3000
   AFINET.SYS:  6.3001

I don't know of any other OS/2 application that does multithreaded 
connect, so I can't test if this is a stack bug or not. A simple 
standalone test case could probably be written to check for that. Mike, 
I would suggest you inquire the guys at Austin about this.

FYI, the test command I used for this test was :
strsclnt -p 2000 -D -d . -N -t 1 -q -c 1000000 localhost

You can build strsclnt by building NSS standalone (gmake nss_build_all 
in mozilla/security/nss).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I tried this on both my OS/2 machines, which were running the exact 
same version of OS/2 and TCP/IP stack.
This problem did not occur on the AMD K6-2 550 MHz box.
But it did occur on the dual CPU AMD Athlon MP 1500+ box.

So I disabled the second CPU using MPCPUMON, and reran this test. This 
time I got no PR_Connect failures . This is definitely an OS/2 TCP/IP 
stack bug, and it is specific to SMP systems.
Perhaps I should open a different bug against NSPR . In strsclnt we use blocking
NSPR sockets, not non-blocking as mozilla is using, so the behavior is expected
to be different.

I'd like to go back on the hypothesis that this is an OS/2 staak bug. Do we use
any locks in PR_Connect () ? Maybe we have some SMP-unsafe locking mechanism in
use. I don't see any lock in os2sock.c though. But maybe some of the function it
can call do, like PR_Sleep.

FYI, I retested it, and the multithreaded connect fails on all sites, not just
local server ports, when both CPUs are enabled in OS/2, and the test succeeds
when the 2nd CPU is disabled. I just hadn't tried the local vs remote client
from the same SMP OS/2 system when I reopened this bug - I was running the local
test on the SMP box and remote test on the UNI box. Turns out local vs remote
isn't a factor for this, only SMP is.

PR_Lock and PR_Unlock are by definition threadsafe.

I'm not sure what to do with this bug.
Mike,

Please leave the bug opened. When I get time, I will be running some more tests
on this, especially now that I got the SSL server code running on OS/2. I
believe there are several TCP/IP stack bugs that affect both client and server code.
When I got the client and servers to work on OS/2, I was only able to test very
light TCP connection loads successfully. High loads would always result in lots
of connection errors. This is very likely caused by OS/2 bugs, but I still need
to find the specifics, whether the bugs are only on SMP or in UNI too, and
isolate the bugs into test cases ... This will take some more effort.

Attached file multi-threaded PR_Connect program (obsolete) —
This is an updated version of the test program that can run multithreaded, and
with multiple iterations.
I updated the test case to run multithreaded. I then ran it and tried to connect
on daemons listening on my localhost - lpd and ftpd. 
It had a very bad effect on the system. The test program and all my daemons
running on the system crashed, even the ones I wasn't connecting to (telnetd).
The OS became very unstable. I believe I'm hitting some serious limit in the
TCP/IP stack.

The arguments I used in the test were :
connect localhost:21 3 200 100

This runs 100 threads with 200 iterations of the test code that first does a
blocking connect, then a non-blocking connect.

I tried disabling a CPU on my SMP box (dual athlon), and it didn't improve  things.

Seems to me that this might be worth fixing in OS/2 TCP/IP.
Unless you have a "real world" scenario, there won't be a chance of getting this
fixed by the TCP/IP team.
Mike,

I believe there are real-world cases for this. If one is running a proxy server
on OS/2, it will act as a client . If you use that proxy to connect to another
server running on the same OS/2 machine as the proxy, then it will run into the
same issues as I have, once the proxy gets heavily loaded.

It is also very common to load-test servers with clients running on the same
system as the server. This bug makes it impossible. I would think that it would
be a very high priority to fix in a produced called "Warp Server for e-business".
A dead product called WarpServer for ebusiness.

You set your hopes too high.
Do you have a small testcase for this? If not, there is no way to get it fixed.
Attachment #100066 - Attachment is obsolete: true
I updated the test case.
To run, enable FTPD, TELNETD and LPD , then run :
connect localhost:21 3 200 100
connect localhost:23 3 200 100
connect localhost:515 3 200 100

You may need to run then multiple times.
Eventually, all the daemons will crash.
The system will be out of sockets and network applications will be unable to run
for a while until the local sockets have timed out.

This happens even on a UNI ACP2 kernel.
Demoting to P3.
Priority: P1 → P3
I'm running OS/2 with TCP/IP 4.02y.  For me this is a very real problem -- I
need to watch the throbber behavior -- if the throbber stops, it appears to mean
that the attempt to connect to a server failed.  I would much rather have an
explicit error message telling me that my previous click on a link was ineffective.
There's no Mozilla fix here, and fixes aren't going to be made to OS/2 for this.
Status: REOPENED → RESOLVED
Closed: 23 years ago20 years ago
Resolution: --- → WONTFIX
Reopening this because I finally figured out the cause for this PR_Connect issue
. It is a bug in the OS/2 socket library.

The OS/2 TCP/IP library actually modifies the sockaddr structure within the
connect call !

I found this by putting a data change breakpoint in IDEBUG on the same PRNetAddr
structure that I was passing to all my threads doing PR_Connect . And it got hit
in tcpip32.dll
Specifically, the "family" byte gets modified during the execution of the OS/2
"connect" socket call. But it is always reset upon returning from the function.

In the multithreaded case, this caused PR_NETADDR_SIZE to return the wrong
addrlen of 110 bytes (presumably the IPv6 size), which then gets passed to
connect, which fails with EINVAL. This was not easy to debug since
PR_NETADDR_SIZE is a macro. I found it by putting an assertion in MD_CONNECT for
addrlen == 16 ( which is the correct structure size for IPv4)

I was able to workaround the problem in the application program by copying the
PRNetAddr structure before passing it to PR_Connect . This causes the address
structure to be no longer shared accross threads, and then all the connects work
properly .

A workaround for this bug is to copy PRNetAddr to the stack within MD_Connect in
NSPR . It might be a good idea to revisit all the OS/2 socket calls that pass an
address and do the same trick . I will attach a patch just for the connect case.

Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Attached patch work around OS/2 connect bug (obsolete) — Splinter Review
Attachment #167074 - Flags: review?(mkaply)
Comment on attachment 167074 [details] [diff] [review]
work around OS/2 connect bug

Javier did this code, so I'd like him to review it
Attachment #167074 - Flags: review?(mkaply) → review?(jhpedemonte)
Comment on attachment 167074 [details] [diff] [review]
work around OS/2 connect bug

Good job!  Open a bug for the other functions that do this.
Attachment #167074 - Flags: review?(jhpedemonte) → review+
Could you find out how connect() is declared on OS/2?
Wan-Teh,

I checked that the other day, and the sockaddr pointer wasn't declared const .
So technically it may not be a bug. Still quite an unusual thing to do .

Javier,

One would have to first write test programs for all the other OS/2 functions
that take sockaddr* that are usd in NSPR to determine if the problem exists for
each NSPR API . I think it's probably easier to just do the same fix to copy the
structure for all of these functions and be done with it .
Comment on attachment 167074 [details] [diff] [review]
work around OS/2 connect bug

>+    PRNetAddr localaddr = *addr; /* work around a bug in OS/2 . see bugzilla 100776 */

Julien, please add a one-line description of this bug
in the comment.
Attachment #167074 - Attachment is obsolete: true
I just attached a new patch with the comment . I'm targetting this bug to NSPR
4.5.2 .

I also updated the bug subject to reflect what was actually happening - the
error message originally was about a mozilla browser bug report, but NSPR
doesn't report error messages . 

I only tested that my patch fixes the case where a multi-threaded client
program, the NSS strsclnt tool, connects to a remote server on my LAN . This
case was previously broken .

I did not try to check the behavior against a server on the local machine (or no
server on localhost). That still needs to be tested . Javier, could you please
check that ?

Thanks .
Summary: No error message on failure to connect → PR_Connect fails inexplicably with localhost and in multi-threaded programs
Target Milestone: 4.2 → 4.5.2
We don't have any OS/2 machines to test this stuff anymore.
I think we should just check in this patch. It shouldn't break anything.

julian, what do you think?
OS/2 developers: would you like me to check in
this patch?
Yes, please do.
Target Milestone: 4.5.2 → 4.6
I checked in Julien's patch (attachment 168262 [details] [diff] [review],
after editing) on the NSPR trunk (NSPR 4.6) and
NSPRPUB_PRE_4_2_CLIENT_BRANCH (Mozilla 1.8 Beta 2).

Good job, Julien, for tracking this down (comment
84).
Status: REOPENED → RESOLVED
Closed: 20 years ago19 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: