100776 - PR_Connect fails inexplicably with localhost and in multi-threaded programs

Reporter

Description

•

23 years ago

When trying to connect to http://localhost (with no server running), we should
get a dialog that says 'connection was refused'.  This works on win32 and linux,
but on OS/2, nothing happens.  The throbber runs for a split second then
immediately stops, but no error message is produced.

Javier Pedemonte

Reporter

Comment 1

•

23 years ago

This seems to be due to an error in the OS/2 TCP/IP implementation.  When doing
a select() on localhost, both win32 and linux will set the exception bit for
that socket, to let you know that you cannot connect to localhost.  On OS/2,
however, we set the read or write bit, but not the exception bit.  So mozilla
tries to read from it and fails, but does not pop up the dialog box.  

My workaround, then, was to call getsockopt when we check the read or write
bits.  If getsockopt says that there is an error for that socket, then we treat
it as an exception.  Otherwise, we treat it as a read or write.

Javier Pedemonte

Reporter

Comment 2

•

23 years ago

Attached file os2sock.c file (obsolete) — Details

Javier Pedemonte

Reporter

Comment 3

•

23 years ago

Attached file os2poll.c file (obsolete) — Details

Javier Pedemonte

Reporter

Comment 4

•

23 years ago

Although the problem was mainly in os2poll.c, I also cleaned up os2sock.c, which
had a similar fix for select().  I attached the whole file, though, since there
were many changes.

Javier Pedemonte

Reporter

Comment 5

•

23 years ago

Attached patch nsprpub\configure.in patch (obsolete) — Details — Splinter Review

Javier Pedemonte

Reporter

Comment 6

•

23 years ago

Also, added patch to make VACPP builds use OS/2 version of select, while EMX
builds still use the BSD select().

Wan-Teh Chang

Comment 7

•

23 years ago

Hi Javier,

The function you need to fix is SocketConnectContinue()
in mozilla/nsprpub/pr/src/io/prsocket.c.  The OS/2
implementation for that function looks like this:

    if (out_flags & PR_POLL_EXCEPT) {
        int len = sizeof(err);
        if (getsockopt(osfd, (int)SOL_SOCKET, SO_ERROR, (char *) &err, &len)
                == SOCKET_ERROR) {
            _PR_MD_MAP_GETSOCKOPT_ERROR(WSAGetLastError());
            return PR_FAILURE;
        }
        if (err != 0) {
            _PR_MD_MAP_CONNECT_ERROR(err);
        } else {
            PR_SetError(PR_UNKNOWN_ERROR, 0);
        }
        return PR_FAILURE;
    }

It seems to me that we should remove the
"if (out_flags & PR_POLL_EXCEPT)".

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

jhp (no longer active)

Comment 8

•

23 years ago

wtc: I just tried messing around with that code, and I was not able to make it
work as a fix for this problem;  I could continue browsing, but the problem
still existed.  Besides, we want to get this fixed in TCP/IP, and if it is
fixed, the code I attached should still work properly.

Mike Kaply [:mkaply]

Comment 9

•

23 years ago

wtc:

Can we get this one into NSPR?

See Javier's last comments.

Thanks

Wan-Teh Chang

Comment 10

•

23 years ago

Javier wrote:
> This seems to be due to an error in the OS/2 TCP/IP implementation.  When doing
> a select() on localhost, both win32 and linux will set the exception bit for
> that socket, to let you know that you cannot connect to localhost.

This is not true.

Unix will mark the socket as both readable and writable if the
non-blocking connect failed.  Unix does not use the exception
bit for this purpose.  Winsock is the only socket implementation
that uses the exception bit to report a non-blocking connect
failure.

> On OS/2, however, we set the read or write bit, but not the exception bit.

This is consistent with the socket implementation on Unix.

This is NOT a bug in the OS/2 TCP/IP implementation, unless
OS/2 needs to be like Winsock.

Javier, what you need to fix is _PR_MD_CONNECT() and
SocketConnectContinue().  I also need you to attach diffs
as opposed to whole files.

Wan-Teh Chang

Updated

•

23 years ago

Attachment #50093 - Flags: needs-work+

Wan-Teh Chang

Updated

•

23 years ago

Attachment #50094 - Flags: needs-work+

Wan-Teh Chang

Comment 11

•

23 years ago

Comment on attachment 50095 [details] [diff] [review]
nsprpub\configure.in patch

This patch is fine by me.  Michael, could you review this patch too?

Attachment #50095 - Flags: review+

jhp (no longer active)

Comment 12

•

23 years ago

wtc:
I changed the files according to your suggestions.  I removed the code that
handled exceptions differently in os2sock.c and os2poll.c, and made the XP_OS2
section in prsocket.c be more like the XP_UNIX section.  Hoever, this does not
fix the problem of connecting to localhost.  I get no error dialog, it just
stops loading.  I debugged this and it turns out that SocketConnectContinue
never gets hit when trying to load localhost, although _PR_MD_CONNECT and
_PR_MD_PR_POLL do get called.  So where do you suggest I put the exception
checking code?

Wan-Teh Chang

Comment 13

•

23 years ago

You need to call getsockopt() in SocketConnectContinue (if
fd->secret->nonblocking is true) and _PR_MD_CONNECT (if
fd->secret->nonblocking is false).

jhp (no longer active)

Comment 14

•

23 years ago

Attached patch patch (obsolete) — Details — Splinter Review

jhp (no longer active)

Comment 15

•

23 years ago

wtc:  I've created a patch based on your suggestions.  I now call getsockopt()
in SocketConnectContinue and _PR_MD_CONNECT.  However, I still get no error
message when connecting to localhost.  I think I also need to be calling
getsockopt in PR_POLL.  Am I missing something here?

Wan-Teh Chang

Comment 16

•

23 years ago

Attached patch A patch from wtc, which implements the solution he outlined. (obsolete) — Details — Splinter Review

Wan-Teh Chang

Comment 17

•

23 years ago

Javier,

I implemented my solution against NSPR 4.1.2.
I tried it on Julien's OS/2 machine here and I
was able to get the error code from connect
whether I connected to a remote machine or the
localhost.

Since Julien has not installed autoconf on his
OS/2 machine, I can't build the version of NSPR
that the Mozilla client is using.  I modified
my solution for the NSPR client branch, which
is the patch I just attached (attachment 52007 [details] [diff] [review]).
Since I can't build it, it may need minor tweaking.
But it shows you in concrete terms what I think
is the right solution.

Please let me know if my patch works for you.
I am also wondering if you and I are using
different Winsock implementations.

jhp (no longer active)

Comment 18

•

23 years ago

No, this still won't work for the issue at hand.  Your patch fixes the exception
case in PR_MD_CONNECT.  However, when contacting 'localhost', it goes through
the PR_POLL code.  This is the code that needs to be fixed in this case.  I need
to be calling getsockopt in PR_POLL, when checking the read/write flags.
Also, in regards to your changes, is it correct to completely remove the
exception flag checking?  Is there no exception other than non-blocking
connection failure that could be set?

Wan-Teh Chang

Comment 19

•

23 years ago

Javier,

I will have to give up on giving you advice on this
problem.  I am not familiar with OS/2.

Please find an OS/2 developer (for example, mkaply)
to review your patch, and I'll check it in for you.
Thanks.

Julien Pierre

Assignee

Comment 20

•

23 years ago

Javier,

I'd like to understand this bug better. What code is falling back to PR_Poll
when connecting to localhost on OS/2 ? Is that a part of NSPR or some code in
the browser ?

Wan-Teh - it would appear that the NSPR socket code is using a Winsock library
that ships with OS/2, so in that sense it is supposed to behave like Windows.
I'm not clear why the decision was made to use that vs the BSD-like socket API
that is native to OS/2. Either API can be used but I think the Winsock.dll fro
OS/2 is an emulation library layered on top of the other socket API.

jhp (no longer active)

Comment 21

•

23 years ago

Julien:
It seems to me that the socket code that is actually trying to contact the
'localhost' server is in PR_POLL.  I have updated PR_MD_CONNECT and
SocketConnectionContinue properly, but I was still seeing the problem.  It was
only after I updated os2poll.c likewise that I saw the 'connection was refused'
error message.
In regards to your winsock statement, I believe you are referring to the code
segments that are #ifndef BSD_SELECT.  OS/2 supports two version of select: one
is the BSD version, and one is a native OS/2 version, which is more efficient. 
I have coded both versions, but for the Visual Age compiled mozilla we used the
native select, while the EMX/GCC compiler uses the more common BSD version.

Julien Pierre

Assignee

Comment 22

•

23 years ago

Javier,

I wasn't referring to select but to the fact that you use things in your code
snippets like WSAGetLastError() which are part of the Winsock emulation library
of OS/2; ie from winsock.h.

When I wrote socket apps for OS/2 before I used more BSD-like functions . Eg for
the error codes it was sock_errno() , which is from sys/socket.h .

These two headers - winsock.h and sys/socket.h - which both come with the OS/2
toolkit, define two independent sets of socket APIs for OS/2. I believe it is
preferable to use the socket API in sys/socket.h over the one from winsock.h .
It is closer to the native OS/2 stack which is a port from the BSD TCP/IP stack
in AIX in the first place.

Mike Kaply [:mkaply]

Comment 23

•

23 years ago

Julien,

We don't use any Winsock code at all for the OS/2 port. I'm not sure what you 
are referring to.

Julien Pierre

Assignee

Comment 24

•

23 years ago

Mike,

I was referring to the code quoted by Wan-Teh above on 9/20 at 10:55 which
included a call to WSAGetLastError() in prsocket.c and was referred to as "OS/2
implementation". However I went back and checked the tree and that code is in a
block only for WIN16 or WIN32. The XP_OS/2 code is below and has the proper
implementation which calls sock_errno . Sorry for the confusion. I think Wan-Teh
probably meant that was the Windows implementation.

Wan-Teh Chang

Comment 25

•

23 years ago

At 2001-10-03 21:24, I wrote:
> Javier,
>
> [...]
>
> I am also wondering if you and I are using
> different Winsock implementations.

Please disregard that comment.  I was just
passing along a comment of Julien's.

jhp (no longer active)

Comment 26

•

23 years ago

Attached patch another patch (obsolete) — Details — Splinter Review

jhp (no longer active)

Comment 27

•

23 years ago

This is my latest working patch.  Everything works now.  I do have some
questions though.

wtc, in your patch, you removed the exceptions check in PR_MD_CONNECT and
replaced it with a call to getsockopt.  In my patch, I still have the exceptions
check, but when checking if the write bit is set, I call getsockopt to see if
there is an error on the socket.  Do I still need the exception check when using
BSD-socket model?

In this patch, I have now added a call to getsockopt when checking the read and
write bits.  If they have an error on the socket, I treat it as an exception.

Wan-Teh Chang

Comment 28

•

23 years ago

Javier,

Although I said earlier that I have given up on
this bug, I still couldn't resist the temptation
to look at your latest patch.

You wrote:
> Do I still need the exception check when using
> BSD-socket model?

No.  BSD-socket model does not use the exception flag.
Only Winsock uses the exception flag.  BSD-socket
model would mark the socket as both readable and writable
if the non-blocking connect failed.

> In this patch, I have now added a call to getsockopt
> when checking the read and write bits.  If they have
> an error on the socket, I treat it as an exception.

This change to PR_Poll is what I have been objecting to.
PR_Poll should not call getsockopt.  It is
SocketConnectContinue and _PR_MD_CONNECT that should call
getsockopt.

Are you saying that the BSD-socket on OS/2 does not
mark the socket as readable and writable when the
non-blocking connect failed?  That would be a bug.

jhp (no longer active)

Comment 29

•

23 years ago

Attached patch patch w/o exception handling in _PR_MD_CONNECT (obsolete) — Details — Splinter Review

jhp (no longer active)

Updated

•

23 years ago

Attachment #50093 - Attachment is obsolete: true

jhp (no longer active)

Updated

•

23 years ago

Attachment #50094 - Attachment is obsolete: true

jhp (no longer active)

Comment 30

•

23 years ago

wtc:
OK, so I removed the exception code in _PR_MD_CONNECT, and wrote up PR_POLL like
it was previously.  Now, if I try to connect to a nonexistant URL (such as
www.lalksdfj.com), then I correctly get an error dialog.  However, I still do
not get a dialog to pop up if I try to connect to "localhost".  I am still
seeing the same thing.  Is this handled differently for some reason?

jhp (no longer active)

Updated

•

23 years ago

Attachment #51869 - Attachment is obsolete: true

jhp (no longer active)

Updated

•

23 years ago

Attachment #52275 - Attachment is obsolete: true

Wan-Teh Chang

Updated

•

23 years ago

Priority: -- → P1

Target Milestone: --- → 4.2

jhp (no longer active)

Comment 31

•

23 years ago

Attached patch patch using unix socket code (obsolete) — Details — Splinter Review

I went back and took as much of the unix code as I could, in both os2poll.c and
os2sock.c.  While this still does not fix the localhost issue, I like this code
better than what I had previously because it uses the enhancements found inthe
unix code.  Plus, it looks much cleaner.

wtc:  I am trying to debug the localhost issue on linux, but it uses pthreads. 
How can I build nspr on linux without pthreads?

Attachment #50095 - Attachment is obsolete: true

Attachment #52007 - Attachment is obsolete: true

Attachment #52952 - Attachment is obsolete: true

jhp (no longer active)

Comment 32

•

23 years ago

Attached patch removed _PR_MD_CONNECT hack (obsolete) — Details — Splinter Review

Attachment #64044 - Attachment is obsolete: true

Julien Pierre

Assignee

Comment 33

•

23 years ago

Attached file test case for PR_Connect and build scripts — Details

I have created a test program for PR_Connect . I couldn't get the NSPR test
build environment to work right so I used NSS coreconf build.

1. Using a regular browser build environment, create a directory and unzip this
file.

2. Run build.cmd . This should pull the latest NSPR and NSS and build them,
including all tools.

3. go to mozilla/security/nss/cmd/connect and run gmake

4. run the connect program from mozilla/dist/os22.45_icc_dbg.obj/bin

eg. connect localhost:1234 should show an error
if you have telnetd listening , connect localhost:23 should succeed
if you don't have an httpd listening, connect localhost:80 should fail

As far as I can tell, PR_Connect works as expected in all cases, so the bug has
to be in the browser code.

If you don't want to do the whole NSS build, just use the connect.exe from the
zip and run it against your existing NSPR DLLs.

Wan-Teh Chang

Comment 34

•

23 years ago

Julien "Mad Brain" Pierre wants to own this bug.  His passion
about OS/2 really touched me.

Assignee: wtc → jpierre

Status: ASSIGNED → NEW

Julien Pierre

Assignee

Comment 35

•

23 years ago

The test case I attached was using PR_Connect in blocking mode, which 
is the default. As Wan-Teh told me, the browser uses non-blocking mode. 
I'm attaching an updated connect.cpp that tests non-blocking mode. I 
have reproduced the problem using that test case, but have not yet 
traced through it.

Julien Pierre

Assignee

Comment 36

•

23 years ago

Attached file non-blocking connect test case (obsolete) — Details

Julien Pierre

Assignee

Comment 37

•

23 years ago

Wan-Teh told me there is a test for this in the NSPR tests, nbconn . I was not
able to get it to build however. See bug 119098 .

jhp (no longer active)

Comment 38

•

23 years ago

I ran "nbconn localhost", and I got this on both OS/2 and Linux (with 
classic nspr threads):

  host: localhost.austin.ibm.com
  Connect in progress
  PR_Poll returns 1
  PR_POLL_WRITE
  PR_GetConnectStatus: connect failed: (-5981, 10061)
  PASS

What does this mean?

jhp (no longer active)

Comment 39

•

23 years ago

Actually, for Linux (classic nspr threads), I get a different error 
code:

PR_GetConnectStatus: connect failed: (-5981, 111)

Wan-Teh Chang

Comment 40

•

23 years ago

  host: localhost.austin.ibm.com

This is the host name.

  Connect in progress

This means the non-blocking PR_Connect() failed with
PR_IN_PROGRESS_ERROR, which is the expected error.
We are supposed to call PR_Poll() next to wait for the
completion of the non-blocking connect.

  PR_Poll returns 1

This means PR_Poll returns successfully.

  PR_POLL_WRITE

This means PR_Poll reports that the socket is writable,
which means the non-blocking connect is complete.  We
need to call PR_GetConnectStatus() to find out the
connect completion status.

  PR_GetConnectStatus: connect failed: (-5981, 10061)

This means the non-blocking connect failed.  NSPR error
code is -5981 and OS error code is 10061.

  PASS

The test passed.

jhp (no longer active)

Comment 41

•

23 years ago

The Linux (with pthreads) test case outputs this:

  host: localhost.austin.ibm.com
  Connect in progress
  PR_Poll returns 1
  PR_POLL_ERR
  PR_GetConnectStatus: connect failed: (-5981, 111)
  PASS

jhp (no longer active)

Comment 42

•

23 years ago

So the issue seems to be this:  one Linux w/ pthreads and Win32 (and 
others), PR_Poll returns a PR_POLL_ERR, PR_POLL_EXCEPT, or 
PR_POLL_HUP, all which lead to an error condition in the browser (see 
http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsSocketTrans
port.cpp lines#872 and 
http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsSocketTrans
port.cpp#998).

However, on OS/2 and Linux w/ classic NSPR threads, PR_POLL returns a 
PR_POLL_WRITE (according to the testcase).  The test case knows that 
the connection failed, though, because it calls PR_GetConnectStatus().  
We need to add calls to this function in the browser code.  I'll try 
this now.

Wan-Teh Chang

Comment 43

•

23 years ago

OK, here is the deal.

The PR_Poll() call only tells you that the non-blocking
connect has completed.  But it doesn't tell you whether
it was a success or a failure.  In particular, it doesn't
tell you the reason (error code) of the failure.

You need to call PR_GetConnectStatus() to find out the
connect success/fail status and the error code.  Actually,
you should call PR_ConnectContinue() instead of
PR_GetConnectStatus() in NSPR 4.1 or newer.

Wan-Teh Chang

Comment 44

•

23 years ago

I just did a LXR search in the mozilla source tree for PR_ConnectContinue
and PR_GetConnectStatus.  I am surprised to find out that mozilla is not
calling either function.  Two conclusions.

1. Mozilla cannot determine the success/failure status of the non-blocking
   connect in a portable manner.  The determination of the success/failure
   status of non-blocking connect is platform depenedent, as Javier has
   found out.  The PR_GetConnectStatus() function (and its replacement
   PR_ConnectContinue()) was designed to hide the non-portable code.

2. When the non-blocking connect fails, Mozilla doesn't have the error code.

jhp (no longer active)

Comment 45

•

23 years ago

Right now, the relevent PR_Poll call is in nsSocketTransportService.cpp#469. 
There is a for loop after that call that loops through each of the sockets.  Is
this where we would call PR_ConnectContinue()?  Or would we call it outside the
loop, right after PR_Poll?

Julien Pierre

Assignee

Comment 46

•

23 years ago

Attached file update to test case — Details

Modified test case to take a timeout parameter, as well as to try both blocking
and non-blocking connections.

Julien Pierre

Assignee

Comment 47

•

23 years ago

I just attached a test case that tries both blocking and non-blocking 
modes. Also, the timeout to PR_Connect is configurable and is passed in 
both tests.

Results :

1) testing on localhost with no daemon and 5 s timeout. This should 
fail

[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect localhost:1234 5
Blocking connection to localhost:1234 failed.
NSPR error : -5981.
OS error : 10061.
Non-blocking connection to localhost:1234 successful.

=> bug in non-blocking code

2) testing on localhost and ftp port, with ftpd listening and 5 s 
timeout. This should succeed

[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect localhost:21   5
Blocking connection to localhost:21 successful.
Non-blocking connection to localhost:21 successful.

=> This is OK

3) testing on www.ibm.com:80 and 5 s timeout. This should succeed

[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect www.ibm.com:80 5
Blocking connection to www.ibm.com:80 successful.
Non-blocking connection to www.ibm.com:80 successful.

[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]

=> this is OK

4) This should fail
[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect www.ibm.com:81 5
Blocking connection to www.ibm.com:81 failed.
NSPR error : -5990.
OS error : 0.
PR_Poll error.
Non-blocking connection to www.ibm.com:81 failed.
NSPR error : -5934.
OS error : 10036.

=> OK

5) Test on www.ibm.com:80 using zero timeout . This should fail in 
blocking mode (0 is not enough time to connect over the Internet), but 
pass in non-blocking mode as timeout is not supposed to affect the 
success/failure in non-blocking mode

[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect www.ibm.com:80 0
Blocking connection to www.ibm.com:80 failed.
NSPR error : -5990.
OS error : 0.
PR_Poll error.
Non-blocking connection to www.ibm.com:80 failed.
NSPR error : -5934.
OS error : 10036.

=> bug in non-blocking mode

6) connecting to my Sun on port 80 on LAN. should fail as I have no 
HTTPD listening on it

[D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect strange:80 0
Blocking connection to strange:80 failed.
NSPR error : -5990.
OS error : 0.
Non-blocking connection to strange:80 successful.

=> bug in non-blocking mode

From these tests I conclude that the non-blocking mode is broken about 
half the time, and the blocking mode is working as designed.

Julien Pierre

Assignee

Updated

•

23 years ago

Attachment #64239 - Attachment is obsolete: true

Julien Pierre

Assignee

Comment 48

•

23 years ago

I built my test case on Solaris. I got all the expected results except 
for 5) . I may have been incorrect in stating that it should work in 
non-blocking mode. I am using the same timeout for PR_Connect and 
PR_Poll, and in test 5), that is zero. The result is a failure both in 
blocking and non-blocking modes on Solaris when going to www.ibm.com:80 
(which passes with a non-zero timeout).

It would seem from these results that this bug is specific to OS/2.

jhp (no longer active)

Comment 49

•

23 years ago

Julien:  Using the patch in this bug, I get a different result for 
test (1).  Here is my output:

1) [D:\builds\trunk\mozilla\obj\dist\bin]julien localhost:1234 5
Blocking connection to localhost:1234 failed.
NSPR error : -5981.
OS error : 10061.
PR_Connect error.
Non-blocking connection to localhost:1234 failed.
NSPR error : -5981.
OS error : 10061.

jhp (no longer active)

Comment 50

•

23 years ago

Who handles the networking code on the browser side?  Could we get 
that person involved in this bug?  According to the NSPR docs, we 
should be calling PR_GetConnectStatus()/PR_ConnectContinue() after 
doing a PR_Poll, but this is not so in the browser.

Julien Pierre

Assignee

Comment 51

•

23 years ago

Javier,

I tried your NSPR patch with my test case.
It fixes problems in my test cases 1) as you reported, as well as 6).
Behavior in test 5) is not what I would expect - both blocking & 
non-blocking tests fail - but it is the same behavior as Solaris.
I'd say your patch looks good from this end.

Wan-Teh, what do you think the behavior should be for a non-blocking 
PR_Connect with 0 timeout and subsequent PR_Poll with 0 timeout, if 
there is a daemon listening on the host/port we are trying to connect 
to ? This is what test case 5 does. I assumed it should succeed since 
it works with a 5s timeout and a 0 timeout should not change it - but 
apparently I was wrong. Perhaps the timeout doesn't matter on 
PR_Connect but is still required on PR_Poll ?

Thanks.

Wan-Teh Chang

Comment 52

•

23 years ago

A PR_Poll with 0 timeout may not wait long enough for the
completion of the non-blocking connect.  It is reasonable
for this PR_Poll call to return 0, which means it times out.

jhp (no longer active)

Comment 53

•

23 years ago

Attached patch fix memory leak — Details — Splinter Review

Attachment #64150 - Attachment is obsolete: true

jhp (no longer active)

Comment 54

•

23 years ago

I am a little confused aboutsome code.  In the part of the socket code 
in unix.c 
(http://lxr.mozilla.org/seamonkey/source/nsprpub/pr/src/md/unix/unix.c
#1142), there is an if statement for "if(!nonblocking && 
EINPROGRESS)".  What does "fd->secret->nonblocking" mean?  If it is 
true, is the socket blocking or nonblocking?  I ask because just above 
this code there is a comment that if "The NSPR socket is nonblocking 
and connect() fails with EINPROGRESS", then we do select and 
getsockopt.  

Basically, the if statement and the comment seem to be saying 
different things.  Is one of these incorrect, or is my understanding 
of the variable 'nonblocking' incorrect?

jhp (no longer active)

Comment 55

•

23 years ago

I have created bug 120181 to handle my solution in necko.  The patch 
attached to that bug, along with the patch here fixes the networking 
issues I was seeing.

jhp (no longer active)

Comment 56

•

23 years ago

wtc:  Can you check in the latest patch here?

Wan-Teh Chang

Comment 57

•

23 years ago

Javier,

Good news.  Someone else independently discovered another
problem caused by the lack of PR_ConnectContinue calls in
nsSocketTransportService.cpp.  See bug 106188.

I suggest that you wait until bug 106188 is resolved to
finalize your patch for this bug.

Wan-Teh Chang

Comment 58

•

23 years ago

Comment on attachment 64905 [details] [diff] [review]
fix memory leak

Javier, could you ask your colleague Michael Kaply to
review this patch too?

I won't be able to check in the trivial change to
mozilla/configure.in in your patch without a review
by Chris Seawood.  I think that change is fine.

Mike Kaply [:mkaply]

Comment 59

•

23 years ago

Comment on attachment 64905 [details] [diff] [review]
fix memory leak

r=mkaply

Attachment #64905 - Flags: review+

hacker formerly known as seawood@netscape.com

Comment 60

•

23 years ago

r=cls on the configure.in changes.

Wan-Teh Chang

Comment 61

•

23 years ago

OK, I'm ready to check in Javier's latest patch now,
but found that the Mozilla tree is closed except for
approved checkins.

I don't know how to get the checkin approval for this
bug.

Wan-Teh Chang

Comment 62

•

23 years ago

Comment on attachment 64905 [details] [diff] [review]
fix memory leak

I've checked in this patch into the tip of NSPR.

I will check it into the client patch of NSPR after
someone gets the approval for me or when the Mozilla
tree is open for general checkins again.

Mike Kaply [:mkaply]

Comment 63

•

23 years ago

I'll send a note to drivers@mozilla.org. 0.9.8 hasn't branched yet, so this 
should be able to go on the client branch before that happens.

Brendan Eich [:brendan]

Comment 64

•

23 years ago

wtc, please go ahead and check the fix into the client branch too,
a=brendan@mozilla.org.  Thanks,

/be

Wan-Teh Chang

Comment 65

•

23 years ago

Comment on attachment 64905 [details] [diff] [review]
fix memory leak

OK, I just checked this patch into the client branch of NSPR.

I also checked in the change to mozilla/configure.in, but I
don't know how to generate mozilla/configure.  I understand
that mozilla/configure will be automatically updated and
checked in by some script of Leaf's.  If you know how to
manually generate mozilla/configure, you can also go ahead
and commit the change.

Wan-Teh Chang

Comment 66

•

23 years ago

Marked the bug fixed.

Status: NEW → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Wan-Teh Chang

Comment 67

•

23 years ago

Just wanted to note that Leaf's script has automatically
updated mozilla/configure.

Julien Pierre

Assignee

Comment 68

•

23 years ago

Re-opening this bug.
I have found there are still problems with PR_Connect on OS/2.
I used the strsclnt tool from NSS to stress selfserv, an HTTPS (SSL 
HTTP) web server running on the local machine on port 2000.
As soon as I set -t 2 in strsclnt to have multiple client threads, I 
got bombarded with PR_Connect failures, as follows :

strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.
strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.
strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.
strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.
strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.
strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.
strsclnt: PR_Connect returned error -5994:
Some unknown error has occurred.

The connect errors occur regardless of whether I connect to the server 
using 127.0.0.1:2000 or the machine's actual IP address on port 2000. 
They occur as long as the connection is to a local daemon. When 
connecting to a remote server, multithreaded PR_Connect works fine.

If I set -t 1 to make strsclnt single-threaded, then PR_Connect works 
fine ...

I don't know if this is a bug in the OS/2 implementation of PR_Connect. 
It might well be a bug in the OS/2 TCP/IP stack. I reproduced this on 
the IBM TCP/IP 4.3 stack :

[e:\dev\nss\mozilla\dist\os22.45_icc_opt.obj\bin]inetver
Version numbers of TCP/IP protocol drivers:
   SOCKETS.SYS: 6.3000
   AFOS2.SYS:   6.3000
   AFINET.SYS:  6.3001

I don't know of any other OS/2 application that does multithreaded 
connect, so I can't test if this is a stack bug or not. A simple 
standalone test case could probably be written to check for that. Mike, 
I would suggest you inquire the guys at Austin about this.

FYI, the test command I used for this test was :
strsclnt -p 2000 -D -d . -N -t 1 -q -c 1000000 localhost

You can build strsclnt by building NSS standalone (gmake nss_build_all 
in mozilla/security/nss).

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Julien Pierre

Assignee

Comment 69

•

23 years ago

I tried this on both my OS/2 machines, which were running the exact 
same version of OS/2 and TCP/IP stack.
This problem did not occur on the AMD K6-2 550 MHz box.
But it did occur on the dual CPU AMD Athlon MP 1500+ box.

So I disabled the second CPU using MPCPUMON, and reran this test. This 
time I got no PR_Connect failures . This is definitely an OS/2 TCP/IP 
stack bug, and it is specific to SMP systems.

Julien Pierre

Assignee

Comment 70

•

23 years ago

Perhaps I should open a different bug against NSPR . In strsclnt we use blocking
NSPR sockets, not non-blocking as mozilla is using, so the behavior is expected
to be different.

I'd like to go back on the hypothesis that this is an OS/2 staak bug. Do we use
any locks in PR_Connect () ? Maybe we have some SMP-unsafe locking mechanism in
use. I don't see any lock in os2sock.c though. But maybe some of the function it
can call do, like PR_Sleep.

FYI, I retested it, and the multithreaded connect fails on all sites, not just
local server ports, when both CPUs are enabled in OS/2, and the test succeeds
when the 2nd CPU is disabled. I just hadn't tried the local vs remote client
from the same SMP OS/2 system when I reopened this bug - I was running the local
test on the SMP box and remote test on the UNI box. Turns out local vs remote
isn't a factor for this, only SMP is.

Mike Kaply [:mkaply]

Comment 71

•

22 years ago

PR_Lock and PR_Unlock are by definition threadsafe.

I'm not sure what to do with this bug.

Julien Pierre

Assignee

Comment 72

•

22 years ago

Mike,

Please leave the bug opened. When I get time, I will be running some more tests
on this, especially now that I got the SSL server code running on OS/2. I
believe there are several TCP/IP stack bugs that affect both client and server code.
When I got the client and servers to work on OS/2, I was only able to test very
light TCP connection loads successfully. High loads would always result in lots
of connection errors. This is very likely caused by OS/2 bugs, but I still need
to find the specifics, whether the bugs are only on SMP or in UNI too, and
isolate the bugs into test cases ... This will take some more effort.

Julien Pierre

Assignee

Comment 73

•

22 years ago

Attached file multi-threaded PR_Connect program (obsolete) — Details

This is an updated version of the test program that can run multithreaded, and
with multiple iterations.

Julien Pierre

Assignee

Comment 74

•

22 years ago

I updated the test case to run multithreaded. I then ran it and tried to connect
on daemons listening on my localhost - lpd and ftpd. 
It had a very bad effect on the system. The test program and all my daemons
running on the system crashed, even the ones I wasn't connecting to (telnetd).
The OS became very unstable. I believe I'm hitting some serious limit in the
TCP/IP stack.

The arguments I used in the test were :
connect localhost:21 3 200 100

This runs 100 threads with 200 iterations of the test code that first does a
blocking connect, then a non-blocking connect.

I tried disabling a CPU on my SMP box (dual athlon), and it didn't improve  things.

Seems to me that this might be worth fixing in OS/2 TCP/IP.

Mike Kaply [:mkaply]

Comment 75

•

22 years ago

Unless you have a "real world" scenario, there won't be a chance of getting this
fixed by the TCP/IP team.

Julien Pierre

Assignee

Comment 76

•

22 years ago

Mike,

I believe there are real-world cases for this. If one is running a proxy server
on OS/2, it will act as a client . If you use that proxy to connect to another
server running on the same OS/2 machine as the proxy, then it will run into the
same issues as I have, once the proxy gets heavily loaded.

It is also very common to load-test servers with clients running on the same
system as the server. This bug makes it impossible. I would think that it would
be a very high priority to fix in a produced called "Warp Server for e-business".

Mike Kaply [:mkaply]

Comment 77

•

22 years ago

A dead product called WarpServer for ebusiness.

You set your hopes too high.

Mike Kaply [:mkaply]

Comment 78

•

22 years ago

Do you have a small testcase for this? If not, there is no way to get it fixed.

Julien Pierre

Assignee

Comment 79

•

22 years ago

Attached patch update to the PR_Connect test program — Details — Splinter Review

Julien Pierre

Assignee

Updated

•

22 years ago

Attachment #100066 - Attachment is obsolete: true

Julien Pierre

Assignee

Comment 80

•

22 years ago

I updated the test case.
To run, enable FTPD, TELNETD and LPD , then run :
connect localhost:21 3 200 100
connect localhost:23 3 200 100
connect localhost:515 3 200 100

You may need to run then multiple times.
Eventually, all the daemons will crash.
The system will be out of sockets and network applications will be unable to run
for a while until the local sockets have timed out.

This happens even on a UNI ACP2 kernel.

Julien Pierre

Assignee

Comment 81

•

21 years ago

Demoting to P3.

Priority: P1 → P3

Mikus Grinbergs

Comment 82

•

21 years ago

I'm running OS/2 with TCP/IP 4.02y.  For me this is a very real problem -- I
need to watch the throbber behavior -- if the throbber stops, it appears to mean
that the attempt to connect to a server failed.  I would much rather have an
explicit error message telling me that my previous click on a link was ineffective.

Mike Kaply [:mkaply]

Comment 83

•

20 years ago

There's no Mozilla fix here, and fixes aren't going to be made to OS/2 for this.

Status: REOPENED → RESOLVED

Closed: 23 years ago → 20 years ago

Resolution: --- → WONTFIX

Julien Pierre

Assignee

Comment 84

•

20 years ago

Reopening this because I finally figured out the cause for this PR_Connect issue
. It is a bug in the OS/2 socket library.

The OS/2 TCP/IP library actually modifies the sockaddr structure within the
connect call !

I found this by putting a data change breakpoint in IDEBUG on the same PRNetAddr
structure that I was passing to all my threads doing PR_Connect . And it got hit
in tcpip32.dll
Specifically, the "family" byte gets modified during the execution of the OS/2
"connect" socket call. But it is always reset upon returning from the function.

In the multithreaded case, this caused PR_NETADDR_SIZE to return the wrong
addrlen of 110 bytes (presumably the IPv6 size), which then gets passed to
connect, which fails with EINVAL. This was not easy to debug since
PR_NETADDR_SIZE is a macro. I found it by putting an assertion in MD_CONNECT for
addrlen == 16 ( which is the correct structure size for IPv4)

I was able to workaround the problem in the application program by copying the
PRNetAddr structure before passing it to PR_Connect . This causes the address
structure to be no longer shared accross threads, and then all the connects work
properly .

A workaround for this bug is to copy PRNetAddr to the stack within MD_Connect in
NSPR . It might be a good idea to revisit all the OS/2 socket calls that pass an
address and do the same trick . I will attach a patch just for the connect case.

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

Julien Pierre

Assignee

Comment 85

•

20 years ago

Attached patch work around OS/2 connect bug (obsolete) — Details — Splinter Review

Julien Pierre

Assignee

Updated

•

20 years ago

Attachment #167074 - Flags: review?(mkaply)

Mike Kaply [:mkaply]

Comment 86

•

20 years ago

Comment on attachment 167074 [details] [diff] [review]
work around OS/2 connect bug

Javier did this code, so I'd like him to review it

Attachment #167074 - Flags: review?(mkaply) → review?(jhpedemonte)

jhp (no longer active)

Comment 87

•

20 years ago

Comment on attachment 167074 [details] [diff] [review]
work around OS/2 connect bug

Good job!  Open a bug for the other functions that do this.

Attachment #167074 - Flags: review?(jhpedemonte) → review+

Wan-Teh Chang

Comment 88

•

20 years ago

Could you find out how connect() is declared on OS/2?

Julien Pierre

Assignee

Comment 89

•

20 years ago

Wan-Teh,

I checked that the other day, and the sockaddr pointer wasn't declared const .
So technically it may not be a bug. Still quite an unusual thing to do .

Javier,

One would have to first write test programs for all the other OS/2 functions
that take sockaddr* that are usd in NSPR to determine if the problem exists for
each NSPR API . I think it's probably easier to just do the same fix to copy the
structure for all of these functions and be done with it .

Wan-Teh Chang

Comment 90

•

20 years ago

Comment on attachment 167074 [details] [diff] [review]
work around OS/2 connect bug

>+    PRNetAddr localaddr = *addr; /* work around a bug in OS/2 . see bugzilla 100776 */

Julien, please add a one-line description of this bug
in the comment.

Julien Pierre

Assignee

Comment 91

•

20 years ago

Attached patch add comment to previous patch — Details — Splinter Review

Attachment #167074 - Attachment is obsolete: true

Julien Pierre

Assignee

Comment 92

•

20 years ago

I just attached a new patch with the comment . I'm targetting this bug to NSPR
4.5.2 .

I also updated the bug subject to reflect what was actually happening - the
error message originally was about a mozilla browser bug report, but NSPR
doesn't report error messages . 

I only tested that my patch fixes the case where a multi-threaded client
program, the NSS strsclnt tool, connects to a remote server on my LAN . This
case was previously broken .

I did not try to check the behavior against a server on the local machine (or no
server on localhost). That still needs to be tested . Javier, could you please
check that ?

Thanks .

Summary: No error message on failure to connect → PR_Connect fails inexplicably with localhost and in multi-threaded programs

Target Milestone: 4.2 → 4.5.2

Mike Kaply [:mkaply]

Comment 93

•

20 years ago

We don't have any OS/2 machines to test this stuff anymore.

Mike Kaply [:mkaply]

Comment 94

•

20 years ago

I think we should just check in this patch. It shouldn't break anything.

julian, what do you think?

Wan-Teh Chang

Comment 95

•

19 years ago

OS/2 developers: would you like me to check in
this patch?

Julien Pierre

Assignee

Comment 96

•

19 years ago

Yes, please do.

Wan-Teh Chang

Updated

•

19 years ago

Target Milestone: 4.5.2 → 4.6

Wan-Teh Chang

Comment 97

•

19 years ago

I checked in Julien's patch (attachment 168262 [details] [diff] [review],
after editing) on the NSPR trunk (NSPR 4.6) and
NSPRPUB_PRE_4_2_CLIENT_BRANCH (Mozilla 1.8 Beta 2).

Good job, Julien, for tracking this down (comment
84).

Status: REOPENED → RESOLVED

Closed: 20 years ago → 19 years ago

Resolution: --- → FIXED

os2sock.c file 23 years ago Javier Pedemonte 22.48 KB, text/plain		Details
os2poll.c file 23 years ago Javier Pedemonte 14.74 KB, text/plain		Details
nsprpub\configure.in patch 23 years ago Javier Pedemonte 656 bytes, patch	wtc : review+	Details \| Diff \| Splinter Review
patch 23 years ago jhp (no longer active) 23.82 KB, patch		Details \| Diff \| Splinter Review
A patch from wtc, which implements the solution he outlined. 23 years ago Wan-Teh Chang 2.49 KB, patch		Details \| Diff \| Splinter Review
another patch 23 years ago jhp (no longer active) 24.26 KB, patch		Details \| Diff \| Splinter Review
patch w/o exception handling in _PR_MD_CONNECT 23 years ago jhp (no longer active) 23.28 KB, patch		Details \| Diff \| Splinter Review
patch using unix socket code 23 years ago jhp (no longer active) 54.43 KB, patch		Details \| Diff \| Splinter Review
removed _PR_MD_CONNECT hack 23 years ago jhp (no longer active) 54.17 KB, patch		Details \| Diff \| Splinter Review
test case for PR_Connect and build scripts 23 years ago Julien Pierre 20.56 KB, application/octet-stream		Details
non-blocking connect test case 23 years ago Julien Pierre 3.31 KB, text/plain		Details
update to test case 23 years ago Julien Pierre 3.86 KB, text/plain		Details
fix memory leak 23 years ago jhp (no longer active) 54.49 KB, patch	mkaply : review+	Details \| Diff \| Splinter Review
multi-threaded PR_Connect program 22 years ago Julien Pierre 7.40 KB, text/plain		Details
update to the PR_Connect test program 22 years ago Julien Pierre 6.62 KB, patch		Details \| Diff \| Splinter Review
work around OS/2 connect bug 20 years ago Julien Pierre 2.21 KB, patch	jhpedemonte : review+	Details \| Diff \| Splinter Review
add comment to previous patch 20 years ago Julien Pierre 1.46 KB, patch		Details \| Diff \| Splinter Review