Closed
Bug 100776
Opened 23 years ago
Closed 19 years ago
PR_Connect fails inexplicably with localhost and in multi-threaded programs
Categories
(NSPR :: NSPR, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
4.6
People
(Reporter: pedemonte, Assigned: julien.pierre)
References
()
Details
Attachments
(5 files, 12 obsolete files)
20.56 KB,
application/octet-stream
|
Details | |
3.86 KB,
text/plain
|
Details | |
54.49 KB,
patch
|
mkaply
:
review+
|
Details | Diff | Splinter Review |
6.62 KB,
patch
|
Details | Diff | Splinter Review | |
1.46 KB,
patch
|
Details | Diff | Splinter Review |
When trying to connect to http://localhost (with no server running), we should get a dialog that says 'connection was refused'. This works on win32 and linux, but on OS/2, nothing happens. The throbber runs for a split second then immediately stops, but no error message is produced.
Reporter | ||
Comment 1•23 years ago
|
||
This seems to be due to an error in the OS/2 TCP/IP implementation. When doing a select() on localhost, both win32 and linux will set the exception bit for that socket, to let you know that you cannot connect to localhost. On OS/2, however, we set the read or write bit, but not the exception bit. So mozilla tries to read from it and fails, but does not pop up the dialog box. My workaround, then, was to call getsockopt when we check the read or write bits. If getsockopt says that there is an error for that socket, then we treat it as an exception. Otherwise, we treat it as a read or write.
Reporter | ||
Comment 2•23 years ago
|
||
Reporter | ||
Comment 3•23 years ago
|
||
Reporter | ||
Comment 4•23 years ago
|
||
Although the problem was mainly in os2poll.c, I also cleaned up os2sock.c, which had a similar fix for select(). I attached the whole file, though, since there were many changes.
Reporter | ||
Comment 5•23 years ago
|
||
Reporter | ||
Comment 6•23 years ago
|
||
Also, added patch to make VACPP builds use OS/2 version of select, while EMX builds still use the BSD select().
Comment 7•23 years ago
|
||
Hi Javier, The function you need to fix is SocketConnectContinue() in mozilla/nsprpub/pr/src/io/prsocket.c. The OS/2 implementation for that function looks like this: if (out_flags & PR_POLL_EXCEPT) { int len = sizeof(err); if (getsockopt(osfd, (int)SOL_SOCKET, SO_ERROR, (char *) &err, &len) == SOCKET_ERROR) { _PR_MD_MAP_GETSOCKOPT_ERROR(WSAGetLastError()); return PR_FAILURE; } if (err != 0) { _PR_MD_MAP_CONNECT_ERROR(err); } else { PR_SetError(PR_UNKNOWN_ERROR, 0); } return PR_FAILURE; } It seems to me that we should remove the "if (out_flags & PR_POLL_EXCEPT)".
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Comment 8•23 years ago
|
||
wtc: I just tried messing around with that code, and I was not able to make it work as a fix for this problem; I could continue browsing, but the problem still existed. Besides, we want to get this fixed in TCP/IP, and if it is fixed, the code I attached should still work properly.
Comment 9•23 years ago
|
||
wtc: Can we get this one into NSPR? See Javier's last comments. Thanks
Comment 10•23 years ago
|
||
Javier wrote: > This seems to be due to an error in the OS/2 TCP/IP implementation. When doing > a select() on localhost, both win32 and linux will set the exception bit for > that socket, to let you know that you cannot connect to localhost. This is not true. Unix will mark the socket as both readable and writable if the non-blocking connect failed. Unix does not use the exception bit for this purpose. Winsock is the only socket implementation that uses the exception bit to report a non-blocking connect failure. > On OS/2, however, we set the read or write bit, but not the exception bit. This is consistent with the socket implementation on Unix. This is NOT a bug in the OS/2 TCP/IP implementation, unless OS/2 needs to be like Winsock. Javier, what you need to fix is _PR_MD_CONNECT() and SocketConnectContinue(). I also need you to attach diffs as opposed to whole files.
Updated•23 years ago
|
Attachment #50093 -
Flags: needs-work+
Updated•23 years ago
|
Attachment #50094 -
Flags: needs-work+
Comment 11•23 years ago
|
||
Comment on attachment 50095 [details] [diff] [review] nsprpub\configure.in patch This patch is fine by me. Michael, could you review this patch too?
Attachment #50095 -
Flags: review+
Comment 12•23 years ago
|
||
wtc: I changed the files according to your suggestions. I removed the code that handled exceptions differently in os2sock.c and os2poll.c, and made the XP_OS2 section in prsocket.c be more like the XP_UNIX section. Hoever, this does not fix the problem of connecting to localhost. I get no error dialog, it just stops loading. I debugged this and it turns out that SocketConnectContinue never gets hit when trying to load localhost, although _PR_MD_CONNECT and _PR_MD_PR_POLL do get called. So where do you suggest I put the exception checking code?
Comment 13•23 years ago
|
||
You need to call getsockopt() in SocketConnectContinue (if fd->secret->nonblocking is true) and _PR_MD_CONNECT (if fd->secret->nonblocking is false).
Comment 14•23 years ago
|
||
Comment 15•23 years ago
|
||
wtc: I've created a patch based on your suggestions. I now call getsockopt() in SocketConnectContinue and _PR_MD_CONNECT. However, I still get no error message when connecting to localhost. I think I also need to be calling getsockopt in PR_POLL. Am I missing something here?
Comment 16•23 years ago
|
||
Comment 17•23 years ago
|
||
Javier, I implemented my solution against NSPR 4.1.2. I tried it on Julien's OS/2 machine here and I was able to get the error code from connect whether I connected to a remote machine or the localhost. Since Julien has not installed autoconf on his OS/2 machine, I can't build the version of NSPR that the Mozilla client is using. I modified my solution for the NSPR client branch, which is the patch I just attached (attachment 52007 [details] [diff] [review]). Since I can't build it, it may need minor tweaking. But it shows you in concrete terms what I think is the right solution. Please let me know if my patch works for you. I am also wondering if you and I are using different Winsock implementations.
Comment 18•23 years ago
|
||
No, this still won't work for the issue at hand. Your patch fixes the exception case in PR_MD_CONNECT. However, when contacting 'localhost', it goes through the PR_POLL code. This is the code that needs to be fixed in this case. I need to be calling getsockopt in PR_POLL, when checking the read/write flags. Also, in regards to your changes, is it correct to completely remove the exception flag checking? Is there no exception other than non-blocking connection failure that could be set?
Comment 19•23 years ago
|
||
Javier, I will have to give up on giving you advice on this problem. I am not familiar with OS/2. Please find an OS/2 developer (for example, mkaply) to review your patch, and I'll check it in for you. Thanks.
Assignee | ||
Comment 20•23 years ago
|
||
Javier, I'd like to understand this bug better. What code is falling back to PR_Poll when connecting to localhost on OS/2 ? Is that a part of NSPR or some code in the browser ? Wan-Teh - it would appear that the NSPR socket code is using a Winsock library that ships with OS/2, so in that sense it is supposed to behave like Windows. I'm not clear why the decision was made to use that vs the BSD-like socket API that is native to OS/2. Either API can be used but I think the Winsock.dll fro OS/2 is an emulation library layered on top of the other socket API.
Comment 21•23 years ago
|
||
Julien: It seems to me that the socket code that is actually trying to contact the 'localhost' server is in PR_POLL. I have updated PR_MD_CONNECT and SocketConnectionContinue properly, but I was still seeing the problem. It was only after I updated os2poll.c likewise that I saw the 'connection was refused' error message. In regards to your winsock statement, I believe you are referring to the code segments that are #ifndef BSD_SELECT. OS/2 supports two version of select: one is the BSD version, and one is a native OS/2 version, which is more efficient. I have coded both versions, but for the Visual Age compiled mozilla we used the native select, while the EMX/GCC compiler uses the more common BSD version.
Assignee | ||
Comment 22•23 years ago
|
||
Javier, I wasn't referring to select but to the fact that you use things in your code snippets like WSAGetLastError() which are part of the Winsock emulation library of OS/2; ie from winsock.h. When I wrote socket apps for OS/2 before I used more BSD-like functions . Eg for the error codes it was sock_errno() , which is from sys/socket.h . These two headers - winsock.h and sys/socket.h - which both come with the OS/2 toolkit, define two independent sets of socket APIs for OS/2. I believe it is preferable to use the socket API in sys/socket.h over the one from winsock.h . It is closer to the native OS/2 stack which is a port from the BSD TCP/IP stack in AIX in the first place.
Comment 23•23 years ago
|
||
Julien, We don't use any Winsock code at all for the OS/2 port. I'm not sure what you are referring to.
Assignee | ||
Comment 24•23 years ago
|
||
Mike, I was referring to the code quoted by Wan-Teh above on 9/20 at 10:55 which included a call to WSAGetLastError() in prsocket.c and was referred to as "OS/2 implementation". However I went back and checked the tree and that code is in a block only for WIN16 or WIN32. The XP_OS/2 code is below and has the proper implementation which calls sock_errno . Sorry for the confusion. I think Wan-Teh probably meant that was the Windows implementation.
Comment 25•23 years ago
|
||
At 2001-10-03 21:24, I wrote: > Javier, > > [...] > > I am also wondering if you and I are using > different Winsock implementations. Please disregard that comment. I was just passing along a comment of Julien's.
Comment 26•23 years ago
|
||
Comment 27•23 years ago
|
||
This is my latest working patch. Everything works now. I do have some questions though. wtc, in your patch, you removed the exceptions check in PR_MD_CONNECT and replaced it with a call to getsockopt. In my patch, I still have the exceptions check, but when checking if the write bit is set, I call getsockopt to see if there is an error on the socket. Do I still need the exception check when using BSD-socket model? In this patch, I have now added a call to getsockopt when checking the read and write bits. If they have an error on the socket, I treat it as an exception.
Comment 28•23 years ago
|
||
Javier, Although I said earlier that I have given up on this bug, I still couldn't resist the temptation to look at your latest patch. You wrote: > Do I still need the exception check when using > BSD-socket model? No. BSD-socket model does not use the exception flag. Only Winsock uses the exception flag. BSD-socket model would mark the socket as both readable and writable if the non-blocking connect failed. > In this patch, I have now added a call to getsockopt > when checking the read and write bits. If they have > an error on the socket, I treat it as an exception. This change to PR_Poll is what I have been objecting to. PR_Poll should not call getsockopt. It is SocketConnectContinue and _PR_MD_CONNECT that should call getsockopt. Are you saying that the BSD-socket on OS/2 does not mark the socket as readable and writable when the non-blocking connect failed? That would be a bug.
Comment 29•23 years ago
|
||
Updated•23 years ago
|
Attachment #50093 -
Attachment is obsolete: true
Updated•23 years ago
|
Attachment #50094 -
Attachment is obsolete: true
Comment 30•23 years ago
|
||
wtc: OK, so I removed the exception code in _PR_MD_CONNECT, and wrote up PR_POLL like it was previously. Now, if I try to connect to a nonexistant URL (such as www.lalksdfj.com), then I correctly get an error dialog. However, I still do not get a dialog to pop up if I try to connect to "localhost". I am still seeing the same thing. Is this handled differently for some reason?
Updated•23 years ago
|
Attachment #51869 -
Attachment is obsolete: true
Updated•23 years ago
|
Attachment #52275 -
Attachment is obsolete: true
Updated•23 years ago
|
Priority: -- → P1
Target Milestone: --- → 4.2
Comment 31•23 years ago
|
||
I went back and took as much of the unix code as I could, in both os2poll.c and os2sock.c. While this still does not fix the localhost issue, I like this code better than what I had previously because it uses the enhancements found inthe unix code. Plus, it looks much cleaner. wtc: I am trying to debug the localhost issue on linux, but it uses pthreads. How can I build nspr on linux without pthreads?
Attachment #50095 -
Attachment is obsolete: true
Attachment #52007 -
Attachment is obsolete: true
Attachment #52952 -
Attachment is obsolete: true
Comment 32•23 years ago
|
||
Attachment #64044 -
Attachment is obsolete: true
Assignee | ||
Comment 33•23 years ago
|
||
I have created a test program for PR_Connect . I couldn't get the NSPR test build environment to work right so I used NSS coreconf build. 1. Using a regular browser build environment, create a directory and unzip this file. 2. Run build.cmd . This should pull the latest NSPR and NSS and build them, including all tools. 3. go to mozilla/security/nss/cmd/connect and run gmake 4. run the connect program from mozilla/dist/os22.45_icc_dbg.obj/bin eg. connect localhost:1234 should show an error if you have telnetd listening , connect localhost:23 should succeed if you don't have an httpd listening, connect localhost:80 should fail As far as I can tell, PR_Connect works as expected in all cases, so the bug has to be in the browser code. If you don't want to do the whole NSS build, just use the connect.exe from the zip and run it against your existing NSPR DLLs.
Comment 34•23 years ago
|
||
Julien "Mad Brain" Pierre wants to own this bug. His passion about OS/2 really touched me.
Assignee: wtc → jpierre
Status: ASSIGNED → NEW
Assignee | ||
Comment 35•23 years ago
|
||
The test case I attached was using PR_Connect in blocking mode, which is the default. As Wan-Teh told me, the browser uses non-blocking mode. I'm attaching an updated connect.cpp that tests non-blocking mode. I have reproduced the problem using that test case, but have not yet traced through it.
Assignee | ||
Comment 36•23 years ago
|
||
Assignee | ||
Comment 37•23 years ago
|
||
Wan-Teh told me there is a test for this in the NSPR tests, nbconn . I was not able to get it to build however. See bug 119098 .
Comment 38•23 years ago
|
||
I ran "nbconn localhost", and I got this on both OS/2 and Linux (with classic nspr threads): host: localhost.austin.ibm.com Connect in progress PR_Poll returns 1 PR_POLL_WRITE PR_GetConnectStatus: connect failed: (-5981, 10061) PASS What does this mean?
Comment 39•23 years ago
|
||
Actually, for Linux (classic nspr threads), I get a different error code: PR_GetConnectStatus: connect failed: (-5981, 111)
Comment 40•23 years ago
|
||
host: localhost.austin.ibm.com This is the host name. Connect in progress This means the non-blocking PR_Connect() failed with PR_IN_PROGRESS_ERROR, which is the expected error. We are supposed to call PR_Poll() next to wait for the completion of the non-blocking connect. PR_Poll returns 1 This means PR_Poll returns successfully. PR_POLL_WRITE This means PR_Poll reports that the socket is writable, which means the non-blocking connect is complete. We need to call PR_GetConnectStatus() to find out the connect completion status. PR_GetConnectStatus: connect failed: (-5981, 10061) This means the non-blocking connect failed. NSPR error code is -5981 and OS error code is 10061. PASS The test passed.
Comment 41•23 years ago
|
||
The Linux (with pthreads) test case outputs this: host: localhost.austin.ibm.com Connect in progress PR_Poll returns 1 PR_POLL_ERR PR_GetConnectStatus: connect failed: (-5981, 111) PASS
Comment 42•23 years ago
|
||
So the issue seems to be this: one Linux w/ pthreads and Win32 (and others), PR_Poll returns a PR_POLL_ERR, PR_POLL_EXCEPT, or PR_POLL_HUP, all which lead to an error condition in the browser (see http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsSocketTrans port.cpp lines#872 and http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsSocketTrans port.cpp#998). However, on OS/2 and Linux w/ classic NSPR threads, PR_POLL returns a PR_POLL_WRITE (according to the testcase). The test case knows that the connection failed, though, because it calls PR_GetConnectStatus(). We need to add calls to this function in the browser code. I'll try this now.
Comment 43•23 years ago
|
||
OK, here is the deal. The PR_Poll() call only tells you that the non-blocking connect has completed. But it doesn't tell you whether it was a success or a failure. In particular, it doesn't tell you the reason (error code) of the failure. You need to call PR_GetConnectStatus() to find out the connect success/fail status and the error code. Actually, you should call PR_ConnectContinue() instead of PR_GetConnectStatus() in NSPR 4.1 or newer.
Comment 44•23 years ago
|
||
I just did a LXR search in the mozilla source tree for PR_ConnectContinue and PR_GetConnectStatus. I am surprised to find out that mozilla is not calling either function. Two conclusions. 1. Mozilla cannot determine the success/failure status of the non-blocking connect in a portable manner. The determination of the success/failure status of non-blocking connect is platform depenedent, as Javier has found out. The PR_GetConnectStatus() function (and its replacement PR_ConnectContinue()) was designed to hide the non-portable code. 2. When the non-blocking connect fails, Mozilla doesn't have the error code.
Comment 45•23 years ago
|
||
Right now, the relevent PR_Poll call is in nsSocketTransportService.cpp#469. There is a for loop after that call that loops through each of the sockets. Is this where we would call PR_ConnectContinue()? Or would we call it outside the loop, right after PR_Poll?
Assignee | ||
Comment 46•23 years ago
|
||
Modified test case to take a timeout parameter, as well as to try both blocking and non-blocking connections.
Assignee | ||
Comment 47•23 years ago
|
||
I just attached a test case that tries both blocking and non-blocking modes. Also, the timeout to PR_Connect is configurable and is passed in both tests. Results : 1) testing on localhost with no daemon and 5 s timeout. This should fail [D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect localhost:1234 5 Blocking connection to localhost:1234 failed. NSPR error : -5981. OS error : 10061. Non-blocking connection to localhost:1234 successful. => bug in non-blocking code 2) testing on localhost and ftp port, with ftpd listening and 5 s timeout. This should succeed [D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect localhost:21 5 Blocking connection to localhost:21 successful. Non-blocking connection to localhost:21 successful. => This is OK 3) testing on www.ibm.com:80 and 5 s timeout. This should succeed [D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect www.ibm.com:80 5 Blocking connection to www.ibm.com:80 successful. Non-blocking connection to www.ibm.com:80 successful. [D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin] => this is OK 4) This should fail [D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect www.ibm.com:81 5 Blocking connection to www.ibm.com:81 failed. NSPR error : -5990. OS error : 0. PR_Poll error. Non-blocking connection to www.ibm.com:81 failed. NSPR error : -5934. OS error : 10036. => OK 5) Test on www.ibm.com:80 using zero timeout . This should fail in blocking mode (0 is not enough time to connect over the Internet), but pass in non-blocking mode as timeout is not supposed to affect the success/failure in non-blocking mode [D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect www.ibm.com:80 0 Blocking connection to www.ibm.com:80 failed. NSPR error : -5990. OS error : 0. PR_Poll error. Non-blocking connection to www.ibm.com:80 failed. NSPR error : -5934. OS error : 10036. => bug in non-blocking mode 6) connecting to my Sun on port 80 on LAN. should fail as I have no HTTPD listening on it [D:\nss\mozilla\dist\os22.45_icc_dbg.obj\bin]connect strange:80 0 Blocking connection to strange:80 failed. NSPR error : -5990. OS error : 0. Non-blocking connection to strange:80 successful. => bug in non-blocking mode From these tests I conclude that the non-blocking mode is broken about half the time, and the blocking mode is working as designed.
Assignee | ||
Updated•23 years ago
|
Attachment #64239 -
Attachment is obsolete: true
Assignee | ||
Comment 48•23 years ago
|
||
I built my test case on Solaris. I got all the expected results except for 5) . I may have been incorrect in stating that it should work in non-blocking mode. I am using the same timeout for PR_Connect and PR_Poll, and in test 5), that is zero. The result is a failure both in blocking and non-blocking modes on Solaris when going to www.ibm.com:80 (which passes with a non-zero timeout). It would seem from these results that this bug is specific to OS/2.
Comment 49•23 years ago
|
||
Julien: Using the patch in this bug, I get a different result for test (1). Here is my output: 1) [D:\builds\trunk\mozilla\obj\dist\bin]julien localhost:1234 5 Blocking connection to localhost:1234 failed. NSPR error : -5981. OS error : 10061. PR_Connect error. Non-blocking connection to localhost:1234 failed. NSPR error : -5981. OS error : 10061.
Comment 50•23 years ago
|
||
Who handles the networking code on the browser side? Could we get that person involved in this bug? According to the NSPR docs, we should be calling PR_GetConnectStatus()/PR_ConnectContinue() after doing a PR_Poll, but this is not so in the browser.
Assignee | ||
Comment 51•23 years ago
|
||
Javier, I tried your NSPR patch with my test case. It fixes problems in my test cases 1) as you reported, as well as 6). Behavior in test 5) is not what I would expect - both blocking & non-blocking tests fail - but it is the same behavior as Solaris. I'd say your patch looks good from this end. Wan-Teh, what do you think the behavior should be for a non-blocking PR_Connect with 0 timeout and subsequent PR_Poll with 0 timeout, if there is a daemon listening on the host/port we are trying to connect to ? This is what test case 5 does. I assumed it should succeed since it works with a 5s timeout and a 0 timeout should not change it - but apparently I was wrong. Perhaps the timeout doesn't matter on PR_Connect but is still required on PR_Poll ? Thanks.
Comment 52•23 years ago
|
||
A PR_Poll with 0 timeout may not wait long enough for the completion of the non-blocking connect. It is reasonable for this PR_Poll call to return 0, which means it times out.
Comment 53•23 years ago
|
||
Attachment #64150 -
Attachment is obsolete: true
Comment 54•23 years ago
|
||
I am a little confused aboutsome code. In the part of the socket code in unix.c (http://lxr.mozilla.org/seamonkey/source/nsprpub/pr/src/md/unix/unix.c #1142), there is an if statement for "if(!nonblocking && EINPROGRESS)". What does "fd->secret->nonblocking" mean? If it is true, is the socket blocking or nonblocking? I ask because just above this code there is a comment that if "The NSPR socket is nonblocking and connect() fails with EINPROGRESS", then we do select and getsockopt. Basically, the if statement and the comment seem to be saying different things. Is one of these incorrect, or is my understanding of the variable 'nonblocking' incorrect?
Comment 55•23 years ago
|
||
I have created bug 120181 to handle my solution in necko. The patch attached to that bug, along with the patch here fixes the networking issues I was seeing.
Comment 56•23 years ago
|
||
wtc: Can you check in the latest patch here?
Comment 57•23 years ago
|
||
Javier, Good news. Someone else independently discovered another problem caused by the lack of PR_ConnectContinue calls in nsSocketTransportService.cpp. See bug 106188. I suggest that you wait until bug 106188 is resolved to finalize your patch for this bug.
Comment 58•23 years ago
|
||
Comment on attachment 64905 [details] [diff] [review] fix memory leak Javier, could you ask your colleague Michael Kaply to review this patch too? I won't be able to check in the trivial change to mozilla/configure.in in your patch without a review by Chris Seawood. I think that change is fine.
Comment 59•23 years ago
|
||
Comment on attachment 64905 [details] [diff] [review] fix memory leak r=mkaply
Attachment #64905 -
Flags: review+
Comment 60•23 years ago
|
||
r=cls on the configure.in changes.
Comment 61•23 years ago
|
||
OK, I'm ready to check in Javier's latest patch now, but found that the Mozilla tree is closed except for approved checkins. I don't know how to get the checkin approval for this bug.
Comment 62•23 years ago
|
||
Comment on attachment 64905 [details] [diff] [review] fix memory leak I've checked in this patch into the tip of NSPR. I will check it into the client patch of NSPR after someone gets the approval for me or when the Mozilla tree is open for general checkins again.
Comment 63•23 years ago
|
||
I'll send a note to drivers@mozilla.org. 0.9.8 hasn't branched yet, so this should be able to go on the client branch before that happens.
Comment 64•23 years ago
|
||
wtc, please go ahead and check the fix into the client branch too, a=brendan@mozilla.org. Thanks, /be
Comment 65•23 years ago
|
||
Comment on attachment 64905 [details] [diff] [review] fix memory leak OK, I just checked this patch into the client branch of NSPR. I also checked in the change to mozilla/configure.in, but I don't know how to generate mozilla/configure. I understand that mozilla/configure will be automatically updated and checked in by some script of Leaf's. If you know how to manually generate mozilla/configure, you can also go ahead and commit the change.
Comment 66•23 years ago
|
||
Marked the bug fixed.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Comment 67•23 years ago
|
||
Just wanted to note that Leaf's script has automatically updated mozilla/configure.
Assignee | ||
Comment 68•23 years ago
|
||
Re-opening this bug. I have found there are still problems with PR_Connect on OS/2. I used the strsclnt tool from NSS to stress selfserv, an HTTPS (SSL HTTP) web server running on the local machine on port 2000. As soon as I set -t 2 in strsclnt to have multiple client threads, I got bombarded with PR_Connect failures, as follows : strsclnt: PR_Connect returned error -5994: Some unknown error has occurred. strsclnt: PR_Connect returned error -5994: Some unknown error has occurred. strsclnt: PR_Connect returned error -5994: Some unknown error has occurred. strsclnt: PR_Connect returned error -5994: Some unknown error has occurred. strsclnt: PR_Connect returned error -5994: Some unknown error has occurred. strsclnt: PR_Connect returned error -5994: Some unknown error has occurred. strsclnt: PR_Connect returned error -5994: Some unknown error has occurred. The connect errors occur regardless of whether I connect to the server using 127.0.0.1:2000 or the machine's actual IP address on port 2000. They occur as long as the connection is to a local daemon. When connecting to a remote server, multithreaded PR_Connect works fine. If I set -t 1 to make strsclnt single-threaded, then PR_Connect works fine ... I don't know if this is a bug in the OS/2 implementation of PR_Connect. It might well be a bug in the OS/2 TCP/IP stack. I reproduced this on the IBM TCP/IP 4.3 stack : [e:\dev\nss\mozilla\dist\os22.45_icc_opt.obj\bin]inetver Version numbers of TCP/IP protocol drivers: SOCKETS.SYS: 6.3000 AFOS2.SYS: 6.3000 AFINET.SYS: 6.3001 I don't know of any other OS/2 application that does multithreaded connect, so I can't test if this is a stack bug or not. A simple standalone test case could probably be written to check for that. Mike, I would suggest you inquire the guys at Austin about this. FYI, the test command I used for this test was : strsclnt -p 2000 -D -d . -N -t 1 -q -c 1000000 localhost You can build strsclnt by building NSS standalone (gmake nss_build_all in mozilla/security/nss).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 69•23 years ago
|
||
I tried this on both my OS/2 machines, which were running the exact same version of OS/2 and TCP/IP stack. This problem did not occur on the AMD K6-2 550 MHz box. But it did occur on the dual CPU AMD Athlon MP 1500+ box. So I disabled the second CPU using MPCPUMON, and reran this test. This time I got no PR_Connect failures . This is definitely an OS/2 TCP/IP stack bug, and it is specific to SMP systems.
Assignee | ||
Comment 70•23 years ago
|
||
Perhaps I should open a different bug against NSPR . In strsclnt we use blocking NSPR sockets, not non-blocking as mozilla is using, so the behavior is expected to be different. I'd like to go back on the hypothesis that this is an OS/2 staak bug. Do we use any locks in PR_Connect () ? Maybe we have some SMP-unsafe locking mechanism in use. I don't see any lock in os2sock.c though. But maybe some of the function it can call do, like PR_Sleep. FYI, I retested it, and the multithreaded connect fails on all sites, not just local server ports, when both CPUs are enabled in OS/2, and the test succeeds when the 2nd CPU is disabled. I just hadn't tried the local vs remote client from the same SMP OS/2 system when I reopened this bug - I was running the local test on the SMP box and remote test on the UNI box. Turns out local vs remote isn't a factor for this, only SMP is.
Comment 71•22 years ago
|
||
PR_Lock and PR_Unlock are by definition threadsafe. I'm not sure what to do with this bug.
Assignee | ||
Comment 72•22 years ago
|
||
Mike, Please leave the bug opened. When I get time, I will be running some more tests on this, especially now that I got the SSL server code running on OS/2. I believe there are several TCP/IP stack bugs that affect both client and server code. When I got the client and servers to work on OS/2, I was only able to test very light TCP connection loads successfully. High loads would always result in lots of connection errors. This is very likely caused by OS/2 bugs, but I still need to find the specifics, whether the bugs are only on SMP or in UNI too, and isolate the bugs into test cases ... This will take some more effort.
Assignee | ||
Comment 73•22 years ago
|
||
This is an updated version of the test program that can run multithreaded, and with multiple iterations.
Assignee | ||
Comment 74•22 years ago
|
||
I updated the test case to run multithreaded. I then ran it and tried to connect on daemons listening on my localhost - lpd and ftpd. It had a very bad effect on the system. The test program and all my daemons running on the system crashed, even the ones I wasn't connecting to (telnetd). The OS became very unstable. I believe I'm hitting some serious limit in the TCP/IP stack. The arguments I used in the test were : connect localhost:21 3 200 100 This runs 100 threads with 200 iterations of the test code that first does a blocking connect, then a non-blocking connect. I tried disabling a CPU on my SMP box (dual athlon), and it didn't improve things. Seems to me that this might be worth fixing in OS/2 TCP/IP.
Comment 75•22 years ago
|
||
Unless you have a "real world" scenario, there won't be a chance of getting this fixed by the TCP/IP team.
Assignee | ||
Comment 76•22 years ago
|
||
Mike, I believe there are real-world cases for this. If one is running a proxy server on OS/2, it will act as a client . If you use that proxy to connect to another server running on the same OS/2 machine as the proxy, then it will run into the same issues as I have, once the proxy gets heavily loaded. It is also very common to load-test servers with clients running on the same system as the server. This bug makes it impossible. I would think that it would be a very high priority to fix in a produced called "Warp Server for e-business".
Comment 77•22 years ago
|
||
A dead product called WarpServer for ebusiness. You set your hopes too high.
Comment 78•22 years ago
|
||
Do you have a small testcase for this? If not, there is no way to get it fixed.
Assignee | ||
Comment 79•22 years ago
|
||
Assignee | ||
Updated•22 years ago
|
Attachment #100066 -
Attachment is obsolete: true
Assignee | ||
Comment 80•22 years ago
|
||
I updated the test case. To run, enable FTPD, TELNETD and LPD , then run : connect localhost:21 3 200 100 connect localhost:23 3 200 100 connect localhost:515 3 200 100 You may need to run then multiple times. Eventually, all the daemons will crash. The system will be out of sockets and network applications will be unable to run for a while until the local sockets have timed out. This happens even on a UNI ACP2 kernel.
Comment 82•21 years ago
|
||
I'm running OS/2 with TCP/IP 4.02y. For me this is a very real problem -- I need to watch the throbber behavior -- if the throbber stops, it appears to mean that the attempt to connect to a server failed. I would much rather have an explicit error message telling me that my previous click on a link was ineffective.
Comment 83•20 years ago
|
||
There's no Mozilla fix here, and fixes aren't going to be made to OS/2 for this.
Status: REOPENED → RESOLVED
Closed: 23 years ago → 20 years ago
Resolution: --- → WONTFIX
Assignee | ||
Comment 84•20 years ago
|
||
Reopening this because I finally figured out the cause for this PR_Connect issue . It is a bug in the OS/2 socket library. The OS/2 TCP/IP library actually modifies the sockaddr structure within the connect call ! I found this by putting a data change breakpoint in IDEBUG on the same PRNetAddr structure that I was passing to all my threads doing PR_Connect . And it got hit in tcpip32.dll Specifically, the "family" byte gets modified during the execution of the OS/2 "connect" socket call. But it is always reset upon returning from the function. In the multithreaded case, this caused PR_NETADDR_SIZE to return the wrong addrlen of 110 bytes (presumably the IPv6 size), which then gets passed to connect, which fails with EINVAL. This was not easy to debug since PR_NETADDR_SIZE is a macro. I found it by putting an assertion in MD_CONNECT for addrlen == 16 ( which is the correct structure size for IPv4) I was able to workaround the problem in the application program by copying the PRNetAddr structure before passing it to PR_Connect . This causes the address structure to be no longer shared accross threads, and then all the connects work properly . A workaround for this bug is to copy PRNetAddr to the stack within MD_Connect in NSPR . It might be a good idea to revisit all the OS/2 socket calls that pass an address and do the same trick . I will attach a patch just for the connect case.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Assignee | ||
Comment 85•20 years ago
|
||
Assignee | ||
Updated•20 years ago
|
Attachment #167074 -
Flags: review?(mkaply)
Comment 86•20 years ago
|
||
Comment on attachment 167074 [details] [diff] [review] work around OS/2 connect bug Javier did this code, so I'd like him to review it
Attachment #167074 -
Flags: review?(mkaply) → review?(jhpedemonte)
Comment 87•20 years ago
|
||
Comment on attachment 167074 [details] [diff] [review] work around OS/2 connect bug Good job! Open a bug for the other functions that do this.
Attachment #167074 -
Flags: review?(jhpedemonte) → review+
Comment 88•20 years ago
|
||
Could you find out how connect() is declared on OS/2?
Assignee | ||
Comment 89•20 years ago
|
||
Wan-Teh, I checked that the other day, and the sockaddr pointer wasn't declared const . So technically it may not be a bug. Still quite an unusual thing to do . Javier, One would have to first write test programs for all the other OS/2 functions that take sockaddr* that are usd in NSPR to determine if the problem exists for each NSPR API . I think it's probably easier to just do the same fix to copy the structure for all of these functions and be done with it .
Comment 90•20 years ago
|
||
Comment on attachment 167074 [details] [diff] [review] work around OS/2 connect bug >+ PRNetAddr localaddr = *addr; /* work around a bug in OS/2 . see bugzilla 100776 */ Julien, please add a one-line description of this bug in the comment.
Assignee | ||
Comment 91•20 years ago
|
||
Attachment #167074 -
Attachment is obsolete: true
Assignee | ||
Comment 92•20 years ago
|
||
I just attached a new patch with the comment . I'm targetting this bug to NSPR 4.5.2 . I also updated the bug subject to reflect what was actually happening - the error message originally was about a mozilla browser bug report, but NSPR doesn't report error messages . I only tested that my patch fixes the case where a multi-threaded client program, the NSS strsclnt tool, connects to a remote server on my LAN . This case was previously broken . I did not try to check the behavior against a server on the local machine (or no server on localhost). That still needs to be tested . Javier, could you please check that ? Thanks .
Summary: No error message on failure to connect → PR_Connect fails inexplicably with localhost and in multi-threaded programs
Target Milestone: 4.2 → 4.5.2
Comment 93•20 years ago
|
||
We don't have any OS/2 machines to test this stuff anymore.
Comment 94•20 years ago
|
||
I think we should just check in this patch. It shouldn't break anything. julian, what do you think?
Comment 95•19 years ago
|
||
OS/2 developers: would you like me to check in this patch?
Assignee | ||
Comment 96•19 years ago
|
||
Yes, please do.
Updated•19 years ago
|
Target Milestone: 4.5.2 → 4.6
Comment 97•19 years ago
|
||
I checked in Julien's patch (attachment 168262 [details] [diff] [review], after editing) on the NSPR trunk (NSPR 4.6) and NSPRPUB_PRE_4_2_CLIENT_BRANCH (Mozilla 1.8 Beta 2). Good job, Julien, for tracking this down (comment 84).
Status: REOPENED → RESOLVED
Closed: 20 years ago → 19 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•