Closed Bug 141614 Opened 23 years ago Closed 17 years ago

Crash in [@SocketConnect]

Categories

(NSPR :: NSPR, defect, P3)

x86
BeOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sergei_d, Assigned: cls)

References

Details

(Keywords: crash, qawanted)

Crash Data

Attachments

(1 file, 6 obsolete files)

This happens AFAIK on net-server builds. Here are two bug reports: http://bezilla.org/modules.php?op=modload&name=Forum&file=viewtopic&topic=18&forum=7&7 and http://bezilla.org/modules.php?op=modload&name=Forum&file=viewtopic&topic=14&forum=7&0 And seems related to net-server issues. Also it seems that remedy is decreasing of connection numbers in file netwerk/base/src/SocketTransportService.h Something in like this: #if defined(XP_BEOS) && !defined(BONE_VERSION) #define MAX_OPEN_CONNECTIONS 16 #else #define MAX_OPEN_CONNECTIONS 50 #endif It was tested with 30.Apr.2002 sources.
Keywords: qawanted
Sergei, it may seem stupid, but could you post a patch for this. it makes life easier for everyone else who is here to help fix the bug.
Ok Paul, done
From active users feedback i think that importance/severity of this bug must be set on very high level!!! It appears to be main problem, preventing majority vanilla BeOS users from Mozilla usage at regular base. BTW, i suspect stack overflow on MD_connect or/and Snooze() calls - is there any way to set bigger stack on build?
As it happens even under BONE with netserver Mozilla builds, it seems that reason is imperfect _MD_connect implementation for net-server in bnet.c. So this very critical crasher should be reassigned to NSPR gurus (maybe Wan-teh ?)
assigning this to Sergei. Changing platform to PC, severity to major.Changing product to NSPR, since the crash is actually in bnet.c. The attached patchis only a temporary fix, and does not solve the problem.
Assignee: new-network-bugs → sergei_d
Severity: normal → major
Status: UNCONFIRMED → NEW
Component: Networking → NSPR
Ever confirmed: true
Product: Browser → NSPR
Hardware: Other → PC
*** Bug 147751 has been marked as a duplicate of this bug. ***
Stack crawls from the when mozilla crashed are also available in comments on Bug#147751. (This way you don't have to go the the bezilla.org forums)
Severity: major → critical
Keywords: crash
Summary: Crash in SocketConnect → Crash in [@SocketConnect]
no wonder. there is ConnectListNode connectList[64]; in mozilla/nsprpub/pr/src/md/beos/bmisc.c (btw, #define MAX_OPEN_CONNECTIONS 50 in mozilla/netwerk/base/src/nsSocketTransportService.h) and absolutely no check for connectCount value in PRInt32 _MD_connect (mozilla/nsprpub/pr/src/md/beos/bnet.c) in this part of code: if( fd->secret->nonblocking && ((err == EAGAIN) || (err == EINPROGRESS))) { PR_Lock(_connectLock); connectList[connectCount].osfd = osfd; memcpy(&connectList[connectCount].addr, addr, addrlen); connectList[connectCount].addrlen = addrlen; connectList[connectCount].timeout = timeout; connectCount++; PR_Unlock(_connectLock); _PR_MD_MAP_CONNECT_ERROR(err); return rv; so it crashes usually when connectCount reaches 80 or more....
Why is that section even there? I don't see any place that entries are actually removed from connectList[] or that connectCount is actually decremented. Sergei, what happens if you just ifdef out everything from the PR_Lock() to the PR_Unlock() ?
yeah, Chris, with only those lines _PR_MD_MAP_CONNECT_ERROR(err); return rv; remaining inside if() statement it works well. No crashes, no suspicious timeouts.
Attached patch patch (obsolete) — Splinter Review
this is patch file accordingly to previous comment
Attached patch full patch (obsolete) — Splinter Review
also removes connectCount and connectList[] from include files in nsprpub Two previous patches must be marked as absolete
Attachment #83198 - Attachment is obsolete: true
Attachment #87120 - Attachment is obsolete: true
Comment on attachment 87147 [details] [diff] [review] full patch Hi Sergei, _connectLock should also be deleted because the variables it protects (connectCount and connectList) are deleted.
Attachment #87147 - Flags: needs-work+
Wan-Teh, if i delete _connectLock void _MD_final_init (void) in bmisc.c will have absolutely empty body? Is it needed at all anymore?
Hrm. So when I removed all references to connectCount, connectList[] & _connectLock, I'm seeing a weird case where net_server dies whenever I try to attach a patch in bugzilla. When I go to the Network preferences panel and tell it to restart Networking sometimes it won't restart until I've exited Mozilla.
Attachment #87147 - Attachment is obsolete: true
connectList and its ilk were added long ago because BeOS' non-BONE (net_server) didn't support nonblocking connects. The solution was polling of connect() calls. You called connect() again with the same arguments. If the result was EAGAIN or somesuch (I don't remember now) you waited some more. If the result was success or failure, report as such. This re-polling logic was put in poll/select, since this is where mozilla (at the time) waited for a socket to become writeable. Has the interaction between mozilla and the NSPR changed, or are you just getting lucky that the connect() is (a) returning fast enough that it's ready by the time you write and (b) is not refused/failed. Or is mozilla using blocking connects now? Also, net_server does not have an implementation to get the status of a connect other than through direct calls to connect. How is the net_server port of NSPR handling this? Thanks! - Matt
Found it. http://bonsai.mozilla.org/cvsview2.cgi? diff_mode=context&whitespace_mode=show&file=bfile.c&root=/cvsroot&subdir=mozilla /nsprpub/pr/src/md/beos&command=DIFF_FRAMESET&rev1=3.7&rev2=3.8 When BONE changes were checked in, the logic to *remove* and *check* connectList entries was toasted in select code. 64 connects at once were fine. (at least at one point) This would even be more than the maximum 50 connection-at-once limit. It's poor form to have a static array, but at the time it was fine. Why y'all have run into problems recently is that the array was never emptied. After 64 nonblocking connection attempts, no matter the outcome, you started exceeding the array size. Solution: Verify that poll/select works for nonblocking sockets in net_server now. If so, remove connectList junk. If connect really isn't working under poll (try going to sites that take a while to connect, not just serve up pages slowly) you'll have to put back connectList or something like it and a select/poll that removes connectList entries after they have succeeded or failed. A better solution than my old way is to spawn a thread that polls for the connect and dumps the status somewhere for poll and PR_GetConnectStatus to read. - Matt
Patches with removed code don't work with https here.
Based on all the problems, someone needs to go fact finding. I no longer have a BeOS partition - BeOS doesn't support my hardware anymore. I have high hopes that someday I can enjoy a rich BeOS-like environment, but not now. Back in the day, I gave up on non-BONE builds and left in the code - which was working fine before the connectList deletion code was removed from socket_io_wait. BONE's release was just around the corner then. I concentrated on making the network stack work with BONE. Someone else took my BONE patches and merged them into the tree. (they really should have commented out all connectList stuff, and left the non-BONE socket_io_wait in, but that's another story) Times have changed. Since we don't have BONE anymore, what needs to be determined is: The *correct* way to do the following under net_server: 1. Nonblocking connects. Do you have to continually poll connect() by re-connecting() with the same arguments as the original call until you get success or failure? (EAGAIN/EWOULDBLOCK should make you try again later) Can you use select/poll() on the socket to get the error? 2. Status from connects. Mozilla uses PR_GetConnectStatus which calls getsockopt(SO_ERROR) to see if a socket is connected and if there was an error on the connection. There is no such function on non-BONE machines, which means you need to track the outcome of a connection somewhere within the NSPR. I once thought that reading 0 bytes from the socket would allow you to safely find out if a socket was alive and connected - never tried it to see if it worked. 3. Does select now work on read and write sockets? The code in socket_io_wait was written for BONE which *assumes* that write sockets work - you'll have a really unstable network stack if you're assuming that write blocking works when it doesn't. If it's any help, I had a small test program I used to test non-blocking socket strangeness that can be used as a starting point. It won't work as a test for the questions above, but it will give you a good starting point as a working program that connects to the website given on the command line, writes the GET request, and reads/echos the HTML that comes back in a nonblocking manner. Find it at ftp://ftp.albany.net/pub/users/maz/nbtest.tar.gz. Good luck! Thanks! - Matt
Matt, what's your hardware issues? (you can contact me off list) Also, A LOT of people still run BONE builds, so we CAN NOT get rid of it. As for your questions, I will see if I can get some help from some other BeOS coders who know the networking code MUCH better than I do (which is very, very little)
Ok, I let another BeOS developer know about the issues we are having. He has looked at this Bug entry, and will also be looking at the source. Here is his response to some of the questions posted here: net_server does indeed support nonblocking connects, as I said before and as it is said in the bugzilla discussion, but only through setsockopt(), not through fcntl(). net_server does not have getsockopt, which it seems that mozilla uses to see if it is connected or not, but they seem to have a workaround for that, although not tested. :/ The big issue with net_server though, is that you cannot select on several sockets at once. Workaround would either be to select all the sockets one by one, or that each socket has its own select thread and the select calls are protected with mutexes. Also the read/write functions need to mutexed in that case.
Attached patch patch to reduce connectCount (obsolete) — Splinter Review
ok, bare with me if this is completely off base, but, it does seem to work, so far. All I did was look at the current _MD_Poll and the old _MD_Poll, which, btw, are still farely similar, and add the code for reducing the connectList[]. Now, I don't see the connectCount getting over 1 or 2! I have not built this under BONE yet, but, it should be ok. I'll check later. I just want to get everyone's opinion on this first.
Attachment #87224 - Attachment is obsolete: true
Paul, seems working fine under BONE. I uploaded this libnspr for more wide testing recently. http://www.bebits.com/bob/12473/libnspr4.so so waiting userbase feedback
Wow! I grabbed the update from BeBits and it really seems to be working well. Processor usage doesn't got to 100% during network activity any more. Great job. This makes bezilla more usable than ever. :-)
You now essential have the code that I originally wrote before BONE was merged. It has deficiencies: 1. PR_GetConnectStatus is still broken 2. PR_Poll/PR_Select still don't behave exactly according to specifications. But it worked, and probably will work well for non-BONE users. I leave it up to some brave soul who wants to work around the remaining implementation details and make a rock-solid non-BONE NSPR. - Matt
True, but, net_server is just plain lousy. If this works, and is good enough, I say we check it in. Even if BeOS was still being actively developed by Be, Inc. or Palm, more than likely, the net_server would be phased out. And, the open source implementation(s) are doing just that, as well. My question would be, did I re-implement the work-around from before properly? I.e. did I put the code in the proper place within _MD_Poll?
Yeah, that is fine. There is no logic to retry connects on a set interval while waiting for select. I am curious if select will return a read or write flag on a waiting to connect socket. If so, this code will work perfectly as is. If select doesn't return, you're only checking for connects after some activity on an unrelated socket or the timeout passed to select has expired. - Matt
ok, I've been using the last patch for a while now, without incident. as for the other functions that need to be implemented that Matt mentioned, well, they were never implemented, and are not part of this bug. the patch attached fixes this bug, so, r=arougthopher@lizardland.net Chris, can you review this as well. I would like to check this in, but, I don't have access to the nspr branch.
*** Bug 157351 has been marked as a duplicate of this bug. ***
Wan-teh or Chris, Could one of you review the latest patch, and check it into the tree. This patch adds A LOT to the stability of mozilla on BeOS. If you have a problem with it, let me know, and I will address it, but I have seen no objections, or, any activity for that matter, on this bug in a long time. Thank you
Status: NEW → ASSIGNED
Arougthopher: could you look at Bug 157351, and verify it if you think it is a dupe? TIA.
Comment on attachment 88237 [details] [diff] [review] patch to reduce connectCount Paul, Your patch is basically fine. I would like to suggest two changes. 1. Style: please use C style comment delimiters /* */ instead of the C++ style comment delimiters //. 2. Safety: please add code that guards against stepping beyond the end of the buffer. In bnet.c, you should say something like: PR_Lock(_connectLock); if (connectCount < sizeof(connectList)/sizeof(connectList[0])) { connectList[connectCount].osfd = osfd; memcpy(&connectList[connectCount].addr, addr, addrlen); connectList[connectCount].addrlen = addrlen; connectList[connectCount].timeout = timeout; connectCount++; PR_Unlock(_connectLock); _PR_MD_MAP_CONNECT_ERROR(err); } else { PR_Unlock(_connectLock); PR_SetError(PR_INSUFFICIENT_RESOURCES_ERROR, 0); } return rv; Alternatively, you can allocate connectList from the heap and dynamically resize it when necessary.
Attachment #88237 - Flags: needs-work+
Updated patch. Ben: I cannot verify it as a dupe, as, I can not reproduce the original problem, even when I don't have this patch applied. (or, at least what I think was the problem) So, my feeling is that it is not a duplicate.
Attachment #88237 - Attachment is obsolete: true
Paul, I will check this in when the code freeze for Mozilla 1.1beta is over. If you want me to check this in sooner, could you please get drivers' approval?
Attachment #91579 - Attachment is obsolete: true
Comment on attachment 91608 [details] [diff] [review] Update patch (fixed indentation, white space, and comments) (checked in) r=wtc. (This is Paul's patch with my edits.)
Attachment #91608 - Flags: review+
Comment on attachment 91608 [details] [diff] [review] Update patch (fixed indentation, white space, and comments) (checked in) a=scc for checkin to the mozilla trunk
Attachment #91608 - Flags: approval+
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Target Milestone: --- → 4.2.2
Fix checked into the tip and NSPRPUB_PRE_4_2_CLIENT_BRANCH (used by mozilla trunk) of NSPR.
*** Bug 157351 has been marked as a duplicate of this bug. ***
Reopening as I see the same problem where net_server dies whenever a form is submitted. Restarting net_server works for other apps but mozilla needs to be restarted to be usable again.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Priority: -- → P1
Target Milestone: 4.2.2 → 4.3
QA Contact: benc → wtc
I doubt this bug is worth "critical" flag. No more crashes - problem is if more perfect implementation is possible.
Priority: P1 → P2
According to http://bugzilla.mozilla.org/show_bug.cgi?id=141614#c41 reassigning to Chris (in other case i prefer to close it)
Assignee: sergei_d → seawood
Status: REOPENED → NEW
-> cls
Assignee: seawood → cls
Severity: critical → major
Priority: P2 → P3
QA Contact: wtchang → nspr
The target milestone was already released. Resetting target milestone.
Target Milestone: 4.3 → ---
Comment on attachment 91608 [details] [diff] [review] Update patch (fixed indentation, white space, and comments) (checked in) This patch was committed. Is this bug now fixed?
Attachment #91608 - Attachment description: Update patch (fixed indentation, white space, and comments) → Update patch (fixed indentation, white space, and comments) (checked in)
(In reply to comment #46) > (From update of attachment 91608 [details] [diff] [review]) > This patch was committed. Is this bug now fixed? > seems no problem anymore
Status: NEW → RESOLVED
Closed: 23 years ago17 years ago
Resolution: --- → FIXED
Target Milestone: --- → 4.2.2
Crash Signature: [@SocketConnect]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: