Closed Bug 141614 Opened 23 years ago Closed 17 years ago

Crash in [@SocketConnect]

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

4.2.2

People

(Reporter: sergei_d, Assigned: cls)

References

Details

(Keywords: crash, qawanted)

Crash Data

Attachments

(1 file, 6 obsolete files)

reduce MAX_OPEN_CONNECTION for net-server 23 years ago Sergei Dolgov 794 bytes, patch		Details \| Diff \| Splinter Review
patch 23 years ago Sergei Dolgov 807 bytes, patch		Details \| Diff \| Splinter Review
full patch 23 years ago Sergei Dolgov 1.95 KB, patch		Details \| Diff \| Splinter Review
Remove all refs to connectList & friends 23 years ago cls 2.32 KB, patch		Details \| Diff \| Splinter Review
patch to reduce connectCount 23 years ago Paul 2.73 KB, patch		Details \| Diff \| Splinter Review
update patch, as per Wan-teh's comments 23 years ago Paul 4.05 KB, patch		Details \| Diff \| Splinter Review
Update patch (fixed indentation, white space, and comments) (checked in) 23 years ago Wan-Teh Chang 3.96 KB, patch	wtc : review+ scc : approval+	Details \| Diff \| Splinter Review

Sergei Dolgov

Reporter

Description

•

23 years ago

This happens AFAIK on net-server builds. Here are two bug reports: http://bezilla.org/modules.php?op=modload&name=Forum&file=viewtopic&topic=18&forum=7&7 and http://bezilla.org/modules.php?op=modload&name=Forum&file=viewtopic&topic=14&forum=7&0 And seems related to net-server issues. Also it seems that remedy is decreasing of connection numbers in file netwerk/base/src/SocketTransportService.h Something in like this: #if defined(XP_BEOS) && !defined(BONE_VERSION) #define MAX_OPEN_CONNECTIONS 16 #else #define MAX_OPEN_CONNECTIONS 50 #endif It was tested with 30.Apr.2002 sources.

benc

Updated

•

23 years ago

Keywords: qawanted

Paul

Comment 1

•

23 years ago

Sergei, it may seem stupid, but could you post a patch for this. it makes life easier for everyone else who is here to help fix the bug.

Sergei Dolgov

Reporter

Comment 2

•

23 years ago

Attached patch reduce MAX_OPEN_CONNECTION for net-server (obsolete) — Details — Splinter Review

Ok Paul, done

Sergei Dolgov

Reporter

Comment 3

•

23 years ago

From active users feedback i think that importance/severity of this bug must be set on very high level!!! It appears to be main problem, preventing majority vanilla BeOS users from Mozilla usage at regular base. BTW, i suspect stack overflow on MD_connect or/and Snooze() calls - is there any way to set bigger stack on build?

Sergei Dolgov

Reporter

Comment 4

•

23 years ago

As it happens even under BONE with netserver Mozilla builds, it seems that reason is imperfect _MD_connect implementation for net-server in bnet.c. So this very critical crasher should be reassigned to NSPR gurus (maybe Wan-teh ?)

Paul

Comment 5

•

23 years ago

assigning this to Sergei. Changing platform to PC, severity to major.Changing product to NSPR, since the crash is actually in bnet.c. The attached patchis only a temporary fix, and does not solve the problem.

Assignee: new-network-bugs → sergei_d

Severity: normal → major

Status: UNCONFIRMED → NEW

Component: Networking → NSPR

Ever confirmed: true

Product: Browser → NSPR

Hardware: Other → PC

Paul

Comment 6

•

23 years ago

*** Bug 147751 has been marked as a duplicate of this bug. ***

Paul

Comment 7

•

23 years ago

Stack crawls from the when mozilla crashed are also available in comments on Bug#147751. (This way you don't have to go the the bezilla.org forums)

timeless

Updated

•

23 years ago

Severity: major → critical

Keywords: crash

Summary: Crash in SocketConnect → Crash in [@SocketConnect]

Sergei Dolgov

Reporter

Comment 8

•

23 years ago

no wonder. there is ConnectListNode connectList[64]; in mozilla/nsprpub/pr/src/md/beos/bmisc.c (btw, #define MAX_OPEN_CONNECTIONS 50 in mozilla/netwerk/base/src/nsSocketTransportService.h) and absolutely no check for connectCount value in PRInt32 _MD_connect (mozilla/nsprpub/pr/src/md/beos/bnet.c) in this part of code: if( fd->secret->nonblocking && ((err == EAGAIN) || (err == EINPROGRESS))) { PR_Lock(_connectLock); connectList[connectCount].osfd = osfd; memcpy(&connectList[connectCount].addr, addr, addrlen); connectList[connectCount].addrlen = addrlen; connectList[connectCount].timeout = timeout; connectCount++; PR_Unlock(_connectLock); _PR_MD_MAP_CONNECT_ERROR(err); return rv; so it crashes usually when connectCount reaches 80 or more....

hacker formerly known as seawood@netscape.com

Updated

•

23 years ago

Keywords: mozilla1.0.1

hacker formerly known as seawood@netscape.com

Comment 9

•

23 years ago

Why is that section even there? I don't see any place that entries are actually removed from connectList[] or that connectCount is actually decremented. Sergei, what happens if you just ifdef out everything from the PR_Lock() to the PR_Unlock() ?

Sergei Dolgov

Reporter

Comment 10

•

23 years ago

yeah, Chris, with only those lines _PR_MD_MAP_CONNECT_ERROR(err); return rv; remaining inside if() statement it works well. No crashes, no suspicious timeouts.

Sergei Dolgov

Reporter

Comment 11

•

23 years ago

Attached patch patch (obsolete) — Details — Splinter Review

this is patch file accordingly to previous comment

timeless

Comment 12

•

23 years ago

please remove all of: http://lxr.mozilla.org/nspr/ident?i=connectList http://lxr.mozilla.org/nspr/ident?i=connectCount

Sergei Dolgov

Reporter

Comment 13

•

23 years ago

Attached patch full patch (obsolete) — Details — Splinter Review

also removes connectCount and connectList[] from include files in nsprpub Two previous patches must be marked as absolete

Wan-Teh Chang

Updated

•

23 years ago

Attachment #83198 - Attachment is obsolete: true

Wan-Teh Chang

Updated

•

23 years ago

Attachment #87120 - Attachment is obsolete: true

Wan-Teh Chang

Comment 14

•

23 years ago

Comment on attachment 87147 [details] [diff] [review] full patch Hi Sergei, _connectLock should also be deleted because the variables it protects (connectCount and connectList) are deleted.

Attachment #87147 - Flags: needs-work+

Sergei Dolgov

Reporter

Comment 15

•

23 years ago

Wan-Teh, if i delete _connectLock void _MD_final_init (void) in bmisc.c will have absolutely empty body? Is it needed at all anymore?

hacker formerly known as seawood@netscape.com

Comment 16

•

23 years ago

Hrm. So when I removed all references to connectCount, connectList[] & _connectLock, I'm seeing a weird case where net_server dies whenever I try to attach a patch in bugzilla. When I go to the Network preferences panel and tell it to restart Networking sometimes it won't restart until I've exited Mozilla.

cls

Assignee

Comment 17

•

23 years ago

Attached patch Remove all refs to connectList & friends (obsolete) — Details — Splinter Review

Attachment #87147 - Attachment is obsolete: true

Matthew Zahorik

Comment 18

•

23 years ago

connectList and its ilk were added long ago because BeOS' non-BONE (net_server) didn't support nonblocking connects. The solution was polling of connect() calls. You called connect() again with the same arguments. If the result was EAGAIN or somesuch (I don't remember now) you waited some more. If the result was success or failure, report as such. This re-polling logic was put in poll/select, since this is where mozilla (at the time) waited for a socket to become writeable. Has the interaction between mozilla and the NSPR changed, or are you just getting lucky that the connect() is (a) returning fast enough that it's ready by the time you write and (b) is not refused/failed. Or is mozilla using blocking connects now? Also, net_server does not have an implementation to get the status of a connect other than through direct calls to connect. How is the net_server port of NSPR handling this? Thanks! - Matt

Matthew Zahorik

Comment 19

•

23 years ago

Found it. http://bonsai.mozilla.org/cvsview2.cgi? diff_mode=context&whitespace_mode=show&file=bfile.c&root=/cvsroot&subdir=mozilla /nsprpub/pr/src/md/beos&command=DIFF_FRAMESET&rev1=3.7&rev2=3.8 When BONE changes were checked in, the logic to *remove* and *check* connectList entries was toasted in select code. 64 connects at once were fine. (at least at one point) This would even be more than the maximum 50 connection-at-once limit. It's poor form to have a static array, but at the time it was fine. Why y'all have run into problems recently is that the array was never emptied. After 64 nonblocking connection attempts, no matter the outcome, you started exceeding the array size. Solution: Verify that poll/select works for nonblocking sockets in net_server now. If so, remove connectList junk. If connect really isn't working under poll (try going to sites that take a while to connect, not just serve up pages slowly) you'll have to put back connectList or something like it and a select/poll that removes connectList entries after they have succeeded or failed. A better solution than my old way is to spawn a thread that polls for the connect and dumps the status somewhere for poll and PR_GetConnectStatus to read. - Matt

Sergei Dolgov

Reporter

Comment 20

•

23 years ago

Patches with removed code don't work with https here.

Matthew Zahorik

Comment 21

•

23 years ago

Based on all the problems, someone needs to go fact finding. I no longer have a BeOS partition - BeOS doesn't support my hardware anymore. I have high hopes that someday I can enjoy a rich BeOS-like environment, but not now. Back in the day, I gave up on non-BONE builds and left in the code - which was working fine before the connectList deletion code was removed from socket_io_wait. BONE's release was just around the corner then. I concentrated on making the network stack work with BONE. Someone else took my BONE patches and merged them into the tree. (they really should have commented out all connectList stuff, and left the non-BONE socket_io_wait in, but that's another story) Times have changed. Since we don't have BONE anymore, what needs to be determined is: The *correct* way to do the following under net_server: 1. Nonblocking connects. Do you have to continually poll connect() by re-connecting() with the same arguments as the original call until you get success or failure? (EAGAIN/EWOULDBLOCK should make you try again later) Can you use select/poll() on the socket to get the error? 2. Status from connects. Mozilla uses PR_GetConnectStatus which calls getsockopt(SO_ERROR) to see if a socket is connected and if there was an error on the connection. There is no such function on non-BONE machines, which means you need to track the outcome of a connection somewhere within the NSPR. I once thought that reading 0 bytes from the socket would allow you to safely find out if a socket was alive and connected - never tried it to see if it worked. 3. Does select now work on read and write sockets? The code in socket_io_wait was written for BONE which *assumes* that write sockets work - you'll have a really unstable network stack if you're assuming that write blocking works when it doesn't. If it's any help, I had a small test program I used to test non-blocking socket strangeness that can be used as a starting point. It won't work as a test for the questions above, but it will give you a good starting point as a working program that connects to the website given on the command line, writes the GET request, and reads/echos the HTML that comes back in a nonblocking manner. Find it at ftp://ftp.albany.net/pub/users/maz/nbtest.tar.gz. Good luck! Thanks! - Matt

Paul

Comment 22

•

23 years ago

Matt, what's your hardware issues? (you can contact me off list) Also, A LOT of people still run BONE builds, so we CAN NOT get rid of it. As for your questions, I will see if I can get some help from some other BeOS coders who know the networking code MUCH better than I do (which is very, very little)

Paul

Comment 23

•

23 years ago

Ok, I let another BeOS developer know about the issues we are having. He has looked at this Bug entry, and will also be looking at the source. Here is his response to some of the questions posted here: net_server does indeed support nonblocking connects, as I said before and as it is said in the bugzilla discussion, but only through setsockopt(), not through fcntl(). net_server does not have getsockopt, which it seems that mozilla uses to see if it is connected or not, but they seem to have a workaround for that, although not tested. :/ The big issue with net_server though, is that you cannot select on several sockets at once. Workaround would either be to select all the sockets one by one, or that each socket has its own select thread and the select calls are protected with mutexes. Also the read/write functions need to mutexed in that case.

Paul

Comment 24

•

23 years ago

Attached patch patch to reduce connectCount (obsolete) — Details — Splinter Review

ok, bare with me if this is completely off base, but, it does seem to work, so far. All I did was look at the current _MD_Poll and the old _MD_Poll, which, btw, are still farely similar, and add the code for reducing the connectList[]. Now, I don't see the connectCount getting over 1 or 2! I have not built this under BONE yet, but, it should be ok. I'll check later. I just want to get everyone's opinion on this first.

Attachment #87224 - Attachment is obsolete: true

Sergei Dolgov

Reporter

Comment 25

•

23 years ago

Paul, seems working fine under BONE. I uploaded this libnspr for more wide testing recently. http://www.bebits.com/bob/12473/libnspr4.so so waiting userbase feedback

adam n

Comment 26

•

23 years ago

Wow! I grabbed the update from BeBits and it really seems to be working well. Processor usage doesn't got to 100% during network activity any more. Great job. This makes bezilla more usable than ever. :-)

Matthew Zahorik

Comment 27

•

23 years ago

You now essential have the code that I originally wrote before BONE was merged. It has deficiencies: 1. PR_GetConnectStatus is still broken 2. PR_Poll/PR_Select still don't behave exactly according to specifications. But it worked, and probably will work well for non-BONE users. I leave it up to some brave soul who wants to work around the remaining implementation details and make a rock-solid non-BONE NSPR. - Matt

Paul

Comment 28

•

23 years ago

True, but, net_server is just plain lousy. If this works, and is good enough, I say we check it in. Even if BeOS was still being actively developed by Be, Inc. or Palm, more than likely, the net_server would be phased out. And, the open source implementation(s) are doing just that, as well. My question would be, did I re-implement the work-around from before properly? I.e. did I put the code in the proper place within _MD_Poll?

Matthew Zahorik

Comment 29

•

23 years ago

Yeah, that is fine. There is no logic to retry connects on a set interval while waiting for select. I am curious if select will return a read or write flag on a waiting to connect socket. If so, this code will work perfectly as is. If select doesn't return, you're only checking for connects after some activity on an unrelated socket or the timeout passed to select has expired. - Matt

Paul

Comment 30

•

23 years ago

ok, I've been using the last patch for a while now, without incident. as for the other functions that need to be implemented that Matt mentioned, well, they were never implemented, and are not part of this bug. the patch attached fixes this bug, so, r=arougthopher@lizardland.net Chris, can you review this as well. I would like to check this in, but, I don't have access to the nspr branch.

Matthias Versen [:Matti]

Comment 31

•

23 years ago

*** Bug 157351 has been marked as a duplicate of this bug. ***

Paul

Comment 32

•

23 years ago

Wan-teh or Chris, Could one of you review the latest patch, and check it into the tree. This patch adds A LOT to the stability of mozilla on BeOS. If you have a problem with it, let me know, and I will address it, but I have seen no objections, or, any activity for that matter, on this bug in a long time. Thank you

Status: NEW → ASSIGNED

benc

Comment 33

•

23 years ago

Arougthopher: could you look at Bug 157351, and verify it if you think it is a dupe? TIA.

Wan-Teh Chang

Comment 34

•

23 years ago

Comment on attachment 88237 [details] [diff] [review] patch to reduce connectCount Paul, Your patch is basically fine. I would like to suggest two changes. 1. Style: please use C style comment delimiters /* */ instead of the C++ style comment delimiters //. 2. Safety: please add code that guards against stepping beyond the end of the buffer. In bnet.c, you should say something like: PR_Lock(_connectLock); if (connectCount < sizeof(connectList)/sizeof(connectList[0])) { connectList[connectCount].osfd = osfd; memcpy(&connectList[connectCount].addr, addr, addrlen); connectList[connectCount].addrlen = addrlen; connectList[connectCount].timeout = timeout; connectCount++; PR_Unlock(_connectLock); _PR_MD_MAP_CONNECT_ERROR(err); } else { PR_Unlock(_connectLock); PR_SetError(PR_INSUFFICIENT_RESOURCES_ERROR, 0); } return rv; Alternatively, you can allocate connectList from the heap and dynamically resize it when necessary.

Attachment #88237 - Flags: needs-work+

Paul

Comment 35

•

23 years ago

Attached patch update patch, as per Wan-teh's comments (obsolete) — Details — Splinter Review

Updated patch. Ben: I cannot verify it as a dupe, as, I can not reproduce the original problem, even when I don't have this patch applied. (or, at least what I think was the problem) So, my feeling is that it is not a duplicate.

Attachment #88237 - Attachment is obsolete: true

Wan-Teh Chang

Comment 36

•

23 years ago

Attached patch Update patch (fixed indentation, white space, and comments) (checked in) — Details — Splinter Review

Paul, I will check this in when the code freeze for Mozilla 1.1beta is over. If you want me to check this in sooner, could you please get drivers' approval?

Attachment #91579 - Attachment is obsolete: true

Wan-Teh Chang

Comment 37

•

23 years ago

Comment on attachment 91608 [details] [diff] [review] Update patch (fixed indentation, white space, and comments) (checked in) r=wtc. (This is Paul's patch with my edits.)

Attachment #91608 - Flags: review+

Scott Collins

Comment 38

•

23 years ago

Comment on attachment 91608 [details] [diff] [review] Update patch (fixed indentation, white space, and comments) (checked in) a=scc for checkin to the mozilla trunk

Attachment #91608 - Flags: approval+

Wan-Teh Chang

Updated

•

23 years ago

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Target Milestone: --- → 4.2.2

Wan-Teh Chang

Comment 39

•

23 years ago

Fix checked into the tip and NSPRPUB_PRE_4_2_CLIENT_BRANCH (used by mozilla trunk) of NSPR.

hacker formerly known as seawood@netscape.com

Comment 40

•

23 years ago

*** Bug 157351 has been marked as a duplicate of this bug. ***

hacker formerly known as seawood@netscape.com

Comment 41

•

23 years ago

Reopening as I see the same problem where net_server dies whenever a form is submitted. Restarting net_server works for other apps but mozilla needs to be restarted to be usable again.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Wan-Teh Chang

Updated

•

23 years ago

Priority: -- → P1

Target Milestone: 4.2.2 → 4.3

benc

Updated

•

23 years ago

QA Contact: benc → wtc

Sergei Dolgov

Reporter

Comment 42

•

22 years ago

I doubt this bug is worth "critical" flag. No more crashes - problem is if more perfect implementation is possible.

Wan-Teh Chang

Updated

•

22 years ago

Priority: P1 → P2

Sergei Dolgov

Reporter

Comment 43

•

22 years ago

According to http://bugzilla.mozilla.org/show_bug.cgi?id=141614#c41 reassigning to Chris (in other case i prefer to close it)

Assignee: sergei_d → seawood

Status: REOPENED → NEW

hacker formerly known as seawood@netscape.com

Comment 44

•

22 years ago

-> cls

Assignee: seawood → cls

Severity: critical → major

Priority: P2 → P3

Reed Loden [:reed]

Updated

•

18 years ago

QA Contact: wtchang → nspr

Christophe Ravel

Comment 45

•

18 years ago

The target milestone was already released. Resetting target milestone.

Target Milestone: 4.3 → ---

Nelson Bolyard (seldom reads bugmail)

Comment 46

•

17 years ago

Comment on attachment 91608 [details] [diff] [review] Update patch (fixed indentation, white space, and comments) (checked in) This patch was committed. Is this bug now fixed?

Attachment #91608 - Attachment description: Update patch (fixed indentation, white space, and comments) → Update patch (fixed indentation, white space, and comments) (checked in)

Sergei Dolgov

Reporter

Comment 47

•

17 years ago

(In reply to comment #46) > (From update of attachment 91608 [details] [diff] [review]) > This patch was committed. Is this bug now fixed? > seems no problem anymore

Wan-Teh Chang

Updated

•

17 years ago

Status: NEW → RESOLVED

Closed: 23 years ago → 17 years ago

Resolution: --- → FIXED

Target Milestone: --- → 4.2.2

Nobody; OK to take it and work on it

Updated

•

14 years ago

Crash Signature: [@SocketConnect]

You need to log in before you can comment on or make changes to this bug.