Closed
Bug 141614
Opened 23 years ago
Closed 17 years ago
Crash in [@SocketConnect]
Categories
(NSPR :: NSPR, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
4.2.2
People
(Reporter: sergei_d, Assigned: cls)
References
Details
(Keywords: crash, qawanted)
Crash Data
Attachments
(1 file, 6 obsolete files)
|
3.96 KB,
patch
|
wtc
:
review+
scc
:
approval+
|
Details | Diff | Splinter Review |
This happens AFAIK on net-server builds.
Here are two bug reports:
http://bezilla.org/modules.php?op=modload&name=Forum&file=viewtopic&topic=18&forum=7&7
and
http://bezilla.org/modules.php?op=modload&name=Forum&file=viewtopic&topic=14&forum=7&0
And seems related to net-server issues.
Also it seems that remedy is decreasing of connection numbers in file
netwerk/base/src/SocketTransportService.h
Something in like this:
#if defined(XP_BEOS) && !defined(BONE_VERSION)
#define MAX_OPEN_CONNECTIONS 16
#else
#define MAX_OPEN_CONNECTIONS 50
#endif
It was tested with 30.Apr.2002 sources.
Sergei, it may seem stupid, but could you post a patch for this. it makes life easier for
everyone else who is here to help fix the bug.
| Reporter | ||
Comment 2•23 years ago
|
||
Ok Paul, done
| Reporter | ||
Comment 3•23 years ago
|
||
From active users feedback i think that importance/severity of this bug must be
set on very high level!!!
It appears to be main problem, preventing majority vanilla BeOS users from
Mozilla usage at regular base.
BTW, i suspect stack overflow on MD_connect or/and Snooze() calls - is there any
way to set bigger stack on build?
| Reporter | ||
Comment 4•23 years ago
|
||
As it happens even under BONE with netserver Mozilla builds, it seems that
reason is imperfect _MD_connect implementation for net-server in bnet.c.
So this very critical crasher should be reassigned to NSPR gurus (maybe Wan-teh ?)
assigning this to Sergei. Changing platform to PC, severity to major.Changing product to NSPR, since the crash is actually in bnet.c. The attached patchis only a temporary fix, and does not solve the problem.
Assignee: new-network-bugs → sergei_d
Severity: normal → major
Status: UNCONFIRMED → NEW
Component: Networking → NSPR
Ever confirmed: true
Product: Browser → NSPR
Hardware: Other → PC
*** Bug 147751 has been marked as a duplicate of this bug. ***
Stack crawls from the when mozilla crashed are also available in comments on
Bug#147751.
(This way you don't have to go the the bezilla.org forums)
Severity: major → critical
Keywords: crash
Summary: Crash in SocketConnect → Crash in [@SocketConnect]
| Reporter | ||
Comment 8•23 years ago
|
||
no wonder.
there is
ConnectListNode connectList[64];
in mozilla/nsprpub/pr/src/md/beos/bmisc.c
(btw, #define MAX_OPEN_CONNECTIONS 50 in
mozilla/netwerk/base/src/nsSocketTransportService.h)
and absolutely no check for connectCount value
in PRInt32 _MD_connect (mozilla/nsprpub/pr/src/md/beos/bnet.c)
in this part of code:
if( fd->secret->nonblocking && ((err == EAGAIN) || (err == EINPROGRESS))) {
PR_Lock(_connectLock);
connectList[connectCount].osfd = osfd;
memcpy(&connectList[connectCount].addr, addr, addrlen);
connectList[connectCount].addrlen = addrlen;
connectList[connectCount].timeout = timeout;
connectCount++;
PR_Unlock(_connectLock);
_PR_MD_MAP_CONNECT_ERROR(err);
return rv;
so it crashes usually when connectCount reaches 80 or more....
Updated•23 years ago
|
Keywords: mozilla1.0.1
Comment 9•23 years ago
|
||
Why is that section even there? I don't see any place that entries are actually
removed from connectList[] or that connectCount is actually decremented.
Sergei, what happens if you just ifdef out everything from the PR_Lock() to the
PR_Unlock() ?
| Reporter | ||
Comment 10•23 years ago
|
||
yeah, Chris, with only those lines
_PR_MD_MAP_CONNECT_ERROR(err);
return rv;
remaining inside if() statement it works well. No crashes, no suspicious timeouts.
| Reporter | ||
Comment 11•23 years ago
|
||
this is patch file accordingly to previous comment
Comment 12•23 years ago
|
||
| Reporter | ||
Comment 13•23 years ago
|
||
also removes connectCount and connectList[] from include files in nsprpub
Two previous patches must be marked as absolete
Updated•23 years ago
|
Attachment #83198 -
Attachment is obsolete: true
Updated•23 years ago
|
Attachment #87120 -
Attachment is obsolete: true
Comment 14•23 years ago
|
||
Comment on attachment 87147 [details] [diff] [review]
full patch
Hi Sergei,
_connectLock should also be deleted because the
variables it protects (connectCount and connectList)
are deleted.
Attachment #87147 -
Flags: needs-work+
| Reporter | ||
Comment 15•23 years ago
|
||
Wan-Teh,
if i delete _connectLock
void
_MD_final_init (void)
in bmisc.c will have absolutely empty body? Is it needed at all anymore?
Comment 16•23 years ago
|
||
Hrm. So when I removed all references to connectCount, connectList[] &
_connectLock, I'm seeing a weird case where net_server dies whenever I try to
attach a patch in bugzilla. When I go to the Network preferences panel and tell
it to restart Networking sometimes it won't restart until I've exited Mozilla.
| Assignee | ||
Comment 17•23 years ago
|
||
Attachment #87147 -
Attachment is obsolete: true
Comment 18•23 years ago
|
||
connectList and its ilk were added long ago because BeOS' non-BONE (net_server)
didn't support nonblocking connects.
The solution was polling of connect() calls. You called connect() again with
the same arguments. If the result was EAGAIN or somesuch (I don't remember
now) you waited some more. If the result was success or failure, report as
such.
This re-polling logic was put in poll/select, since this is where mozilla (at
the time) waited for a socket to become writeable.
Has the interaction between mozilla and the NSPR changed, or are you just
getting lucky that the connect() is (a) returning fast enough that it's ready
by the time you write and (b) is not refused/failed. Or is mozilla using
blocking connects now?
Also, net_server does not have an implementation to get the status of a connect
other than through direct calls to connect. How is the net_server port of
NSPR handling this?
Thanks!
- Matt
Comment 19•23 years ago
|
||
Found it.
http://bonsai.mozilla.org/cvsview2.cgi?
diff_mode=context&whitespace_mode=show&file=bfile.c&root=/cvsroot&subdir=mozilla
/nsprpub/pr/src/md/beos&command=DIFF_FRAMESET&rev1=3.7&rev2=3.8
When BONE changes were checked in, the logic to *remove* and *check*
connectList entries was toasted in select code.
64 connects at once were fine. (at least at one point) This would even be more
than the maximum 50 connection-at-once limit. It's poor form to have a static
array, but at the time it was fine. Why y'all have run into problems recently
is that the array was never emptied. After 64 nonblocking connection attempts,
no matter the outcome, you started exceeding the array size.
Solution: Verify that poll/select works for nonblocking sockets in net_server
now. If so, remove connectList junk. If connect really isn't working under
poll (try going to sites that take a while to connect, not just serve up pages
slowly) you'll have to put back connectList or something like it and a
select/poll that removes connectList entries after they have succeeded or
failed. A better solution than my old way is to spawn a thread that polls for
the connect and dumps the status somewhere for poll and PR_GetConnectStatus to
read.
- Matt
| Reporter | ||
Comment 20•23 years ago
|
||
Patches with removed code don't work with https here.
Comment 21•23 years ago
|
||
Based on all the problems, someone needs to go fact finding. I no longer have a
BeOS partition - BeOS doesn't support my hardware anymore. I have high hopes
that someday I can enjoy a rich BeOS-like environment, but not now.
Back in the day, I gave up on non-BONE builds and left in the code - which was
working fine before the connectList deletion code was removed from
socket_io_wait. BONE's release was just around the corner then. I concentrated
on making the network stack work with BONE. Someone else took my BONE patches
and merged them into the tree. (they really should have commented out all
connectList stuff, and left the non-BONE socket_io_wait in, but that's another
story)
Times have changed. Since we don't have BONE anymore, what needs to be
determined is:
The *correct* way to do the following under net_server:
1. Nonblocking connects. Do you have to continually poll connect() by
re-connecting() with the same arguments as the original call until you get
success or failure? (EAGAIN/EWOULDBLOCK should make you try again later) Can
you use select/poll() on the socket to get the error?
2. Status from connects. Mozilla uses PR_GetConnectStatus which calls
getsockopt(SO_ERROR) to see if a socket is connected and if there was an error
on the connection. There is no such function on non-BONE machines, which means
you need to track the outcome of a connection somewhere within the NSPR. I once
thought that reading 0 bytes from the socket would allow you to safely find out
if a socket was alive and connected - never tried it to see if it worked.
3. Does select now work on read and write sockets? The code in socket_io_wait
was written for BONE which *assumes* that write sockets work - you'll have a
really unstable network stack if you're assuming that write blocking works when
it doesn't.
If it's any help, I had a small test program I used to test non-blocking socket
strangeness that can be used as a starting point. It won't work as a test for
the questions above, but it will give you a good starting point as a working
program that connects to the website given on the command line, writes the GET
request, and reads/echos the HTML that comes back in a nonblocking manner. Find
it at ftp://ftp.albany.net/pub/users/maz/nbtest.tar.gz.
Good luck!
Thanks!
- Matt
Comment 22•23 years ago
|
||
Matt, what's your hardware issues? (you can contact me off list)
Also, A LOT of people still run BONE builds, so we CAN NOT get rid of it.
As for your questions, I will see if I can get some help from some other BeOS
coders who know the networking code MUCH better than I do (which is very, very
little)
Comment 23•23 years ago
|
||
Ok, I let another BeOS developer know about the issues we are having. He has looked at
this Bug entry, and will also be looking at the source. Here is his response to some
of the questions posted here:
net_server does indeed support nonblocking connects, as I said
before and as it is said in the bugzilla discussion, but only
through setsockopt(), not through fcntl().
net_server does not have getsockopt, which it seems that mozilla
uses to see if it is connected or not, but they seem to have a
workaround for that, although not tested. :/
The big issue with net_server though, is that you cannot select
on several sockets at once. Workaround would either be to select
all the sockets one by one, or that each socket has its own
select thread and the select calls are protected with mutexes. Also
the read/write functions need to mutexed in that case.
Comment 24•23 years ago
|
||
ok, bare with me if this is completely off base, but, it does seem to work, so
far. All I did was look at the current _MD_Poll and the old _MD_Poll, which,
btw, are still farely similar, and add the code for reducing the connectList[].
Now, I don't see the connectCount getting over 1 or 2!
I have not built this under BONE yet, but, it should be ok.
I'll check later. I just want to get everyone's opinion on this first.
Attachment #87224 -
Attachment is obsolete: true
| Reporter | ||
Comment 25•23 years ago
|
||
Paul, seems working fine under BONE.
I uploaded this libnspr for more wide testing recently.
http://www.bebits.com/bob/12473/libnspr4.so
so waiting userbase feedback
Comment 26•23 years ago
|
||
Wow! I grabbed the update from BeBits and it really seems to be working well.
Processor usage doesn't got to 100% during network activity any more. Great job.
This makes bezilla more usable than ever. :-)
Comment 27•23 years ago
|
||
You now essential have the code that I originally wrote before BONE was merged.
It has deficiencies:
1. PR_GetConnectStatus is still broken
2. PR_Poll/PR_Select still don't behave exactly according to specifications.
But it worked, and probably will work well for non-BONE users.
I leave it up to some brave soul who wants to work around the remaining
implementation details and make a rock-solid non-BONE NSPR.
- Matt
Comment 28•23 years ago
|
||
True, but, net_server is just plain lousy. If this works, and is good enough, I say we
check it in. Even if BeOS was still being actively developed by Be, Inc. or Palm, more
than likely, the net_server would be phased out. And, the open source implementation(s)
are doing just that, as well.
My question would be, did I re-implement the work-around from before properly?
I.e. did I put the code in the proper place within _MD_Poll?
Comment 29•23 years ago
|
||
Yeah, that is fine.
There is no logic to retry connects on a set interval while waiting for
select. I am curious if select will return a read or write flag on a waiting
to connect socket. If so, this code will work perfectly as is. If select
doesn't return, you're only checking for connects after some activity on an
unrelated socket or the timeout passed to select has expired.
- Matt
Comment 30•23 years ago
|
||
ok, I've been using the last patch for a while now, without incident. as for
the other functions that need to be implemented that Matt mentioned, well, they
were never implemented, and are not part of this bug. the patch attached fixes
this bug, so, r=arougthopher@lizardland.net
Chris, can you review this as well. I would like to check this in, but, I don't
have access to the nspr branch.
Comment 31•23 years ago
|
||
*** Bug 157351 has been marked as a duplicate of this bug. ***
Comment 32•23 years ago
|
||
Wan-teh or Chris,
Could one of you review the latest patch, and check it into the tree. This
patch adds A LOT to the stability of mozilla on BeOS. If you have a problem
with it, let me know, and I will address it, but I have seen no objections, or,
any activity for that matter, on this bug in a long time.
Thank you
Status: NEW → ASSIGNED
Comment 33•23 years ago
|
||
Arougthopher: could you look at Bug 157351, and verify it if you think it is a
dupe? TIA.
Comment 34•23 years ago
|
||
Comment on attachment 88237 [details] [diff] [review]
patch to reduce connectCount
Paul,
Your patch is basically fine. I would like to suggest
two changes.
1. Style: please use C style comment delimiters /* */
instead of the C++ style comment delimiters //.
2. Safety: please add code that guards against stepping
beyond the end of the buffer. In bnet.c, you should say
something like:
PR_Lock(_connectLock);
if (connectCount < sizeof(connectList)/sizeof(connectList[0])) {
connectList[connectCount].osfd = osfd;
memcpy(&connectList[connectCount].addr, addr, addrlen);
connectList[connectCount].addrlen = addrlen;
connectList[connectCount].timeout = timeout;
connectCount++;
PR_Unlock(_connectLock);
_PR_MD_MAP_CONNECT_ERROR(err);
} else {
PR_Unlock(_connectLock);
PR_SetError(PR_INSUFFICIENT_RESOURCES_ERROR, 0);
}
return rv;
Alternatively, you can allocate connectList from the heap
and dynamically resize it when necessary.
Attachment #88237 -
Flags: needs-work+
Comment 35•23 years ago
|
||
Updated patch.
Ben: I cannot verify it as a dupe, as, I can not reproduce the original
problem, even when I don't have this patch applied. (or, at least what I think
was the problem) So, my feeling is that it is not a duplicate.
Attachment #88237 -
Attachment is obsolete: true
Comment 36•23 years ago
|
||
Paul, I will check this in when the code freeze for
Mozilla 1.1beta is over. If you want me to check
this in sooner, could you please get drivers' approval?
Attachment #91579 -
Attachment is obsolete: true
Comment 37•23 years ago
|
||
Comment on attachment 91608 [details] [diff] [review]
Update patch (fixed indentation, white space, and comments) (checked in)
r=wtc. (This is Paul's patch with my edits.)
Attachment #91608 -
Flags: review+
Comment 38•23 years ago
|
||
Comment on attachment 91608 [details] [diff] [review]
Update patch (fixed indentation, white space, and comments) (checked in)
a=scc for checkin to the mozilla trunk
Attachment #91608 -
Flags: approval+
Updated•23 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Target Milestone: --- → 4.2.2
Comment 39•23 years ago
|
||
Fix checked into the tip and NSPRPUB_PRE_4_2_CLIENT_BRANCH
(used by mozilla trunk) of NSPR.
Comment 40•23 years ago
|
||
*** Bug 157351 has been marked as a duplicate of this bug. ***
Comment 41•23 years ago
|
||
Reopening as I see the same problem where net_server dies whenever a form is submitted.
Restarting net_server works for other apps but mozilla needs to be restarted to be usable
again.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•23 years ago
|
Priority: -- → P1
Target Milestone: 4.2.2 → 4.3
| Reporter | ||
Comment 42•22 years ago
|
||
I doubt this bug is worth "critical" flag.
No more crashes - problem is if more perfect implementation is possible.
Updated•22 years ago
|
Priority: P1 → P2
| Reporter | ||
Comment 43•22 years ago
|
||
According to http://bugzilla.mozilla.org/show_bug.cgi?id=141614#c41
reassigning to Chris
(in other case i prefer to close it)
Assignee: sergei_d → seawood
Status: REOPENED → NEW
Updated•18 years ago
|
QA Contact: wtchang → nspr
Comment 45•18 years ago
|
||
The target milestone was already released. Resetting target milestone.
Target Milestone: 4.3 → ---
Comment 46•17 years ago
|
||
Comment on attachment 91608 [details] [diff] [review]
Update patch (fixed indentation, white space, and comments) (checked in)
This patch was committed. Is this bug now fixed?
Attachment #91608 -
Attachment description: Update patch (fixed indentation, white space, and comments) → Update patch (fixed indentation, white space, and comments) (checked in)
| Reporter | ||
Comment 47•17 years ago
|
||
(In reply to comment #46)
> (From update of attachment 91608 [details] [diff] [review])
> This patch was committed. Is this bug now fixed?
>
seems no problem anymore
Updated•17 years ago
|
Status: NEW → RESOLVED
Closed: 23 years ago → 17 years ago
Resolution: --- → FIXED
Target Milestone: --- → 4.2.2
Updated•14 years ago
|
Crash Signature: [@SocketConnect]
You need to log in
before you can comment on or make changes to this bug.
Description
•