Closed Bug 70808 Opened 24 years ago Closed 24 years ago

When net_server is restarted while Mozilla is running, Mozilla uses 100% of CPU

Categories

(Core :: Networking, defect, P3)

x86
BeOS
defect

Tracking

()

RESOLVED FIXED
mozilla0.9.4

People

(Reporter: rseguy, Assigned: cls)

Details

(Keywords: helpwanted)

Attachments

(5 files)

From Bugzilla Helper: User-Agent: Mozilla/5.0 (BeOS; U; BeOS 5.0 BePC; en-US; 0.8) Gecko/20010222 BuildID: 2001022211 Reproducible: Always Steps to Reproduce: Lauch Mozilla Restart net_server (BeOS -> Preferences -> Network -> Restart Networking) Actual Results: Mozilla uses 100% of CPU. Quitting Mozilla and re-launching is necessary. Expected Results: Mozilla should not have used 100% of the CPU
beos
Assignee: asa → koehler
Component: Browser-General → Networking
QA Contact: doronr → tever
Reporter is this still a problem in the latest nightlies?
I'm afraid I can't help you : networking with net_server is broken since the end of february for BeOS. So there are no new nightlies for BeOS since that time :-( And I've not BONE and I've not finished to download mozilla src from CVS...
I'm afraid I can't help you : networking with net_server is broken since the end of february for BeOS. So there are no new nightlies for BeOS since that time :-( And I've not finished to download mozilla src from CVS and I've not BONE...
I see this as well with my build from earlier tonight. thid total user kernel %cpu team name thread name 33913 4778.33 2533.00 2244.00 95.6 mozilla-bin moz-thread
Status: UNCONFIRMED → NEW
Ever confirmed: true
mass move, v2. qa to me.
QA Contact: tever → benc
This bug still appears in build 2001061213 (Mozilla-i586-pc-beos-0.9.1).
Keywords: helpwanted
Process Controller indicates that this is one of the two threads called 'moz-thread' that uses 100% of CPU.
+qawanted - I have no BeOS system. Is net_server the network access service (IP stack on your OS)? We have had some reports of racing when losing network connections in the past, but the are generally resolved. Does the same thing happen if you just unplug the network or hangup the modem?
Keywords: qawanted
net_server is the IP stack/network access service of BeOS.The problem doesn't happen if the modem is switched off while Mozilla is running.
This bug still appears in build 2001070302 (Mozilla-i586-pc-beos-0.9.2).
No more working on Bezilla.
Assignee: koehler → nobody
No more working on Bezilla
I'm not sure if this is a necko bug or a NSPR bug. If I make sure that USE_POLLABLE_EVENT is not defined for BeOS in netwerk/base/src/nsSocketTransportService.h , then restarting net_server works fine. Darin, is there any harm in undefining this?
Assignee: nobody → cls
Priority: -- → P3
Target Milestone: --- → mozilla0.9.4
if you don't use a pollable event, then the socket thread will get woken up every 5 milliseconds to check for sockets that need to be added to the select list. this code was originally for the MAC, as it was only very recently that NSPR supported pollable events on the MAC.
an alternative solution to this bug would probably be to go offline before restarting net_server, as doing so would destroy the socket transport service and thus kill the pollable event socket pair.
i'm not exactly sure how we're getting into this state, but it appears that we're returning from PR_Poll with PR_POLL_WRITE set on a socket which has no associated write request and hence no data to write. somehow we're calling PR_Poll with PR_POLL_WRITE when we shouldn't, or somehow we're losing the write request too early.
Going offline, then restarting net_server didn't make a difference. The cpu gets pegged whenever you come back online. Digging deeper, what I'm seeing is this: When necko starts up, 2 sockets are opened to localhost (according to netstat). When net_server is restared, those sockets go away but PR_Poll is still returning success...that's bad. The implementation of _MD_pr_poll for beos is slightly faulty. It doesn't properly catch errors returned from select (as beos doesn't have poll()). After fixing that, I'm still seeing the CPU being pegged. nsSocketTransport is missing code to deal with errors from PR_Poll. The comment indicates that this "should never happen" so I'm not quite sure how to deal with it yet.
The last patch attempts to recreate the PollableEvent after PR_Poll fails. Because net_server doesn't restart instantly, the cpu will be pegged until PR_Poll stops failing....presumably after net_server comes back up. This usually takes 7-10 secs. Note: this hack only works if you are not currently loading a page (ie, nothing else is on mActiveTransportList) when net_server is restarted. I haven't quite figured out how to handle that condition.
> somehow we're calling PR_Poll with PR_POLL_WRITE when we shouldn't, or somehow > we're losing the write request too early. I think this may be caused by bug 65909 which states that beos' select implementation is incomplete an returns immediately if you attempt to check the write bits.
the NSPR patch looks correct to me. as for the sockettransportservice patch, it looks fine w/ the exception of some minor nits: 1) thread_event breaks naming convention, and seems a bit vague.. i think it means hadThreadEvent.. is that correct? and if so, why not call it this instead? 2) how about using C++ style comments? otherwise r/sr=darin
Why don't you create a new pollable event immediately after you destroy the old one? This way you can contain the hack in one place and omit the thread_event flag.
I don't attempt to recreate the pollable event immediately because resetting the ip stack (restarting net_server) isn't an atomic operation so there is a slight delay (7-10 secs) before it becomes available again. During that time, any attempts to recreate the pollable event will fail so we need to have to thread_event check anyway. I thought it'd be cleaner if we only attempted the recreation in a single place.
When reviewing cls's NSPR patch I noticed two problems. 1. The first problem is in the existing code. In mozilla/nsprpub/pr/src/md/beos/bfile.c, we have: timeout -= PR_IntervalNow() - start; if(timeout <= 0) { /* timed out */ n = 0; } This code is wrong for two reasons. First, it is only correct the first time it is executed. The second time it is executed, it will subtract too much from 'timeout'. The second reason is that PRIntervalTime is an unsigned type and so 'timeout' will never be less than 0. Here is one way to do this right. PRIntervalTime now, elapsed; now = PR_IntervalNow(); elapsed = (PRIntervalTime) (now - start); if (elapsed < timeout) { timeout -= elapsed; start = now; } else { /* timed out */ n = 0; } This should be fixed. 2. In cls's patch, if _MD_pr_poll returns -1 (where it sets rc to -1 if n < 0), it does not call PR_SetError() to set the error codes. Moreover, in the PR_Poll interface, the EBADF error from select() should be returned by returning a positive value (indicating how many fd's are bad) and set PR_POLL_NVAL in the out_flags fields of the bad fd's. You can look at mozilla/nsprpub/pr/src/md/unix/uxpoll.c as an example. This problem is less serious. I believe that it only makes it harder to diagnose a programming error. (I don't think a fd passed to select() will go bad by itself.) So depending on how motivated you are, you might want to just mark the code with a big "FIXME", perhaps with my comments above. In summary, it is fine to check in the NSPR patch after adding the suggested "FIXME" and comments. However, a pre-existing problem in that file around the manipulation of 'timeout' should be fixed.
The BeOS/BONE patch for NSPR has a fixed implementation of PR_Poll based upon the win32 one so it should have the timeout & errno fixes. This win32-based implementation works for the non-BONE ip stack as well so I'll just land that. Marking this bug fixed.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Every since this patch was committed, it seems that mozilla has become a little "flaky" on beos. It will complain that a sight cannot be found, and then, if you try again right afterwards, it works. Now, you may not notice this on a fast internet connection, but on a 28.8k connection like I have (damn crappy phone lines), it happens very, very often. Just thought I'd make a note of it.
So the natural follow-up question would be, does backing out the patch fix the flakiness?
*sigh* Now again for the proper party.....does backing out the patch fix the "flakiness"? I don't see how it would unless net_server is restarting whenever the "flakiness" occcurs.
Well, the net_server usually only gets restarted on my machine when I reboot. I don't change my network settings very often. I will try to see if there is a difference, if I back out the patch. Plus, I just posted a build, so we can find out if other people are having the problem as well.
Keywords: qawanted
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: