Closed
Bug 70808
Opened 24 years ago
Closed 24 years ago
When net_server is restarted while Mozilla is running, Mozilla uses 100% of CPU
Categories
(Core :: Networking, defect, P3)
Tracking
()
RESOLVED
FIXED
mozilla0.9.4
People
(Reporter: rseguy, Assigned: cls)
Details
(Keywords: helpwanted)
Attachments
(5 files)
|
727 bytes,
patch
|
Details | Diff | Splinter Review | |
|
354.74 KB,
text/plain
|
Details | |
|
826.67 KB,
text/plain
|
Details | |
|
1.34 KB,
patch
|
Details | Diff | Splinter Review | |
|
2.12 KB,
patch
|
Details | Diff | Splinter Review |
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (BeOS; U; BeOS 5.0 BePC; en-US; 0.8) Gecko/20010222
BuildID: 2001022211
Reproducible: Always
Steps to Reproduce:
Lauch Mozilla
Restart net_server (BeOS -> Preferences -> Network -> Restart Networking)
Actual Results: Mozilla uses 100% of CPU. Quitting Mozilla and re-launching is
necessary.
Expected Results: Mozilla should not have used 100% of the CPU
beos
Assignee: asa → koehler
Component: Browser-General → Networking
QA Contact: doronr → tever
Comment 2•24 years ago
|
||
Reporter is this still a problem in the latest nightlies?
| Reporter | ||
Comment 3•24 years ago
|
||
I'm afraid I can't help you : networking with net_server is broken since the end of february for BeOS. So there are no new nightlies for BeOS since that time :-( And I've not BONE and I've not finished to download mozilla src from CVS...
| Reporter | ||
Comment 4•24 years ago
|
||
I'm afraid I can't help you : networking with net_server is broken since the end of february for BeOS. So there are no new nightlies for BeOS since that time :-( And I've not finished to download mozilla src from CVS and I've not BONE...
I see this as well with my build from earlier tonight.
thid total user kernel %cpu team name thread name
33913 4778.33 2533.00 2244.00 95.6 mozilla-bin moz-thread
Status: UNCONFIRMED → NEW
Ever confirmed: true
| Reporter | ||
Comment 7•24 years ago
|
||
This bug still appears in build 2001061213 (Mozilla-i586-pc-beos-0.9.1).
Keywords: helpwanted
| Reporter | ||
Comment 8•24 years ago
|
||
Process Controller indicates that this is one of the two threads called 'moz-thread' that uses 100% of CPU.
+qawanted - I have no BeOS system.
Is net_server the network access service (IP stack on your OS)?
We have had some reports of racing when losing network connections in the past,
but the are generally resolved. Does the same thing happen if you just unplug
the network or hangup the modem?
Keywords: qawanted
| Reporter | ||
Comment 10•24 years ago
|
||
net_server is the IP stack/network access service of BeOS.The problem doesn't happen if the modem is switched off while Mozilla is running.
| Reporter | ||
Comment 11•24 years ago
|
||
This bug still appears in build 2001070302 (Mozilla-i586-pc-beos-0.9.2).
Comment 13•24 years ago
|
||
No more working on Bezilla
| Assignee | ||
Comment 14•24 years ago
|
||
I'm not sure if this is a necko bug or a NSPR bug. If I make sure that
USE_POLLABLE_EVENT is not defined for BeOS in
netwerk/base/src/nsSocketTransportService.h , then restarting net_server works
fine. Darin, is there any harm in undefining this?
Assignee: nobody → cls
Priority: -- → P3
Target Milestone: --- → mozilla0.9.4
| Assignee | ||
Comment 15•24 years ago
|
||
Comment 16•24 years ago
|
||
if you don't use a pollable event, then the socket thread will get woken up
every 5 milliseconds to check for sockets that need to be added to the select list.
this code was originally for the MAC, as it was only very recently that NSPR
supported pollable events on the MAC.
Comment 17•24 years ago
|
||
an alternative solution to this bug would probably be to go offline before
restarting net_server, as doing so would destroy the socket transport service
and thus kill the pollable event socket pair.
| Assignee | ||
Comment 18•24 years ago
|
||
Comment 19•24 years ago
|
||
i'm not exactly sure how we're getting into this state, but it appears that
we're returning from PR_Poll with PR_POLL_WRITE set on a socket which has no
associated write request and hence no data to write. somehow we're calling
PR_Poll with PR_POLL_WRITE when we shouldn't, or somehow we're losing the write
request too early.
| Assignee | ||
Comment 20•24 years ago
|
||
| Assignee | ||
Comment 21•24 years ago
|
||
Going offline, then restarting net_server didn't make a difference. The cpu
gets pegged whenever you come back online.
Digging deeper, what I'm seeing is this:
When necko starts up, 2 sockets are opened to localhost (according to netstat).
When net_server is restared, those sockets go away but PR_Poll is still
returning success...that's bad. The implementation of _MD_pr_poll for beos is
slightly faulty. It doesn't properly catch errors returned from select (as beos
doesn't have poll()). After fixing that, I'm still seeing the CPU being pegged.
nsSocketTransport is missing code to deal with errors from PR_Poll. The
comment indicates that this "should never happen" so I'm not quite sure how to
deal with it yet.
| Assignee | ||
Comment 22•24 years ago
|
||
| Assignee | ||
Comment 23•24 years ago
|
||
| Assignee | ||
Comment 24•24 years ago
|
||
The last patch attempts to recreate the PollableEvent after PR_Poll fails.
Because net_server doesn't restart instantly, the cpu will be pegged until
PR_Poll stops failing....presumably after net_server comes back up. This
usually takes 7-10 secs.
Note: this hack only works if you are not currently loading a page (ie, nothing
else is on mActiveTransportList) when net_server is restarted. I haven't quite
figured out how to handle that condition.
| Assignee | ||
Comment 25•24 years ago
|
||
> somehow we're calling PR_Poll with PR_POLL_WRITE when we shouldn't, or somehow
> we're losing the write request too early.
I think this may be caused by bug 65909 which states that beos' select
implementation is incomplete an returns immediately if you attempt to check the
write bits.
Comment 26•24 years ago
|
||
the NSPR patch looks correct to me.
as for the sockettransportservice patch, it looks fine w/ the exception of some
minor nits:
1) thread_event breaks naming convention, and seems a bit vague.. i think it
means hadThreadEvent.. is that correct? and if so, why not call it this instead?
2) how about using C++ style comments?
otherwise r/sr=darin
Comment 27•24 years ago
|
||
Why don't you create a new pollable event immediately
after you destroy the old one? This way you can contain
the hack in one place and omit the thread_event flag.
| Assignee | ||
Comment 28•24 years ago
|
||
I don't attempt to recreate the pollable event immediately because resetting the
ip stack (restarting net_server) isn't an atomic operation so there is a slight
delay (7-10 secs) before it becomes available again. During that time, any
attempts to recreate the pollable event will fail so we need to have to
thread_event check anyway. I thought it'd be cleaner if we only attempted the
recreation in a single place.
Comment 29•24 years ago
|
||
When reviewing cls's NSPR patch I noticed two problems.
1. The first problem is in the existing code. In
mozilla/nsprpub/pr/src/md/beos/bfile.c, we have:
timeout -= PR_IntervalNow() - start;
if(timeout <= 0)
{
/* timed out */
n = 0;
}
This code is wrong for two reasons. First, it is only
correct the first time it is executed. The second time
it is executed, it will subtract too much from 'timeout'.
The second reason is that PRIntervalTime is an unsigned
type and so 'timeout' will never be less than 0.
Here is one way to do this right.
PRIntervalTime now, elapsed;
now = PR_IntervalNow();
elapsed = (PRIntervalTime) (now - start);
if (elapsed < timeout)
{
timeout -= elapsed;
start = now;
}
else
{
/* timed out */
n = 0;
}
This should be fixed.
2. In cls's patch, if _MD_pr_poll returns -1 (where it sets rc to -1
if n < 0), it does not call PR_SetError() to set the error codes.
Moreover, in the PR_Poll interface, the EBADF error from select()
should be returned by returning a positive value (indicating how
many fd's are bad) and set PR_POLL_NVAL in the out_flags fields of
the bad fd's. You can look at mozilla/nsprpub/pr/src/md/unix/uxpoll.c
as an example.
This problem is less serious. I believe that it only makes it
harder to diagnose a programming error. (I don't think a fd
passed to select() will go bad by itself.) So depending on how
motivated you are, you might want to just mark the code with a
big "FIXME", perhaps with my comments above.
In summary, it is fine to check in the NSPR patch after adding
the suggested "FIXME" and comments. However, a pre-existing
problem in that file around the manipulation of 'timeout' should
be fixed.
| Assignee | ||
Comment 30•24 years ago
|
||
The BeOS/BONE patch for NSPR has a fixed implementation of PR_Poll based upon
the win32 one so it should have the timeout & errno fixes. This win32-based
implementation works for the non-BONE ip stack as well so I'll just land that.
Marking this bug fixed.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Comment 31•24 years ago
|
||
Every since this patch was committed, it seems that mozilla has become a little
"flaky" on beos. It will complain that a sight cannot be found, and then, if
you try again right afterwards, it works. Now, you may not notice this on a
fast internet connection, but on a 28.8k connection like I have (damn crappy
phone lines), it happens very, very often.
Just thought I'd make a note of it.
| Assignee | ||
Comment 32•24 years ago
|
||
So the natural follow-up question would be, does backing out the patch fix the
flakiness?
| Assignee | ||
Comment 33•24 years ago
|
||
*sigh* Now again for the proper party.....does backing out the patch fix the
"flakiness"? I don't see how it would unless net_server is restarting whenever
the "flakiness" occcurs.
Comment 34•24 years ago
|
||
Well, the net_server usually only gets restarted on my machine when I reboot. I
don't change my network settings very often. I will try to see if there is a
difference, if I back out the patch. Plus, I just posted a build, so we can
find out if other people are having the problem as well.
You need to log in
before you can comment on or make changes to this bug.
Description
•