Closed Bug 70808 Opened 24 years ago Closed 24 years ago

When net_server is restarted while Mozilla is running, Mozilla uses 100% of CPU

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla0.9.4

People

(Reporter: rseguy, Assigned: cls)

Details

(Keywords: helpwanted)

Attachments

(5 files)

Don't define USE_POLLABLE_EVENT for beos 24 years ago cls 727 bytes, patch		Details \| Diff \| Splinter Review
nsSocketTransport:5 log -- warning: contains noisy debug build assertions 24 years ago cls 354.74 KB, text/plain		Details
nsSocketTransport log -- going offline, then restarting net_server, then back online 24 years ago cls 826.67 KB, text/plain		Details
Only ignore failed select() if errno == EINTR 24 years ago cls 1.34 KB, patch		Details \| Diff \| Splinter Review
Attempt to recreate PollableEvent if PR_Poll fails 24 years ago cls 2.12 KB, patch		Details \| Diff \| Splinter Review

Romain SEGUY

Reporter

Description

•

24 years ago

From Bugzilla Helper: User-Agent: Mozilla/5.0 (BeOS; U; BeOS 5.0 BePC; en-US; 0.8) Gecko/20010222 BuildID: 2001022211 Reproducible: Always Steps to Reproduce: Lauch Mozilla Restart net_server (BeOS -> Preferences -> Network -> Restart Networking) Actual Results: Mozilla uses 100% of CPU. Quitting Mozilla and re-launching is necessary. Expected Results: Mozilla should not have used 100% of the CPU

timeless

Comment 1

•

24 years ago

beos

Assignee: asa → koehler

Component: Browser-General → Networking

QA Contact: doronr → tever

Keyser Sose

Comment 2

•

24 years ago

Reporter is this still a problem in the latest nightlies?

Romain SEGUY

Reporter

Comment 3

•

24 years ago

I'm afraid I can't help you : networking with net_server is broken since the end of february for BeOS. So there are no new nightlies for BeOS since that time :-( And I've not BONE and I've not finished to download mozilla src from CVS...

Romain SEGUY

Reporter

Comment 4

•

24 years ago

I'm afraid I can't help you : networking with net_server is broken since the end of february for BeOS. So there are no new nightlies for BeOS since that time :-( And I've not finished to download mozilla src from CVS and I've not BONE...

cls

Assignee

Comment 5

•

24 years ago

I see this as well with my build from earlier tonight. thid total user kernel %cpu team name thread name 33913 4778.33 2533.00 2244.00 95.6 mozilla-bin moz-thread

Status: UNCONFIRMED → NEW

Ever confirmed: true

benc

Comment 6

•

24 years ago

mass move, v2. qa to me.

QA Contact: tever → benc

Romain SEGUY

Reporter

Comment 7

•

24 years ago

This bug still appears in build 2001061213 (Mozilla-i586-pc-beos-0.9.1).

timeless

Updated

•

24 years ago

Keywords: helpwanted

Romain SEGUY

Reporter

Comment 8

•

24 years ago

Process Controller indicates that this is one of the two threads called 'moz-thread' that uses 100% of CPU.

benc

Comment 9

•

24 years ago

+qawanted - I have no BeOS system. Is net_server the network access service (IP stack on your OS)? We have had some reports of racing when losing network connections in the past, but the are generally resolved. Does the same thing happen if you just unplug the network or hangup the modem?

Keywords: qawanted

Romain SEGUY

Reporter

Comment 10

•

24 years ago

net_server is the IP stack/network access service of BeOS.The problem doesn't happen if the modem is switched off while Mozilla is running.

Romain SEGUY

Reporter

Comment 11

•

24 years ago

This bug still appears in build 2001070302 (Mozilla-i586-pc-beos-0.9.2).

Yannick Koehler

Comment 12

•

24 years ago

No more working on Bezilla.

Assignee: koehler → nobody

Yannick Koehler

Comment 13

•

24 years ago

No more working on Bezilla

cls

Assignee

Comment 14

•

24 years ago

I'm not sure if this is a necko bug or a NSPR bug. If I make sure that USE_POLLABLE_EVENT is not defined for BeOS in netwerk/base/src/nsSocketTransportService.h , then restarting net_server works fine. Darin, is there any harm in undefining this?

Assignee: nobody → cls

Priority: -- → P3

Target Milestone: --- → mozilla0.9.4

cls

Assignee

Comment 15

•

24 years ago

Attached patch Don't define USE_POLLABLE_EVENT for beos — Details — Splinter Review

Darin Fisher

Comment 16

•

24 years ago

if you don't use a pollable event, then the socket thread will get woken up every 5 milliseconds to check for sockets that need to be added to the select list. this code was originally for the MAC, as it was only very recently that NSPR supported pollable events on the MAC.

Darin Fisher

Comment 17

•

24 years ago

an alternative solution to this bug would probably be to go offline before restarting net_server, as doing so would destroy the socket transport service and thus kill the pollable event socket pair.

cls

Assignee

Comment 18

•

24 years ago

Attached file nsSocketTransport:5 log -- warning: contains noisy debug build assertions — Details

Darin Fisher

Comment 19

•

24 years ago

i'm not exactly sure how we're getting into this state, but it appears that we're returning from PR_Poll with PR_POLL_WRITE set on a socket which has no associated write request and hence no data to write. somehow we're calling PR_Poll with PR_POLL_WRITE when we shouldn't, or somehow we're losing the write request too early.

cls

Assignee

Comment 20

•

24 years ago

Attached file nsSocketTransport log -- going offline, then restarting net_server, then back online — Details

cls

Assignee

Comment 21

•

24 years ago

Going offline, then restarting net_server didn't make a difference. The cpu gets pegged whenever you come back online. Digging deeper, what I'm seeing is this: When necko starts up, 2 sockets are opened to localhost (according to netstat). When net_server is restared, those sockets go away but PR_Poll is still returning success...that's bad. The implementation of _MD_pr_poll for beos is slightly faulty. It doesn't properly catch errors returned from select (as beos doesn't have poll()). After fixing that, I'm still seeing the CPU being pegged. nsSocketTransport is missing code to deal with errors from PR_Poll. The comment indicates that this "should never happen" so I'm not quite sure how to deal with it yet.

cls

Assignee

Comment 22

•

24 years ago

Attached patch Only ignore failed select() if errno == EINTR — Details — Splinter Review

cls

Assignee

Comment 23

•

24 years ago

Attached patch Attempt to recreate PollableEvent if PR_Poll fails — Details — Splinter Review

cls

Assignee

Comment 24

•

24 years ago

The last patch attempts to recreate the PollableEvent after PR_Poll fails. Because net_server doesn't restart instantly, the cpu will be pegged until PR_Poll stops failing....presumably after net_server comes back up. This usually takes 7-10 secs. Note: this hack only works if you are not currently loading a page (ie, nothing else is on mActiveTransportList) when net_server is restarted. I haven't quite figured out how to handle that condition.

cls

Assignee

Comment 25

•

24 years ago

> somehow we're calling PR_Poll with PR_POLL_WRITE when we shouldn't, or somehow > we're losing the write request too early. I think this may be caused by bug 65909 which states that beos' select implementation is incomplete an returns immediately if you attempt to check the write bits.

Darin Fisher

Comment 26

•

24 years ago

the NSPR patch looks correct to me. as for the sockettransportservice patch, it looks fine w/ the exception of some minor nits: 1) thread_event breaks naming convention, and seems a bit vague.. i think it means hadThreadEvent.. is that correct? and if so, why not call it this instead? 2) how about using C++ style comments? otherwise r/sr=darin

Wan-Teh Chang

Comment 27

•

24 years ago

Why don't you create a new pollable event immediately after you destroy the old one? This way you can contain the hack in one place and omit the thread_event flag.

cls

Assignee

Comment 28

•

24 years ago

I don't attempt to recreate the pollable event immediately because resetting the ip stack (restarting net_server) isn't an atomic operation so there is a slight delay (7-10 secs) before it becomes available again. During that time, any attempts to recreate the pollable event will fail so we need to have to thread_event check anyway. I thought it'd be cleaner if we only attempted the recreation in a single place.

Wan-Teh Chang

Comment 29

•

24 years ago

When reviewing cls's NSPR patch I noticed two problems. 1. The first problem is in the existing code. In mozilla/nsprpub/pr/src/md/beos/bfile.c, we have: timeout -= PR_IntervalNow() - start; if(timeout <= 0) { /* timed out */ n = 0; } This code is wrong for two reasons. First, it is only correct the first time it is executed. The second time it is executed, it will subtract too much from 'timeout'. The second reason is that PRIntervalTime is an unsigned type and so 'timeout' will never be less than 0. Here is one way to do this right. PRIntervalTime now, elapsed; now = PR_IntervalNow(); elapsed = (PRIntervalTime) (now - start); if (elapsed < timeout) { timeout -= elapsed; start = now; } else { /* timed out */ n = 0; } This should be fixed. 2. In cls's patch, if _MD_pr_poll returns -1 (where it sets rc to -1 if n < 0), it does not call PR_SetError() to set the error codes. Moreover, in the PR_Poll interface, the EBADF error from select() should be returned by returning a positive value (indicating how many fd's are bad) and set PR_POLL_NVAL in the out_flags fields of the bad fd's. You can look at mozilla/nsprpub/pr/src/md/unix/uxpoll.c as an example. This problem is less serious. I believe that it only makes it harder to diagnose a programming error. (I don't think a fd passed to select() will go bad by itself.) So depending on how motivated you are, you might want to just mark the code with a big "FIXME", perhaps with my comments above. In summary, it is fine to check in the NSPR patch after adding the suggested "FIXME" and comments. However, a pre-existing problem in that file around the manipulation of 'timeout' should be fixed.

cls

Assignee

Comment 30

•

24 years ago

The BeOS/BONE patch for NSPR has a fixed implementation of PR_Poll based upon the win32 one so it should have the timeout & errno fixes. This win32-based implementation works for the non-BONE ip stack as well so I'll just land that. Marking this bug fixed.

Status: NEW → RESOLVED

Closed: 24 years ago

Resolution: --- → FIXED

Paul

Comment 31

•

24 years ago

Every since this patch was committed, it seems that mozilla has become a little "flaky" on beos. It will complain that a sight cannot be found, and then, if you try again right afterwards, it works. Now, you may not notice this on a fast internet connection, but on a 28.8k connection like I have (damn crappy phone lines), it happens very, very often. Just thought I'd make a note of it.

cls

Assignee

Comment 32

•

24 years ago

So the natural follow-up question would be, does backing out the patch fix the flakiness?

cls

Assignee

Comment 33

•

24 years ago

*sigh* Now again for the proper party.....does backing out the patch fix the "flakiness"? I don't see how it would unless net_server is restarting whenever the "flakiness" occcurs.

Paul

Comment 34

•

24 years ago

Well, the net_server usually only gets restarted on my machine when I reboot. I don't change my network settings very often. I will try to see if there is a difference, if I back out the patch. Plus, I just posted a build, so we can find out if other people are having the problem as well.

Peter Bylenga [:PBylenga]

Updated

•

11 years ago

Keywords: qawanted

You need to log in before you can comment on or make changes to this bug.