Closed Bug 146884 Opened 23 years ago Closed 23 years ago

Mozilla sometimes consumes 100% CPU when pipelining is enabled...

Categories

(Core :: Networking, defect, P1)

x86
All
defect

Tracking

()

VERIFIED FIXED
mozilla1.0.1

People

(Reporter: mozilla-bugs, Assigned: darin.moz)

References

Details

(Keywords: hang, regression, topembed+, Whiteboard: [pipelining][adt2 RTM][verified-trunk])

Attachments

(1 file, 2 obsolete files)

I've seen this happen twice when doing essentially the same thing in both cases (in a secure banking site followed a link while some frame was still saing "Please wait"). Mozilla would hang, one thread would start using all the CPU time and the windows would not even redraw (and, of course, would not react to anything I might do). I attached gdb to Mozilla when it happened the second time, and let it run, interrupting it several times and getting the stack trace for the thread eating all the CPU. Here are the stacks I saw: Thread 3 (Thread 1026 (LWP 12749)): #0 0x4017065e in nsPipe::AddRef () from /usr/lib/libxpcom.so #1 0x401714f6 in nsPipe::nsPipeOutputStream::AddRef () from /usr/lib/libxpcom.so #2 0x40171630 in nsPipe::nsPipeOutputStream::WriteSegments () from /usr/lib/libxpcom.so #3 0x40171b1e in nsPipe::nsPipeOutputStream::WriteFrom () from /usr/lib/libxpcom.so #4 0x407ceec9 in nsHttpTransaction::OnDataWritable () from /usr/lib/mozilla/components/libnecko.so #5 0x407e0448 in nsHttpPipeline::OnDataWritable () from /usr/lib/mozilla/components/libnecko.so #6 0x407ce3a1 in nsHttpConnection::OnDataWritable () from /usr/lib/mozilla/components/libnecko.so #7 0x4079cb11 in nsSocketWriteRequest::OnWrite () from /usr/lib/mozilla/components/libnecko.so #8 0x407994a3 in nsSocketTransport::doReadWrite () from /usr/lib/mozilla/components/libnecko.so #9 0x4079857d in nsSocketTransport::Process () from /usr/lib/mozilla/components/libnecko.so #10 0x4079d538 in nsSocketTransportService::Run () from /usr/lib/mozilla/components/libnecko.so #11 0x4018d1b3 in nsThread::Main () from /usr/lib/libxpcom.so #12 0x4020f69c in _pt_root () from /usr/lib/libnspr4.so #13 0x40221c6f in pthread_start_thread (arg=0x40b16be0) at manager.c:284 Thread 3 (Thread 1026 (LWP 12749)): #0 0x40225b50 in __pthread_unlock (lock=0x447fbd0c) at spinlock.c:260 #1 0x4020a037 in PR_Unlock () from /usr/lib/libnspr4.so #2 0x4020a5e1 in PR_ExitMonitor () from /usr/lib/libnspr4.so #3 0x4018afbf in nsAutoMonitor::Exit () from /usr/lib/libxpcom.so #4 0x40171773 in nsPipe::nsPipeOutputStream::WriteSegments () from /usr/lib/libxpcom.so #5 0x40171b1e in nsPipe::nsPipeOutputStream::WriteFrom () from /usr/lib/libxpcom.so #6 0x407ceec9 in nsHttpTransaction::OnDataWritable () from /usr/lib/mozilla/components/libnecko.so #7 0x407e0448 in nsHttpPipeline::OnDataWritable () from /usr/lib/mozilla/components/libnecko.so #8 0x407ce3a1 in nsHttpConnection::OnDataWritable () from /usr/lib/mozilla/components/libnecko.so #9 0x4079cb11 in nsSocketWriteRequest::OnWrite () from /usr/lib/mozilla/components/libnecko.so #10 0x407994a3 in nsSocketTransport::doReadWrite () from /usr/lib/mozilla/components/libnecko.so #11 0x4079857d in nsSocketTransport::Process () from /usr/lib/mozilla/components/libnecko.so #12 0x4079d538 in nsSocketTransportService::Run () from /usr/lib/mozilla/components/libnecko.so #13 0x4018d1b3 in nsThread::Main () from /usr/lib/libxpcom.so #14 0x4020f69c in _pt_root () from /usr/lib/libnspr4.so #15 0x40221c6f in pthread_start_thread (arg=0x40b16be0) at manager.c:284 #0 0x40223e59 in pthread_self () at pthread.c:701 #1 0x40209fc8 in PR_Lock () from /usr/lib/libnspr4.so #2 0x4020a57e in PR_EnterMonitor () from /usr/lib/libnspr4.so #3 0x40171659 in nsPipe::nsPipeOutputStream::WriteSegments () from /usr/lib/libxpcom.so #4 0x40171b1e in nsPipe::nsPipeOutputStream::WriteFrom () from /usr/lib/libxpcom.so #5 0x407ceec9 in nsHttpTransaction::OnDataWritable () from /usr/lib/mozilla/components/libnecko.so #6 0x407e0448 in nsHttpPipeline::OnDataWritable () from /usr/lib/mozilla/components/libnecko.so #7 0x407ce3a1 in nsHttpConnection::OnDataWritable () from /usr/lib/mozilla/components/libnecko.so #8 0x4079cb11 in nsSocketWriteRequest::OnWrite () from /usr/lib/mozilla/components/libnecko.so #9 0x407994a3 in nsSocketTransport::doReadWrite () from /usr/lib/mozilla/components/libnecko.so #10 0x4079857d in nsSocketTransport::Process () from /usr/lib/mozilla/components/libnecko.so #11 0x4079d538 in nsSocketTransportService::Run () from /usr/lib/mozilla/components/libnecko.so #12 0x4018d1b3 in nsThread::Main () from /usr/lib/libxpcom.so #13 0x4020f69c in _pt_root () from /usr/lib/libnspr4.so #14 0x40221c6f in pthread_start_thread (arg=0x40b16be0) at manager.c:284 Thread 3 (Thread 1026 (LWP 12749)): #0 0x40225a9b in __pthread_lock (lock=0x447fbd1c, self=0x0) at ../linuxthreads/sysdeps/i386/i686/pt-machine.h:57 #1 0x40222d0e in __pthread_mutex_lock (mutex=0x447fbd0c) at mutex.c:101 #2 0x40209fbc in PR_Lock () from /usr/lib/libnspr4.so #3 0x4020a57e in PR_EnterMonitor () from /usr/lib/libnspr4.so #4 0x40171659 in nsPipe::nsPipeOutputStream::WriteSegments () from /usr/lib/libxpcom.so #5 0x40171b1e in nsPipe::nsPipeOutputStream::WriteFrom () from /usr/lib/libxpcom.so #6 0x407ceec9 in nsHttpTransaction::OnDataWritable () from /usr/lib/mozilla/components/libnecko.so #7 0x407e0448 in nsHttpPipeline::OnDataWritable () from /usr/lib/mozilla/components/libnecko.so #8 0x407ce3a1 in nsHttpConnection::OnDataWritable () from /usr/lib/mozilla/components/libnecko.so #9 0x4079cb11 in nsSocketWriteRequest::OnWrite () from /usr/lib/mozilla/components/libnecko.so #10 0x407994a3 in nsSocketTransport::doReadWrite () from /usr/lib/mozilla/components/libnecko.so #11 0x4079857d in nsSocketTransport::Process () from /usr/lib/mozilla/components/libnecko.so #12 0x4079d538 in nsSocketTransportService::Run () from /usr/lib/mozilla/components/libnecko.so #13 0x4018d1b3 in nsThread::Main () from /usr/lib/libxpcom.so #14 0x4020f69c in _pt_root () from /usr/lib/libnspr4.so #15 0x40221c6f in pthread_start_thread (arg=0x40b16be0) at manager.c:284
Keywords: hang
P.S. BuildID 2002052314 (trunk) on RedHat Linux 7.2
maybe darin can look at this.
Assignee: dougt → darin
reporter: looks like you have HTTP pipelining enabled. does this ever happen when pipelining is disabled?
Status: NEW → ASSIGNED
Priority: -- → P1
Whiteboard: [pipelining]
Target Milestone: --- → mozilla1.0.1
Yes, I probably do have it enabled. I am not sure what happens if it is not enabled - the only way of reproducing the bug that I know of involves messing with my internet banking in some weird ways, which is something I'd rather not do.
aleksey: have you encountered this bug with recent nightly trunk builds? starting with 1.1alpha?
downgrading for now since i don't really have enough information to do anything about this bug and it doesn't seem like it is happening that often or to many folks. it's also possible that one or more of the recent pipelining bug fixes also fixed this.
Severity: critical → major
Target Milestone: mozilla1.0.1 → ---
I can still reproduce this using BuildID 2002061717 on RedHat Linux 7.2 It now looks somewhat different. 1) In the offending window I see the "Please wait" message (that would normally be replaced with a message telling me the transaction is successful) and *no indicators* of network activity - both the "M" icon and status/progress bar look as if the page have finished loading. 2) Mozilla does not hang, but goes into a state where any further network activity (I tried HTTP and IMAP) is impossible: - I tried to view a message (from IMAP) - the progress indicators looked as if it started loading the message, but nothing ever happens. - Next I tried to middle-click on a link (e.g. "open link in the new window"). I did get a new window with all the progress indicators being "alive", but again, nothing was ever loaded. As before, gdb showed that the thread consuming most of the CPU was in WriteSegments function all the time. When I tried to quit Mozilla, it hanged for good.
aleksey: wow! ok are there any steps i might be able to use try to reproduce the problem? are you able to reproduce it readily simply by visiting a particular site? are you connecting to the internet using a proxy server? if so, do you happen to know what kind of proxy server? thx!
I use squid on localhost, but only for http/ftp, while the site in question is https. I can reproduce this simply by trying to transfer money between savings and checking on a particular online banking site (small local bank). BTW, the reason it looked differently yesterday may be simply because I did something slightly different - when I saw "Please wait" I didn't try to follow a link in the offending window, but went to another window (trying to read mail). Possibly it hangs only when I attempt to do something in the offending window (including trying to close it) and if I leave that window alone, it blocks network activity in other windows. Yeah, I wish I had a test case that didn't involve messing with my bank account...
I'm also seeing this bug with current trunk builds on win2k. No proxy. Steps to reproduce (sometimes) 1. Go to http://www.anandtech.com 2. Click on one of the articles in the main column 3. Browse through the article 4. Try & stop some of the pages in the article before they load completely I was able to reproduce this bug 2/3 of the time following the steps above.
OS: Linux → All
walk84: what kind of internet connection do you have? modem? gagan, dave: this is that pipelining bug that i mentioned to you today.
Keywords: topembed
darin, yes. 56k.
Blocks: 144480
No longer blocks: 144480
*** Bug 153772 has been marked as a duplicate of this bug. ***
just saw this at http://oneandone.co.uk/xml/init?k_id=3978017 , with trunk 2002070604 on win2k. pipelining enabled, using 64k ISDN. I also reproduced it 2 times out of 2 with the steps in comment 10 (in fact, i didn't have to fiddle with anything, it happened while loading an article). disabled pipelining and it worked 2 times out of 2.
Keywords: mozilla1.1
*** Bug 154373 has been marked as a duplicate of this bug. ***
*** Bug 152643 has been marked as a duplicate of this bug. ***
*** Bug 158473 has been marked as a duplicate of this bug. ***
*** Bug 156142 has been marked as a duplicate of this bug. ***
*** Bug 159212 has been marked as a duplicate of this bug. ***
*** Bug 159327 has been marked as a duplicate of this bug. ***
Target Milestone: --- → mozilla1.1beta
Attached patch v1 patch (obsolete) — Splinter Review
i was able to consistently reproduce this bug on http://www.battle.net/war3/nightelf (from one of the dupes), and the problem turned out to be in the socket transport. we simply weren't clearing the socket error state before calling OnDataAvailable. in some cases, nsHttpPipeline::OnDataReadable would return NS_OK without ever trying to read from the socket. as a result, the socket transport would sit in a loop trying to call OnDataAvailable because it erroneously believed that the socket read resulted in a WOULD_BLOCK error despite the fact that read was never called on the socket. whew! anyways, this simple patch fixes the problem. however, it doesn't seem to address the stack trace initially reported. i'm still thinking about that one.
fixing the component and summary
Component: XPCOM → Networking
Summary: Mozilla hangs in nsPipe::nsPipeOutputStream::WriteSegments with 100% CPU usage... → Mozilla sometimes consumes 100% CPU when pipelining is enabled...
QA -> tever
QA Contact: scc → tever
Not sure about the new summary - for me it indeed used to hang Mozilla or at least make all feature network activity in all windows (including Mail) impossible.
aleksey: hmm... have you seen that problem recently?
No, but I have not tried reproducing it recently. I might try once your patch goes in.
Site mentioned in comment 21 will completely hang my computer (not just Mozilla but the OS) for 20-30 seconds - although it does still buffer some keystrokes for later processsing. It does this about once every 5 minutes, starting about 5 minutes after loading the site. 100% reproducible for me. (Without pipelining everything is fine.)
Aleksey: sounds to me as if the problem you were seeing might be explained by this bit of code inside nsHttpPipeline::OnDataWritable(): while (1) { rv = mTransactionQ[i]->OnDataWritable(outputStream); if (rv == NS_BASE_STREAM_CLOSED) break; // advance to next transaction if (NS_FAILED(rv)) return rv; // something bad happened!! // else, there's more to write (the transaction may be // writing in small chunks). } if mTransactionQ[i]->OnDataWritable(...) returned NS_OK without actually writing anything to outputStream, then we could get stuck in this loop. moreover, this loop runs on the socket transport thread (a background thread), meaning that the UI would still function, but you'd be unable to use the network (as you mentioned). i'm thinking of adding an additional check to this loop to verify that data was actually written to the stream. if no data is written and NS_OK is returned, then i'll assume the transaction is done writing and advance to the next transaction. that should make this bit of code more robust, but it does seem like a bit of a hack. nsHttpTransaction::OnDataWritable is designed to return NS_BASE_STREAM_CLOSED when it cannot write any data. for some reason, it is not doing that.
Attached patch v2 patch (obsolete) — Splinter Review
ok, this patch combines both fixes.
Attachment #92856 - Attachment is obsolete: true
Comment on attachment 92856 [details] [diff] [review] v1 patch nit: In the comment: not GetWouldBlock(), but GotWouldBlock. other than that, seams reasonable. r=dougt
Attachment #92856 - Flags: review+
Comment on attachment 92941 [details] [diff] [review] v2 patch r=dougt
Attachment #92941 - Flags: review+
Attachment #92941 - Attachment is obsolete: true
Comment on attachment 92946 [details] [diff] [review] v2.1 patch - revised spelling mistake forwarding r=dougt
Attachment #92946 - Flags: review+
Comment on attachment 92946 [details] [diff] [review] v2.1 patch - revised spelling mistake sr=rpotts@netscape.com
Attachment #92946 - Flags: superreview+
Comment on attachment 92946 [details] [diff] [review] v2.1 patch - revised spelling mistake a=asa (on behalf of drivers) for checkin to 1.1
Attachment #92946 - Flags: approval+
fixed-on-trunk
Whiteboard: [pipelining] → [pipelining][adt2 RTM]
-> moz 1.0.1
Target Milestone: mozilla1.1beta → mozilla1.0.1
*** Bug 156405 has been marked as a duplicate of this bug. ***
resolving FIXED to get QA's attention.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Darin, aren't you looking for the fixed1.0.1 keyword?
*** Bug 143131 has been marked as a duplicate of this bug. ***
Could everyone who is able to reproduce this check that it is fixed in the nightly and add their comments? I've seen a few of the 100% cpu hangs but they usually occur after I leave my machine running all night (with mail open I think). I haven't been able come up with a reliable test case. All of the sites in the dupes work for me. Thanks Anthony, the fixed1.0.1 is used to signify that the fix is in the branch and ready for testing.
It's not happened for me since this was checked in. Bug 160222 was reported today, but that was with an older build. Seems to me it was clobbered fine.
I've not experience it as well since it was checked in to the trunk.
*** Bug 153772 has been marked as a duplicate of this bug. ***
OK thanks. I haven't seen this for a few days either so marking as verified.
Status: RESOLVED → VERIFIED
Whiteboard: [pipelining][adt2 RTM] → [pipelining][adt2 RTM][verified-trunk]
adt1.0.1- per the ADT. let's get it in the next release. should you disagree with the minus, pls remove the minus, leaving adt1.0.1 in the keywords, then send an email to Drivers' and ADT for approval.
Keywords: adt1.0.1adt1.0.1-
*** Bug 160222 has been marked as a duplicate of this bug. ***
I really disagree with ADT's decision.. This fix should be included in the next Netscape release. Just look at all the dupes.
The should be only adt+ if NS has also the preference to enable pipelining..
please checkin to the MOZILLA_1_0_BRANCH branch. once there, remove the "mozilla1.0.1+" keyword and add the "fixed1.0.1" keyword.
Adding topembed+ per email from Jud Valeski. Removing minus from adt1.0.1, and renominating this one for 1.0.1, as it is needed by a major embeddor.
Adding adt1.0.1+ on behalf of the adt for checkin to the 1.0 branch. When you check this into the branch, please change the mozilla1.0.1+ keyword to fixed1.0.1
Keywords: adt1.0.1adt1.0.1+
fixed on the 1.0 branch
tever: tom, can you pls verify this as fixed on the 1.0 branch, then replace "fixed1.0.1" with "verified1.0.1"? thanks!
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: