Closed Bug 309424 Opened 20 years ago Closed 20 years ago

Firefox 1.5beta1 freezes & goes into 95+% cpu usage browsing blackisha.com

Categories

(Core :: XML, defect)

defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla1.8beta5

People

(Reporter: mozilla, Assigned: darin.moz)

References

()

Details

(Keywords: fixed1.8, hang, regression)

Attachments

(3 files, 1 obsolete file)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b4) Gecko/20050908 Firefox/1.4 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b4) Gecko/20050908 Firefox/1.4 Goto blackisha.com & hover over the second comments link. it will come up with a list of comment titles, hover over a title to get the msg body. then move the mouse over to somewhere else in the page. Mozilla freezes & starts consuming cpu. Reproducible: Always Steps to Reproduce: 1. blackisha.com 2. hover over the second comments link (or a link with comments), a menu will popup 3.hover over a comment title, so that another div pops ip. 4. move the mouse to somewhere else of the page, it crashes at this point for me. Actual Results: crash & 95+% cpu usage. Expected Results: continue normally & the divs (ur <ul>'s, cant quite remember) should dissapear I will test in safe mode, but cant test yet as it will crash browser.
OK, I tried it in safe mode. same problem.
Summary: mozilla 1.5beta1 freezes & goes into 95+% cpu usage → mozilla 1.5beta1 freezes & goes into 95+% cpu usage browsing blackisha.com
Version: unspecified → 1.5 Branch
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20050920 Firefox/1.6a1 ID:2005092021 Sadly I can't get it to crash, but the site becomes unresponsive, the toolbars disappear and I have to force a shutdown of the browser.
Reproduced hang with 2005-09-18 branch build on OS X.
Severity: normal → critical
Keywords: hang
OS: Windows XP → All
Hardware: PC → All
Debugging on a several days old home-made trunk build, the hang seems to be in nsXMLHttpRequest::Send(), in the following loop, which appears to run forever: // If we're synchronous, spin an event loop here and wait if (!(mState & XML_HTTP_REQUEST_ASYNC)) { while (mState & XML_HTTP_REQUEST_SYNCLOOPING) { modalEventQueue->ProcessPendingEvents(); } Confirming.
Assignee: nobody → xml
Status: UNCONFIRMED → NEW
Component: General → XML
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → ashshbhatt
Version: 1.5 Branch → Trunk
Can't reproduce on Firefox 1.0 => regression
Keywords: regression
SOrry, I should have said hang & cpu usage, not crash.
Since this is a regression and a hang, requesting blocking1.8b5. I have no idea how to make a testcase for this, and I'm also not sure what the correct component is.
Flags: blocking1.8b5?
(In reply to comment #4) > Debugging on a several days old home-made trunk build, the hang seems to be in > nsXMLHttpRequest::Send(), in the following loop, which appears to run forever: > > // If we're synchronous, spin an event loop here and wait > if (!(mState & XML_HTTP_REQUEST_ASYNC)) { > while (mState & XML_HTTP_REQUEST_SYNCLOOPING) { > modalEventQueue->ProcessPendingEvents(); > } > > Confirming. The site should switch to an async xmlhttprequest to stop the UI thread from not refreshing.
I can reproduce this with a 2005-09-22 Windows 1.8-branch build.
I noticed that this page appears to be sync loading via XMLHttpRequest from a mousemove handler. The sync load causes XMLHttpRequest to push a child event queue, which suppresses events from existing requests from being processed. I noticed in my log file that the same URL was being loaded twice when the hang happens. The second request for the URL is stuck waiting to access the cache while the first one is responsible for populating the cache. The first never has any of its events processed, however, which prevents it from ever writing to the cache, and thus we're in a dead locked situation where nothing can proceed. I'm not sure what is causing the event queue to spin so madly though.
Attached file HTTP log
Here's a HTTP log that shows two different XMLHttpRequest objects each attempting to sync load the same resource. Trouble occurs because of the modal event queues.
This stack trace shows recursive calls to XMLHttpRequest::Send. Matching those up to the HTTP log, we see that they are both for the same URL. The second never finishes because it is waiting for the first to complete, but the first never completes because its events are never processed. Yay! :-(
OK, the problem is that PL_ProcessPendingEvents does not actually process its events if called recursively. See: http://lxr.mozilla.org/mozilla/source/xpcom/threads/plevent.c#602 NOTE: It turns out that the statement I made above is not true: "The sync load causes XMLHttpRequest to push a child event queue, which suppresses events from existing requests from being processed." The elder event queue is processed, but because of the code pointed to above, it turns out that the events simply aren't processed. This explains why we loop madly through ProcessPendingEvents.
As a test, I commented out this code at the top of PL_ProcessPendingEvents: /* if (self->processingEvents) { _pl_AcknowledgeNativeNotify(self); self->notified = PR_FALSE; PR_ExitMonitor(self->monitor); return; } */ With that code removed, I can no longer reproduce this bug. I'm not sure that it is the correct change. Unfortunately, there is near zero documentation in plevent.c about the purpose of the processingEvents field and the cvs logs for this chunk of code are barren. It seems to me that our nested event queue implementation will only suppress elder event queues when the elder event queue's PL_ProcessPendingEvents call is on the stack. But, then one has to wonder why a nsEventQueue bothers to call ProcessPendingEvents on its mElderQueue at all... /Sigh/
i think this block of code was added so that we don't overflow the Windows native event queue with a bunch of notification MSG's. Yes, our stacked event queue impl. is not consistent, poorly documented, and unowned. Simon has recently has experience much pain because of this.
I just had a very similar problem in Camino with a PLEvent used by an nsProxyObject; PLEvent processing was already on the stack, which meant that the PLEvent for the proxied XPCOM call (it was an observer notification from nsKeygenThread) was never processed (bug 308700). I fixed this by pushing another event queue, but I'd rather this was fixed at a more general level.
Unfortunately, pushing another event queue won't help with this bug. We are already pushing an event queue per XMLHttpRequest::Send invocation.
Doesn't the new nsIEventQueue have a different PLEventQueue, though? So I'd expect that we would process events on the child queue ok. Also, is the regression range in comment 7 correct? I don't see any event queue or XMLHttpRequest checkins in there...
(In reply to comment #19) > Also, is the regression range in comment 7 correct? I don't see any event queue > or XMLHttpRequest checkins in there... I double-checked with Firefox builds (on a different machine) and I still see the regression between those two builds (except that the 2005-03-28 build is 2005-03-28-08, and not 2005-03-28-09 as I wrote). I also checked with Mozilla (suite) builds: Build ID 2005032810 works, Build ID 2005032909 hangs. Of course, it would be nice if someone else could confirm this as well.
(In reply to comment #19) > Doesn't the new nsIEventQueue have a different PLEventQueue, though? So I'd > expect that we would process events on the child queue ok. Yes, and we do. But, in this situation the second request (run by the child queue) will not finish until the first request (run by the elder queue) finishes. There's no way for the first request to finish if its events are never processed.
Hmm... Why can't the second request finish until the first does? Maybe I'm confused about the call order here... I thought we were calling send() again while waiting for a response to the first send(), which caused us to push a second nested event queue for the new send(). Is that not what's going on?
bz: See comment #11. The problem is that the second request is stuck waiting to access the cache entry that is being written to by the first request.
Ah, I see... So what we somehow need is to prevent all interaction with the page while that first sync XMLHttpRequest is in flight?
> So what we somehow need is to prevent all interaction with the page while that > first sync XMLHttpRequest is in flight? Right. The problem is the second mousemove event. I suspect that something changed between 1.0 and 1.5 that causes that second mousemove event to fire.
> I suspect that something > changed between 1.0 and 1.5 that causes that second mousemove event to fire. Bug 284664 has to do with mouse events, and was fixed in the regression window of this bug.
ROC, can you look into this for us?
Assignee: xml → roc
The problem is that the new behavior is basically correct. While a "sync" XMLHttpRequest is in flight, we are delivering events to the page, and that would include mouse move events. The bug roc fixed just fixed cases where we didn't delive mouse move events at all (independent of whether there was XMLHttpRequest involved). So the real issue is that our impl of "sync" XMLHttpRequest is really not very sync. This has caused problems before; this is just another example.
Darin, could we possibly switch XMLHttpRequest to just using open() for sync sends? Then we could perhaps deal on the HTTP end by throwing from an open() call if the cache entry is already in use....
(In reply to comment #29) > Darin, could we possibly switch XMLHttpRequest to just using open() for sync > sends? Then we could perhaps deal on the HTTP end by throwing from an open() > call if the cache entry is already in use.... That's an interesting idea. We could also make the second HTTP request simply bypass the cache if it cannot get exclusive access. Or, would it be better to throw?
I'd prefer bypassing the cache to throwing, yeah. Do we need to switch to open() to do this, or could we do it in today's asyncOpen() world? I assume the latter would affect far too many other things, since we would have no idea that the caller is evil?
(In reply to comment #31) > I'd prefer bypassing the cache to throwing, yeah. > > Do we need to switch to open() to do this, or could we do it in today's > asyncOpen() world? I assume the latter would affect far too many other things, > since we would have no idea that the caller is evil? Well, we could implement it as a load flag. DONT_BLOCK_ON_CACHE or something like that. I can't think of a good name for it right now.
That might be easiest and safest for 1.8, with possibly a followup bug to consider using open() for trunk?
Yeah, that's probably a reasonable solution for the impending release.
Darin, could you put that together? Let's call the flag BYPASS_CACHE_IF_LOCKED perhaps?
Yeah, I could do that. But, I'd like to think about this a bit more. I don't really like exposing details of the cache implementation like this if it can be helped. Hmm... -> me
Assignee: roc → darin
Target Milestone: --- → mozilla1.8beta5
Perhaps a better option would be to treat the current Window as modal when we are sync loading. Then it would not respond to events, and we would avoid this nasty re-entrancy problem. I think I've suggested this before, but I don't know what it would take exactly to simulate a modal dialog. Perhaps there is code in WindowWatcher::OpenWindowJS that we could rip.
Flags: blocking1.8b5? → blocking1.8b5+
Darin, how common / prevalent do you think this issue will be out in the wild?
> Darin, how common / prevalent do you think this issue will be out in the wild? I don't know. You'd hope that most people would know to use the asynchronous mode instead. However, I wouldn't be surprised if we hear about this causing compatibility problems with some legacy web apps developed against IE, which has a much better synchronous XMLHTTP implementation.
Flags: blocking1.8b5+ → blocking1.8b5-
Flags: blocking1.8rc1?
our inclination is to minus this bug for 1.5, we don't think this is a critical stop ship bug and we decided it wasn't important enough for beta 2. Care to build a case to the contrary Darin? Leaving the nomination alone for now.
This is a regression from Firefox 1.0; however, people can workaround the problem. I don't know how much trouble this will cause in the wild.
Flags: blocking1.8rc1? → blocking1.8rc1-
Attached patch v1 patch (obsolete) — Splinter Review
Here's a workaround that does the trick. I think this is a good fix for 1.5rc1.
Attachment #199374 - Flags: superreview?(jst)
Attachment #199374 - Flags: review?(bzbarsky)
This patch also implies that we can simply eliminate the pushing and popping of event queues in XMLHttpRequest, but I was trying to minimize the changes.
(In reply to comment #43) > This patch also implies that we can simply eliminate the pushing and popping of > event queues in XMLHttpRequest, but I was trying to minimize the changes. To add to that comments, even if we did away with the PushThreadEventQueue call in XMLHttpRequest, we'd still need to walk the event queue chain in order to deal with the fact that someone else might have pushed a modal event queue.
Hmm... So any events that got posted before the send() call will also get processed, basically?
> Hmm... So any events that got posted before the send() call will also get > processed, basically? Yes.
Does that change our behavior with regard to interaction with timeouts and intervals? Or are we firing those into the XMLHttpRequest queue anyway?
(In reply to comment #47) > Does that change our behavior with regard to interaction with timeouts and > intervals? Or are we firing those into the XMLHttpRequest queue anyway? Right, we fire those into the current event queue... same with input events. Given that the current event loop may end up executing arbitrary events, I see no reason not to process all event queues from this code.
Comment on attachment 199374 [details] [diff] [review] v1 patch Yeah, ok. Seems reasonable. I guess as long as we can't end up in an XMLHttpRequest that started _after_ a modal dialog that we really want to block necko, eg, came up, we're ok. That assumption isn't true for Linux Firefox, but I don't see what we can do about it here.
Attachment #199374 - Flags: review?(bzbarsky) → review+
> Yeah, ok. Seems reasonable. I guess as long as we can't end up in an > XMLHttpRequest that started _after_ a modal dialog that we really want to block > necko, eg, came up, we're ok. That assumption isn't true for Linux Firefox, > but I don't see what we can do about it here. Well... with this change, an async XMLHttpRequest might fire onload while a script is inside a call to alert(). Similarly, an async XMLHttpRequest might fire onload while a script is performing a sync XMLHttpRequest send.
I'm more worried about necko events being processed while something like the HTTP password prompt is up. I guess that can happen if the send() is done first, then in a different window the user tries to load something that uses basic auth, right? If it does happen, will HTTP deal?
> I'm more worried about necko events being processed while something like the > HTTP password prompt is up. I guess that can happen if the send() is done > first, then in a different window the user tries to load something that uses > basic auth, right? If it does happen, will HTTP deal? HTTP should be fine in that case because the same nsHttpChannel instance shouldn't be re-entered while it is making that call to PromptUsernameAndPassword.
So we have something preventing that other than the fact that the dialog is modal?
> So we have something preventing that other than the fact that the dialog is modal? No, not really.... I suppose it may be possible for OnDataAvailable and OnStopRequest events to be processed while we are inside the prompt call if the event queue is processed, and that might not be so good. The problem, I think, is that we do not suspend those events while making the prompt call. So, yeah... this patch might be trouble :-(
Comment on attachment 199374 [details] [diff] [review] v1 patch OK, this patch is no good. A cheap solution to this bug is to simply bypass the cache when making a sync load. I tried to avoid that solution because it introduces bad performance on the site in question, but maybe that's just going to have to be the solution for Gecko 1.8.
Attachment #199374 - Attachment is obsolete: true
Attached patch v2 patchSplinter Review
This patch implements what we came up with earlier: a load flag for causing the channel to bypass the local cache instead of waiting for a cache entry.
Attachment #199664 - Flags: superreview?(bzbarsky)
Attachment #199664 - Flags: review?(cbiesinger)
Attachment #199374 - Flags: superreview?(jst)
Blocks: 312534
Attachment #199664 - Flags: superreview?(bzbarsky) → superreview+
Comment on attachment 199664 [details] [diff] [review] v2 patch + * request. Unlike LOAD_BYPASS_CACHE, it does not force an end-to-end load + * (i.e., it does not affect proxy caches), and it does not prevent the + * channel from writing to the cache once it receives a response from the + * server. hmm... this comment sounds to me like LOAD_BYPASS_CACHE prevents writing to the cache, which isn't the case... r=biesi
Attachment #199664 - Flags: review?(cbiesinger) → review+
Attachment #199664 - Flags: approval1.8rc1?
> hmm... this comment sounds to me like LOAD_BYPASS_CACHE prevents writing to the > cache, which isn't the case... Good point. I'll clean that up before check-in.
Darin, can you tell us what kind of risk is involved in taking this and how many sites are going to be broken or developers prevented from doing good and obvious coding by thid problem? Also, this would need to land and be verified to work and not break anything on the trunk before we'd consider for the branch.
> Darin, can you tell us what kind of risk is involved in taking this and how > many sites are going to be broken or developers prevented from doing good and > obvious coding by thid problem? Asa: This bug is a regression. While websites can workaround this problem, I suspect that many will be shocked to find that they have to. The fix is very low-risk IMO, and I think we should definitely take it for this release. I'm landing this on the trunk today.
we'll evaluate again after this has been landed and verified on the trunk.
fixed-on-trunk
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
I still see issues using today's Windows trunk build Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20051021 Firefox/1.6a1. I load the page, and when I hover over the links in the comments section, I have difficult escaping out of that mode. At one point the browser page just became blank, and then eventually the browser comes back to life. I will have Tracy try this out too to see what he sees.
seemed to work ok for me in a cvs build and my nightly trunk build on winxp updated to today. I couldn't reproduce the issues Marcia is seeing.
Marcia, Yeah that is working as expected. The good part is that your browser did eventually come back to life. Without the patch in this bug, it would never come back to life. Fixing the other part of the bug is well beyond the scope of FF 1.5.
Verifying based on Darin's comment using Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20051021 Firefox/1.6a1. I checked CPU usage while the page was loading and there are no issues there, as well as no crash. Marking verified.
Status: RESOLVED → VERIFIED
Also looks good on Linux trunk build from 1021
Attachment #199664 - Flags: approval1.8rc1? → approval1.8rc1+
fixed1.8
Keywords: fixed1.8
Summary: mozilla 1.5beta1 freezes & goes into 95+% cpu usage browsing blackisha.com → Firefox 1.5beta1 freezes & goes into 95+% cpu usage browsing blackisha.com
*** Bug 300631 has been marked as a duplicate of this bug. ***
Just a note that I can still reproduce this with 1.5 (Gecko/20051111 Firefox/1.5) on OSX -- talkback http://talkback-public.mozilla.org/talkback/fastfind.jsp?search=2&type=iid&id=12086427 , I had to SEGV the process manually to get tb to kick in. I only left it running for about 10-15 minutes, but in that time it never came back. I can also trivially reproduce this using an AJAX "skin" for the web interface for my SlimServer (music server thingy, http://www.slimdevices.com/su_downloads.html), which seems to use sync xmlhttprequest calls throughout =/
Vlad: I think it is a different problem on OSX. My FF gets hung as well sometimes -- not always -- and when it does get hung, it is not consuming the CPU. So, I think you found a different bug.
Vlad: In my testing, I found that if I waited long enough, my browser would always become unfrozen. With some of the very large comment blocks on blackisha.com, it would sometimes take several minutes to fetch the data. Also, if you happened to trigger multiple mouseover events on that site, then you would trigger multiple synchronous fetchs of the data, which would surely increase the time it takes to unblock the browser (note: the disk cache doesn't help). I filed bug 317596 for this issue.
Blocks: 757042
No longer blocks: 757042
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: