Closed Bug 309424 Opened 19 years ago Closed 19 years ago

Firefox 1.5beta1 freezes & goes into 95+% cpu usage browsing blackisha.com

Categories

(Core :: XML, defect)

defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla1.8beta5

People

(Reporter: mozilla, Assigned: darin.moz)

References

()

Details

(Keywords: fixed1.8, hang, regression)

Attachments

(3 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b4) Gecko/20050908 Firefox/1.4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b4) Gecko/20050908 Firefox/1.4

Goto blackisha.com & hover over the second comments link. it will come up with a
list of comment titles, hover over a title to get the msg body. then move the
mouse over to somewhere else in the page. Mozilla freezes & starts consuming cpu.

Reproducible: Always

Steps to Reproduce:
1. blackisha.com
2. hover over the second comments link (or a link with comments), a menu will popup
3.hover over a comment title, so that another div pops ip.
4. move the mouse to somewhere else of the page, it crashes at this point for me.

Actual Results:  
crash & 95+% cpu usage.

Expected Results:  
continue normally & the divs (ur <ul>'s, cant quite remember) should dissapear

I will test in safe mode, but cant test yet as it will crash browser.
OK, I tried it in safe mode. same problem.
Summary: mozilla 1.5beta1 freezes & goes into 95+% cpu usage → mozilla 1.5beta1 freezes & goes into 95+% cpu usage browsing blackisha.com
Version: unspecified → 1.5 Branch
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20050920
Firefox/1.6a1 ID:2005092021

Sadly I can't get it to crash, but the site becomes unresponsive, the toolbars
disappear and I have to force a shutdown of the browser.
Reproduced hang with 2005-09-18 branch build on OS X.
Severity: normal → critical
Keywords: hang
OS: Windows XP → All
Hardware: PC → All
Debugging on a several days old home-made trunk build, the hang seems to be in
nsXMLHttpRequest::Send(), in the following loop, which appears to run forever:

  // If we're synchronous, spin an event loop here and wait
  if (!(mState & XML_HTTP_REQUEST_ASYNC)) {
    while (mState & XML_HTTP_REQUEST_SYNCLOOPING) {
      modalEventQueue->ProcessPendingEvents();
    }

Confirming.
Assignee: nobody → xml
Status: UNCONFIRMED → NEW
Component: General → XML
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → ashshbhatt
Version: 1.5 Branch → Trunk
Can't reproduce on Firefox 1.0 => regression
Keywords: regression
SOrry, I should have said hang & cpu usage, not crash.
Since this is a regression and a hang, requesting blocking1.8b5. I have no idea
how to make a testcase for this, and I'm also not sure what the correct
component is.
Flags: blocking1.8b5?
(In reply to comment #4)
> Debugging on a several days old home-made trunk build, the hang seems to be in
> nsXMLHttpRequest::Send(), in the following loop, which appears to run forever:
> 
>   // If we're synchronous, spin an event loop here and wait
>   if (!(mState & XML_HTTP_REQUEST_ASYNC)) {
>     while (mState & XML_HTTP_REQUEST_SYNCLOOPING) {
>       modalEventQueue->ProcessPendingEvents();
>     }
> 
> Confirming.

The site should switch to an async xmlhttprequest to stop the UI thread from not
refreshing.
I can reproduce this with a 2005-09-22 Windows 1.8-branch build.
I noticed that this page appears to be sync loading via XMLHttpRequest from a
mousemove handler.  The sync load causes XMLHttpRequest to push a child event
queue, which suppresses events from existing requests from being processed.  I
noticed in my log file that the same URL was being loaded twice when the hang
happens.  The second request for the URL is stuck waiting to access the cache
while the first one is responsible for populating the cache.  The first never
has any of its events processed, however, which prevents it from ever writing to
the cache, and thus we're in a dead locked situation where nothing can proceed.
 I'm not sure what is causing the event queue to spin so madly though.
Attached file HTTP log
Here's a HTTP log that shows two different XMLHttpRequest objects each
attempting to sync load the same resource.  Trouble occurs because of the modal
event queues.
This stack trace shows recursive calls to XMLHttpRequest::Send.  Matching those
up to the HTTP log, we see that they are both for the same URL.  The second
never finishes because it is waiting for the first to complete, but the first
never completes because its events are never processed.  Yay! :-(
OK, the problem is that PL_ProcessPendingEvents does not actually process its
events if called recursively.  See:
http://lxr.mozilla.org/mozilla/source/xpcom/threads/plevent.c#602

NOTE:  It turns out that the statement I made above is not true:

  "The sync load causes XMLHttpRequest to push a child event queue, which 
   suppresses events from existing requests from being processed."

The elder event queue is processed, but because of the code pointed to above, it
turns out that the events simply aren't processed.

This explains why we loop madly through ProcessPendingEvents.
As a test, I commented out this code at the top of PL_ProcessPendingEvents:

 /* if (self->processingEvents) {
        _pl_AcknowledgeNativeNotify(self);
        self->notified = PR_FALSE;
        PR_ExitMonitor(self->monitor);
        return;
    } */

With that code removed, I can no longer reproduce this bug.  I'm not sure that
it is the correct change.  Unfortunately, there is near zero documentation in
plevent.c about the purpose of the processingEvents field and the cvs logs for
this chunk of code are barren.

It seems to me that our nested event queue implementation will only suppress
elder event queues when the elder event queue's PL_ProcessPendingEvents call is
on the stack.  But, then one has to wonder why a nsEventQueue bothers to call
ProcessPendingEvents on its mElderQueue at all...  /Sigh/
i think this block of code was added so that we don't overflow the Windows
native event queue with a bunch of notification MSG's.  

Yes, our stacked event queue impl. is not consistent, poorly documented, and
unowned.  Simon has recently has experience much pain because of this.
I just had a very similar problem in Camino with a PLEvent used by an
nsProxyObject; PLEvent processing was already on the stack, which meant that the
PLEvent for the proxied XPCOM call (it was an observer notification from
nsKeygenThread) was never processed (bug 308700). I fixed this by pushing
another event queue, but I'd rather this was fixed at a more general level.
Unfortunately, pushing another event queue won't help with this bug.  We are
already pushing an event queue per XMLHttpRequest::Send invocation.
Doesn't the new nsIEventQueue have a different PLEventQueue, though?  So I'd
expect that we would process events on the child queue ok.

Also, is the regression range in comment 7 correct?  I don't see any event queue
or XMLHttpRequest checkins in there...
(In reply to comment #19)
> Also, is the regression range in comment 7 correct?  I don't see any event queue
> or XMLHttpRequest checkins in there...

I double-checked with Firefox builds (on a different machine) and I still see
the regression between those two builds (except that the 2005-03-28 build is
2005-03-28-08, and not 2005-03-28-09 as I wrote).
I also checked with Mozilla (suite) builds: Build ID 2005032810 works, Build ID
2005032909 hangs.

Of course, it would be nice if someone else could confirm this as well.
(In reply to comment #19)
> Doesn't the new nsIEventQueue have a different PLEventQueue, though?  So I'd
> expect that we would process events on the child queue ok.

Yes, and we do.  But, in this situation the second request (run by the child
queue) will not finish until the first request (run by the elder queue)
finishes.  There's no way for the first request to finish if its events are
never processed.
Hmm... Why can't the second request finish until the first does?  Maybe I'm
confused about the call order here... I thought we were calling send() again
while waiting for a response to the first send(), which caused us to push a
second nested event queue for the new send().  Is that not what's going on?
bz: See comment #11.  The problem is that the second request is stuck waiting to
access the cache entry that is being written to by the first request.
Ah, I see...

So what we somehow need is to prevent all interaction with the page while that
first sync XMLHttpRequest is in flight?
> So what we somehow need is to prevent all interaction with the page while that
> first sync XMLHttpRequest is in flight?

Right.  The problem is the second mousemove event.  I suspect that something
changed between 1.0 and 1.5 that causes that second mousemove event to fire.
> I suspect that something
> changed between 1.0 and 1.5 that causes that second mousemove event to fire.

Bug 284664 has to do with mouse events, and was fixed in the regression window
of this bug.
ROC, can you look into this for us?
Assignee: xml → roc
The problem is that the new behavior is basically correct.  While a "sync"
XMLHttpRequest is in flight, we are delivering events to the page, and that
would include mouse move events.  The bug roc fixed just fixed cases where we
didn't delive mouse move events at all (independent of whether there was
XMLHttpRequest involved).

So the real issue is that our impl of "sync" XMLHttpRequest is really not very
sync.  This has caused problems before; this is just another example.
Darin, could we possibly switch XMLHttpRequest to just using open() for sync
sends?  Then we could perhaps deal on the HTTP end by throwing from an open()
call if the cache entry is already in use....
(In reply to comment #29)
> Darin, could we possibly switch XMLHttpRequest to just using open() for sync
> sends?  Then we could perhaps deal on the HTTP end by throwing from an open()
> call if the cache entry is already in use....

That's an interesting idea.  We could also make the second HTTP request simply
bypass the cache if it cannot get exclusive access.  Or, would it be better to
throw?
I'd prefer bypassing the cache to throwing, yeah.

Do we need to switch to open() to do this, or could we do it in today's
asyncOpen() world?  I assume the latter would affect far too many other things,
since we would have no idea that the caller is evil?
(In reply to comment #31)
> I'd prefer bypassing the cache to throwing, yeah.
> 
> Do we need to switch to open() to do this, or could we do it in today's
> asyncOpen() world?  I assume the latter would affect far too many other things,
> since we would have no idea that the caller is evil?

Well, we could implement it as a load flag.  DONT_BLOCK_ON_CACHE or something
like that.  I can't think of a good name for it right now.
That might be easiest and safest for 1.8, with possibly a followup bug to
consider using open() for trunk?
Yeah, that's probably a reasonable solution for the impending release.
Darin, could you put that together?  Let's call the flag BYPASS_CACHE_IF_LOCKED
perhaps?
Yeah, I could do that.  But, I'd like to think about this a bit more.  I don't
really like exposing details of the cache implementation like this if it can be
helped.  Hmm...

-> me
Assignee: roc → darin
Target Milestone: --- → mozilla1.8beta5
Perhaps a better option would be to treat the current Window as modal when we
are sync loading.  Then it would not respond to events, and we would avoid this
nasty re-entrancy problem.  I think I've suggested this before, but I don't know
what it would take exactly to simulate a modal dialog.  Perhaps there is code in
WindowWatcher::OpenWindowJS that we could rip.
Flags: blocking1.8b5? → blocking1.8b5+
Darin, how common / prevalent do you think this issue will be out in the wild? 

> Darin, how common / prevalent do you think this issue will be out in the wild? 

I don't know.  You'd hope that most people would know to use the asynchronous
mode instead.  However, I wouldn't be surprised if we hear about this causing
compatibility problems with some legacy web apps developed against IE, which has
a much better synchronous XMLHTTP implementation. 
Flags: blocking1.8b5+ → blocking1.8b5-
Flags: blocking1.8rc1?
our inclination is to minus this bug for 1.5, we don't think this is a critical
stop ship bug and we decided it wasn't important enough for beta 2. Care to
build a case to the contrary Darin? Leaving the nomination alone for now. 
This is a regression from Firefox 1.0; however, people can workaround the
problem.   I don't know how much trouble this will cause in the wild.
Flags: blocking1.8rc1? → blocking1.8rc1-
Attached patch v1 patch (obsolete) — Splinter Review
Here's a workaround that does the trick.  I think this is a good fix for
1.5rc1.
Attachment #199374 - Flags: superreview?(jst)
Attachment #199374 - Flags: review?(bzbarsky)
This patch also implies that we can simply eliminate the pushing and popping of
event queues in XMLHttpRequest, but I was trying to minimize the changes.
(In reply to comment #43)
> This patch also implies that we can simply eliminate the pushing and popping of
> event queues in XMLHttpRequest, but I was trying to minimize the changes.

To add to that comments, even if we did away with the PushThreadEventQueue call
in XMLHttpRequest, we'd still need to walk the event queue chain in order to
deal with the fact that someone else might have pushed a modal event queue.
Hmm...  So any events that got posted before the send() call will also get
processed, basically?
> Hmm...  So any events that got posted before the send() call will also get
> processed, basically?

Yes.
Does that change our behavior with regard to interaction with timeouts and
intervals?  Or are we firing those into the XMLHttpRequest queue anyway?
(In reply to comment #47)
> Does that change our behavior with regard to interaction with timeouts and
> intervals?  Or are we firing those into the XMLHttpRequest queue anyway?

Right, we fire those into the current event queue... same with input events. 
Given that the current event loop may end up executing arbitrary events, I see
no reason not to process all event queues from this code.
Comment on attachment 199374 [details] [diff] [review]
v1 patch

Yeah, ok.  Seems reasonable.  I guess as long as we can't end up in an
XMLHttpRequest that started _after_ a modal dialog that we really want to block
necko, eg, came up, we're ok.  That assumption isn't true for Linux Firefox,
but I don't see what we can do about it here.
Attachment #199374 - Flags: review?(bzbarsky) → review+
> Yeah, ok.  Seems reasonable.  I guess as long as we can't end up in an
> XMLHttpRequest that started _after_ a modal dialog that we really want to block
> necko, eg, came up, we're ok.  That assumption isn't true for Linux Firefox,
> but I don't see what we can do about it here.

Well... with this change, an async XMLHttpRequest might fire onload while a
script is inside a call to alert().  Similarly, an async XMLHttpRequest might
fire onload while a script is performing a sync XMLHttpRequest send.
I'm more worried about necko events being processed while something like the
HTTP password prompt is up.  I guess that can happen if the send() is done
first, then in a different window the user tries to load something that uses
basic auth, right?  If it does happen, will HTTP deal?
> I'm more worried about necko events being processed while something like the
> HTTP password prompt is up.  I guess that can happen if the send() is done
> first, then in a different window the user tries to load something that uses
> basic auth, right?  If it does happen, will HTTP deal?

HTTP should be fine in that case because the same nsHttpChannel instance
shouldn't be re-entered while it is making that call to PromptUsernameAndPassword.
So we have something preventing that other than the fact that the dialog is modal?
> So we have something preventing that other than the fact that the dialog is modal?

No, not really....  I suppose it may be possible for OnDataAvailable and
OnStopRequest events to be processed while we are inside the prompt call if the
event queue is processed, and that might not be so good.  The problem, I think,
is that we do not suspend those events while making the prompt call.  So,
yeah... this patch might be trouble :-(
Comment on attachment 199374 [details] [diff] [review]
v1 patch

OK, this patch is no good.  A cheap solution to this bug is to simply bypass
the cache when making a sync load.  I tried to avoid that solution because it
introduces bad performance on the site in question, but maybe that's just going
to have to be the solution for Gecko 1.8.
Attachment #199374 - Attachment is obsolete: true
Attached patch v2 patchSplinter Review
This patch implements what we came up with earlier: a load flag for causing the
channel to bypass the local cache instead of waiting for a cache entry.
Attachment #199664 - Flags: superreview?(bzbarsky)
Attachment #199664 - Flags: review?(cbiesinger)
Attachment #199374 - Flags: superreview?(jst)
Blocks: 312534
Attachment #199664 - Flags: superreview?(bzbarsky) → superreview+
Comment on attachment 199664 [details] [diff] [review]
v2 patch

+     * request.  Unlike LOAD_BYPASS_CACHE, it does not force an end-to-end
load
+     * (i.e., it does not affect proxy caches), and it does not prevent the
+     * channel from writing to the cache once it receives a response from the
+     * server.

hmm... this comment sounds to me like LOAD_BYPASS_CACHE prevents writing to the
cache, which isn't the case...

r=biesi
Attachment #199664 - Flags: review?(cbiesinger) → review+
Attachment #199664 - Flags: approval1.8rc1?
> hmm... this comment sounds to me like LOAD_BYPASS_CACHE prevents writing to the
> cache, which isn't the case...

Good point.  I'll clean that up before check-in.
Darin, can you tell us what kind of risk is involved in taking this and how many
sites are going to be broken or developers prevented from doing good and obvious
coding by thid problem? 

Also, this would need to land and be verified to work and not break anything on
the trunk before we'd consider for the branch.
> Darin, can you tell us what kind of risk is involved in taking this and how 
> many sites are going to be broken or developers prevented from doing good and 
> obvious coding by thid problem?

Asa: This bug is a regression.  While websites can workaround this problem, I
suspect that many will be shocked to find that they have to.  The fix is very
low-risk IMO, and I think we should definitely take it for this release.

I'm landing this on the trunk today.
we'll evaluate again after this has been landed and verified on the trunk.
fixed-on-trunk
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
I still see issues using today's Windows trunk build Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.9a1) Gecko/20051021 Firefox/1.6a1.  I load the page,
and when I hover over the links in the comments section, I have difficult
escaping out of that mode. At one point the browser page just became blank, and
then eventually the browser comes back to life. I will have Tracy try this out
too to see what he sees.
seemed to work ok for me in a cvs build and my nightly trunk build on winxp
updated to today. I couldn't reproduce the issues Marcia is seeing.
Marcia, Yeah that is working as expected.  The good part is that your browser
did eventually come back to life.  Without the patch in this bug, it would never
come back to life.  Fixing the other part of the bug is well beyond the scope of
FF 1.5.
Verifying based on Darin's comment using Mozilla/5.0 (Windows; U; Windows NT
5.1; en-US; rv:1.9a1) Gecko/20051021 Firefox/1.6a1. I checked CPU usage while
the page was loading and there are no issues there, as well as no crash. Marking
verified.
Status: RESOLVED → VERIFIED
Also looks good on Linux trunk build from 1021
Attachment #199664 - Flags: approval1.8rc1? → approval1.8rc1+
fixed1.8
Keywords: fixed1.8
Summary: mozilla 1.5beta1 freezes & goes into 95+% cpu usage browsing blackisha.com → Firefox 1.5beta1 freezes & goes into 95+% cpu usage browsing blackisha.com
*** Bug 300631 has been marked as a duplicate of this bug. ***
Just a note that I can still reproduce this with 1.5 (Gecko/20051111 Firefox/1.5) on OSX -- talkback http://talkback-public.mozilla.org/talkback/fastfind.jsp?search=2&type=iid&id=12086427 , I had to SEGV the process manually to get tb to kick in.  I only left it running for about 10-15 minutes, but in that time it never came back.  I can also trivially reproduce this using an AJAX "skin" for the web interface for my SlimServer (music server thingy, http://www.slimdevices.com/su_downloads.html), which seems to use sync xmlhttprequest calls throughout =/
Vlad: I think it is a different problem on OSX.  My FF gets hung as well sometimes -- not always -- and when it does get hung, it is not consuming the CPU.  So, I think you found a different bug.
Vlad: In my testing, I found that if I waited long enough, my browser would always become unfrozen.  With some of the very large comment blocks on blackisha.com, it would sometimes take several minutes to fetch the data.  Also, if you happened to trigger multiple mouseover events on that site, then you would trigger multiple synchronous fetchs of the data, which would surely increase the time it takes to unblock the browser (note: the disk cache doesn't help).  I filed bug 317596 for this issue.
Blocks: 757042
No longer blocks: 757042
You need to log in before you can comment on or make changes to this bug.