Closed Bug 622326 Opened 9 years ago Closed 9 years ago

Tab-modal confirm(): browser freeze

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set

Tracking

()

VERIFIED FIXED
mozilla2.0b11
Tracking Status
blocking2.0 --- final+

People

(Reporter: kdevel, Assigned: Paolo)

References

(Blocks 1 open bug)

Details

(Keywords: regression, Whiteboard: [softblocker][fx4-fixed-bugday])

Attachments

(1 file, 5 obsolete files)

User-Agent:       
Build Identifier: Mozilla/5.0 (X11; Linux x86_64; rv:2.0b9pre) Gecko/20100101 Firefox/4.0b9pre

Follow-up to Bug 621308.

Reproducible: Always

Steps to Reproduce:
1. open reduced testcase https://bug619644.bugzilla.mozilla.org/attachment.cgi?id=498236
2. press stop-button with mouse (stop loading page)
3. create second tab open reduced testcase https://bug619644.bugzilla.mozilla.org
Actual Results:  
Browser freezes.

Expected Results:  
Keep running.

Built from http://hg.mozilla.org/mozilla-central/rev/b2275b45a33d
STR #3 should read:

3. create second tab, open reduced testcase https://bug619644.bugzilla.mozilla.org/attachment.cgi?id=498236
WFM using Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b9pre) Gecko/20101231 Firefox/4.0b9pre ID:20101231030358
Blocks: 59314
Keywords: regression
Product: Firefox → Toolkit
QA Contact: general → general
Version: unspecified → Trunk
Confirmed on
http://hg.mozilla.org/mozilla-central/rev/b2275b45a33d
Mozilla/5.0 (X11; Linux i686; rv:2.0b9pre) Gecko/20110101 Firefox/4.0b9pre ID:20110101030401

And this happens since 
http://hg.mozilla.org/mozilla-central/rev/baa6cc2f72e4
Mozilla/5.0 (X11; Linux i686; rv:2.0b8pre) Gecko/20101120 Firefox/4.0b8pre ID:20101120030848
Status: UNCONFIRMED → NEW
blocking2.0: --- → ?
Ever confirmed: true
Browser occupies a core of CPU(i.e 25% on C2Q) on windows7,
However, the browser does not freeze.
I filed a Bug 622331,(I am not sure whether it to be duplication of this or not)
Edge case so this doesn't block for now, would be nice to figure out whether there is something more sinister going on with these though.
blocking2.0: ? → -
nsThread logging revealed that the high CPU usage observed here is caused
by an event handler, ClearScopeEvent, that continues to re-post itself to
the event queue until a certain condition is met:

http://hg.mozilla.org/mozilla-central/file/4a3866321a14/dom/base/nsGlobalWindow.cpp#l1153

The event is raised for the first time just before a DOM inner window is
destroyed ("--DOMWINDOW" console message) to prevent memory leaks by
clearing the inner window's global JavaScript object. This operation
cannot be done safely until all the scripts in the window's scripting
context stopped their execution (I guess that this is because any one of
them might hold a reference to data stored in the global object of the
window being destroyed?).

The fact is that the scripting context is actually stored in the outer
window and is thus shared between all the inner windows, either visible
or invisible. This means that the condition the event handler is looking
for (no scripts executing) will never be met if *any* inner window is
executing a script, even a window that is not currently being destroyed.
This also means that no inner window can be destroyed if there is at
least one of them that, for example, is stuck in a modal event loop.

To reproduce this condition, and the high CPU usage, reliably:

1. Type in the address bar:
    data:text/html,<body onload="alert(0);">
    Leave the tab-modal message box open.
2. Open a new tab and repeat step 1.
3. Close the first tab.
4. Wait a few seconds. This is often necessary because the destruction of
    DOM windows when they are closed is not immediate, but delayed a bit.

Can also be reproduced on FF 3.6 using two different windows instead of tabs.
This issue existed for a long time, but the new tab-modal dialogs will make
it happen more often, because they start inner modal event loops.

The attached patch replaces the event approach with a polling timer,
and solves the CPU usage issue for me. Later I noticed that there might
be a better approach based on nsIScriptContext::SetTerminationFunction,
I think I'll try and implement it later. Meanwhile I'm interested in
feedback on the issue in general.

I'm not sure whether termination functions are executed when any script on
the stack returns to C++ or when the first script posted to the JS context
returns to C++. If the case is the latter, in presence of modal loops,
termination functions in general might be invoked much later than expected,
however in my case that would be better because I should not need to check
jsscx->GetExecutingScript() again.
Assignee: nobody → paolo.02.prg
Status: NEW → ASSIGNED
Attachment #502676 - Flags: feedback?(bzbarsky)
Blocks: 622331
Blocks: 615931
Attached patch The patch (obsolete) — Splinter Review
This should fix the CPU usage issue and related event starvation.

Just a bit of XPCOM / C++ advice required:

+nsGlobalWindow::ClearWindowScopeOnDestruction(nsISupports *aWindow)
+{
+  nsCOMPtr<nsIDOMWindow> xpcom_window(do_QueryInterface(aWindow));
+  nsIDOMWindow *raw_window = xpcom_window;
+  nsGlobalWindow *window = static_cast<nsGlobalWindow *>(raw_window);

I suppose there is a saner way of doing this cast?
Attachment #502676 - Attachment is obsolete: true
Attachment #502906 - Flags: review?(bzbarsky)
Attachment #502676 - Flags: feedback?(bzbarsky)
For testing, I don't think an automated one can check CPU usage reliably.
The manual testing steps are however quite easy:

1. Open a CPU usage monitor (for example, Task Manager on Windows).
2. Check that CPU usage is low to begin with.
3. Go to this address:
   data:text/html,<body onload="alert(0);">
   Leave the tab-modal message box open.
4. Open a new tab, go to this address:
   data:text/html,<body onload="alert(0);">
   Leave the tab-modal message box open.
5. Close the first tab.
6. Wait 10 seconds.
7. Check that CPU usage is still low and Firefox is responsive.
(In reply to comment #5)
> Edge case so this doesn't block for now, would be nice to figure out whether
> there is something more sinister going on with these though.

Further investigation revealed that this is not such an edge case, it may
happen very often when tab-modal dialogs in more than one tab are ignored.
When this problem occurs, starvation of UI refresh events make the interface
unresponsive, and Firefox becomes unusable.

There is a patch and a manual test case awaiting review.

It may also be appropriate to backport the fix to 1.9.2 and earlier branches.
blocking2.0: - → ?
Whiteboard: [has patch] [needs review bz]
Reproducible regression causing hang/resource-starvation.

This seems like the kind of scenario that we just can't accurately predict how edge-case-y it is, so I think we should block on fixing it.
blocking2.0: ? → final+
Whiteboard: [has patch] [needs review bz] → [has patch] [needs review bz] [softblocker]
> I'm not sure whether termination functions are executed when any script on
> the stack returns to C++

Sadly, yes.

jst, Blake, peterv do we even need this scope-clearing jazz anymore?  We don't need it to actually clear the inner window scope, and with cycle collector we shouldn't need it for leak prevention....
(In reply to comment #11)
> jst, Blake, peterv do we even need this scope-clearing jazz anymore?  We don't
> need it to actually clear the inner window scope, and with cycle collector we
> shouldn't need it for leak prevention....

It also occurred to me that this might not be needed anymore because of the
cycle collector. If this is true and there's no reason to wait until all
scripts stopped to call NotifyWindowIDDestroyed, then all this code can be
removed.
(In reply to comment #11)
> jst, Blake, peterv do we even need this scope-clearing jazz anymore?  We don't
> need it to actually clear the inner window scope, and with cycle collector we
> shouldn't need it for leak prevention....

See bug 470510. We're not quite there yet.
Comment on attachment 502906 [details] [diff] [review]
The patch

>+++ b/dom/base/nsGlobalWindow.cpp
>+nsGlobalWindow::ClearWindowScopeOnDestruction(nsISupports *aWindow)

I'd call this function something more like TryClearWindowScope.

>+  nsCOMPtr<nsIDOMWindow> xpcom_window(do_QueryInterface(aWindow));
>+  nsIDOMWindow *raw_window = xpcom_window;
>+  nsGlobalWindow *window = static_cast<nsGlobalWindow *>(raw_window);

This can just be:

  nsGlobalWindow *window = 
    static_cast<nsGlobalWindow *>(static_cast<nsIDOMWindow*>(this));

r=me with those two nits.
Attachment #502906 - Flags: review?(bzbarsky) → review+
Attached patch The patch (obsolete) — Splinter Review
Attachment #502906 - Attachment is obsolete: true
Setting checkin-needed, manual test case is in comment 8.
Keywords: checkin-needed
OS: Linux → All
Hardware: x86_64 → All
Whiteboard: [has patch] [needs review bz] [softblocker] → [has patch] [softblocker]
Attached patch The patch (obsolete) — Splinter Review
Uhm... latest patch contained garbage due to automatic line-ending conversions.
Sorry for the bugspam.
Attachment #504003 - Attachment is obsolete: true
Pushed http://hg.mozilla.org/mozilla-central/rev/0f1fd87b570d

Paolo, thanks for the fix!
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Flags: in-testsuite?
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla2.0b10
new Bug 626015
I backed this out on suspicion of causing bug 626100, an intermittent large leak in a11y mochitests.

http://hg.mozilla.org/mozilla-central/rev/9c32afba1189
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Component: General → DOM
Product: Toolkit → Core
QA Contact: general → general
It looks like the leak happens when SetDocShell(null) is called before all
scripts in the outer window's scripting context have stopped. Here's the code:

http://mxr.mozilla.org/comm-central/source/mozilla/dom/base/nsGlobalWindow.cpp#2319

In this condition and with the patch applied, the call to FreeInnerObjects
sets one or more termination functions on the context, but the context is
then destroyed in SetDocShell, a few lines of code later, and this
assertion is hit:

http://mxr.mozilla.org/comm-central/source/mozilla/dom/base/nsJSEnvironment.cpp#1158

Without the patch applied, no leak occurs apparently, but still
mContext->ClearScope is never called on mJSObject for inner windows.

By the way, the following lines mitigate the problem by solving the leak in
case the assertion is hit. Maybe it's worth including them anyway?

   NS_PRECONDITION(!mTerminations, "Shouldn't have termination funcs by now");
+  if (mTerminations)
+    delete mTerminations;

To avoid the assertion in this specific case, I see two possibilities:

1) If we don't need to call ClearScope if we are going to destroy the
    context, we could just call FreeInnerObjects(PR_FALSE) instead of
    FreeInnerObjects(PR_TRUE).

2) If we need to call ClearScope even if we are going to destroy
    the context, I'm open to suggestions :-)

However, I can also see a scenario where we set a termination function on
the context during a previous scope clearing request operation, and we
never got the opportunity of running it. The same goes for termination
functions that other windows set on us. Any idea on how to handle this
case? We might either cancel the execution of termination functions, or
forcibly run them even if scripts are still executing, but I don't know
this area of the code enough to understand the implications.
jst or mrbkap are probably the best people to answer those questions.  I can dig through the code to figure that stuff out, but they might recall off the top of their heads.
   NS_PRECONDITION(!mTerminations, "Shouldn't have termination funcs by now");
+  if (mTerminations)
+    delete mTerminations;

With this change only, we get the current behavior, without the leak. So in
case we can't figure out a better behavior, we could just fix this and re-land,
what do you think?

I'm not sure what to do with assertions in cases like this one, where we know
that the condition might occur during the normal flow of the program, even if
in rare cases. See also bug 398103 for the assertion in LeaveModalState. Should
these be removed, turned into warnings, or left in the code as they are?
> I'm not sure what to do with assertions in cases like this one

Remove them; they're clearly bogus if they're asserting something that just isn't true.
Attached patch The patch, fixing memory leak (obsolete) — Splinter Review
Attachment #504008 - Attachment is obsolete: true
Attachment #506036 - Flags: review?(bzbarsky)
Comment on attachment 506036 [details] [diff] [review]
The patch, fixing memory leak

This looks fine to me, but I'd like jst to take a look at least at the mTerminations changes too...
Attachment #506036 - Flags: review?(jst)
Attachment #506036 - Flags: review?(bzbarsky)
Attachment #506036 - Flags: review+
Duplicate of this bug: 588219
Comment on attachment 506036 [details] [diff] [review]
The patch, fixing memory leak

- In nsJSContext::~nsJSContext():

+  if (mTerminations)
+    delete mTerminations;

No need for the if check here, delete is null safe.

r=jst

The one thing that I see here that's a difference in behavior from before, other than fixing this bug, is that this makes us potentially clear scope synchronously in cases where we used to go back to the event loop before doing so. This should make no difference to any JS code, but it could in theory cause objects to be deleted earlier than they do before this change. I can't think of any cases where that matters, but it could affect C++ callers that don't hold proper strong refs to stuff or what not. If we feel it's worth worrying about that we could keep ClearScopeEvent code in addition to the new termination function code, but I'm not convinced it is.
Attachment #506036 - Flags: review?(jst) → review+
(In reply to comment #28)
> If we feel it's worth worrying about
> that we could keep ClearScopeEvent code in addition to the new termination
> function code, but I'm not convinced it is.

As soon as we're convinced it isn't, here is the patch ready to land :-)

Thanks for the review!
Attachment #506036 - Attachment is obsolete: true
I decided to go ahead and push this fix.

http://hg.mozilla.org/mozilla-central/rev/5e0bba79c052

Thanks for your work her Paolo!
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
(In reply to comment #30)
> I decided to go ahead and push this fix.
> Thanks for your work her Paolo!

Thanks to you! Curiously, I was going to use this patch to test my brand-new
tryserver access exactly now, this saved some cycles for other devs I guess :-)
Whiteboard: [has patch] [softblocker] → [softblocker]
Target Milestone: mozilla2.0b10 → mozilla2.0b11
Sorry to spoil your try server test run plans, I guess you're going to have to fix some other bugs now so you can test your try server access :)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla2.0b11 → mozilla2.0b10
jst, did you mean to reopen this?
I'm assuming he did not.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: mozilla2.0b10 → mozilla2.0b11
Status: RESOLVED → VERIFIED
Whiteboard: [softblocker] → [softblocker][fx4-fixed-bugday]
Verified with Mozilla/5.0 (X11; Linux i686; rv:2.0b12pre) Gecko/20110204 Firefox/4.0b12pre
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.