Closed Bug 905909 Opened 9 years ago Closed 9 years ago

test_tree.xul causes infinite recursion

Categories

(Core :: Disability Access APIs, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla27

People

(Reporter: bholley, Assigned: roc)

References

Details

Attachments

(2 files, 1 obsolete file)

Bug 845095 describes intermittent timeout oranges in states/test_tree.xul. The eventual solution was to disable the test on OSX 10.6, which wasn't very satisfying. I'm getting something similar with names/test_tree.xul, so I suspect this is an issue that is shared between the different test_tree.xul instances.

I'm now getting consistent macosx64-debug oranges (on try) with my patches in bug 732665, where I effectively triple the stack limit on that configuration. I believe that the issue is that these tests trigger infinite recursion, causing us to race with the timeout. If we win the race, we through an over-recursion error and move on. If we lose, the timeout fires, and we get an intermittent orange.

On my fast local machine, the test still passes with the deeper recursion limit, but takes significantly longer. But the tinderbox machines are significantly slower, so they probably lose the race. The solution is probably to fix the infinite recursion in the test.

I've got a stack for the overflow, which I'll attach shortly.
Attached file stack.txt
Flags: needinfo?(trev.saunders)
Roc, it looks like this is pres shell code for reflows infinitely recursing, does that seem reasonable to you?
Flags: needinfo?(trev.saunders) → needinfo?(roc)
Set a breakpoint on PresShell::PostReflowCallback, enable it while we're infinitely-recursing and get a stack for that --- including the JS stack via DumpJSStack.

Trees run script during post-reflow callbacks and it looks like the test's script is causing another reflow.
Flags: needinfo?(roc)
So a call stack for PresShell::FrameNeedsReflow during the infinitely-recursing part would be helpful too.
Bobby, can you please try what Robert suggest?
Flags: needinfo?(bobbyholley+bmo)
So here's what happens:

nsTreeBodyFrame adds itself as a post-reflow callback because it needs to update various scrolling state after the reflow:

http://pastebin.mozilla.org/2987764

When handling the post-reflow, it ends up invoking FullScrollbarsUpdate, which invokes a script runner for nsTreeBodyFrame::CheckOverflow. This, in turn, decides that there is an overflow, and synchronously dispatches an underflow event, which gets picked up in the tree XBL binding here: 

http://mxr.mozilla.org/mozilla-central/source/toolkit/content/widgets/tree.xml#954

This does |tree.setAttribute("hidevscroll", "true");|, which ends up translating to the |collapsed| attribute on the underlying element. This invokes nsNodeUtils::AttributeChanged, which ends up adding a pending restyle here:

http://pastebin.mozilla.org/2987753

I see a number of ways this cycle could potentially be broken - it seems like someone somewhere should do something asynchronously (perhaps we should make it as unsafe to run script so that the ScriptRunner machinery will do the right thing?). Anyway, I'll let roc tell us what to do here.
Flags: needinfo?(bobbyholley+bmo) → needinfo?(roc)
If hidevscroll is already true, setting it to true again should not generate a new pending restyle.
Flags: needinfo?(roc)
We appear to be bouncing between the "overflow" and "underflow" events, causing us to alternately set hidevscroll to true and false: http://mxr.mozilla.org/mozilla-central/source/toolkit/content/widgets/tree.xml#954
Flags: needinfo?(roc)
It's weird for hiding a scrollbar to trigger an overflow event, but OK...

Here's what I think we should do: nsTreeBodyFrame::CheckOverflow, after dispatching an event, should set a flag on nsTreeBodyFrame indicating that we're in CheckOverflow while calling FlushPendingNotifications(Flush_Layout). In nsTreeBodyFrame::FullScrollbarsUpdate, if we're already inside a CheckOverflow, we should dispatch nsOverflowChecker as an async runnable instead of using AddScriptRunner. That means a single overflow or underflow event still happens synchronously but nested ones will be deferred to the event loop.
Flags: needinfo?(roc)
Attachment #801861 - Flags: review?(roc)
Attachment #801861 - Flags: review?(roc) → review+
So, the crashtest failure in layout/xul/tree/crashtests/382444-1.html stems from a situation in which |this| points to poisoned memory at the end of nsTreeBodyFrame::CheckOverflow, so invoking PresContext() causes us to dereference 0x5a5a5a5a5a5a5a5a. That seems bad...
Flags: needinfo?(roc)
Instead of "return weakFrame.IsAlive()", do "if (!weakFrame.IsAlive()) { return false; }" before reading mCheckingOverflow.
Flags: needinfo?(roc)
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #13)
> Instead of "return weakFrame.IsAlive()", do "if (!weakFrame.IsAlive()) {
> return false; }" before reading mCheckingOverflow.

But there's already an NS_ENSURE_TRUE(weakFrame.IsAlive(), false); right before that, no?

This isn't the most efficient process in the world. Is there someone with more layout fu that I can hand this patch off to? Or maybe trevor could drive it?
Flags: needinfo?(roc)
Oops, I read your comment incorrectly.

+  mCheckingOverflow = true;
+  PresContext()->PresShell()->FlushPendingNotifications(Flush_Layout);
+  mCheckingOverflow = false;

We need a weakFrame check before the two mCheckingOverflow assignments. We should also acquire the presshell ahead of time and keep it strongly referenced.

(In reply to Bobby Holley (:bholley) from comment #14)
> This isn't the most efficient process in the world. Is there someone with
> more layout fu that I can hand this patch off to? Or maybe trevor could
> drive it?

Fair. I've updated the patch here:
https://tbpl.mozilla.org/?tree=Try&rev=3f3a851b3434
Flags: needinfo?(roc)
The assertion failure is probably easy to fix but the test_tree.xul failures don't look so easy...
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #16)
> The assertion failure is probably easy to fix but the test_tree.xul failures
> don't look so easy...

Maybe trevor can fix those?
Flags: needinfo?(trev.saunders)
I doubt I know more than you do about this stuff, but I can take a look
Flags: needinfo?(trev.saunders)
I believe this patch exposes a longstanding bug in XUL trees: dynamic changes to the "rows" attribute of the base <tree> element don't trigger reflow of the nsTreeBodyFrame. nsTreeBodyFrame::GetMinSize does "baseElement->GetAttr(kNameSpaceID_None, nsGkAtoms::rows, rows)" but as far as I can tell there is nothing to trigger reflow of the nsTreeBodyFrame when that "rows" attribute changes.

All the test_tree.xul tests set rows=4 on the <tree> element markup, and don't change it, except for test_tree_view.xul, which has this:

function init()
{
  var tree = document.getElementById("tree-view");
  tree.view = view;
  tree.treeBoxObject.ensureRowIsVisible(0);
  is(tree.treeBoxObject.getFirstVisibleRow(), 0, "first visible after ensureRowIsVisible on load");
  tree.setAttribute("rows", "4");

  setTimeout(testtag_tree, 0, "tree-view", "treechildren-view", "multiple", "simple", "tree view");
}

]]>
</script>

<tree id="tree-view">
  <treecols>
    <treecol id="name" label="Name" sort="label" flex="1"/>
    <treecol id="address" label="Address" flex="1"/>
  </treecols>
  <treechildren id="treechildren-view"/>
</tree>

Now without this patch, nothing in init() flushes layout before setting "rows" to 4. In particular nsTreeBoxObject::EnsureRowIsVisible and nsTreeBoxObject::GetFirstVisibleRow do not flush layout (which is really a bug; if there's a pending style change to the containing XUL that will change the height of the tree, it's wrong to just ignore that change). There's a pending layout before entering init() which is not flushed until after init() finishes, at which point the new "rows" value is taken into account.

But with this patch, CheckOverflow gets called during "tree.view = view" (which calls nsTreeBodyFrame::SetView which calls nsTreeBodyFrame::FullScrollbarsUpdate which runs CheckOverflow synchronously since we're safe to run script). That flushes the pending layout we had coming into init(), and setting "rows" to 4 doesn't make the layout dirty, so we keep going with a a tree body that's 0 rows high, and the synthetic double-click on a cell to make it editable doesn't work (the click misses the tree entirely).
Assignee: nobody → roc
This version of the patch makes dynamic changes to "rows" reflow the tree:
https://tbpl.mozilla.org/?tree=Try&rev=cfc156b7b898
Attached patch fix v2Splinter Review
Attachment #801861 - Attachment is obsolete: true
Attachment #805230 - Flags: review?(dholbert)
Comment on attachment 805230 [details] [diff] [review]
fix v2

>+++ b/layout/xul/base/src/nsBoxFrame.cpp
>@@ -1232,16 +1232,23 @@ nsBoxFrame::AttributeChanged(int32_t aNa
>+  else if (aAttribute == nsGkAtoms::rows &&
>+           tag == nsGkAtoms::tree) {
>+    // Reflow ourselves and all our children if "rows" changes, since
>+    // nsTreeBodyFrame's layout reads this from its parent (this frame).
>+    PresContext()->PresShell()->
>+      FrameNeedsReflow(this, nsIPresShell::eStyleChange, NS_FRAME_IS_DIRTY);
>+  }

Seems like we should put this code in a (new) nsTreeBodyFrame::AttributeChanged method, rather than appending a tree-specific chunk to the general box-frame ::AttributeChanged method, right?

Something like this:
  http://mxr.mozilla.org/mozilla-central/source/layout/xul/base/src/nsDeckFrame.cpp?rev=6a5bf6934ba8#65

>--- a/layout/xul/tree/nsTreeBodyFrame.cpp
>+  if (!weakFrame.IsAlive()) {
>+    fprintf(stderr, "FRAME DEAD 1!\n");
>+    return;
>+  }
>+  mCheckingOverflow = true;
>+  fprintf(stderr, "BEGIN FlushPendingNotifications\n");

Is this fprintf logging code intended to be checked in, or was this just for debugging?
Comment on attachment 805230 [details] [diff] [review]
fix v2

>@@ -911,34 +912,54 @@ nsTreeBodyFrame::CheckOverflow(const Scr
[...]
>+  mCheckingOverflow = true;
>+  fprintf(stderr, "BEGIN FlushPendingNotifications\n");
>+  presShell->FlushPendingNotifications(Flush_Layout);
>+  fprintf(stderr, "END FlushPendingNotifications\n");
>+  if (!weakFrame.IsAlive()) {
>+    fprintf(stderr, "FRAME DEAD 2!\n");
>+    return;
>+  }
>+  mCheckingOverflow = false;

Two things:
 (1) Probably worth asserting that mCheckingOverflow is false right before you set it to true, right? If that doesn't hold, seems like things could get weird.

 (2) I was *going* to suggest using AutoRestore to clean up mCheckingOverflow, but then I realized that it'd be dangerous if AutoRestore tried to modify this->mCheckingOverflow on a deleted "this" pointer, in the !IsAlive() codepath.
So: maybe worth adding a comment saying something like...
  // NOTE: can't use AutoRestore to clean up mCheckingOverflow, because
  // |this| might be deleted when we return!
...near where you set mCheckingOverflow, to stop someone from coming along and "helpfully" switching this to AutoRestore at some point in the future?

>+  // Overflow checking dispatches synchronous events, which can cause infinite
>+  // recursion during reflow. Do the first overflow check synchronously, but
>+  // force any nested checks to round-trip through the event loop. See bug
>+  // 905909.
>+  nsRefPtr<nsOverflowChecker> checker = new nsOverflowChecker(this);
>+  if (mCheckingOverflow) {
>+    NS_DispatchToCurrentThread(checker);
>+  } else {
>+    nsContentUtils::AddScriptRunner(checker);
>+  }

Readability nit: The code is backwards with respect to the prose there. (The bottom clause is the sync one, which the prose says we do "first".)

Maybe swap the clauses (and negate the "if" check), to make the code be in more obvious temporal order & match the prose better?

>+++ b/layout/xul/tree/nsTreeBodyFrame.h
>@@ -612,15 +612,17 @@ protected: // Data Members
>   // Do we have a fixed number of onscreen rows?
>   bool mHasFixedRowCount;
> 
>   bool mVerticalOverflow;
>   bool mHorizontalOverflow;
> 
>   bool mReflowCallbackPosted;
> 
>+  bool mCheckingOverflow;

Probably worth a one-liner comment explaining this bool, like:
  // Are we currently inside of ::CheckOverflow? (to prevent infinite recursion)

r=me with that and comment 22 addressed.
Attachment #805230 - Flags: review?(dholbert) → review+
Version: unspecified → Trunk
(In reply to Daniel Holbert [:dholbert] from comment #22)
> Comment on attachment 805230 [details] [diff] [review]
> fix v2
> 
> >+++ b/layout/xul/base/src/nsBoxFrame.cpp
> >@@ -1232,16 +1232,23 @@ nsBoxFrame::AttributeChanged(int32_t aNa
> >+  else if (aAttribute == nsGkAtoms::rows &&
> >+           tag == nsGkAtoms::tree) {
> >+    // Reflow ourselves and all our children if "rows" changes, since
> >+    // nsTreeBodyFrame's layout reads this from its parent (this frame).
> >+    PresContext()->PresShell()->
> >+      FrameNeedsReflow(this, nsIPresShell::eStyleChange, NS_FRAME_IS_DIRTY);
> >+  }
> 
> Seems like we should put this code in a (new)
> nsTreeBodyFrame::AttributeChanged method, rather than appending a
> tree-specific chunk to the general box-frame ::AttributeChanged method,
> right?

No, because this the "rows" attribute on the <tree>, which is a parent of the <treebody> which gets the nsTreeBodyFrame. So this is just a generic nsBoxFrame.

> Is this fprintf logging code intended to be checked in, or was this just for
> debugging?

I'll take it out.

(In reply to Daniel Holbert [:dholbert] from comment #23)
> Two things:
>  (1) Probably worth asserting that mCheckingOverflow is false right before
> you set it to true, right? If that doesn't hold, seems like things could get
> weird.

Sure.

>  (2) I was *going* to suggest using AutoRestore to clean up
> mCheckingOverflow, but then I realized that it'd be dangerous if AutoRestore
> tried to modify this->mCheckingOverflow on a deleted "this" pointer, in the
> !IsAlive() codepath.
> So: maybe worth adding a comment saying something like...
>   // NOTE: can't use AutoRestore to clean up mCheckingOverflow, because
>   // |this| might be deleted when we return!
> ...near where you set mCheckingOverflow, to stop someone from coming along
> and "helpfully" switching this to AutoRestore at some point in the future?

Sure.

> Readability nit: The code is backwards with respect to the prose there. (The
> bottom clause is the sync one, which the prose says we do "first".)
> 
> Maybe swap the clauses (and negate the "if" check), to make the code be in
> more obvious temporal order & match the prose better?

Sure.

> >+++ b/layout/xul/tree/nsTreeBodyFrame.h
> >@@ -612,15 +612,17 @@ protected: // Data Members
> >   // Do we have a fixed number of onscreen rows?
> >   bool mHasFixedRowCount;
> > 
> >   bool mVerticalOverflow;
> >   bool mHorizontalOverflow;
> > 
> >   bool mReflowCallbackPosted;
> > 
> >+  bool mCheckingOverflow;
> 
> Probably worth a one-liner comment explaining this bool, like:
>   // Are we currently inside of ::CheckOverflow? (to prevent infinite
> recursion)

Sure.
https://hg.mozilla.org/mozilla-central/rev/fb4ac527994e
https://hg.mozilla.org/mozilla-central/rev/394579caa310
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla27
You need to log in before you can comment on or make changes to this bug.