Closed Bug 724235 Opened 12 years ago Closed 10 years ago

crash in nsStyleSet::ReparentStyleContext

Categories

(Core :: XBL, defect)

17 Branch
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla36

People

(Reporter: cbook, Assigned: MatsPalmgren_bugz)

References

()

Details

(Keywords: crash, regression, reproducible)

Crash Data

Attachments

(5 files, 1 obsolete file)

noticed on crashstats for Firefox 11a2 - Crash Report [@ nsStyleSet::ReparentStyleContext(nsStyleContext*, nsStyleContext*, mozilla::dom::Element*) ]

overview report: https://crash-stats.mozilla.com/report/list?signature=nsStyleSet%3A%3AReparentStyleContext%28nsStyleContext%2A%2C+nsStyleContext%2A%2C+mozilla%3A%3Adom%3A%3AElement%2A%29

example crash report: 
https://crash-stats.mozilla.com/report/index/d0390db3-3011-4b1b-a65b-926ba2120203


Crashing Thread
Frame 	Module 	Signature [Expand] 	Source
0 		@0xe 	
1 	xul.dll 	nsStyleSet::ReparentStyleContext 	layout/style/nsStyleSet.cpp:1454
2 	xul.dll 	nsFrameManager::ReResolveStyleContext 	layout/base/nsFrameManager.cpp:1249
3 	xul.dll 	nsFrameManager::ReResolveStyleContext 	layout/base/nsFrameManager.cpp:1598
4 	xul.dll 	nsFrameManager::ReResolveStyleContext 	layout/base/nsFrameManager.cpp:1598
5 	xul.dll 	nsFrameManager::ReResolveStyleContext 	layout/base/nsFrameManager.cpp:1598
6 	xul.dll 	nsFrameManager::ReResolveStyleContext 	layout/base/nsFrameManager.cpp:1598
7 	xul.dll 	nsFrameManager::ReResolveStyleContext 	layout/base/nsFrameManager.cpp:1598
8 	xul.dll 	nsFrameManager::ReResolveStyleContext 	layout/base/nsFrameManager.cpp:1598
9 	xul.dll 	nsFrameManager::ReResolveStyleContext 	layout/base/nsFrameManager.cpp:1598
10 	xul.dll 	nsFrameManager::ComputeStyleChangeFor 	layout/base/nsFrameManager.cpp:1684
11 	xul.dll 	mozilla::css::RestyleTracker::DoProcessRestyles 	layout/base/RestyleTracker.cpp:242
12 	xul.dll 	PresShell::FlushPendingNotifications 	layout/base/nsPresShell.cpp:4073
13 	xul.dll 	nsRefreshDriver::Notify 	layout/base/nsRefreshDriver.cpp:396
14 	xul.dll 	nsTimerImpl::Fire 	xpcom/threads/nsTimerImpl.cpp:431
15 	xul.dll 	nsTimerEvent::Run 	xpcom/threads/nsTimerImpl.cpp:524
16 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:660
17 	nspr4.dll 	PR_Unlock 	nsprpub/pr/src/threads/combined/prulock.c:347
18 	xul.dll 	MessageLoop::RunHandler 	ipc/chromium/src/base/message_loop.cc:201
19 	xul.dll 	_SEH_epilog4 	
20 	xul.dll 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:175
21 	xul.dll 	nsImageBoxFrame::OnStopDecode 	layout/xul/base/src/nsImageBoxFrame.cpp:608
22 	xul.dll 	nsBaseAppShell::Run 	widget/src/xpwidgets/nsBaseAppShell.cpp:189
23 		@0x2991d5f
Depends on: 698469
There's a spike in crashes (about 50 crashes per hour!) starting in 16.0a1/20120613. The regression range for the spike is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=131961e5e0d1&tochange=964b11fea7f1
Crash Signature: [@ nsStyleSet::ReparentStyleContext(nsStyleContext*, nsStyleContext*, mozilla::dom::Element*) ] → [@ nsStyleSet::ReparentStyleContext(nsStyleContext*, nsStyleContext*, mozilla::dom::Element*) ] [@ nsStyleSet::ReparentStyleContext ]
Keywords: topcrash
OS: Windows 7 → All
Hardware: x86 → All
Summary: Crash Report [@ nsStyleSet::ReparentStyleContext(nsStyleContext*, nsStyleContext*, mozilla::dom::Element*) ] → crash in nsStyleSet::ReparentStyleContext
Steps to reproduce that work reliably for me:

- Have NoScript installed, start with the default start page.
- View -> Toolbars -> Addon bar.
- Click the NoScript icon in the addon bar. 

Boom.
Might it be a fallout of bug 539356 or bug 736695?
Comment 1 through Comment 3 are a new regression in today's nightly, which just happens to crash in the function this bug was filed on.

I filed bug 764591 on the new regression -- let's take the discussion over there.
Depends on: 764591
Keywords: topcrash
bp-98a866da-4ac3-4488-a7f7-309392120616
http://hg.mozilla.org/mozilla-central/rev/4e3362864fbd
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/16.0 Firefox/16.0a1 ID:20120616030538

STR
1. Start Nightly nightly with clean profile
2. Install NoScript https://addons.mozilla.org/ja/firefox/addon/noscript/ And Restart
3. Enable Add-on Bar
4. Open Customize Toolbar by right click toolbar
5. Drag and Drop NoScript Icon from Navigation Toolbar to Add-on Bar And Done
6. Mouse hover over the NoScript Icon on the Add-on Bar.
Comment 5 and Comment 2 are both bug 764591 (which should hopefully be fixed in tomorrow's nightly).
(In reply to Alice0775 White from comment #5)
> bp-98a866da-4ac3-4488-a7f7-309392120616
> http://hg.mozilla.org/mozilla-central/rev/4e3362864fbd
> Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/16.0 Firefox/16.0a1
> ID:20120616030538
> 
> STR
> 1. Start Nightly nightly with clean profile
> 2. Install NoScript https://addons.mozilla.org/ja/firefox/addon/noscript/
> And Restart
> 3. Enable Add-on Bar
> 4. Open Customize Toolbar by right click toolbar
> 5. Drag and Drop NoScript Icon from Navigation Toolbar to Add-on Bar And Done
> 6. Mouse hover over the NoScript Icon on the Add-on Bar.

This is bug 765502
which itself is a dupe of bug 765502.
Removing topcrash as bug 764591 is fixed.
Keywords: topcrash
(In reply to Bob Clary [:bc:] from comment #12)
> http://www.rozmari.blogfa.com/category/8
I doesn't crash for me.
(In reply to Scoobidiver from comment #13)
> (In reply to Bob Clary [:bc:] from comment #12)
> > http://www.rozmari.blogfa.com/category/8
> I doesn't crash for me.

using? How long did you let it sit there? May take a little while.
(In reply to Bob Clary [:bc:] from comment #14)
> using? How long did you let it sit there? May take a little while.
On 64-bit Windows 7 with 22.0a1/20130228. Enough long.
hmm. pretty reproducible for me but I'm using vms. including the app notes from the crash reports.

win61i64 (esxi vm) bp-5e6bd1e2-bedf-425d-9554-1cd2a2130301

AdapterVendorID: 0x0000, AdapterDeviceID: 0x0000, AdapterSubsysID: 00000000, AdapterDriverVersion: 

Has dual GPUs. GPU #2: AdapterVendorID2: 0x15ad, AdapterDeviceID2: 0x0405, AdapterSubsysID2: 040515ad, AdapterDriverVersion2: 7.14.1.1070D3

win61i32 (esxi vm) bp-ea496b2a-162c-467f-beb6-939972130301

AdapterVendorID: 0x0000, AdapterDeviceID: 0x0000, AdapterSubsysID: 00000000, AdapterDriverVersion: 

Has dual GPUs. GPU #2: AdapterVendorID2: 0x15ad, AdapterDeviceID2: 0x0405, AdapterSubsysID2: 040515ad, AdapterDriverVersion2: 7.14.1.1070D3
The following steps are 100% reproducible on Linux64:
1. load http://www.abhinethri.com/Actress/T/Taapsee_Pannu/Tapsee_01.html
2. click the "Photosession" link

bp-93b3cab1-4b31-4cba-b6e5-deec92130528
Attached file stacks
It's also reproducible in a debug build -- there's an assertion before
the crash (stack included):
###!!! ASSERTION: Unexpected document; this will lead to incorrect behavior!: 'aElement->GetCurrentDoc() == Document()', file layout/base/RestyleTracker.cpp, line 260
Regression window for the STR in comment 17: 
2012-08-14-03-05-21 -- 2012-08-15-03-05-56
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=22288130fea2&tochange=86ee4deea55b

So I guess that might be a separate bug.  Or possibly that on Linux there was
a second change required to trigger the crash.
Keywords: regression
STR of comment 12: bp-bc163701-65bc-4c3c-857e-903cd2130528
STR of comment 17: bp-c5bb9de2-3215-4ced-b9c0-0ac012130528
They seem similar with mozilla::css::RestyleTracker::DoProcessRestyles in the stack trace.
Version: 11 Branch → 17 Branch
(In reply to Mats Palmgren [:mats] from comment #19)
> Regression window for the STR in comment 17: 
> 2012-08-14-03-05-21 -- 2012-08-15-03-05-56
> http://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=22288130fea2&tochange=86ee4deea55b
> 
> So I guess that might be a separate bug.  Or possibly that on Linux there was
> a second change required to trigger the crash.

bug 775965 is the most suspicious thing I see in that range.

(In reply to Mats Palmgren [:mats] from comment #18)
> Created attachment 754639 [details]
> stacks
> 
> It's also reproducible in a debug build -- there's an assertion before
> the crash (stack included):
> ###!!! ASSERTION: Unexpected document; this will lead to incorrect
> behavior!: 'aElement->GetCurrentDoc() == Document()', file
> layout/base/RestyleTracker.cpp, line 260

That's because aElement->GetCurrentDoc() is null.
And, in particular, we're looking at an nsXULScrollFrame (display:box) that's still in the frame tree but whose mContent is an HTMLDivElement that's not IsInDoc(), and whose mSubtreeRoot is itself.  Its frame parent is display:inline-block, and has a normal content node.

(gdb) up
#2  0x00007fd19c5cad12 in nsFrameManager::ReResolveStyleContext (this=0x36a2050, 
    aPresContext=0x42485c0, aFrame=0x44be170, aParentContent=0x4115950, 
    aChangeList=0x7fffd721e560, aMinChange=nsChangeHint_RepaintFrame, 
    aParentFrameHintsNotHandledForDescendants=0, aRestyleHint=0, 
    aRestyleTracker=..., 
    aDesiredA11yNotifications=nsFrameManager::eSendAllNotifications, 
    aVisibleKidsOfHiddenElement=..., aTreeMatchContext=...)
    at /home/dbaron/builds/ssd/mozilla-central/mozilla/layout/base/nsFrameManager.cpp:1068
1068	      if (aRestyleTracker.GetRestyleData(content->AsElement(), &restyleData)) {
(gdb) p aFrame
$19 = (nsXULScrollFrame *) 0x44be170
(gdb) p aFrame->mParent
$20 = (nsBlockFrame *) 0x1ea0a70
(gdb) p $20->mContent
$21 = (mozilla::dom::HTMLDivElement *) 0x4115950
(gdb) p $->mParent
$22 = (mozilla::dom::HTMLAnchorElement *) 0x51c95c0
(gdb) p $->mParent
$23 = (mozilla::dom::HTMLElement *) 0x4115dd0
(gdb) p $->mParent
$24 = (mozilla::dom::HTMLBodyElement *) 0x3695d40
(gdb) p $->mParent
$25 = (mozilla::dom::HTMLSharedElement *) 0x278f340
(gdb) p $->mParent
$26 = (nsHTMLDocument *) 0x40fb210
(gdb) p aFrame->mContent
$27 = (mozilla::dom::HTMLDivElement *) 0x48bf820
(gdb) p $->mSubtreeRoot
$28 = (mozilla::dom::HTMLDivElement *) 0x48bf820
(gdb) p $20->mContent->mPrimaryFrame
$29 = (nsBlockFrame *) 0x1ea0a70

(note that mSubtreeRoot and mPrimaryFrame are a union)


We crash because we try to reparent the style context of that nsXULScrollFrame to have a null parent, and we dereference null.
(I'm not planning to do any more with this bug right now.)
For the STR in comment 17, I get this bad changeset:

changeset:   102269:bcac58cbf328
user:        Chris Pearce <cpearce@mozilla.com>
date:        Tue Aug 14 16:06:44 2012 +1200
summary:     Bug 775965 - Ensure presentation persists across nsSubDocumentFrame reframes. r=roc

Specifically, if I comment out this flush the crash does not occur:
http://hg.mozilla.org/mozilla-central/annotate/18fc62fd8dcc/layout/generic/nsSubDocumentFrame.cpp#l788

I don't see anything wrong with flushing there per se, I think it just
triggers a bug that already existed.
Attached file stacks + frame dumps
Restyling the <marquee> creates a new style context for it the
"Block(marquee)(1)@0x7fffc0277c90" frame.  Its child, the XULscroll
needs to have its style context reparented, but
nsFrame::DoGetParentStyleContextFrame mistakenly thinks this is
the root frame and returns null:

7214   if (mContent && !mContent->GetParent() &&
7215       !StyleContext()->GetPseudo()) {
7216     // we're a frame for the root.  We have no style context parent.
7217     return nullptr;
7218   }

http://hg.mozilla.org/mozilla-central/annotate/18fc62fd8dcc/layout/generic/nsFrame.cpp#l7212

because the XULscroll content has mParent == null (it's the root
of an anonymous subtree) and does not have a pseudo.
So it's a (safe) null-pointer crash, at least for this testcase.
Attached patch Like so? (obsolete) — Splinter Review
I did verify in a debugger that the XULscroll style context's parent
was indeed changed to the new style context of its parent (green in
the attached frame dump)
https://tbpl.mozilla.org/?tree=Try&rev=bed97bdddca3
Attachment #757041 - Flags: review?(bzbarsky)
Comment on attachment 757041 [details] [diff] [review]
Like so?

Being root of anonymous subtree should not mean null mParent.  Why is the parent actually null here?
Attached file more stacks
> Being root of anonymous subtree should not mean null mParent.

OK, it's non-null at the time the frame tree is created by the
flush in nsHideViewer::Run.

> Why is the parent actually null here?

It's nulled out by UnbindFromTree from a RemoveFromBindingManagerRunnable
(first stack) that was posted as a result of the parser doing
a RemoveChildAt (second stack).  I guess as a result of some "adoption"
operation due to the unclosed <a> ?
Attachment #757041 - Attachment is obsolete: true
Attachment #757041 - Flags: review?(bzbarsky)
The nsHideViewer and RemoveFromBindingManagerRunnable are created as
a result of the same nsINode::doRemoveChildAt call.
First we do nsNodeUtils::ContentRemoved which destroys the frame tree
which posts the nsHideViewer.  Then we call UnbindFromTree which
posts the RemoveFromBindingManagerRunnable.
http://hg.mozilla.org/mozilla-central/annotate/295a11e1efef/content/base/src/nsINode.cpp#l1432

So I guess we could delay the nsHideViewer so it occurs after all other
script runners.  Maybe just use NS_DispatchToCurrentThread instead?
The NS_DispatchToCurrentThread idea didn't work:
https://tbpl.mozilla.org/?tree=Try&rev=ecbcb0a1e683

Trying a "delayed script runner" approach instead:
https://tbpl.mozilla.org/?tree=Try&rev=9d7ec2ca7d38
Attached patch fix+testSplinter Review
Use a dummy script runner that appends the nsHideViewer runner at the end,
thereby "stepping over" any other runners that was added after it,
e.g. RemoveFromBindingManagerRunnables.

https://tbpl.mozilla.org/?tree=Try&rev=e88b26425fc9
Assignee: nobody → matspal
Attachment #757355 - Flags: review?(bzbarsky)
Comment on attachment 757355 [details] [diff] [review]
fix+test

> It's nulled out by UnbindFromTree from a RemoveFromBindingManagerRunnable

Why is this not killing the frame?  That seems like the real bug here: unbound things should not have frames, ever.  We should add an assert to that effect in UnbindFromTree!

I don't think messing with scriptrunners is the right way to go here, at first glance...
Attachment #757355 - Flags: review?(bzbarsky) → review-
It does! as a result of nsNodeUtils::ContentRemoved
http://hg.mozilla.org/mozilla-central/annotate/295a11e1efef/content/base/src/nsINode.cpp#l1432
during which nsSubDocumentFrame::DestroyFrom posts a nsHideViewer.
Then we do the UnbindFromTree which posts the RemoveFromBindingManagerRunnable.

After all that is done, with no frame tree, we start running script runners,
first nsHideViewer which *creates a new frame tree* with its flush,
then RemoveFromBindingManagerRunnable...
The point is RemoveFromBindingManagerRunnable is calling UnbindFromTree on a node that has a frame, no?  That's not an OK thing to do.  It needs to kill off that frame first.
(In reply to Boris Zbarsky (:bz) from comment #35)
> The point is RemoveFromBindingManagerRunnable is calling UnbindFromTree on a
> node that has a frame, no?

Yes, but that frame did not exist when the RemoveFromBindingManagerRunnable
instance was created.  It was created later, by the flush in nsHideViewer.

>  That's not an OK thing to do.

Yes, of course.  That's what my patch is trying to avoid.

> It needs to kill off that frame first.

I really don't think we want an ongoing UnbindFromTree operation to mess
with the *new* frame tree that did not exist at the time UnbindFromTree
began.  It seems to me that "kill off that frame first" will lead to the
wrong result (a Frankenstein frame tree).

We should instead makes sure the UnbindFromTree has completed before
creating a new frame tree.  Yes, my patch mostly just wallpapers the
problem by delaying the nsHideViewer to create the separation.

Alternatively, I can fix the underlying problem in UnbindFromTree if you
want but that's a bit more work.  Instead of using script runners to do
the RemovedFromDocumentInternal part, I imagine we could collect those
in an array and run them at the end of the top-most UnbindFromTree.
Would that be an acceptable solution to you?
(In reply to Boris Zbarsky (:bz) from comment #33)
> unbound things should not have frames, ever.  We should add an assert to
> that effect in UnbindFromTree!

I tried that,

MOZ_ASSERT(!mPrimaryFrame) fails because of the way
nsXBLBinding::InstallAnonymousContent works -- it always calls UnbindFromTree:
(gdb) bt
#0  nsIContent::UnbindFromTree
#1  nsXBLBinding::InstallAnonymousContent
#2  nsXBLBinding::GenerateAnonymousContent
#3  nsXBLService::LoadBindings
#4  nsCSSFrameConstructor::AddFrameConstructionItemsInternal
...

Relaxing it to MOZ_ASSERT(!GetPrimaryFrame()) fixed that, but it fails
in AnonymousContentDestroyer::Run for the <input>.setAttribute call in
content/html/content/crashtests/571428-1.html
(gdb) bt
#0  nsIContent::UnbindFromTree
#1  AnonymousContentDestroyer::Run
#2  nsContentUtils::RemoveScriptBlocker
#3  nsAutoScriptBlocker::~nsAutoScriptBlocker
#5  mozilla::dom::Element::SetAttr
...

Let me know if you want bugs filed for those issues.
> It was created later, by the flush in nsHideViewer.

Sure.

> That's what my patch is trying to avoid.

But it only avoids it for the one case here.  Nothing prevents other script runners having the same problem...

> I really don't think we want an ongoing UnbindFromTree operation to mess
> with the *new* frame tree that did not exist at the time UnbindFromTree
> began.

There are two different UnbindFromTree here, no?  There's the one on the non-anonymous content and the one one the anonymous content.  The latter is not making sure that the anon content has no frames first, as far as I can tell, which is the bug.

> MOZ_ASSERT(!mPrimaryFrame) fails 

Because mPrimaryFrame is in a union, sure.

> but it fails in AnonymousContentDestroyer::Run 

Yes, because imo AnonymousContentDestroyer is buggy, no?
(In reply to Boris Zbarsky (:bz) from comment #38)
> > That's what my patch is trying to avoid.
> 
> But it only avoids it for the one case here.  Nothing prevents other script
> runners having the same problem...

Which I already admitted to, and suggested an alternative solution
which would fix all such cases since the top-most UnbindFromTree
would be fully complete when it returns.  Did you see that part?
(comment 36, last two paragraphs)

> There are two different UnbindFromTree here, no?  There's the one on the
> non-anonymous content and the one one the anonymous content.

Correct.  The latter runs from a RemoveFromBindingManagerRunnable.

> The latter is not making sure that the anon content has no frames first,
> as far as I can tell, which is the bug.

No, there should not be any frames there in the first place since they
were already destroyed earlier.  What happens is:

The parser decides it wants to move a node from one place to another
so it does a Remove followed by an Append:
  1. nsINode::doRemoveChildAt
       1a. destroys the frame tree (fully), one of the frames posts
           a nsHideViewer
       1b. calls UnbindFromTree, one of the nodes posts a 
           RemoveFromBindingManagerRunnable
  2. nsINode::AppendChildTo
       2a. calls BindToTree

then we execute the script runners:
  3. nsHideViewer flushes which creates frames for all the content,
     including anon content.
  4. RemoveFromBindingManagerRunnable calls UnbindFromTree
     to complete the operation in 1b, but since 2a and 3 occurred
     it's messing up stuff in the new tree.

I've suggested a wallpaper and a alternative fix that would solve
the problem completely.  Not sure what more I can do here...
(In reply to Boris Zbarsky (:bz) from comment #38)
> Because mPrimaryFrame is in a union, sure.

Oh, OK.  I didn't get the memo on that ;-)

> Yes, because imo AnonymousContentDestroyer is buggy, no?

Probably, but I didn't dive into the details on how it's supposed to work.
I did see comment 36 last part.  I guess it's not obvious to me that this problem is limited to the RemoveFromBindingManagerRunnable as posted from UnbindFromTree, but maybe it is....

> then we execute the script runners:

Hrm.  The fact that #4 happens after #2 is just wrong; now I seem to recall it coming up before... :(

Blake, Jonas, can we just fix that ordering?
It does indeed seem like it would be nice if we could call RemovedFromDocumentInternal synchronously rather than off a script runner.

That looks possible if we change nsXBLBinding::ChangeDocument to do all of it's script jazz of a script runner rather than synchronously. We would also need to call nsXBLPrototypeBinding::BindingDetached from a scriptrunner rather than synchronously.

It'd be nice to change this teardown stuff so that all script-running is located in a single function, and that that function is called from a scriptrunner. And all other teardown is done synchronously.

We should probably wait until the other xbl refactoring has landed before making such a big change though.
The underlying problem is in content/ code so reassigning.
Assignee: matspal → nobody
Component: Layout → XBL
I can't reproduce this crash in any build.  It seems to have been fixed in bug 930250;
if I revert that change then I can reproduce the crash loading the attached testcase.
Depends on: 930250
Flags: in-testsuite?
Landed a crashtest:
https://hg.mozilla.org/integration/mozilla-inbound/rev/49adb7e3b03b
Flags: in-testsuite? → in-testsuite+
https://hg.mozilla.org/mozilla-central/rev/49adb7e3b03b
Assignee: nobody → mats
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: