Closed Bug 858937 Opened 7 years ago Closed 7 years ago

crash in nsCSSFrameConstructor::MaybeRecreateContainerForFrameRemoval

Categories

(Core :: CSS Parsing and Computation, defect, critical)

18 Branch
ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla25
blocking-b2g leo+
Tracking Status
firefox23 --- wontfix
firefox24 --- wontfix
firefox25 --- fixed
b2g18 --- fixed
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- wontfix
b2g-v1.1hd --- fixed

People

(Reporter: scoobidiver, Assigned: dbaron)

References

()

Details

(Keywords: crash, reproducible, topcrash, Whiteboard: [b2g-crash])

Crash Data

Attachments

(2 files)

It has been hit by two users in B2G/18.0.

Here is a crash report: bp-22e4c10c-2ca4-4bbd-9da8-124e22130402.

Frame 	Module 	Signature 	Source
0 	libxul.so 	nsCSSFrameConstructor::MaybeRecreateContainerForFrameRemoval 	nsIFrame.h:838
1 	libxul.so 	nsCSSFrameConstructor::RecreateFramesForContent 	nsCSSFrameConstructor.cpp:9348
2 	libxul.so 	nsCSSFrameConstructor::MaybeRecreateContainerForFrameRemoval 	nsCSSFrameConstructor.cpp:9255
3 	libxul.so 	nsCSSFrameConstructor::ContentRemoved 	nsCSSFrameConstructor.cpp:7472
4 	libxul.so 	PresShell::ContentRemoved 	nsPresShell.cpp:4145
5 	libxul.so 	nsNodeUtils::ContentRemoved 	nsNodeUtils.cpp:177
6 	libxul.so 	nsINode::doRemoveChildAt 	nsINode.cpp:1360
7 	libxul.so 	mozilla::dom::FragmentOrElement::RemoveChildAt 	FragmentOrElement.cpp:1037
8 	libxul.so 	nsGenericHTMLElement::SetInnerHTML 	nsGenericHTMLElement.cpp:1345
9 	libxul.so 	nsIDOMHTMLElement_SetDOMInnerHTML 	nsGenericHTMLElement.h:171
10 	libxul.so 	js::baseops::SetPropertyHelper 	jscntxtinlines.h:450
11 	libxul.so 	js::Interpret 	jsinterpinlines.h:360
12 	libxul.so 	js::RunScript 	jsinterp.cpp:324
13 	libxul.so 	js::InvokeKernel 	jsinterp.cpp:378
14 	libxul.so 	js_fun_call 	jsinterp.h:109
15 	libxul.so 	js::InvokeKernel 	jscntxtinlines.h:364
16 	libxul.so 	js::Interpret 	jsinterp.cpp:2475
17 	libxul.so 	js::RunScript 	jsinterp.cpp:324
18 	libxul.so 	js::Invoke 	jsinterp.cpp:378
19 	libxul.so 	JS_CallFunctionValue 	jsapi.cpp:5893
20 	libxul.so 	nsXPCWrappedJSClass::CallMethod 	XPCWrappedJSClass.cpp:1432 
...

More reports at:
https://crash-stats.mozilla.com/report/list?signature=nsCSSFrameConstructor%3A%3AMaybeRecreateContainerForFrameRemoval
So frame 2 is at:

9249   // Reconstruct if inflowFrame is parent's only child, and parent is, or has,
9250   // a non-fluid continuation, i.e. it was split by bidi resolution
9251   if (!inFlowFrame->GetPrevSibling() &&
9252       !inFlowFrame->GetNextSibling() &&
9253       ((parent->GetPrevContinuation() && !parent->GetPrevInFlow()) ||
9254        (parent->GetNextContinuation() && !parent->GetNextInFlow()))) {
9255     *aResult = RecreateFramesForContent(parent->GetContent(), true);
9256     return true;

and frame 1 is at:

9348   if (frame && MaybeRecreateContainerForFrameRemoval(frame, &rv)) {
9349     return rv;
9350   }

and frame 0 is at:

838   nsIFrame* GetParent() const { return mParent; }

and the crash is a null-deref.

The best guess is that somehow the placeholder for the frame ended up null or something?  That's the only way I can see a GetParent() on null in MaybeRecreateContainerForFrameRemoval, apart from parent->GetParent() being null down in the IsAnonymousFlexItem(parent) case.
Do we have anything like urls here?
Keywords: needURLs
crashing in homescreen app - app://homescreen.gaiamobile.org/manifest.webapp
Keywords: needURLs
It still happens in B2G 18.0/20130506.
It's #3 top crasher in B2G 18.0.
blocking-b2g: --- → leo?
Keywords: topcrash
I agree that this looks like the frame has a null placeholder, as bz suggested in comment 1. 

(It'd be pretty bizarre for this to be the IsAnonymousFlexItem(parent) case -- anonymous flex items have a parent by definition, and more importantly, flexbox support was still preffed off in the gecko 18 timeframe, so we shouldn't have any flex containers / anonymous flex items.)

At this point I wonder if we should just add a band-aid to make us return true if we have a null placeholder, instead of crashing.  (where "true" would indicate "yes, recreate frames", to be on the safe side, since we can't really tell)
(In reply to Daniel Holbert [:dholbert] from comment #6)
> At this point I wonder if we should just add a band-aid to make us return
> true if we have a null placeholder, instead of crashing.  (where "true"
> would indicate "yes, recreate frames", to be on the safe side, since we
> can't really tell)

Oh, never mind -- a "true" return value from this function would indicate "yes, we _did_ recreate frames", not "yes please recreate frames".  So that band-aid suggestion isn't quite right.

Plus, if we return without reframing sufficiently, we could end up in an inconsistent state which could open us up to something scarier than a null crash...
leo+ since this is a top crash, but since we'll always ship with a top 10 list, this may not end up being a blocker later in the game.
blocking-b2g: leo? → leo+
(In reply to Alex Keybl [:akeybl] from comment #8)
> since we'll always ship with a top 10 list
Not necessarily for B2G if no bugs comply with criteria (all below the noise). See https://wiki.mozilla.org/CrashKill/Topcrash
Note that a working about:crashes and the capability of commenting would help reproduce crashes.
dbaron, can you help us find somebody to take a look at the stack? Hopefully we find the issue quickly, if not, we can ask QA for their help reproducing.
Assignee: nobody → dbaron
Is there a way to find crash reports for this crash other than the one directly referenced in comment 0?  The query in comment 0 doesn't show anything, nor do reasonable modifications to it that I made (like switching product to B2G).
Flags: needinfo?(scoobidiver)
(In reply to David Baron [:dbaron] (don't cc:, use needinfo? instead) from comment #11)
> Is there a way to find crash reports for this crash other than the one
> directly referenced in comment 0?  The query in comment 0 doesn't show
> anything, nor do reasonable modifications to it that I made (like switching
> product to B2G).
It works for me whatever the UI product. Try https://crash-stats.mozilla.com/report/list?product=B2G&signature=nsCSSFrameConstructor%3A%3AMaybeRecreateContainerForFrameRemoval
Flags: needinfo?(scoobidiver)
(In reply to David Baron [:dbaron] (don't cc:, use needinfo? instead) from comment #11)
>The query in comment 0 doesn't show
> anything, nor do reasonable modifications to it that I made (like switching
> product to B2G).

Strange, it works fine for me.
Oh, I guess the problem was it showed me the "Signature Summary" tab, which was empty, and I need to switch to the "Reports" tab to see anything useful.
The dominant URLs I'm seeing in the crash reports are these:
  http://forecast.io/
  http://taratatach.github.com/FFDoku/manifest.webapp
plus variations with stuff stuck on the end of them (though it's common to see those bare URLs as the URL field as well)
Oh, and the forecast.io crashes have stacks going through the flush associated with setting CanvasRenderingContext2D.strokeStyle or fillStyle and only have one MaybeRecreateContainerForFrameRemoval on the stack, whereas the FFDoku crashes have stacks that go through ContentRemoved and are in the second recursive call to MaybeRecreateContainerForFrameRemoval.
(ContentRemoved via the HTMLElement.innerHTML setter, I should have said)
It's pretty easy for me to get http://forecast.io/ to crash on a B2G phone; I can go to the selector icon in the upper left and then choose "Go to current location".  (Though it gives me the crash UI that I'd associate with the hang detector, except there was no pause before it came up.)  I can try to debug when I'm in the office.
(In reply to David Baron [:dbaron] (don't cc:, use needinfo? instead) from comment #14)
> Oh, I guess the problem was it showed me the "Signature Summary" tab, which
> was empty, and I need to switch to the "Reports" tab to see anything useful.

Oh, yes, there might be a bug with the Signature Summary there, I hope it will be solved when we switch over to the new Socorro UI (but that page is completely broken there right now for different reasons).
Spent some time debugging on v1-train (i.e., with b2g18 gecko).  We're trying to reframe an nsTableOuterFrame that has the NS_FRAME_OUT_OF_FLOW bit set, but whose (old, presumably) style data say it's display:inline-block (!!), float:none, and position: relative.  Not sure how we got in that situation, but probably the next step is to check if it also crashes on master (mozilla-central gecko).
The nsTableOuterFrame's child nsTableFrame (more sensibly) has display:table and position:absolute, though.
The nsTableOuterFrame's style context is for ::-moz-table-outer.  Worth noting, though, that its parent style context is not the nsTableFrame's style context.  Its parent style context is also display:inline-block.

In the case where we reresolve a style context for a reframe, we shouldn't set the new style context on the frame (to avoid violating invariants like these).  And it does look like we do this correctly when the style context parent comes from a child (see assumeDifferenceHint and the way it's passed to CaptureChange, which puts it into aMinChange, and the call to SetStyleContext right after CaptureChange).

Though the fact that NS_FRAME_OUT_OF_FLOW isn't set also makes it seem like something that's been wrong for a while rather than something that went wrong during this restyle.
And it also crashes on master (i.e., with current gecko).
So the crash on mozilla-central is a bit different, but seems to have the same underlying cause (an nsTableOuterFrame whose style doesn't at all match the frame).

If I add an nsTableOuterFrame::DidSetStyleContext:
+/* virtual */ void
+nsTableOuterFrame::DidSetStyleContext(nsStyleContext* aOldStyleContext)
+{
+  nsContainerFrame::DidSetStyleContext(aOldStyleContext);
+
+  MOZ_ASSERT(GetStyleDisplay()->mDisplay == NS_STYLE_DISPLAY_TABLE ||
+             GetStyleDisplay()->mDisplay == NS_STYLE_DISPLAY_INLINE_TABLE);
+}

(and put the same MOZ_ASSERT in the constructor, but that's not relevant), ***and*** change every occurrence of SetStyleContextWithoutNotification in nsTransitionManager.cpp to SetStyleContext, then I hit my MOZ_ASSERT during nsTransitionManager::UpdateThrottledStylesForSubtree.
I think the underlying problem here is that nsTransitionManager::UpdateThrottledStylesForSubtree is assuming that the primary frame for a content node is the frame associated with that content node's styles.  I think nsTableOuterFrame is the only exception to this (see, say, nsComputedDOMStyle::GetPropertyCSSValue).
Component: Layout → CSS Parsing and Computation
I have a patch series that I've confirmed fixes the crash on forecast.io on master.

I tried (not all that hard) to repro the crash on the FFDoku app, and couldn't.
The fixes to the miniflush code
(nsTransitionManager::UpdateThrottledStyle and UpdateAllThrottledStyles)
fix the case where we constructed totally incorrect style contexts for
outer table frames (which have special style contexts inheriting from
the table frame) during the miniflush, leading to inconsistent style
data and other bad things, when we should have been touching the style
on the table frame instead.

The fixes to the other OMTA codepaths lead to layer tests being
performed on the same frame that the styles will be applied to, and
probably fix real bugs (which would occur when animating opacity or
transform on a table).
Attachment #766979 - Flags: review?(ncameron)
Attachment #766978 - Flags: review?(ncameron) → review+
Comment on attachment 766979 [details] [diff] [review]
patch 2:  Make transitions code that should be using nsLayoutUtils::GetStyleFrame do so.

Review of attachment 766979 [details] [diff] [review]:
-----------------------------------------------------------------

Should we also use GetStyleFrame in other places that the primary frame is used without checking for table outer frames? (A quick grep found one in nsComputedDOMStyle.cpp and one in nsStyleStructInlines.h).
Attachment #766979 - Flags: review?(ncameron) → review+
(In reply to Nick Cameron [:nrc] from comment #29)
> Should we also use GetStyleFrame in other places that the primary frame is
> used without checking for table outer frames? (A quick grep found one in
> nsComputedDOMStyle.cpp and one in nsStyleStructInlines.h).

I looked at those two; the nsComputedDOMStyle.cpp one already has a more complicated special case for exactly the same thing (it needs to track both frames); the nsStyleStructInlines.h is specific to the root... although, on second thought, perhaps it does need adjusting.  But probaly in a separate bug.
And I really should add an automated test for this case, even though it wouldn't be hit by any of our current test suites, though it would be something that they could hit when we turn on OMTA for desktop platforms.
Flags: in-testsuite?
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/f94cd48c8907 because I couldn't tell whether this or bug 886635 was the cause of the sPagePrintTimer warnings-as-errors bustage of https://tbpl.mozilla.org/php/getParsedLog.php?id=24554210&tree=Mozilla-Inbound
(In reply to David Baron [:dbaron] (don't cc:, use needinfo? instead) from comment #30)
> (In reply to Nick Cameron [:nrc] from comment #29)
> > Should we also use GetStyleFrame in other places that the primary frame is
> > used without checking for table outer frames? (A quick grep found one in
> > nsComputedDOMStyle.cpp and one in nsStyleStructInlines.h).
> 
> I looked at those two; the nsComputedDOMStyle.cpp one already has a more
> complicated special case for exactly the same thing (it needs to track both
> frames); the nsStyleStructInlines.h is specific to the root... although, on
> second thought, perhaps it does need adjusting.  But probaly in a separate
> bug.

I saw two uses in nsComputedDOMStyle.cpp - I think the one in nsComputedDOMStyle::GetPropertyCSSValue is the one you refer to, but there is also a use in nsComputedDOMStyle::GetStyleContextForElementNoFlush which looks to me like a simple case like the ones in animation/transitions. I don't understand the code well enough to say for sure it should be changed though.
(In reply to Nick Cameron [:nrc] from comment #35)
> I saw two uses in nsComputedDOMStyle.cpp - I think the one in
> nsComputedDOMStyle::GetPropertyCSSValue is the one you refer to, but there
> is also a use in nsComputedDOMStyle::GetStyleContextForElementNoFlush which
> looks to me like a simple case like the ones in animation/transitions. I
> don't understand the code well enough to say for sure it should be changed
> though.

That one is already fixed in patch 1.
Merged patches to b2g18, tested (in a user build of v1-train+b2g18) that the crash on forecast.io is in fact fixed by them there as well, and pushed per the rules in https://wiki.mozilla.org/Release_Management/B2G_Landing#Landing_leo.2B_bugs_for_v1.1.0_.28updated_2.2F13.29 :

remote:   https://hg.mozilla.org/releases/mozilla-b2g18/rev/566203eec96d
remote:   https://hg.mozilla.org/releases/mozilla-b2g18/rev/b158e7aa6cd5
(In reply to David Baron [:dbaron] (don't cc:, use needinfo? instead) from comment #36)
> (In reply to Nick Cameron [:nrc] from comment #35)
> > I saw two uses in nsComputedDOMStyle.cpp - I think the one in
> > nsComputedDOMStyle::GetPropertyCSSValue is the one you refer to, but there
> > is also a use in nsComputedDOMStyle::GetStyleContextForElementNoFlush which
> > looks to me like a simple case like the ones in animation/transitions. I
> > don't understand the code well enough to say for sure it should be changed
> > though.
> 
> That one is already fixed in patch 1.

AH, of course, sorry.
https://hg.mozilla.org/mozilla-central/rev/4cae950f6c35
https://hg.mozilla.org/mozilla-central/rev/20ae43c44de6
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla25
It would be good to confirm that this has actually fixed what we were seeing in crash-stats.  (I suspect it has, but I've only confirmed that it fixes the crashes on forecast.io.  It's possible the FFDoku and/or other crashes were separate bugs, which, if still present, should be filed as separate bug reports.)

I'm not sure when we will get sufficient crash data to check that, though.
Keywords: qawanted
Keywords: qawantedverifyme
(In reply to David Baron [:dbaron] (don't cc:, use needinfo? instead) from comment #41)
> I'm not sure when we will get sufficient crash data to check that, though.

Yes, won't be easy as most phones used (nearly) by users will be FxOS 1.0.1 for a while (first-wave release phones and Geeksphones), and we don't have the fix in there. OTOH, only the very small collection of our internal test phones run 1.1, which has the fix, and a small collection of Geeksphones and internal test phones run m-c/master, which also has the fix.
mass remove verifyme requests greater than 4 months old
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.