Closed Bug 1384824 Opened 2 years ago Closed 2 years ago

stylo: Crash in PLDHashTable::Add | mozilla::ArenaRefPtr<T>::assignFrom<T>

Categories

(Core :: CSS Parsing and Computation, defect, P1, critical)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox-esr52 --- unaffected
firefox55 --- wontfix
firefox56 --- fixed
firefox57 blocking verified

People

(Reporter: jchen, Assigned: heycam)

References

(Blocks 1 open bug, )

Details

(Keywords: crash)

Crash Data

Attachments

(6 files)

This bug was filed from the Socorro interface and is 
report bp-25eee233-5388-4902-9db7-b9f240170727.
=============================================================

Recent crash with first crashes from the 07-21 Nightly. Seems like the frame after `mozilla::ArenaRefPtr<T>::assignFrom<T>` is always `nsComputedDOMStyle::SetResolvedStyleContext(RefPtr<nsStyleContext>&&, unsigned __int64)`.

Manish, do you think this is related to the recent stylo work in nsComputedDOMStyle (e.g. bug 1367904)?
Flags: needinfo?(manishearth)
The Mac crash signature is slightly different, but the stack trace is nearly identical as the crashes on Windows.

bp-68c728bc-e213-46f4-bae8-ba41b0170726

[@ PLDHashTable::Add | void mozilla::ArenaRefPtr<T>::assignFrom<T> ]
Crash Signature: [@ PLDHashTable::Add | mozilla::ArenaRefPtr<T>::assignFrom<T>] → [@ PLDHashTable::Add | mozilla::ArenaRefPtr<T>::assignFrom<T>] [@ PLDHashTable::Add | void mozilla::ArenaRefPtr<T>::assignFrom<T>]
OS: Windows 10 → All
Priority: -- → P2
Summary: Crash in PLDHashTable::Add | mozilla::ArenaRefPtr<T>::assignFrom<T> → stylo: Crash in PLDHashTable::Add | mozilla::ArenaRefPtr<T>::assignFrom<T>
Yes, it's related, but I'm not sure why it's happening. Might be a null pointer dereference but ArenaRefPtr seems to have code that will handle null values.

 Will need to look into it.
If we could get an example of the URL on which this happens it would be great.
Flags: needinfo?(manishearth)
I can consistently reproduce this crash on Windows 10 by loading https://10gbps.io/
Interestingly, the crash does not occur on my build from Tuesday (only functional build I have right now, others are rebuilding) that is on https://hg.mozilla.org/mozilla-central/rev/388d81ed93fa

It does occur for me on my up to date nightly. And only on the Starling Bank site (need to move the mouse around and wait for it to trigger). I don't think it was occurring on yesterday's nightly (just before I updated today), but I didn't try too much.

However, the changes that we're thinking of are from a week before https://hg.mozilla.org/mozilla-central/rev/388d81ed93fa .

Might mozregression this.
Mozregression gives me https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=ac03688d438263275bd4c1b5dd801a41cc786847&tochange=0b3e4b5b8e4d6b195d9db08d034e99575e0a7c90 (xidorn, any idea why that would cause this?)

When running on a debug build I get an assert on https://dxr.mozilla.org/mozilla-central/rev/36f95aeb4c77f7cf3b3366583008cd6e4b6b1dba/layout/base/ArenaRefPtr.h#152

Arena() on nsStyleContext will yield PresContext->PresShell(). IIRC we're allowed to outlive the pres shell, so this is probably wrong.

I think that code should be doing `if (mPtr->Arena())` instead? bholley, wdyt?
Flags: needinfo?(xidorn+moz)
Flags: needinfo?(bobbyholley)
That bug changes whether we use stylo for blank document (the document of about:blank, which exists before we navigate to anything in a new browser container like a new tab or new iframe). Before that, we were probably not using stylo for about:blank effectively. So if this bug can only appear in about:blank, then that change may uncover this bug.
Flags: needinfo?(xidorn+moz)
This is the #4 Windows topcrash in Nightly 20170728100358.
I can reliably reproduce this issue with https://10gbps.io/ as well, and it seems that the document in question is a sub-document of the document but with the same url. It doesn't have PresShell, but is marked visible.
Hmmm, so it seems the crash on https://10gbps.io/ happens on an element inside a <iframe src="about:blank"> in a "display:none" tree. This can explain why bug 1384162 triggers this on this website, because before that bug, we effectively don't apply stylo on about:blank page.
Attached file simplified testcase
This is a simplified testcase which has the same crash stack as the crash happens on https://10gbps.io/ but it is constructed by guessing, not by reducing the page.
I'm not familiar with all the arena stuff, so I'd leave the rest to manish and bholley. I suppose this simplified testcase is enough for investigating what's happening there.
Assignee: nobody → manishearth
(In reply to Manish Goregaokar [:manishearth] from comment #7)
> Mozregression gives me
> https://hg.mozilla.org/integration/autoland/
> pushloghtml?fromchange=ac03688d438263275bd4c1b5dd801a41cc786847&tochange=0b3e
> 4b5b8e4d6b195d9db08d034e99575e0a7c90 (xidorn, any idea why that would cause
> this?)
> 
> When running on a debug build I get an assert on
> https://dxr.mozilla.org/mozilla-central/rev/
> 36f95aeb4c77f7cf3b3366583008cd6e4b6b1dba/layout/base/ArenaRefPtr.h#152
> 
> Arena() on nsStyleContext will yield PresContext->PresShell(). IIRC we're
> allowed to outlive the pres shell, so this is probably wrong.

The whole point of ArenaRefPtr is so that pointers to style contexts do not outlive the presshell (i.e. the presshell can clear those pointers out when it goes away).

> 
> I think that code should be doing `if (mPtr->Arena())` instead? bholley,
> wdyt?

I don't see how that would be helpful.
Flags: needinfo?(bobbyholley)
> The whole point of ArenaRefPtr is so that pointers to style contexts do not outlive the presshell (i.e. the presshell can clear those pointers out when it goes away).

So it seems like these pointers are still being cleared -- but we may be holding on to a non-arena RefPtr to ServoStyleContext somewhere else? Because the error is not with the *existing* style context, it's with the new one that we obtained from nsComputedDOMStyle::GetStyleContext()
This might be related to GetBaseComputedValuesForElement
We seem to be returning the result of ResolveTransientStyle in this case, which probably was constructed on the spot.

https://dxr.mozilla.org/mozilla-central/rev/36f95aeb4c77f7cf3b3366583008cd6e4b6b1dba/layout/style/nsComputedDOMStyle.cpp#664-672
Also, the display:none stuff in the testcase isn't necessary, you can still get this crash on the iframe via

<!DOCTYPE html>
<style>
  textarea {
    min-height: 100px;
  }
</style>
<div>
  <iframe src="about:blank"></iframe>
</div>
<script>
  let div = document.querySelector('div');
  let iframe = document.querySelector('iframe');
  iframe.onload = function() {
    let doc = iframe.contentDocument;
    let e = doc.createElement('textarea');
    doc.body.appendChild(e);
    alert(getComputedStyle(e).minHeight);

  };
</script>
Manish, Bobby and I discussed this a bit today.  There are two cases that we need to worry about:

1. The pres shell has gone away, and the existing style data in the tree should not be re-used from getComputedStyle.
2. The pres shell has gone away, and another new pres shell has been created, and the existing style data in the tree shouldn't be used for regular styling.  (I didn't check whether this will actually happen, but it seems plausible.)

For both cases we'll solve this by dropping all style data in the tree, but only doing it at times when we know we need to.  We'll do this by giving a pres shell a unique serial number, and storing on the document what the serial number of the last pres shell we styled with was.  For #1, for all of the ServoStyleSet entry points that nsComputedDOMStyle calls, we make it check whether the serial number on the document matches its current pres shell's serial number, and if not, it starts by calling ClearServoDataFromSubtree on the document.  For #2, we also explicitly do this check and clearing if we ever create a new pres shell for a document.  We should avoid doing this check for the common calls like ResolveStyleFor, to avoid adding any overhead there.

For all the other ServoStyleSet entry points that can resolve style, we can do the serial number updating/check in an #ifdef DEBUG block.  It would be nice to partition the API into callers that know the style data in the tree must be up to date (e.g. when called from nsCSSFrameConstructor, or ServoRestyleManager, or random frame classes), but we can think about that in a followup.
Actually we might not even need a serial number, just a flag to say whether the document previously had a pres shell and might have stale data in it.
Attached patch WIPSplinter Review
Like this.  Not sure about the case where the document has gone in the bfcache, and GetShell() returns null but we could still show the document again later.  Maybe Brian already handled bfcache weirdness in another bug, I'm not sure...
Attachment #8892366 - Flags: feedback?(bobbyholley)
Cameron is looking at this bug. This is our top Stylo crash report.
Assignee: manishearth → cam
Priority: P2 → P1
Comment on attachment 8892366 [details] [diff] [review]
WIP

Review of attachment 8892366 [details] [diff] [review]:
-----------------------------------------------------------------

A few potential issues:
* It seems to me that, contrary to our discussion last night, the code path for ResolveTransientStyle does leave styles in the tree. So we need to handle that somehow.
* Looks like we're missing handling for the other callsites in dom/.
Attachment #8892366 - Flags: feedback?(bobbyholley)
Priority: P1 → --
Priority: -- → P1
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #23)
> * It seems to me that, contrary to our discussion last night, the code path
> for ResolveTransientStyle does leave styles in the tree. So we need to
> handle that somehow.

Where?  (I'll add assertions to ResolveTransientStyle etc. that after resolving the style that we didn't leave ElementData in the tree, if the document has no pres shell.)

> * Looks like we're missing handling for the other callsites in dom/.

Yeah, I'll handle those next.
(In reply to Cameron McCormack (:heycam) from comment #24)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #23)
> > * It seems to me that, contrary to our discussion last night, the code path
> > for ResolveTransientStyle does leave styles in the tree. So we need to
> > handle that somehow.
> 
> Where?  (I'll add assertions to ResolveTransientStyle etc. that after
> resolving the style that we didn't leave ElementData in the tree, if the
> document has no pres shell.)

Maybe I'm missing it? Looks to me that ResolveTransientStyle just calls into Servo_ResolveStyleLazily, which doesn't appear to do any special handling to drop the styles if there's no presshell.
Crash Signature: [@ PLDHashTable::Add | mozilla::ArenaRefPtr<T>::assignFrom<T>] [@ PLDHashTable::Add | void mozilla::ArenaRefPtr<T>::assignFrom<T>] → [@ PLDHashTable::Add | mozilla::ArenaRefPtr<T>::assignFrom<T>] [@ PLDHashTable::Add | void mozilla::ArenaRefPtr<T>::assignFrom<T>] [@ PLDHashTable::Add | nsBaseHashtable<T>::Put | mozilla::ArenaRefPtr<T>::assignFrom<T>]
I'd like to track this as a blocking 57 bug. This issue is also gating a broader experiment on Beta56.
That try run has a single failure in test_computed_style_bfcache_display_none.html, which I guess answers my question from comment 21...
The one thing I'm unsure of here is what to do when the document is in the bfcache.  The patch as it stands is probably not right, although it shouldn't be unsafe or crash.  If the document is in the bfcache, and we try to resolve style on some unstyled element, then those unstyled elements will have their selector matching performed with the rules from the window getComputedStyle was called on, but it will still re-use element data stored in the tree.  The bfcache'd document's pres shell is still alive, so these styles are safe to use, but conceptually we probably don't want to use them since we should really be treating the document as undisplayed / pres-shell-less.

But just clearing those styles when it's brought out of the cache doesn't seem right, since don't we want those styles to be preserved to avoid doing that work again?  We could clear the styles, then restyle when it's brought out of the bfcache again, although we'd probably need to do that to the frame tree too, since it's not going to be simple to restyle if we've discarded the existing styles.  Do you have any good suggestions Bobby?

(This is assuming that the styles and frame tree are indeed preserved and re-used once the document comes out of the bfcache, which I haven't checked.  Is that true?)
Flags: needinfo?(bobbyholley)
Per IRC discussion, I think we're probably going to need to introduce a mode to ResolveTransientStyle that ignores styles already in the tree.
Flags: needinfo?(bobbyholley)
Comment on attachment 8893219 [details]
Bug 1384824 - Part 1: Lazily clear stale Servo element data from a document when its pres shell changes.

https://reviewboard.mozilla.org/r/164260/#review170126

::: dom/base/nsDocument.cpp:3889
(Diff revision 3)
>  
>    NS_ENSURE_FALSE(GetBFCacheEntry(), nullptr);
>  
>    FillStyleSet(aStyleSet);
>  
> +  if (aStyleSet->IsServo()) {

Nit: I would move the conditional to the part where we set mMightHaveStaleServoData.

::: dom/base/nsDocument.cpp:4014
(Diff revision 3)
> +  // Record that the tree might have stale Servo element data in it
> +  // that would need to be cleared if we ever get a new pres shell
> +  // or if we call ServoStyleSet style resolving functions on
> +  // elements in the document.
> +  mMightHaveStaleServoData = true;

Make a note of the performance reasons for which we set this bit rather than clearing style from the tree explicitly in PresShell::Destroy (and how the lazy thing works 99% of the time)?

::: dom/html/nsGenericHTMLElement.cpp:2981
(Diff revision 3)
> +  // Let's pretend everything is display:none if we're in the bfcache.
> +  if (aElement->OwnerDoc()->GetBFCacheEntry()) {
> +    return true;
> +  }

I think this will become unnecessary if we go through ResolveTransientStyle.

::: layout/style/ServoStyleSet.h:191
(Diff revision 3)
>    // Resolves style for a (possibly-pseudo) Element without assuming that the
>    // style has been resolved, and without worrying about setting the style
>    // context up to live in the style context tree (a null parent is used).
>    // |aPeudoTag| and |aPseudoType| must match.

There's really no need for ResolveTransientStyle anymore, because we no longer have the tree state in style context. As such, can you add a comment that this function can probably be merged with ResolveStyleLazily?

::: layout/style/ServoStyleSet.h:205
(Diff revision 3)
> +  // XXXheycam There's no longer any difference between ResolveTransientStyle
> +  // and ResolveTransientServoStyle so we should merge them.

And update this comment to consider that.

::: layout/style/ServoStyleSet.cpp:220
(Diff revision 3)
> +already_AddRefed<ServoStyleContext>
> +ServoStyleSet::ResolveCleanStyleFor(Element* aElement,

Per IRC discussion, I think we can nix this additional entry point, and should add a cleanup patch underneath this one to rationalize the other entry points.

::: layout/style/ServoStyleSet.cpp:488
(Diff revision 3)
>      Element* aElement,
>      CSSPseudoElementType aPseudoType,
>      nsIAtom* aPseudoTag,
>      StyleRuleInclusion aRuleInclusion)
>  {
> +  bool ignoreExistingStyles = aElement->OwnerDoc()->GetBFCacheEntry();

This is probably a good place to explain why the bfcache matters.

::: layout/style/ServoStyleSet.cpp:956
(Diff revision 3)
> +  // Servo_GetComputedKeyframeValues below won't handle ignoring existing
> +  // element data for bfcached documents.
> +  MOZ_ASSERT(!aElement->OwnerDoc()->GetBFCacheEntry());

MOZ_RELEASE_ASSERT.

::: layout/style/ServoStyleSet.cpp:983
(Diff revision 3)
>    const ServoStyleContext* aStyleContext,
>    nsTArray<RefPtr<RawServoAnimationValue>>& aAnimationValues)
>  {
> +  // Servo_GetAnimationValues below won't handle ignoring existing element
> +  // data for bfcached documents.
> +  MOZ_ASSERT(!aElement->OwnerDoc()->GetBFCacheEntry());

MOZ_RELEASE_ASSERT.

::: layout/style/ServoStyleSet.cpp:1002
(Diff revision 3)
>    nsPresContext* aPresContext,
>    nsIAtom* aPseudoTag,
>    CSSPseudoElementType aPseudoType,
>    const ServoStyleContext* aStyle)
>  {
> +  MOZ_ASSERT(!aElement->OwnerDoc()->GetBFCacheEntry(),

MOZ_RELEASE_ASSERT.

::: layout/style/ServoStyleSet.cpp:1020
(Diff revision 3)
> +  // Servo_AnimationValue_Compute below won't handle ignoring existing element
> +  // data for bfcached documents.
> +  MOZ_ASSERT(!aElement->OwnerDoc()->GetBFCacheEntry());

MOZ_RELEASE_ASSERT.

Also, might be worth adding a comment to these sites explaining why the bfcache matters, or at least pointing to some common place that explains it.

::: layout/style/nsComputedDOMStyle.cpp:594
(Diff revision 3)
>    }
>  
> +  // We do this check to avoid having to add too much special casing of
> +  // Servo functions we call to explicitly ignore any element data in
> +  // the tree.
> +  MOZ_ASSERT((aStyleType == eAll && aAnimationFlag == eWithAnimation) ||

Given that bfcache-ness is a pretty finicky dynamic thing, maybe make this MOZ_RELEASE_ASSERT?

::: servo/components/style/traversal.rs:430
(Diff revision 3)
> -    let mut layout_parent_style = style.clone();
> +    let mut layout_parent_style = None;
> +
> +    if !ignore_existing_style {
> +        layout_parent_style = style.clone();
> -    while let Some(style) = layout_parent_style.take() {
> +        while let Some(style) = layout_parent_style.take() {
> -        if !style.is_display_contents() {
> +            if !style.is_display_contents() {
> -            layout_parent_style = Some(style);
> +                layout_parent_style = Some(style);
> -            break;
> +                break;
> -        }
> +            }
>  
> -        ancestor = ancestor.unwrap().traversal_parent();
> +            ancestor = ancestor.unwrap().traversal_parent();
> -        layout_parent_style = ancestor.map(|a| {
> +            layout_parent_style = ancestor.map(|a| {
> -            a.borrow_data().unwrap().styles.primary().clone()
> +                a.borrow_data().unwrap().styles.primary().clone()
> -        });
> +            });
> -    }
> +        }
> +    }

Hm, why is this change necessary? It seems like |style| will remain |None| given the check above, at which point the old code will work fine here.
Attachment #8893219 - Flags: review?(bobbyholley) → review-
Comment on attachment 8893220 [details]
Bug 1384824 - Part 2: Crashtests.

https://reviewboard.mozilla.org/r/164262/#review170156
Attachment #8893220 - Flags: review?(bobbyholley) → review+
Comment on attachment 8893665 [details] [diff] [review]
Part 0: Consolidate lazy style resolution entrypoints.

Review of attachment 8893665 [details] [diff] [review]:
-----------------------------------------------------------------

This is so much better, thank you!

::: layout/style/ServoStyleSet.h
@@ +174,5 @@
>                              dom::Element* aPseudoElement);
>  
>    // Resolves style for a (possibly-pseudo) Element without assuming that the
>    // style has been resolved, and without worrying about setting the style
>    // context up to live in the style context tree (a null parent is used).

This comment needs updating.
Attachment #8893665 - Flags: review?(bobbyholley) → review+
Comment on attachment 8893666 [details] [diff] [review]
Part 1: Lazily clear stale Servo element data from a document when its pres shell changes. (v2)

Review of attachment 8893666 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/style/ServoStyleSet.cpp
@@ +215,5 @@
> +  ~AutoClearStaleData()
> +  {
> +#ifdef DEBUG
> +    // Assert that the element and its ancestors are all unstyled, if the
> +    // document has no pres shell.

Maybe note here that we need to check the bfcache case in addition to the presshell because being in the bfcache basically means having a "hidden" presshell that might become unhidden at some point.

@@ +478,5 @@
> +  // Lazy style computation avoid storing any new data in the tree.
> +  // If the tree has stale data in it, then the AutoClearStaleData below
> +  // will ensure it's cleared so we don't use it. But if the document is
> +  // in the bfcache, then we will have valid, usable data in the tree,
> +  // but we don't want to use it. Instead we want to pretend as if the

Make it clear that we don't want to use it because we're using the caller's prescontext for style resolution in that case, which is a totally different style set.
Attachment #8893666 - Flags: review?(bobbyholley) → review+
Pushed by cmccormack@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8cba9d56c357
Part 0: Consolidate lazy style resolution entrypoints. r=bholley
https://hg.mozilla.org/integration/autoland/rev/caed54430cf0
Part 1: Lazily clear stale Servo element data from a document when its pres shell changes. r=bholley
https://hg.mozilla.org/integration/autoland/rev/d3f55d822e4e
Part 2: Crashtests. r=bholley
(In reply to Chris Peterson [:cpeterson] from comment #5)
> I can consistently reproduce this crash on Windows 10 by loading https://10gbps.io/

Reliably crashed after ca. 2 seconds with 20170804193726 (previous build).
Does not crash anymore in my Nightly 57 x64 20170805100334 @ Debian Testing.
Let's hope the crash stats will confirm this.
Version: unspecified → Trunk
Darkspirit, thanks for confirming the fix!

Cameron, this bug was one of our top Stylo crashes from Nightly 56. We should probably uplift your fix to Beta 56 before we start our Stylo experiment on Beta.
Flags: needinfo?(cam)
Yes, good idea.  Crash stats looks good too, no instances of the crash after the 2017-08-05 build.
Flags: needinfo?(cam)
Comment on attachment 8893665 [details] [diff] [review]
Part 0: Consolidate lazy style resolution entrypoints.

Approval request applies to to parts 0, 1, and 2.  What landed had some small fixes, so please see the actual patches that landed on autoland/mozilla-central for what I'm requesting approval for.

This uplift is a prerequisite for starting a stylo pref experiment on beta.

Approval Request Comment
[Feature/Bug causing the regression]: Unsure of a specific bug, but it's when stylo is enabled.
[User impact if declined]: Content can crash with certain getComputedStyle operations when stylo is enabled.
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: low risk
[Why is the change risky/not risky?]: stylo-only change (though this is decreasingly relevant for risk); the two main changes -- ignoring styles in the tree, and clearing styles in the tree -- are relatively self contained changes, and we've added a bunch of release assertions to ensure we're doing the right thing
[String changes made/needed]: none
Attachment #8893665 - Flags: approval-mozilla-beta?
Duplicate of this bug: 1387531
Duplicate of this bug: 1387619
Duplicate of this bug: 1387136
https://hg.mozilla.org/projects/date/rev/8cba9d56c3571429f2a8aff2b1ba0e6c6d6df668
Bug 1384824 - Part 0: Consolidate lazy style resolution entrypoints. r=bholley

https://hg.mozilla.org/projects/date/rev/caed54430cf0394b66646337c8d1831c78740b4e
Bug 1384824 - Part 1: Lazily clear stale Servo element data from a document when its pres shell changes. r=bholley

https://hg.mozilla.org/projects/date/rev/d3f55d822e4e1374a7be8ae928777118c98a0058
Bug 1384824 - Part 2: Crashtests. r=bholley
Comment on attachment 8893665 [details] [diff] [review]
Part 0: Consolidate lazy style resolution entrypoints.

Fix for stylo crashes in 56. Let's uplift them for beta 2. 

heycam, can you link to the correct patches? (Or, maybe the sheriffs will understand from comment 51 what they need. )
Flags: needinfo?(cam)
Attachment #8893665 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: needinfo?(cam)
(In reply to Cameron McCormack (:heycam) from comment #51)
> [Is this code covered by automated tests?]: yes
> [Has the fix been verified in Nightly?]: yes
> [Needs manual test from QE? If yes, steps to reproduce]: no

Setting qe-verify- based on Cameron's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.