Closed Bug 1382568 Opened 2 years ago Closed 2 years ago

stylo: crash in Servo_ResolveStyle loading Twitter

Categories

(Core :: CSS Parsing and Computation, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox-esr52 --- unaffected
firefox55 --- unaffected
firefox56 --- fixed
firefox57 --- fixed

People

(Reporter: heycam, Assigned: emilio)

References

(Blocks 1 open bug, )

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(3 files, 2 obsolete files)

Sometimes when loading twitter.com I get a crash in Servo_ResolveStyle, like:

  https://crash-stats.mozilla.com/report/index/f64a1d0e-dedc-4264-8a3c-cb5030170720

This may just be bug 1380957 with this assertion currently raised to a release assertion:

  http://searchfox.org/mozilla-central/rev/3a3af33f513071ea829debdfbc628caebcdf6996/servo/ports/geckolib/glue.rs#2796
Keywords: crash
Priority: -- → P1
Crash Signature: [@ mozalloc_abort | abort | core::option::expect_failed]
Duplicate of this bug: 1382816
Add more crash signatures this ServoStyleSet::GetContext() crash:

[@ alloc::oom::default_oom_handler | core::option::expect_failed | geckoservo::glue::Servo_ResolveStyle ]
[@ core::option::expect_failed | geckoservo::glue::Servo_ResolveStyle ]
Crash Signature: [@ mozalloc_abort | abort | core::option::expect_failed] → [@ alloc::oom::default_oom_handler | core::option::expect_failed | geckoservo::glue::Servo_ResolveStyle ] [@ core::option::expect_failed | geckoservo::glue::Servo_ResolveStyle ] [@ mozalloc_abort | abort | core::option::expect_failed]
(In reply to Chris Peterson [:cpeterson] from comment #2)
> Add more crash signatures this ServoStyleSet::GetContext() crash:
> 
> [@ alloc::oom::default_oom_handler | core::option::expect_failed |
> geckoservo::glue::Servo_ResolveStyle ]

All of the crashes with this signature are on the 2017-07-21 build, and seem like bug 1382357 since they have nsGenericHTMLElement::GetInnerText on the stack.  That bug's fix landed in the 2017-07-21 build though so I'm not sure what's going on there.

> [@ core::option::expect_failed | geckoservo::glue::Servo_ResolveStyle ]

This one's different.  Just one crash, on the 2017-07-19 build, during frame construction in response to a throttled style flush.  (Maybe Hiro knows if that specific crash would have been fixed by any of the animation-related fixes since then.)
(In reply to Cameron McCormack (:heycam) from comment #3)

> > [@ core::option::expect_failed | geckoservo::glue::Servo_ResolveStyle ]
> 
> This one's different.  Just one crash, on the 2017-07-19 build, during frame
> construction in response to a throttled style flush.  (Maybe Hiro knows if
> that specific crash would have been fixed by any of the animation-related
> fixes since then.)

Yes, the build (buildid: 20170719030206) did not include bug 1381431 yet), so this crash has presumably been fixed by bug 1381431.
Still persists on 20170725030209.

All of the reports are the same install time and have IsOrHasAncestorWithDisplayNone().

An example;
https://crash-stats.mozilla.com/report/index/6788e2c5-dab8-4c50-bfa3-9e4840170725

Cameron any thought?
Flags: needinfo?(cam)
We always flush styles just before calling IsOrHasAncestorWithDisplayNone, so our styles should be up to date.  I'm struggling to come up with a test that could result in that first ResolveStyleFor call passing an unstyled element.  I guess we should just re-write that code per Bobby's comment anyway.
Flags: needinfo?(cam)
Well, I'm not sure.  Re-writing the code to call into Rust to do that check isn't trivial, if we want to keep the current performance behavior of only resolving down to an element that is display:none.  If we don't care about that, then we can just replace the whole loop with a single ResolveTransientStyle() call, and then call IsInDisplayNoneSubtree() on the resulting style context.

I still can't think of a situation where we should hit this assertion.  We know the parent element has a frame and the child element has no frame.  If the parent has a frame, then it surely can't be in a display:none subtree.  And if that's the case, the child should be styled (it's either a display:none root or a normal displayed element).  And I don't think display:contents affects things here.

Re-writing the code might just be papering over the issue, and if we want to paper over it, we may as well just make it LazyComputeBehavior::Allow.
Cameron, may this be because that function doesn't use the flattened tree, but the normal DOM tree? That looks fishy to me (and it's wrong, I'm pretty sure).
Flags: needinfo?(cam)
It's possible, if it's one of the elements in the flattened tree path is the display:none one.  I would be a bit surprised if content could be calling innerText on an element that is shuffled around in the tree due to a binding, but maybe there is some case where that can happen I'm not thinking of.  Do you want to write up a quick patch to fix that?
Flags: needinfo?(cam) → needinfo?(emilio+bugs)
Sure thing.
Flags: needinfo?(emilio+bugs)
But actually, I wonder (In reply to Cameron McCormack (:heycam) from comment #6)
> We always flush styles just before calling IsOrHasAncestorWithDisplayNone,
> so our styles should be up to date.  I'm struggling to come up with a test
> that could result in that first ResolveStyleFor call passing an unstyled
> element.  I guess we should just re-write that code per Bobby's comment
> anyway.

I think that comment was just regarding us storing the styles on the element instead of on the frames, but with the display: none thing that's no longer true, so I'm removing and doing some spring cleanup there too.
(In reply to Cameron McCormack (:heycam) from comment #9)
> It's possible, if it's one of the elements in the flattened tree path is the
> display:none one.  I would be a bit surprised if content could be calling
> innerText on an element that is shuffled around in the tree due to a
> binding, but maybe there is some case where that can happen I'm not thinking
> of.  Do you want to write up a quick patch to fix that?

I think this can be possible with addons? IDK, let's leave the bug open and close if the patches fix the issue.
Keywords: leave-open
Comment on attachment 8891712 [details]
Bug 1382568: Avoid trying to resolve styles in an uninitialized presshell in GetInnerText.

https://reviewboard.mozilla.org/r/162774/#review168106
Attachment #8891712 - Flags: review?(cam) → review+
Comment on attachment 8891713 [details]
Bug 1382568: Use range-based for loop instead of array indexing in IsOrHasAncestorWithDisplayNone.

https://reviewboard.mozilla.org/r/162776/#review168108
Attachment #8891713 - Flags: review?(cam) → review+
Comment on attachment 8891714 [details]
Bug 1382568: Use the flattened tree instead of the DOM tree in IsOrHasAncestorWithDisplayNone.

https://reviewboard.mozilla.org/r/162778/#review168110

::: dom/html/nsGenericHTMLElement.cpp:2981
(Diff revision 1)
>    AutoTArray<Element*, 10> elementsToCheck;
> -  for (Element* e = aElement; e; e = e->GetParentElement()) {
> +  for (Element* e = aElement; e; e = e->GetFlattenedTreeParentElement()) {

Maybe add a comment in here saying something like that its the flattened tree descendants of a display:none element that have no frame or style data?
Attachment #8891714 - Flags: review?(cam) → review+
Pushed by ecoal95@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/6c55b6e879d3
Use AutoTArray in IsOrHasAncestorWithDisplayNone. r=heycam
https://hg.mozilla.org/integration/autoland/rev/874935d5a813
Use range-based for loop instead of array indexing in IsOrHasAncestorWithDisplayNone. r=heycam
https://hg.mozilla.org/integration/autoland/rev/30e92e587ea6
Use the flattened tree instead of the DOM tree in IsOrHasAncestorWithDisplayNone. r=heycam
Assigning to Emilio since he wrote those patches.  needinfo myself to check back in a few days to see if the crashes are gone.
Assignee: nobody → emilio+bugs
Status: NEW → ASSIGNED
Flags: needinfo?(cam)
Priority: P1 → --
Priority: -- → P1
Sadly, there is a new crash report which has the same call stack.  The buildid 20170801100311 actually includes the Emilio fix.
https://crash-stats.mozilla.com/report/index/2abaaa8d-55b1-4f57-acdd-455250170801

Another sad thing is that there is another crash report that was caused by throttled animation flush again. I guess Bobby's fix (bug 1384769) revealed a veiled bug.
https://crash-stats.mozilla.com/report/index/f3ab89cf-f23c-47d7-8e4f-504c90170801
There are 64 crashes in nightly 56 with buildid >=20170730100307 and the signature is ranked #2 in nightly top-crashers.
Keywords: topcrash
My bad, there are 198 crashes (not 64) with buildid >=20170730100307.
Duplicate of this bug: 1386488
If that helps, this website makes it very reproducible https://web.security.plumbing/hackability/
That is great, thanks Freddy!  Here's a crash report from that: https://crash-stats.mozilla.com/report/index/95612f97-0c5a-4db0-ac26-4f9a70170804

This particular assertion is going to get papered over with the patches in bug 1384824, but it would still be good to understand why this is happening.
following the STR from comment 28:
bp-580b0759-7559-4242-ba47-79eee0170804
Crash Signature: [@ alloc::oom::default_oom_handler | core::option::expect_failed | geckoservo::glue::Servo_ResolveStyle ] [@ core::option::expect_failed | geckoservo::glue::Servo_ResolveStyle ] [@ mozalloc_abort | abort | core::option::expect_failed] → [@ alloc::oom::default_oom_handler | core::option::expect_failed | geckoservo::glue::Servo_ResolveStyle ] [@ core::option::expect_failed | geckoservo::glue::Servo_ResolveStyle ] [@ mozalloc_abort | abort | core::option::expect_failed] [@ mozalloc_abort…
Version: unspecified → Trunk
So I went through this in the debugger, and there are no frames nor styles on the page. When we call ProcessPendingRestyles, we're arriving to the !DidInitialize() path at [1].

I'm going to post a patch that fixes the crash and I think it's equally correct (it changes the codepath we go through when the shell isn't initialized, but for the "no frames in the document" case it should be pretty equivalent to the GetTextContentInternal call).

If we don't we'd need to switch to lazy resolution in IsOrHasAncestorWithDisplayNone, which would be somewhat unfortunate.

[1]: http://searchfox.org/mozilla-central/rev/30a47c4339bd397b937abdb2305f99d3bb537ba6/layout/base/ServoRestyleManager.cpp#871
This particular instance of the problem is probably a regression from bug 1384162, which made us use the Servo style system on about:blank.
Blocks: 1384162
Attachment #8891713 - Attachment is obsolete: true
Attachment #8891714 - Attachment is obsolete: true
Comment on attachment 8891712 [details]
Bug 1382568: Avoid trying to resolve styles in an uninitialized presshell in GetInnerText.

MozReview doesn't handle pretty well leave-open bugs :)
Attachment #8891712 - Flags: review+ → review?(cam)
Comment on attachment 8891712 [details]
Bug 1382568: Avoid trying to resolve styles in an uninitialized presshell in GetInnerText.

https://reviewboard.mozilla.org/r/162774/#review170480

r=me (in case it's not clear from MozReview, and I guess we probably should have used a separate bug for this followup work).

::: dom/html/nsGenericHTMLElement.cpp:3054
(Diff revision 2)
>      nsIPresShell* presShell = nsComputedDOMStyle::GetPresShellForContent(this);
> -    if (!presShell || IsOrHasAncestorWithDisplayNone(this, presShell)) {
> +    if (!presShell || !presShell->DidInitialize() ||
> +        IsOrHasAncestorWithDisplayNone(this, presShell)) {
>        GetTextContentInternal(aValue, aError);
>        return;
>      }

Please add a comment in here to say why we need to check if the pres shell is initialized (as it's not a common condition to need to check).
Attachment #8891712 - Flags: review?(cam) → review+
Quick question... How do I add a test for this? Doesn't seem the kind of stuff it's trivial to test. Cameron, do you know if we can ensure the test loads in a secure context which isn't localhost?
(In reply to Emilio Cobos Álvarez [:emilio] from comment #37)
> Quick question... How do I add a test for this? Doesn't seem the kind of
> stuff it's trivial to test. Cameron, do you know if we can ensure the test
> loads in a secure context which isn't localhost?

I'm pretty sure you can take the existing http://mochi.test:8888/... URLs that the mochitests run from and use https:// URLs too.  For robustness you should probably generate that https:// URL from script by taking the test's window.location and replacing the http with https.
Flags: needinfo?(cam)
Xidorn found and pointed out to me that we support loading example.com/test/* from both secure and insecure origins, so I wrote a test based on that, and verified that:

 (1) It crashes without this patch.
 (2) It times out if it fails to load the iframe used for the test.
 (3) It passes now :-)
Comment on attachment 8894051 [details]
Bug 1382568: Test.

https://reviewboard.mozilla.org/r/165150/#review170502

Thanks, yeah looks like there are plenty of tests that hard code http://example.com/tests/blah URLs.
Attachment #8894051 - Flags: review?(cam) → review+
Pushed by ecoal95@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/14219b6f563e
Avoid trying to resolve styles in an uninitialized presshell in GetInnerText. r=heycam
https://hg.mozilla.org/integration/autoland/rev/cd4eb75e85ce
Test. r=heycam
Pushed by ecoal95@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/fe13b9b9e3ce
followup: Fix comment wording. r=me
Emilio, this bug was one of our top Stylo crashes from Nightly 56. We should probably uplift your fix to Beta 56 before we start our Stylo experiment on Beta.
Flags: needinfo?(emilio+bugs)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Keywords: leave-open
Resolution: --- → FIXED
Comment on attachment 8891712 [details]
Bug 1382568: Avoid trying to resolve styles in an uninitialized presshell in GetInnerText.

Approval Request Comment
[Feature/Bug causing the regression]: none
[User impact if declined]: Crashes with Stylo enabled.
[Is this code covered by automated tests?]: Yes
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: No
[Why is the change risky/not risky?]: One-liner that handles a pretty hard to reach edge-case.
[String changes made/needed]: none
Flags: needinfo?(emilio+bugs)
Attachment #8891712 - Flags: approval-mozilla-beta?
Blocks: 1388031
That looks like a totally different stack trace than the one fixed here. Let's open a new bug for that one, since that's animation-related (all of those have FlushThrottledStyles on the stack trace).
(In reply to Calixte Denizet (:calixte) from comment #50)
> There still are 14 crashes in 57 with buildid >= 20170806100257 ([1]) for
> signature 'core::option::expect_failed |
> geckoservo::glue::Servo_ResolveStyle'.
> 
> [1]
> https://crash-stats.mozilla.com/search/
> ?signature=%3Dcore%3A%3Aoption%3A%3Aexpect_failed%20%7C%20geckoservo%3A%3Aglu
> e%3A%3AServo_ResolveStyle&build_id=%3E%3D20170806100257&product=Firefox&versi
> on=57.0a1&date=%3E%3D2017-07-31T11%3A16%3A46.000Z&date=%3C2017-08-
> 07T11%3A16%3A46.000Z&_sort=-
> date&_facets=url&_facets=install_time&_facets=version&_facets=build_id&_facet
> s=platform_pretty_version&_columns=date&_columns=signature&_columns=product&_
> columns=version&_columns=build_id&_columns=platform#crash-reports

I did check all 14 reports, all of them have FlushThrottledStyles symbol that I mentioned in comment 24, they are different stacks.

Filed bug 1388031 for it.
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
No longer blocks: 1388031
Target Milestone: --- → mozilla57
Flags: in-testsuite+
Comment on attachment 8891712 [details]
Bug 1382568: Avoid trying to resolve styles in an uninitialized presshell in GetInnerText.

Mitigates a crash with twitter, let's uplift for 56 beta 1.
Attachment #8891712 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Emilio Cobos Álvarez [:emilio] from comment #49)
> [Is this code covered by automated tests?]: Yes
> [Has the fix been verified in Nightly?]: Yes
> [Needs manual test from QE? If yes, steps to reproduce]: No

Setting qe-verify- based on Emilio's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.