Closed Bug 1405878 Opened 2 years ago Closed 2 years ago

stylo: Tab Crashes in geckoservo::glue::Servo_ResolveStyle

Categories

(Core :: CSS Parsing and Computation, defect, P2)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox57 + fixed
firefox58 + fixed

People

(Reporter: nachtigall, Assigned: bzbarsky)

References

Details

Crash Data

Attachments

(5 files)

I can see a reproducable crash on my network's router admin page. Here are a few crash reports from this:
* https://crash-stats.mozilla.com/report/index/d2aed79f-6c5f-499f-ac6f-1c3ca0171004
* https://crash-stats.mozilla.com/signature/?product=Firefox&signature=mozalloc_abort%20%7C%20abort%20%7C%20core%3A%3Aoption%3A%3Aexpect_failed%20%7C%20geckoservo%3A%3Aglue%3A%3AServo_ResolveStyle#summary
* https://crash-stats.mozilla.com/report/index/63090648-9762-4732-a88e-eff9f0170930#tab-details

I could add more this this happens reproducable.

mozregression says:

64:23.40 INFO: Last good revision: ebd299dd19d6f53ad7a22b55bd283911f77e0a88
64:23.40 INFO: First bad revision: 3265cc2501b6d23f2ece3cc27eaa6d87401bc7bc
64:23.40 INFO: Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=ebd299dd19d6f53ad7a22b55bd283911f77e0a88&tochange=3265cc2501b6d23f2ece3cc27eaa6d87401bc7bc


This is on a private LAN's router webconfig page so I cannot share it. FWIW, this is on table that's get rendered very slowly by the router row by row. If I just save the page (including assets like CSS/JS) with Chrome or an older working Firefox and then open this page with the crashing Firefox then it works. So it seems to me that this is caused by the the slow way in which the page is sent to the browser by the not so cpu-strong router.

Due to the pushlog I set NI and bug dependency - please correct in case I might be wrong.
Blocks: 1388466
Stylo panic. bholley, can you find somebody to take a quick look?
Flags: needinfo?(bobbyholley)
CCing some people who might be able to help. We're crashing trying to resolve styles on an unstyled element in nsCSSFrameConstructor::InitializeSelectFrame, in case that rings any bells.

Jens - can you reproduce in a debug build of firefox? If so, setting the env vars: |STYLO_THREADS=1 RUST_LOG=geckoservo=debug,style::traversal=debug| will produce a log that very often will help us debug the issue.

Alternatively, you could try taking a recording of the traffic using mitmproxy, which we could then play back on your end.
Flags: needinfo?(bobbyholley)
Priority: -- → P2
Yeah, maybe something funny about how select frames work. They have definitely some custom paths in the FC, but not sure without a test-case, or an idea of one...

Seeing the HTML & CSS of the page or something would be very valuable.
Component: Panning and Zooming → CSS Parsing and Computation
I think it would be easiest if I would dig a hole into our firewall and send someone (or two people) a temporary login to the local page. So you can get access to this page in order to easily reproduce yourself. Maybe someone can send me a private mail and then I can communicate the login via an encrypted channel (PGP, Signal or WhatsApp would be fine). Drop me a line at nachtigall__AT__web__DOT__de
Ok. Assuming you're in Germany, working with emilio probably makes sense, since he's in the Berlin office and is our most prolific stylo regression fixer in any case.
Flags: needinfo?(emilio)
On it.

Cool part is I can't repro on Nightly. I can repro very consistently in current dev-edition 57.0b5. So time to find a regression range I guess...
Assignee: nobody → emilio
Flags: needinfo?(emilio)
Err, I'm wrong, it does still repro on nightly (took a bit more), investigating a bit more...
Ok, so we're creating an HTML <option> element from the XML parser in:

 * http://searchfox.org/mozilla-central/rev/8efd128b48cdae3a6c6862795ce618aa1ca1c0b4/dom/xml/nsXMLContentSink.cpp#981

By the time we get there, the parent <select> element is already styled and laid out. That call doesn't notify at all so there's no way for the style system to notice that it has to style that element.

It's not clear to me how this is supposed to work, I'll try to build a test-case.
Summary: Tab Crashes in geckoservo::glue::Servo_ResolveStyle → stylo: Tab Crashes in geckoservo::glue::Servo_ResolveStyle
(In reply to Emilio Cobos Álvarez [:emilio] from comment #8)
> Ok, so we're creating an HTML <option> element from the XML parser in:
> 
>  *
> http://searchfox.org/mozilla-central/rev/
> 8efd128b48cdae3a6c6862795ce618aa1ca1c0b4/dom/xml/nsXMLContentSink.cpp#981
> 
> By the time we get there, the parent <select> element is already styled and
> laid out. That call doesn't notify at all so there's no way for the style
> system to notice that it has to style that element.

Pretty similar to what I saw in bug 1393637.
There's no notification, and layout and style have definitely run (on the parent) by then. Indeed the only reason we then (later) try to construct frames for the child is because we reframe the whole table that contains it.
Henri, can you help here? I'm not familiar with how the parser operates.

Once we've run layout on an element, what guarantees that layout will notice that a children has been inserted in that subtree, given the parser doesn't notify[1]?

[1]: http://searchfox.org/mozilla-central/rev/8efd128b48cdae3a6c6862795ce618aa1ca1c0b4/dom/xml/nsXMLContentSink.cpp#981
Flags: needinfo?(hsivonen)
So, the patch above fixes the problem for me, as expected, I guess. But I suspect it's not the right fix, given nothing else except the doc element insertion in that file seems to notify?

In any case, Jens, can you confirm (when builds are ready) that a build with this patch applied like:

  https://treeherder.mozilla.org/#/jobs?repo=try&revision=caa513c47f968c62bd04a7902521e5959c6f650f

Fixes the crash for you too?

I'm still trying to construct a test-case. Trying something naive with PHP + usleep isn't working as well as I'd hoped...
Flags: needinfo?(nachtigall)
FWIW, I don't suggest you run a custom build directly, because that may harm your profile.

It would probably be better to use mozregression to run single build with the given commit on try.
So boris pointed at me at how this is supposed to work, and indeed my patch is wrong. But knowing a bit more the setup (the parser notifying in CloseTag() and FlushTags()), made me find the bug.

In particular, we're flushing content _after_ popping the select element from the stack, but before checking mNotifyLevel for the children.

That FlushTags overrides mNotifyLevel, which means we never notify the children.

This is an nsXMLContentSink bug per se, not a stylo bug, fwiw...

ni? Boris since he wanted to take a closer look at this.
Flags: needinfo?(hsivonen)
Err, fail
Flags: needinfo?(bzbarsky)
(In reply to Emilio Cobos Álvarez [:emilio] from comment #13)
> In any case, Jens, can you confirm (when builds are ready) that a build with
> this patch applied like:
> 
>  
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=caa513c47f968c62bd04a7902521e5959c6f650f
> 
> Fixes the crash for you too?

Yes, using this build (I've done `mozregression --repo try --launch caa513c47f968c62bd04a7902521e5959c6f650f`) the crash is fixed.
Flags: needinfo?(nachtigall)
So ideally we would fix bug 363688 or something.  The HTML sink doesn't have this deferred-notification setup at all; see the minimal work nsHtml5TreeOpExecutor::FlushPendingNotifications does.

In the meantime, the XML sink is buggy as heck with <select>; this bug is just triggering some of that.  For example:

  <html xmlns="http://www.w3.org/1999/xhtml">
    <select>
      <option>Hello there</option>
    </select>
  </html>

does not show the option when loaded as XML, precisely because of Emilio's stack and failing to notify on the option...

Anyway, I think the right fix here is to make sure we do that whole "notify on the kids if needed" bit after we pop the stack, but before we do anything else that can cause reentry and mess up our state.  I just did some historical digging, and this is what the HTML sink used to do too, which is pretty hopeful risk-wise.  ;)  See http://searchfox.org/mozilla-central/rev/836aa0b2844d25a457699424800efcbb3e4b53d8/content/html/document/src/nsHTMLContentSink.cpp#1325-1362 which does the notification bits right after popping the stack and before it starts calling DoneAddingChildren on anything.  

I'll take this, since I'd want a second opinion on the review anyway and Emilio has enough on his plate right now.
Assignee: emilio → bzbarsky
Flags: needinfo?(bzbarsky)
Once we call DoneAddingChildren, random code of various sorts will run, which
can flush our notification state.  If that happens before we've notified on our
kids, but after we've popped the element we're closing off the element stack,
we will fail to ever notify on the kids.

MozReview-Commit-ID: Ei7v5OobX8R
Attachment #8916123 - Flags: review?(hsivonen)
[Tracking Requested - why for this release]: Stylo crash regression.  This is a longstanding XML-parsing bug (e.g. the reftest fails pre-stylo), but with stylo it violates even more invariants.
Attachment #8916123 - Flags: review?(hsivonen) → review+
Based on Comment 22 I think we should + this for 57/58. Please nominate for uplift to beta after the fix lands on central.
Pushed by bzbarsky@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f144e1434312
Make sure to notify for our kids, if needed, before calling DoneAddingChildren in the XML content sink.  r=hsivonen
Comment on attachment 8916123 [details] [diff] [review]
Make sure to notify for our kids, if needed, before calling DoneAddingChildren in the XML content sink

Approval Request Comment
[Feature/Bug causing the regression]: Stylo, sort of, but the actual bug has
   been around for a long time.
[User impact if declined]: (Safe) crashes able to be triggered by XML involving
   <html:select>.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: Yes.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: Yes.
[Why is the change risky/not risky?]: This is changing the XML parser, which is
   not exactly a simple area of the code.  I'm somewhat confident about the fix,
   in the sense that it aligns the behavior better with what the HTML parser
   used to have, which was much more extensively tested than the XML parser. 
   But I do not have super-high confidence that I am not introducing other bugs
   (including correctness and security bugs) in the process...  If it were not
   for the stylo dependency where this causes crashes on code that used to not
   crash, I would not even consider suggesting this for uplift.
[String changes made/needed]:
Attachment #8916123 - Flags: approval-mozilla-beta?
https://hg.mozilla.org/mozilla-central/rev/f144e1434312
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Crash Signature: [@ mozalloc_abort | abort | core::option::expect_failed | geckoservo::glue::Servo_ResolveStyle]
Comment on attachment 8916123 [details] [diff] [review]
Make sure to notify for our kids, if needed, before calling DoneAddingChildren in the XML content sink

Fix a crash, taking it.
Attachment #8916123 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Should be in 57b8
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #25)
> [Is this code covered by automated tests?]: Yes.
> [Has the fix been verified in Nightly?]: Yes.
> [Needs manual test from QE? If yes, steps to reproduce]: No.

Setting qe-verify- based on Boris's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.