Closed Bug 1699844 Opened 3 years ago Closed 3 years ago

promiseDocumentFlushed's refresh observer can stick around and cause crashes in nsGlobalWindowInner::DidRefresh

Categories

(Core :: Layout, defect)

Desktop
All
defect

Tracking

()

RESOLVED FIXED
89 Branch
Tracking Status
firefox88 --- wontfix
firefox89 --- fixed

People

(Reporter: Gijs, Assigned: emilio)

References

Details

Crash Data

Attachments

(2 files)

I ran into this in bug 1693008.

The code under test I'm adding is using promiseDocumentFlushed to react to layout changes when the title changes, so that it can add/remove an "overflown" attribute which adds a mask to the title of a dialog.

The test exercises this code, and uses a mutation observer to await the new attribute and text content, and then immediately closes the prompt.

This reliably crashed my opt builds with this stack:

 0  xul.dll!nsGlobalWindowInner::DidRefresh() [nsGlobalWindowInner.cpp:092ee6b0c9f2b093ae808df6ee8c9d49ef0d3b5d : 7060 + 0x0]
    rax = 0x0000000000000000   rdx = 0x0000024fdece2ea0
    rcx = 0x000000000000068f   rbx = 0x0000024fe060d568
    rsi = 0x0000024fe060d580   rdi = 0x0000024fe060d400
    rbp = 0x00007ff8c456a0a0   rsp = 0x000000ea2dbfcc00
     r8 = 0x7efefefefefefe00    r9 = 0x000000007ffee001
    r10 = 0x00000fff10b0ceee   r11 = 0x0000400000000000
    r12 = 0x0000024fd9261000   r13 = 0x00007ff888f69198
    r14 = 0x000000000242e350   r15 = 0x00007ff888f69101
    rip = 0x00007ff8858677ba
    Found by: given as instruction pointer in context
 1  xul.dll!nsRefreshDriver::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp) [nsRefreshDriver.cpp:092ee6b0c9f2b093ae808df6ee8c9d49ef0d3b5d : 2388 + 0xc]
    rbx = 0x0000024fe060d568   rbp = 0x00007ff8c456a0a0
    rsp = 0x000000ea2dbfcc50   r12 = 0x0000024fd9261000
    r13 = 0x00007ff888f69198   r14 = 0x000000000242e350
    r15 = 0x00007ff888f69101   rip = 0x00007ff886d5fedb
    Found by: call frame info
 2  xul.dll!mozilla::RefreshDriverTimer::TickRefreshDrivers(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp, nsTArray<RefPtr<nsRefreshDriver> >&) [nsRefreshDriver.cpp:092ee6b0c9f2b093ae808df6ee8c9d49ef0d3b5d : 324 + 0x1f]
    rbx = 0x0000024fe060d568   rbp = 0x00007ff8c456a0a0
    rsp = 0x000000ea2dbfe170   r12 = 0x0000024fd9261000
    r13 = 0x00007ff888f69198   r14 = 0x000000000242e350
    r15 = 0x00007ff888f69101   rip = 0x00007ff886d64980
    Found by: call frame info
 3  xul.dll!mozilla::RefreshDriverTimer::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp) [nsRefreshDriver.cpp:092ee6b0c9f2b093ae808df6ee8c9d49ef0d3b5d : 339 + 0x1f]
    rbx = 0x0000024fe060d568   rbp = 0x00007ff8c456a0a0
    rsp = 0x000000ea2dbfe200   r12 = 0x0000024fd9261000
    r13 = 0x00007ff888f69198   r14 = 0x000000000242e350
    r15 = 0x00007ff888f69101   rip = 0x00007ff886d647e1
    Found by: call frame info
 4  xul.dll!mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::TickRefreshDriver(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp) [nsRefreshDriver.cpp:092ee6b0c9f2b093ae808df6ee8c9d49ef0d3b5d : 699 + 0x27]
    rbx = 0x0000024fe060d568   rbp = 0x00007ff8c456a0a0
    rsp = 0x000000ea2dbfe2a0   r12 = 0x0000024fd9261000
    r13 = 0x00007ff888f69198   r14 = 0x000000000242e350
    r15 = 0x00007ff888f69101   rip = 0x00007ff886d64357
    Found by: call frame info
 5  xul.dll!mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyParentProcessVsync() [nsRefreshDriver.cpp:092ee6b0c9f2b093ae808df6ee8c9d49ef0d3b5d : 612 + 0x23]
    rbx = 0x0000024fe060d568   rbp = 0x00007ff8c456a0a0
    rsp = 0x000000ea2dbfe3a0   r12 = 0x0000024fd9261000
    r13 = 0x00007ff888f69198   r14 = 0x000000000242e350
    r15 = 0x00007ff888f69101   rip = 0x00007ff886d63dca
    Found by: call frame info
 6  xul.dll!mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::ParentProcessVsyncNotifier::Run() [nsRefreshDriver.cpp:092ee6b0c9f2b093ae808df6ee8c9d49ef0d3b5d : 502 + 0x5]
    rbx = 0x0000024fe060d568   rbp = 0x00007ff8c456a0a0
    rsp = 0x000000ea2dbfe460   r12 = 0x0000024fd9261000
    r13 = 0x00007ff888f69198   r14 = 0x000000000242e350
    r15 = 0x00007ff888f69101   rip = 0x00007ff886d64aff
    Found by: call frame info

When debugging, it becomes clear that presShell is null (so a debug build would hit the relevant debug assert) - the dialog (which is really just a same-process <browser> inside the browser.xhtml window) closing destroys its pres shell, and the document flushed code uses PresShell::AddPostRefreshObserver. Those don't get removed properly when the document goes away and things go ka-boom when they try to fire after the presshell is gone. Emilio suggested the PromiseDocumentFlushed code should add itself as a one-shot observer instead.

For the moment I can work around in the test by using await new Promise(requestAnimationFrame) before closing the dialog, but we should fix the implementation here.

So, just to confirm -- should these work as STR here? Or am I missing anything?

(1) Change browser/base/content/test/tabPrompts/browser_contentOrigins.js to look the way it did in the first uploaded version of your bug 1693008 patch -- i.e. this version: https://phabricator.services.mozilla.com/D109152?id=415810
(2) Run ./mach test browser/base/content/test/tabPrompts/browser_contentOrigins.js

I tried this on Mac in a debug build and didn't hit any crashes or assertions, FWIW (but I do see you were using an opt build, and in phabricator you mentioned Windows, so either of those could be a relevant difference between the env I've tested so far & your affected environment).

Flags: needinfo?(gijskruitbosch+bugs)

Yeah, those STR (with a debug+opt build on linux) were basically sufficient to trigger the bug.

I didn't hit the crash the first time, but I did when I ran it with --run-until-failure --repeat=1000 (and it only took ~2 minutes to crash).

I captured it in rr; I'll poke around a little and post what I find.

Flags: needinfo?(gijskruitbosch+bugs)

So it looks like this is what happens, roughly:

  • During a refresh driver tick, ProcessRestyledFrames calls RecreateFramesForContent
  • This causes a nsSubDocumentFrame (for an iframe or similar) to be destroyed.
  • its DestroyFrom method queues up an nsHideViewer runnable to be invoked as a script runner.

Then:

  • Near the end of that refresh driver tick, when the script blocker in DoFlushPendingNotifications() goes out of scope, our nsHideViewer runnable gets run, which (via a few intermediate layers) calls nsDocumentViewer::DestroyPresShell, which triggers Document::DeletePresShell, which nulls out the Document's mPresShell pointer. Here's that backtrace:
#0  mozilla::dom::Document::DeletePresShell() (this=0x7fb51d61e000) at $SRC/dom/base/Document.cpp:6776
#1  0x00007fb54adc65f0 in mozilla::PresShell::Destroy() (this=0x7fb50cc97000) at $SRC/layout/base/PresShell.cpp:1423
#2  0x00007fb54ae8118e in nsDocumentViewer::DestroyPresShell() (this=0x7fb52e38a580) at $SRC/layout/base/nsDocumentViewer.cpp:3647
#3  0x00007fb54ae7d699 in nsDocumentViewer::Hide() (this=0x7fb52e38a580) at $SRC/layout/base/nsDocumentViewer.cpp:2198
#4  0x00007fb54d2ad79f in nsDocShell::SetVisibility(bool) (this=0x7fb511a32c00, aVisibility=false) at $SRC/docshell/base/nsDocShell.cpp:5077
#5  0x00007fb546de3365 in nsFrameLoader::Hide() (this=0x7fb50e59aa00) at $SRC/dom/base/nsFrameLoader.cpp:1147
#6  0x00007fb54b153405 in nsHideViewer::Run() (this=0x7fb50cd0fe20) at $SRC/layout/generic/nsSubDocumentFrame.cpp:882
#7  0x00007fb5469556e9 in nsContentUtils::RemoveScriptBlocker() () at $SRC/dom/base/nsContentUtils.cpp:5569
#8  0x00007fb545e882d1 in nsAutoScriptBlocker::~nsAutoScriptBlocker() (this=0x7ffe34bfecf8) at $SRC/dom/base/nsContentUtils.h:3476
#9  0x00007fb54add6f9f in mozilla::PresShell::DoFlushPendingNotifications(mozilla::ChangesToFlush) (this=0x7fb522705000, aFlush=...) at $SRC/layout/base/PresShell.cpp:4203
#10 0x00007fb5469b1753 in mozilla::PresShell::FlushPendingNotifications(mozilla::ChangesToFlush) (this=0x7fb522705000, aType=...) at $OBJ/dist/include/mozilla/PresShell.h:1408
#11 0x00007fb54ad85d5b in nsRefreshDriver::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp) (this=0x7fb522751000, aId=..., aNowTime=...) at $SRC/layout/base/nsRefreshDriver.cpp:2190
#12 0x00007fb54ad93a92 in mozilla::RefreshDriverTimer::TickDriver(nsRefreshDriver*, mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp) (driver=0x7fb522751000, aId=..., now=...) at $SRC/layout/base/nsRefreshDriver.cpp:347
#13 0x00007fb54ad938b4 in mozilla::RefreshDriverTimer::TickRefreshDrivers(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp, nsTArray<RefPtr<nsRefreshDriver> >&) (this=0x7fb52f3bbd60, aId=..., aNow=..., aDrivers=nsTArray<RefPtr<nsRefreshDriver> > & = {...}) at $SRC/layout/base/nsRefreshDriver.cpp:325
#14 0x00007fb54ad936fe in mozilla::RefreshDriverTimer::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp) (this=0x7fb52f3bbd60, aId=..., now=...) at $SRC/layout/base/nsRefreshDriver.cpp:341

Then, later on in the same call to nsRefreshDriver::Tick (stack level 11), we hit the offending call to nsGlobalWindowInner::DidRefresh and assert/crash, because the nsGlobalWindowInner's mDoc is pointing to the inner document that has just been hidden and has had its mRefreshDriver pointer nulled out.

More investigation to be done, but that's what I've got so far

I'm seeing these crashes turn up some more on various jobs on https://treeherder.mozilla.org/jobs?repo=try&revision=cabf5e40b9404ba3c80570c6195ef0f11f9cba3b&selectedTaskRun=VAuc4_ZCS2WQLYdzV5Msiw.0 , e.g.:

https://treeherder.mozilla.org/jobs?repo=try&revision=cabf5e40b9404ba3c80570c6195ef0f11f9cba3b&selectedTaskRun=XRt5nJ34Tmqbp8jQTW9Qug.0
https://treeherder.mozilla.org/jobs?repo=try&revision=cabf5e40b9404ba3c80570c6195ef0f11f9cba3b&selectedTaskRun=bfJAcD9SRmSDSus6TZ9uJg.0
https://treeherder.mozilla.org/jobs?repo=try&revision=cabf5e40b9404ba3c80570c6195ef0f11f9cba3b&selectedTaskRun=FdJqX4pkQu28_unI5fF7Kw.0
https://treeherder.mozilla.org/jobs?repo=try&revision=cabf5e40b9404ba3c80570c6195ef0f11f9cba3b&selectedTaskRun=EmyJ1KxrQZeDIod5AqZANA.0

This includes Linux debug (last one). Fixing this (or finding a workaround) is likely going to block bug 1704616, which we'd like to land this week prior to 89 branching. To reproduce locally, it should be sufficient to run the test locally with the browser.proton.modals.enabled pref flipped to true (which you can also do with the mochitest framework).

Dan, did you get any closer to a solution here? At the time, Emilio suggested the PromiseDocumentFlushed code should add itself as a one-shot observer instead - would that be a possible solution?

Blocks: 1704616
Flags: needinfo?(dholbert)

I didn't get any closer beyond what I've posted above - I was mostly trying to dump some info about what's going on when this crash happens, since I did happen to catch it in rr (and I was intending to circle back to do some more poking around, but haven't been able to yet).

I'm not familiar with PromiseDocumentFlushed or one-shot observers, so I can't comment on emilio's suggestion at this point. emilio, do you have any cycles this week to investigate the approach that you suggested?

Flags: needinfo?(dholbert) → needinfo?(emilio)

Sure, will look into this.

Assignee: nobody → emilio
Flags: needinfo?(emilio)

Allow using the MOZ_KnownLive function to get around it.

Use case is the following: I have an std::function member variable, and I want
that member to be able to capture this.

Using a strong reference creates a cycle and thus would leak. I know this to
always outlive the member, so it is fine to use a weak capture there.

By rejecting the relevant promises, instead of crashing (and if we didn't crash
we'd leave the window registered as a refresh driver observer, which would be
bad).

For this, we modify the OneShotPostRefreshObserver API to deal with cancelation
due to shell destruction.

We fix APZ's usage of this API, which was doing something extremely weird
(returning a refcounted object in a UniquePtr).

Depends on D111850

Attachment #9215480 - Attachment description: Bug 1699844 - Make promiseDocumentFlushed handle presshell destruction correctly. r=smaug → Bug 1699844 - Make promiseDocumentFlushed handle presshell destruction correctly. r=smaug!,botond!
Blocks: 1705101
Blocks: 1705118
Pushed by ealvarez@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ee4b9934e074
Add an escape hatch for the refcounted inside lambda checker. r=andi
Pushed by ealvarez@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/472077cb1e38
Make promiseDocumentFlushed handle presshell destruction correctly. r=smaug,botond
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 89 Branch
Flags: in-testsuite+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: