Closed Bug 1342823 Opened 3 years ago Closed 3 years ago

Crash in RefPtr<T>::RefPtr<T> | TakeFrameRequestCallbacksFrom

Categories

(Core :: DOM: Animation, defect, critical)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox-esr45 53+ fixed
firefox52 --- wontfix
firefox-esr52 53+ fixed
firefox53 + fixed
firefox54 + fixed
firefox55 + fixed

People

(Reporter: marcia, Assigned: bzbarsky)

References

Details

(4 keywords, Whiteboard: [adv-main53+][adv-esr52.1+][adv-esr45.9+][post-critsmash-triage])

Crash Data

Attachments

(6 files, 2 obsolete files)

2.57 KB, patch
Details | Diff | Splinter Review
3.17 KB, patch
farre
: review+
Details | Diff | Splinter Review
12.86 KB, patch
farre
: review+
abillings
: sec-approval+
Details | Diff | Splinter Review
14.78 KB, patch
farre
: review+
Details | Diff | Splinter Review
14.57 KB, patch
farre
: review+
Details | Diff | Splinter Review
14.12 KB, patch
farre
: review+
Details | Diff | Splinter Review
This bug was filed from the Socorro interface and is 
report bp-74aa902b-d29f-4f91-bc08-73ba42170226.
=============================================================

Seen while looking at nightly crash stats - crashes started using the 20170218030212 build: http://bit.ly/2lK81v3. Majority of crashes are on Windows 10.
Andreas probably knows what's up here.
Flags: needinfo?(afarre)
Oh, I misread the summary. Olli, any thoughts on this (or, if not, where it should live?)?
Component: DOM → DOM: Animation
Flags: needinfo?(afarre) → needinfo?(bugs)
Is this a new crash? There was similar fixed very recently I think.
Searching...
I'd say this is bug 1335450.
Marcia, are there more recent crashes with the same stack?
Flags: needinfo?(bugs) → needinfo?(mozillamarcia.knous)
Oh, or is this a regression from that bug, if this started 20170218030212.
Flags: needinfo?(bzbarsky)
(In reply to Olli Pettay [:smaug] (review request overload, closing the queue for a day or two) from comment #4)
> I'd say this is bug 1335450.
> Marcia, are there more recent crashes with the same stack?

looks as if there are few with Build 20170227030203
Flags: needinfo?(mozillamarcia.knous)
Did the crashes start then, or just continue?  Per discussion in bug 1335450 the patch for it should be more or less a no-op on trunk... but that still leaves the question of what's causing the crashes on trunk open.  :(

Maybe we should just give up and refcount these pointers in the refresh driver.  Then failure to remove ourselves will be a leak instead of a crash, of course...
Group: dom-core-security
Flags: needinfo?(bzbarsky)
See Also: → 1335450
Would use of nsWeakPtr be to slow?
But, if this is a regression, regression window would be good.
Duplicate of this bug: 1342411
Crash Signature: [@ RefPtr<T>::RefPtr<T> | TakeFrameRequestCallbacksFrom] → [@ RefPtr<T>::RefPtr<T> | TakeFrameRequestCallbacksFrom] [@ RefPtr<T>::RefPtr<T> | nsTArray_Impl<T>::AppendElement<T> | TakeFrameRequestCallbacksFrom]
Quoting myself from bug 1342411 comment 3:

The spike seems to start around Dec 14 2016, which is very close to the release
of 50.1.0 on 2016-12-13.  I can't find any crashes in 50.0* for example.

I think this is the list of changes that comprise 50.1.0:
https://hg.mozilla.org/releases/mozilla-release/pushloghtml?fromchange=FIREFOX_50_0_2_RELEASE&tochange=FIREFOX_50_1_0_RELEASE
See Also: → 1342374
Fwiw, the vast majority of crashes are on 32-bit:
414 (64-bit) vs 5563 (32-bit) in the past 6 months.
100% of crashes are on Windows for the first two signatures.

I did find a similar signature for Linux though which is probably this bug:
bp-12243d70-2761-4dc1-a7d3-7532b2170301
OS: Windows 8 → All
Hardware: Unspecified → All
Crash Signature: [@ RefPtr<T>::RefPtr<T> | TakeFrameRequestCallbacksFrom] [@ RefPtr<T>::RefPtr<T> | nsTArray_Impl<T>::AppendElement<T> | TakeFrameRequestCallbacksFrom] → [@ RefPtr<T>::RefPtr<T> | TakeFrameRequestCallbacksFrom] [@ RefPtr<T>::RefPtr<T> | nsTArray_Impl<T>::AppendElement<T> | TakeFrameRequestCallbacksFrom] [@ nsCOMPtr<T>::nsCOMPtr | TakeFrameRequestCallbacksFrom ]
I strongly suspect that one of the nsIDocument raw pointers in
mThrottledFrameRequestCallbackDocs have been destroyed:
http://searchfox.org/mozilla-central/rev/9c1c7106eef137e3413fd867fc1ddfb1d3f6728c/layout/base/nsRefreshDriver.h#480
When create a nsCOMPtr<nsIDocument> for it, we probably crash trying to increment
its ref count:
http://searchfox.org/mozilla-central/rev/9c1c7106eef137e3413fd867fc1ddfb1d3f6728c/layout/base/nsRefreshDriver.cpp#1552
Yeah, I don't see anything that easily guarantees that the raw pointer is removed from that array when the document is destroyed. Would it be so bad to just remove it in the dtor for nsIDocument? Or failing that, make those arrays strong references?
Can we please avoid dumping multiple issues together?  50.1 suffers from the obvious issue we fixed in bug 1335450.  But the crashes on nightly are much more confusing...  This bug is specifically about the nightly crashes, or at least I would prefer it to be.

> I don't see anything that easily guarantees that the raw pointer is removed from that array
> when the document is destroyed.

Well, the invariants around when we add/remove ourselves should guarantee it.  That is, by the time a document is being destroyed it should have no script global object, no presshell, etc, and any of those things going away should generally remove from the list.

That said, the logic around that is a bit complicated, and we're clearly screwing it up somewhere, given the nightly crashes.  :(

> Would it be so bad to just remove it in the dtor for nsIDocument?

Remove from where?  By the time ~nsIDocument is running mPresShell had better be null (because the presshell holds a strong ref to the document; see <http://searchfox.org/mozilla-central/rev/9c1c7106eef137e3413fd867fc1ddfb1d3f6728c/layout/base/nsIPresShell.h#1783>).  That means we can't reach the object to remove ourselves from.

> Or failing that, make those arrays strong references?

That's certainly doable.  Then this crash will turn into a leak...
And I strongly urge anyone willing to spend time on this to read bug 1335450 comment 7.  That describes the invariants we are trying to maintain and my audit of whether we are maintaining those invariants.  Again, clearly I'm missing _something_ if we still have crashes on trunk.  But I can't figure out what, and really would appreciate a second set of eyes on this...
> Can we please avoid dumping multiple issues together?  50.1 suffers from the obvious issue we fixed in bug 1335450.

Sorry about that.  I came in from the duped bug and didn't immediately notice
this was already fixed for the most part.

> This bug is specifically about the nightly crashes

Absolutely.
> Would use of nsWeakPtr be to slow?

Probably not, though we should measure....
Andreas, can you take a look here? See comment 16 and we're *hoping* this just requires another person to take a careful look.
Flags: needinfo?(afarre)
Assignee: nobody → afarre
Flags: needinfo?(afarre)
If you can't figure out what is going wrong, you could add a bool flag that tracks if the nsIDocument is in the callback list and crash if it still is in the dtor. That would at least defang the security issue, and might provide some insight into the actual problem by giving an earlier stack.
I need access to bug 1335450, can anyone help me there?
Flags: needinfo?(overholt)
(In reply to Andreas Farre [:farre] from comment #21)
> I need access to bug 1335450, can anyone help me there?

You are now cc'd.
Flags: needinfo?(overholt)
Duplicate of this bug: 1349110
Crash Signature: [@ RefPtr<T>::RefPtr<T> | TakeFrameRequestCallbacksFrom] [@ RefPtr<T>::RefPtr<T> | nsTArray_Impl<T>::AppendElement<T> | TakeFrameRequestCallbacksFrom] [@ nsCOMPtr<T>::nsCOMPtr | TakeFrameRequestCallbacksFrom ] → [@ RefPtr<T>::RefPtr<T> | TakeFrameRequestCallbacksFrom] [@ RefPtr<T>::RefPtr<T> | nsTArray_Impl<T>::AppendElement<T> | TakeFrameRequestCallbacksFrom] [@ nsCOMPtr<T>::nsCOMPtr | TakeFrameRequestCallbacksFrom ] [@ nsTArray_Impl<T>::AppendElement<T> | TakeF…
Note that the additional signature from bug 1349110 is all EXEC crashes.... :-(
Ok, found something that might be a lead. After sampling some crash reports it turns out that everything seems to come from ticking throttled rAFs.
(In reply to Andreas Farre [:farre] from comment #25)
> Ok, found something that might be a lead. After sampling some crash reports
> it turns out that everything seems to come from ticking throttled rAFs.

Well, I found some coming from the vsync path as well, but sofar only on linux.
I've been banging my head against this for a couple of days now, and I cannot for the life of me see how this happens. Boris comment about invariants holds up to my scrutiny, so it feels like something else is going on.

I've attached the patch that turns the arrays into strong references, and it is very small and the period where we'd leak a document is also limited, since actually running a rAF callback will remove the document.

Any thoughts about this?
Flags: needinfo?(continuation)
Flags: needinfo?(bzbarsky)
Attachment #8850916 - Flags: feedback?
> and the period where we'd leak a document is also limited, since actually running
> a rAF callback will remove the document

Yes, but what guarantees that we'll run an rAF callback?  What is the scope of a single refresh driver, especially with e10s?  I'd think it's a single tab, which means in a background tab with e10s we never run rAF on that refresh driver at all, unless I'm missing something.
Flags: needinfo?(bzbarsky)
Flags: needinfo?(continuation)
So I just went through the invariants bits again, and here are some variously-far-fetched hypotheses for what might be happening:

1)  Maybe we're unlinking the document while it's still registered with the refresh driver.  This will clear mFrameRequestCallbacks, but not unregister.  After that nothing will unregister, because it will see that we have no callbacks anyway, so are not registered.  This seems extremely unlikely to me, because presshell holds a strong ref to the doc, and it's _not_ cycle-collected, so I can't see any reason the document could get unlinked while the presshell is still alive.

2)  Maybe the refresh driver is not correctly removing the doc from its lists after calling TakeFrameRequestCallbacksFrom.  I checked that just now, and it looks ok.

3)  Maybe something under nsDocument::CreateShell causes rescheduling after we set mPresShell but before we call MaybeRescheduleAnimationFrameNotifications so we reschedule the same document twice.  I don't see how any of those things would, but maybe.

4)  Maybe the value of mPresShell->GetPresContext() or mPresShell->GetPresContext()->RefreshDriver() can change somehow during a presshell's lifetime.  I don't _think_ it can, but haven't managed to prove it yet.

5)  Maybe something in SetScriptGlobalObject runs between when we revoke/reschedule stuff and when mScriptGlobalObject changes.  This is actually not seeming implausible.  In particular, consider the case when SetScriptGlobalObject(nullptr) is called.  We will call RevokeAnimationFrameNotifications().  Then if we have a nonzero mOnloadBlockCount, we will RemoveRequest(mOnloadBlocker).  This will run our load event, which can call requestAnimationFrame.  At this point mScriptGlobalObject is not null yet, so that will schedule us.

#5 might be the most likely problem here, actually....

I don't know what we could do about #4 offhand, but for the other problems, and especially for #5... let me try something here.
Assignee: afarre → bzbarsky
Status: NEW → ASSIGNED
Try run (without mention of this bug#, and with the patch hidden in the try-syntax changeset, on top of some other stuff I wanted to test anyway), looks greenish.
[Tracking Requested - why for this release]:
Several of the signatures listed here affect 53 and 52.0.1/ESR, and we're seeing upwards of 400 crashes per day.

Note also that bug 1230817 covers almost the same set of signatures, and is open.  It was filed a year ago...  Al, should we dup and close it?  Let it sit idle to avoid suggesting attention (at least until this fix lands/ships)?
Flags: needinfo?(abillings)
the esr status flag was set by accident i think
Attachment #8851152 - Flags: review?(afarre) → review+
Attachment #8851153 - Flags: review?(afarre) → review+
Comment on attachment 8851153 [details] [diff] [review]
part 2.  Simplify the management of whether our document's frame request callbacks are scheduled

[Security approval request comment]
How easily could an exploit be constructed based on the patch?  It'll take some work.  I haven't managed to get a crash out of it yet, though I haven't tried too hard.  You have to figure out where the actual failure is (which we don't know yet, though comment 29 item 5 is a likely option).  If that's the one, you have to navigate away from a page in such a way that the navigation completes while there is a pending requestAnimationFrame _and_ before the load event for the page.  I haev some ideas for how to do that, with a bit of work.  Then you have to ensure the document gets gced before the next refresh driver tick.  Again, probably doable.  Then you have to figure out how to make use of your dead object.  Importantly, this is not the first time we're changing this code in a security bug in the last few months.  :(

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?  To some extent yes, in that it's clear it's to do with rAF callback scheduling.  But not what the problem with that scheduling is, no.

Which older supported branches are affected by this flaw? All of them, pretty much.

If not all supported branches, which bug introduced the flaw?

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?  Not yet.  Backporting to branches before 54 will require a bit of work because those branches don't have bug 1340086 fixed.

How likely is this patch to cause regressions; how much testing does it need?  I think this is fairly safe in terms of regression risk.
Attachment #8851153 - Flags: sec-approval?
(In reply to Randell Jesup [:jesup] from comment #33)
 
> Note also that bug 1230817 covers almost the same set of signatures, and is
> open.  It was filed a year ago...  Al, should we dup and close it?  Let it
> sit idle to avoid suggesting attention (at least until this fix lands/ships)?

I'd wait until we ship the fix and then dupe it to this bug.

Are we wanting to backport this to 53 and ESR52 or ESR45?

I'm giving sec-approval+ for trunk. We should at least take this there and Aurora.
Flags: needinfo?(abillings)
Attachment #8851153 - Flags: sec-approval? → sec-approval+
> Are we wanting to backport this to 53 and ESR52 or ESR45?

It's probably a good idea, yes.  I'll work on writing a backported patch.
Attachment #8850916 - Flags: feedback?
Comment on attachment 8851152 [details] [diff] [review]
part 1.  Move some one-bit booleans in nsIDocument next to all the other one-bit booleans

Approval Request Comment
[Feature/Bug causing the regression]: Probably requestAnimationFrame when we
   initially implemented it.
[User impact if declined]: Random crashes, probably exploitable if one tries
   hard enough.
[Is this code covered by automated tests?]: To some extent.  Clearly not for the
   crashing scenario.
[Has the fix been verified in Nightly?]:  Not yet.
[Needs manual test from QE? If yes, steps to reproduce]: We don't know; that's
   the problem.
[List of other uplifts needed for the feature/fix]:  None.
[Is the change risky?]: Somewhat.
[Why is the change risky/not risky?]:  It's refactoring some code and changing
   the flow of logic that is not very clear.  That said, I think this is
   something that is safe enough to backport and the regression risk is not
   too bad.
[String changes made/needed]:  None.
Attachment #8851152 - Flags: approval-mozilla-aurora?
Comment on attachment 8851153 [details] [diff] [review]
part 2.  Simplify the management of whether our document's frame request callbacks are scheduled

This and the previous change applies as-is on Aurora.
Attachment #8851153 - Flags: approval-mozilla-aurora?
Andreas, the main changes here are the IsAnimationsPaused() check in UpdateFrameRequestCallbackSchedulingState and the corresponding changes to SuppressEventHandling/ResumeAnimations.  But there were various other merge conflicts, so please look this over somewhat carefully.  My initial merge lost the UpdateFrameRequestCallbackSchedulingState() call in SuppressEventHandling, for example...
Attachment #8851744 - Flags: review?(afarre)
Attachment #8851744 - Flags: approval-mozilla-beta?
Comment on attachment 8851750 [details] [diff] [review]
Rollup backport for esr52; this had various merge conflicts (even starting with the 53 patch as a base), but not substantive changes..

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:
   It's sec-crit.
User impact if declined: The crashes will continue until morale improves. Then
   again, they may even with this patch.
Fix Landed on Version: 55.
Risk to taking this patch (and alternatives if risky): See comment 38.  Risk is
   moderate, but I'm not sure there are less-risky options.
String or UUID changes made by this patch: None.

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Attachment #8851750 - Flags: approval-mozilla-esr52?
Attachment #8851754 - Flags: approval-mozilla-esr45?
Attachment #8851152 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8851153 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8851744 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-aurora/rev/99da1ab37c37
https://hg.mozilla.org/releases/mozilla-aurora/rev/ae2b433951e6

NOTE: This can't be uplifted to Beta yet despite the approval since review on the rebased patch is still pending.
Attachment #8851750 - Attachment is obsolete: true
Attachment #8851750 - Flags: review?(afarre)
Attachment #8851750 - Flags: approval-mozilla-esr52?
Attachment #8851754 - Attachment is obsolete: true
Attachment #8851754 - Flags: review?(afarre)
Attachment #8851754 - Flags: approval-mozilla-esr45?
Attachment #8851838 - Flags: approval-mozilla-esr52?
Attachment #8851839 - Flags: approval-mozilla-esr45?
Attachment #8851744 - Flags: review?(afarre) → review+
Attachment #8851838 - Flags: review?(afarre) → review+
Attachment #8851839 - Flags: review?(afarre) → review+
Group: dom-core-security → core-security-release
Comment on attachment 8851839 [details] [diff] [review]
ESR45 patch merged a bit better

sec-critical fix for esr45
Attachment #8851839 - Flags: approval-mozilla-esr45? → approval-mozilla-esr45+
Comment on attachment 8851838 [details] [diff] [review]
ESR52 patch merged a bit better

sec-critical fix for esr52
Attachment #8851838 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+
Duplicate of this bug: 1230817
How do things look here since the checkins?  Or do we not have enough data yet?
Flags: needinfo?(rjesup)
No crashes on any of the signatures with a build date after 2017-03-29-00:00:00
:-)
Flags: needinfo?(rjesup)
Whiteboard: [adv-main53+][adv-esr52.1+][adv-esr48.9+]
Whiteboard: [adv-main53+][adv-esr52.1+][adv-esr48.9+] → [adv-main53+][adv-esr52.1+][adv-esr45.9+]
Flags: qe-verify-
Whiteboard: [adv-main53+][adv-esr52.1+][adv-esr45.9+] → [adv-main53+][adv-esr52.1+][adv-esr45.9+][post-critsmash-triage]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.