Closed Bug 1404297 Opened 2 years ago Closed 2 years ago

Crash in nsIDocument::FlushPendingLinkUpdates

Categories

(Core :: DOM: Core & HTML, defect, P1, critical)

56 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox-esr52 - unaffected
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 + fixed
firefox60 + fixed

People

(Reporter: philipp, Assigned: mrbkap)

Details

(4 keywords, Whiteboard: [adv-main59+])

Crash Data

Attachments

(2 files, 3 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-55e48e11-4d66-492a-8087-0867a0170928.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	nsIDocument::FlushPendingLinkUpdates() 	dom/base/nsDocument.cpp:10330
1 	xul.dll 	nsCSSFrameConstructor::ResolveStyleContext(nsStyleContext*, nsIContent*, nsFrameConstructorState*, mozilla::dom::Element*) 	layout/base/nsCSSFrameConstructor.cpp:5194
2 	xul.dll 	nsCSSFrameConstructor::AddFCItemsForAnonymousContent(nsFrameConstructorState&, nsContainerFrame*, nsTArray<nsIAnonymousContentCreator::ContentInfo>&, nsCSSFrameConstructor::FrameConstructionItemList&, unsigned int) 	layout/base/nsCSSFrameConstructor.cpp:11200
3 	xul.dll 	nsCSSFrameConstructor::AddFCItemsForAnonymousContent(nsFrameConstructorState&, nsContainerFrame*, nsTArray<nsIAnonymousContentCreator::ContentInfo>&, nsCSSFrameConstructor::FrameConstructionItemList&, unsigned int) 	layout/base/nsCSSFrameConstructor.cpp:11199
4 	xul.dll 	nsCSSFrameConstructor::BeginBuildingScrollFrame(nsFrameConstructorState&, nsIContent*, nsStyleContext*, nsContainerFrame*, nsIAtom*, bool, nsContainerFrame*&) 	layout/base/nsCSSFrameConstructor.cpp:4729
5 	xul.dll 	nsCSSFrameConstructor::SetUpDocElementContainingBlock(nsIContent*) 	layout/base/nsCSSFrameConstructor.cpp:3025

this security sensitive crash has been around for a while- i've noticed it now because it is accounting for 0.46% of crashes in very early data for the rollout of 56.0 to release users.
Group: core-security → layout-core-security
Component: DOM → Layout
We seem to be crashing (with EXCEPTION_ACCESS_VIOLATION_READ) on the link->GetElement() deref here in FlushPendingLinkUpdates.

>  for (auto iter = mLinksToUpdate.Iter(); !iter.Done(); iter.Next()) {
>    Link* link = iter.Get();
>    Element* element = link->GetElement();

GetElement() is just a member-var accessor. So I think the "link" local variable must contain a bogus pointer here.  (In crash reports, it looks like sometimes it's null or near-null, sometimes it's 0xffffffffffffffff, sometimes it's 0xffffffffe5e5e5ed, and sometimes it's a random value like 0x74736a7e) 

Smaug, it looks like you touched this code ~recently in bug 1355540 (which landed in Firefox 55). Do you know how we might end up with a bogus pointer here?
Flags: needinfo?(bugs)
mLinksToUpdate keeps strong reference to the Link (which is an Element), so I don't quite see how that could be bogus. So far doesn't ring any bells what could cause this.
Flags: needinfo?(bugs)
My current guess is that we have a bug elsewhere and do extra release on some Element (which is a Link) and the cycle collector and snow white killer runs and the element is deleted. But there is this reference in mLinksToUpdate and we crash when we access it.
If that were what happened, though, then we'd our "link->GetElement()" call would still succeed just fine (because "link" would just be a pointer to a deleted object, and its non-virtual trivial GetElement() getter would just pull the bogus member-value out of that deleted object).

So I'm suspicious that mLinksToUpdate itself has been corrupted somehow...   Adding some small credence to this theory, we do have some crash reports on hashtable & iterator operations (on mLinksToUpdate & its iterators) inside FlushPendingLinkUpdates:
https://crash-stats.mozilla.com/search/?signature=~nsIDocument%3A%3AFlushPendingLinkUpdates&date=%3E%3D2017-08-01T21%3A09%3A00.000Z&date=%3C2017-09-29T21%3A09%3A00.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

(Note it was a hashtable before it was migrated to SegmentedVector in bug 1355540 -- so the PLDHash signatures are from older builds before bug 1355540.)
The only other clue I'm seeing at this point: these crashes (recent ones at least) seem to always be inside of mozilla::PresShell::Initialize().  So this is happening *very* early on in document lifetime -- which means this might be the first time FlushPendingLinkUpdates has ever been called on this document.

That fact made me suspect that we'd left some variable uninitialized here.  But the prime suspects seem to be initialized correctly -- the mLinksToUpdate's default constructor (for SegmentedVector) looks fine, and we do correctly initialize the mHasLinksToUpdate boolean "guard flag" to false in the nsIDocument constructor.
(moving back to DOM, since it looks like nsIDocument::FlushPendingLinkUpdates() is the problematic thing here, and nsCSSFrameConstructor seems to just be the unlucky thing that is calling into it [while resolving style for the root scrollframe as part of doing the first layout].)
Component: Layout → DOM
smaug, can you look into this further perhaps?
Flags: needinfo?(bugs)
I tried to look at this, but didn't really ring a bell, except a guess in comment 3.
link->GetElement() should always just work fine.
Flags: needinfo?(bugs)
Hi Andrew:

I have assigned these security bugs to you to reassign them to appropriate developers in your team to investigate and fix them.

Thanks!

Wennie
Assignee: nobody → overholt
Group: layout-core-security → dom-core-security
(FWIW I couldn't see this bug until it was assigned to me :)
Blake, please take a look here.
Assignee: overholt → mrbkap
Flags: needinfo?(mrbkap)
I'm still investigating. Based on some of the the more recent crashes being under the JS version of spinEventLoopUntil, my current theory is that this has to do with devtools being opened.
Priority: -- → P1
I've been staring at this for a while now and can't find anything. My best guess is that this has to do with the --wait-for-jsdebugger flag, but I haven't found anything that would explain this. I feel like I'm getting diminishing returns from my investigations; this really does seem like it could be a refcounting error, maybe in a failure case...
Flags: needinfo?(mrbkap)
May jryans has thoughts given he implemented --wait-for-jsbdebugger in bug 1275942?
Flags: needinfo?(jryans)
I don't see an obvious line that would connect wait for JS debugger to this specific crash...

Yes, it uses `spinEventLoopUntil` via JS, but there are various non-DevTools paths that also call this close to startup, so I don't believe we have enough info to pinpoint which it would be from the current stack.  Also, there are still quite a lot of recent reports which don't have this frame at all.

Most of the reports appear to be startup crashes (uptime less than 30 secs), but not all.

Not sure what else to offer at the moment, but I'll try to think about other angles on what's happening here.
Flags: needinfo?(jryans)
dholbert, given that this seems at a dead end, is it worthwhile to try to release-assert some in-variants and try to get closer to the actual source of the crash/corruption?
Flags: needinfo?(dholbert)
I don't know FlushPendingLinkUpdates or its invariants, nor have I dug around as much as smaug/mrbkap here, so I'll redirect comment 16's needinfo to them since they're more likely to have an answer.
Flags: needinfo?(dholbert) → needinfo?(bugs)
Flags: needinfo?(mrbkap)
Attached patch Debugging patch (obsolete) — Splinter Review
I can think of two ways for us to crash here: a refcounting/CC problem in Element code and us somehow re-entering this method and changing mLinksToUpdate out from under us. Let's try to catch that if it's happening.

We might need to uplift this to Beta to see it in action, though. I don't see any recent reports for this bug on Nightly.

MozReview-Commit-ID: FoQGVDWtwN1
Attachment #8941262 - Flags: review?(bzbarsky)
Flags: needinfo?(mrbkap)
Attached patch Debugging patch (obsolete) — Splinter Review
Sorry, missed a file in that last patch. bz, please see comment 18.

MozReview-Commit-ID: FoQGVDWtwN1
Attachment #8941265 - Flags: review?(bzbarsky)
Attachment #8941262 - Attachment is obsolete: true
Attachment #8941262 - Flags: review?(bzbarsky)
Comment on attachment 8941265 [details] [diff] [review]
Debugging patch

Since I was looking at this... r+
Flags: needinfo?(bugs)
Attachment #8941265 - Flags: review?(bzbarsky) → review+
Fwiw, this looks reasonable to me too.
Comment on attachment 8941265 [details] [diff] [review]
Debugging patch

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

Given that we don't have a testcase for this yet, it should be close to impossible.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

This might cause someone to look into link elements.

Which older supported branches are affected by this flaw?

Probably Beta and Release. It isn't clear if Nightly is affected.

If not all supported branches, which bug introduced the flaw?

Unknown.

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

This patch applies cleanly to Beta.

How likely is this patch to cause regressions; how much testing does it need?

This is a debugging patch. It should be extremely safe.
Attachment #8941265 - Flags: sec-approval?
Comment on attachment 8941265 [details] [diff] [review]
Debugging patch

sec-approval+ for trunk. I don't know if we need to take it into beta, given the caveats about difficulty in reproducing and the late stage of the cycle.
Attachment #8941265 - Flags: sec-approval? → sec-approval+
Let's land this on Nightly, then, and see if we get any reports there.
there are a handful of new reports from nightly: https://crash-stats.mozilla.com/signature/?product=Firefox&version=60.0a1&signature=nsIDocument%3A%3AFlushPendingLinkUpdates#reports

(also the volume for this crash seems to jump up after 58 went to release)
ni?me to look at the new stacks (which look much more helpful).
Flags: needinfo?(mrbkap)
This looks pretty bad. The new stacks show that we're spinning the event loop during frame construction, which is never a good sign. I independently wrote the same patch as found in bug 1361618, so I've been trying to figure out what might be causing the accessibility test to fail. Thus far, I haven't made too much progress on that front (I haven't seen the test fail, either on try or locally), but I hope to have something early next week. In the worst case, I'm hoping to reland the patch and see if intervening changes in Gecko have fixed the test failure (in particular, I'm hoping that chaos mode + rr might make it fail locally).
Chaos mode didn't reveal anything about the other bug. I'm going to try pushing that patch to central again to see if it sticks this time.

I think we can fix the specific crash in this bug with a targeted patch allowing modification of mLinksToUpdate.
Flags: needinfo?(mrbkap)
One more note: it appears that we're only spinning the event loop on startup if the places database is corrupted. That might make it harder to exploit this (and related problems) since I think that can only happen on startup and it would be harder for an attacker to control the environment enough to actually exploit it (they would also have to have a way to force us to think the places database is corrupt and I don't know if that's even possible).
Given that we have a SegmentedVector of nsCOMPtrs, it's probably worth
avoiding copying it.

MozReview-Commit-ID: GHyfVLrdnlQ
Attachment #8951450 - Flags: review?(bugs)
Some thoughts:
* It is very surprising that Link::LinkState can spin the event loop. It's probably worth filing a followup bug on that.
* bug 1361618 covers avoiding spinning the event loop during frame construction.
* I initially simply copied mLinksToUpdate into a local temporary, but then got scared that the History machinery wasn't re-entrant safe. Reading through the code didn't convince me otherwise. This patch ensures that nsIDocument::FlushPendingLinkUpdates won't re-entrantly call into it.

MozReview-Commit-ID: A31dMVw0yDx
Attachment #8951451 - Flags: review?(bugs)
Attachment #8951450 - Flags: review?(bugs) → review+
So bug 1361618 has possibly fixed the actual security bug?
Comment on attachment 8951451 [details] [diff] [review]
Change the way we iterate over our links to update.

> 
>-  AutoRestore<bool> saved(mIsLinkUpdateRegistrationsForbidden);
>-  mIsLinkUpdateRegistrationsForbidden = true;
>-  for (auto iter = mLinksToUpdate.Iter(); !iter.Done(); iter.Next()) {
>-    Link* link = iter.Get();
>-    Element* element = link->GetElement();
>-    if (element->OwnerDoc() == this) {
>-      link->ClearHasPendingLinkUpdate();
>-      if (element->IsInComposedDoc()) {
>-        element->UpdateLinkState(link->LinkState());
>+  auto restore = MakeScopeExit([&] { mFlushingPendingLinkUpdates = false; });
why not AutoRestore<bool> ?

>+  mFlushingPendingLinkUpdates = true;
>+
>+  while (!mLinksToUpdate.IsEmpty()) {
>+    auto links(Move(mLinksToUpdate));
don't use auto here. It is unclear from this code what the type of |links| is
Attachment #8951451 - Flags: review?(bugs) → review+
(In reply to Olli Pettay [:smaug] from comment #35)
> So bug 1361618 has possibly fixed the actual security bug?

It fixed a related security bug (spinning the event loop in frame construction). This patch is also necessary because we're still spinning the event loop while iterating over mLinksToUpdate, so could theoretically still recurse into FlushPendingLinkUpdates.

(In reply to Olli Pettay [:smaug] from comment #36)
> why not AutoRestore<bool> ?

In this patch, I made mFlushingPendingLinkUpdates a bitfield (bool:1). C++ doesn't allow references to bitfields. I could make the member a regular bool, if you think it's worth it.
I added a typedef for the SegmentedVector type because it's pretty complicated.
Attachment #8952549 - Flags: review?(bugs)
Attachment #8951451 - Attachment is obsolete: true
ahaa, it was that bitfield.
Attachment #8952549 - Flags: review?(bugs) → review+
Comment on attachment 8951450 [details] [diff] [review]
Add a move constructor to SegmentedVector

I don't think I need to request sec-approval for this patch. It's simply adding a move constructor so SegmentedVector users don't need to copy as many things.
Comment on attachment 8941265 [details] [diff] [review]
Debugging patch

Obsoleting as this has already landed and is being overwritten by the other patches here.
Attachment #8941265 - Attachment is obsolete: true
Comment on attachment 8952549 [details] [diff] [review]
Change the way we iterate over our links to update.

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

I think it would be difficult. as far as I've seen, this should only be possible very early in startup, so it would be difficult to even have a payload in memory or to control it.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

I think the patch pretty clearly points to a problem recursively calling into FlushPendingLinkUpdates (and with the crash reports on crash-stats one can even find *how* that happens), but again, with the only known way to cause this crash happening so early in startup, I don't know how to go about attacking it. Also, an attacker would need to find a way to corrupt Place's history database remotely.

Which older supported branches are affected by this flaw?

All.

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

Not yet, but they should be easy to make and not significantly more risky.

How likely is this patch to cause regressions; how much testing does it need?

I think this patch is low-risk. The core of the code is the same and most of the added code should be idle when we wouldn't have been crashing.
Attachment #8952549 - Flags: sec-approval?
sec-approval+ for trunk.
We'll want Beta and ESR52 patches nominated as well.
Attachment #8952549 - Flags: sec-approval? → sec-approval+
https://hg.mozilla.org/integration/mozilla-inbound/rev/b235338be4d99c2341184c39e96d46eb78342dae

Please request Beta approval on this when you get a chance - it grafts cleanly as-landed. It'll also need a rebased patch for ESR52 with approval request.
Flags: needinfo?(mrbkap)
Keywords: leave-open
Backed out for build bustages at build/src/dom/base/nsDocument.cpp

https://hg.mozilla.org/integration/mozilla-inbound/rev/8e44cbf5be82d0b949544ad5a40eced6c690b385

Push with bustage: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=b235338be4d99c2341184c39e96d46eb78342dae&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable

Build log: https://treeherder.mozilla.org/logviewer.html#?job_id=164719283&repo=mozilla-inbound

/builds/worker/workspace/build/src/dom/base/nsDocument.cpp:9658:49: error: use of deleted function 'mozilla::SegmentedVector<nsCOMPtr<mozilla::dom::Link>, 128ul, InfallibleAllocPolicy>::SegmentedVector(const mozilla::SegmentedVector<nsCOMPtr<mozilla::dom::Link>, 128ul, InfallibleAllocPolicy>&)'
/builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/SegmentedVector.h:54:7: error: 'mozilla::LinkedList<T>::LinkedList(const mozilla::LinkedList<T>&) [with T = mozilla::SegmentedVector<nsCOMPtr<mozilla::dom::Link>, 128ul, InfallibleAllocPolicy>::SegmentImpl<13ul>]' is private within this context
/builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/SegmentedVector.h:54:7: error: use of deleted function 'mozilla::LinkedList<T>::LinkedList(const mozilla::LinkedList<T>&) [with T = mozilla::SegmentedVector<nsCOMPtr<mozilla::dom::Link>, 128ul, InfallibleAllocPolicy>::SegmentImpl<13ul>]'
As a note: there aren't any instances of this crash on crash-stats for ESR. Looking at the immediate code (the nsDocument.cpp code), it isn't re-entrant safe, so there's no obvious reason that this wouldn't crash there too. It might be that ESR users deal with the error path (corrupted places database) much more rarely than non-ESR users.

I'm pushing a patch to try now and will ask for approval for it once I have the results from that push.
Flags: needinfo?(mrbkap)
Comment on attachment 8951450 [details] [diff] [review]
Add a move constructor to SegmentedVector

Approval Request Comment
[Feature/Bug causing the regression]: n/a
[User impact if declined]: Crashes on startup.
[Is this code covered by automated tests?]: It's exercised, though the crashing path is not.
[Has the fix been verified in Nightly?]: Not yet.
[Needs manual test from QE? If yes, steps to reproduce]: n/a
[List of other uplifts needed for the feature/fix]: only the two patches in this bug.
[Is the change risky?]: Low risk.
[Why is the change risky/not risky?]: The fix is a pretty standard transformation of code to iterate over a copy of an array and to not recur into the function doing the iteration. It isn't tricky in any way.
[String changes made/needed]: n/a
Attachment #8951450 - Flags: approval-mozilla-beta?
Comment on attachment 8952549 [details] [diff] [review]
Change the way we iterate over our links to update.

Approval Request Comment: see comment 48.
Attachment #8952549 - Flags: approval-mozilla-beta?
Comment on attachment 8951450 [details] [diff] [review]
Add a move constructor to SegmentedVector

Approved for Fx59RC1 and Fennec 59b15.
Attachment #8951450 - Flags: approval-mozilla-beta? → approval-mozilla-release+
Attachment #8952549 - Flags: approval-mozilla-beta? → approval-mozilla-release+
The lack of crashes on esr was bothering me so I looked into it more. The patch that added the SpinEventLoopUntil was landed for bug 1355561 in 56. Before that, it appears that the failure patch would do thing semi-synchronously, but wouldn't ever actually spin the event loop. Therefore, I don't think we need this patch on ESR52.
Flags: needinfo?(mrbkap)
Whiteboard: [adv-main59+]
Group: dom-core-security → core-security-release
Group: core-security-release
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.