Closed Bug 1023758 Opened 10 years ago Closed 10 years ago

Incremental cycle collection does not properly handle dead traversed nodes, leading to CSS use-after-free

Categories

(Core :: XPCOM, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox31 --- unaffected
firefox32 + unaffected
firefox33 + fixed
firefox-esr24 --- unaffected
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: gkw, Assigned: mccr8)

References

Details

(Keywords: csectype-uaf, regression, sec-critical)

Attachments

(5 files, 1 obsolete file)

Attached file stack (obsolete) —
+++ This bug was initially created as a clone of Bug #1021612 +++

I tested Steven's asan build ( http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg ) from Jun 9 (bug 997908 comment 106) and when I was loading a Wordpress admin/edit page, the asan build crashed with a heap buffer overflow at ContentEnumFunc.

( cc'ing :peterv, :bz, :dbaron and :khuey as randomly picked from searching "CSS" at https://wiki.mozilla.org/Modules/All - needinfo? selectively from :peterv and :bz, please look at it whoever gets to it first)

Not sure if DOM: CSS Object Model is the right component - please also feel free to assign this to the right one.

Setting s-s and sec-critical until we know what is going on.
Flags: needinfo?(peterv)
Flags: needinfo?(bzbarsky)
Component: DOM: CSS Object Model → CSS Parsing and Computation
Flags: needinfo?(peterv)
Interesting.  Could this be what's behind the ReleaseSliceNow() crashes at bug 997908?
Note possibly related bug 1011391 and bug 1013603.

Gary, can we possibly get line numbers out of ASAN here?
Flags: needinfo?(gary)
Flags: needinfo?(bzbarsky)
Steven says "Probably not, but I'm not 100% sure." as per bug 1021612 comment 5.
Flags: needinfo?(gary) → needinfo?(bzbarsky)
That makes it pretty hard to tell what we're overflowing, exactly.  :(
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] from comment #4)
> That makes it pretty hard to tell what we're overflowing, exactly.  :(

Any way around this, Steven? Does this mean I *have* to do my own ASan builds?

If so, I'd think I have to get your mozconfigs, as well as your local ASan changes. Also, I'm on LLVM rev 200213 as per bug 957865 and https://developer.mozilla.org/en-US/docs/Mozilla/Testing/Firefox_and_Address_Sanitizer#LLVM.2FClang
Flags: needinfo?(smichaud)
I've now documented exactly how I do my Mac ASan builds, here:
http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt

Let me know if you have any questions.
Flags: needinfo?(smichaud)
Most of what I've learned about ASan builds I learned in the last week or two.  I do remember seeing something about line numbers in ASan logs.  Let me see if I can find it again.
The best info I can find on the line numbers issue is this:
http://stackoverflow.com/questions/23554398/get-line-numbers-with-addresssanitizer-output

Which, needless to say, is pretty inconclusive.  But other things I've seen imply that it *is* possible -- for example:
http://www.chromium.org/developers/testing/addresssanitizer

At the very least, though, I'll probably have to change one or more build parameters and redo my build.
This is probably the answer -- the issue is probably a clang/llvm-symbolizer bug:
https://code.google.com/p/address-sanitizer/issues/detail?id=207

But running dsymutil on the XUL in my current Mac ASan builds doesn't fix the problem.  So I'll probably need to rebuild with the "-g" flag (which I'm currently not using), *then* run dsymutil manually on XUL (at least).

It may be a while before you see line numbers, Boris :-(
(In reply to Steven Michaud from comment #9)
> But running dsymutil on the XUL in my current Mac ASan builds doesn't fix
> the problem.  So I'll probably need to rebuild with the "-g" flag (which I'm
> currently not using), *then* run dsymutil manually on XUL (at least).

Steven, I'll await your new build to try this out again.
Flags: needinfo?(smichaud)
I've spent several more hours on this, without any luck.  Currently I have no idea when or if I'll be able to get line numbers into ASan logs.  So Boris, please try to make do without them.
Flags: needinfo?(smichaud)
(In reply to Steven Michaud from comment #6)
> I've now documented exactly how I do my Mac ASan builds, here:
> http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt
> 
> Let me know if you have any questions.

Just curious, Steven - your instructions are for Lion 10.7, have you tried on a 10.9 machine? Or are you keeping 10.7 for a reason?
Flags: needinfo?(smichaud)
I haven't tried on a 10.9 machine (or a 10.8 one).  I'm sticking to 10.7 (and the 10.6 SDK) because I want these builds to work on all our supported versions of OS X.
Flags: needinfo?(smichaud)
I think I've got the line numbers problem fixed ... at least mostly.  New build coming up shortly.

I had to fix a couple of bugs to stop llvm-symbolizer (and the code that invokes it) from crashing while processing the Dwarf debugging info for such an enormous (and complex) file as XUL.
I spoke too soon.  I've found another stupid llvm crash bug.

With luck I'll have a new build in the next few days, though.
What do you expect to learn from the line numbers?  It seems like a testcase would be vastly more informative, but I'm not sure what line numbers would add to what we know from the existing stacks (which are somewhat useful but don't point to a problem).
Well, we'd know what exact thing we're overflowing...

But yes, a testcase would help even more.
Also, the ASAN output here is not a heap buffer overflow -- it's a use after free.

And, in particular, it's a use after free of part of the selector data structure (I think more likely an nsCSSSelector than an nsCSSSelectorList, though I didn't look too closely, and don't think it matters much) of a style rule that's inside an @media rule which is inside a style sheet, which we're somehow trying to use after the @media rule has been cycle-collected.  (I don't see how we'd know from the stacks whether or not the sheet is still alive, although there might be a way.)
Summary: AddressSanitizer: heap-buffer-overflow [@ ContentEnumFunc] → AddressSanitizer: use after free [@ ContentEnumFunc]
Blocks: 911246
As it turns out, the latest llvm bug wasn't hard to fix.  So I've created a new Mac ASan build (using current m-c code), and have uploaded it to the usual place:

http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg

I've also updated my readme:
http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt

You should now see line numbers for XUL ... or at least for most of it.  I followed the general strategy in comment #6 -- I ran dsymutil on XUL to create a separate directory (XUL.dSYM) containing Dwarf debug information, readable by llvm-symbolizer.  But I also had to fix three separate llvm crash bugs, and I'm not sure I've got them all.  So please let me know if you see any llvm crashes.

The DMG file is now *huge* -- more than 500 MB, and about five times the size of the previous build's DMG.  You will need lots of RAM to run this build :-)

You'll also need to use the embedded llvm-symbolizer (in Nightly.app/Contents/MacOS), instead of whatever llvm-symbolizer you may have built yourself.  Only mine has the fixes to stop it from choking on XUL.dSYM.
(In reply to Steven Michaud from comment #19)
> http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg

I'm now testing this build, but when I re-tried going to the same place (Wordpress admin page) where I hit the bug, this issue wouldn't occur again. Will keep trying.

(In reply to Steven Michaud from comment #19)
> You'll also need to use the embedded llvm-symbolizer (in
> Nightly.app/Contents/MacOS), instead of whatever llvm-symbolizer you may
> have built yourself.  Only mine has the fixes to stop it from choking on
> XUL.dSYM.

Yep, using the embedded llvm-symbolizer too.
Depends on: 1011391
Boris, here's the stack with line numbers as hoped for.

Hit this while surfing Twitter. I noticed prior to crashing that it was substantially slower (e.g. trudging along) than after restart.

Steven, I'm not sure what the m-c rev is for this build, did I miss the changeset rev from which you compiled your ASan build from? (so the line numbers can match up with source)
Attachment #8438283 - Attachment is obsolete: true
Flags: needinfo?(smichaud)
Flags: needinfo?(bzbarsky)
#10 0x1091fb7d2 in mozilla::css::GroupRule::cycleCollection::Unlink(void*) /usr/local/src/Mozilla/asan/mozilla-central/layout/style/nsCSSRules.cpp:568

Cycle collection at / involving GroupRule?
> Steven, I'm not sure what the m-c rev is for this build

It's http://hg.mozilla.org/mozilla-central/rev/f00616c14171.  I got that by running the build and doing about:buildconfig :-)
Flags: needinfo?(smichaud)
(In reply to Steven Michaud from comment #23)
> It's http://hg.mozilla.org/mozilla-central/rev/f00616c14171.  I got that by
> running the build and doing about:buildconfig :-)

Oh yeah, I should have remembered that. Nonetheless, this stack shows your build with line numbers worked. :)

I'd suggest filing a new bug to get some of the workarounds to LLVM(?) landed and see if we can get regular (nightly?) ASan builds with line numbers, if there isn't any yet. Judging by the 2 s-s bugs found recently by my real-world testing, I would be surprised if we feel it isn't worth it.
Upstreaming my LLVM patches will take some time.  It's complicated by the fact that I'm not using the latest version of LLVM, since I haven't (yet) gotten that to work properly on OS X 10.7.  Also by the fact that I don't want to spend much time on it, now that I've got working builds.  I'd be very happy if someone else picked that task up ... but I'm not sure when/if that'll happen.

Even more complicated is the automation of Mac ASan builds.  I imagine that will take months ... if not longer.

I'll at least open some bmo bugs on these subjects, probably sometime next week.

In the meantime I'm happy to keep doing the builds by hand (about once a week), and to fix occasional LLVM bugs as they show up.  That will presumably be a lot less work than the alternative.

And yes, I think these builds have been a big success.  You've found two security bugs in just days.  And I found one bug myself, though it's not a security bug (bug 125078).
OK, so there are two obvious possibilities.

Either we're screwing up CC somehow and destroying a GroupRule that should be live, or we're failing to notify the style set when a sheet is removed from a document.

Gary, can you reproduce this reliably enough to try bisecting, by any chance?  We may need to fall back to that if we don't come up with any better ideas...

GroupRule itself is pretty simple in terms of CC stuff; it's DOMCSSStyleRule and DOMCSSDeclarationImpl that are complicated....
Flags: needinfo?(bzbarsky) → needinfo?(gary)
(In reply to Boris Zbarsky [:bz] from comment #27)
> Gary, can you reproduce this reliably enough to try bisecting, by any
> chance?  We may need to fall back to that if we don't come up with any
> better ideas...

Nope. (1) It is not reliable, and (2) I rely on builds produced by Steven, which tend to be once a week, and which only started these couple of days.

For the crash in bug 997908 (which is what led me to use Steven's ASan builds), I noticed ReleaseSliceNow crashes happening from when Fx31 was nightly, so that might be a start (assuming the ReleaseSliceNow crashes are actually this bug).

If the 2 bugs are completely unrelated, then I really have no idea when it started.
Flags: needinfo?(gary)
> Nope. (1) It is not reliable

My browser usage is such that I have upwards of 250-300 tabs open, and I almost always never restart my browser - sometimes for a week. During this period, a lot of tabs get opened and closed, in all sorts of languages (incl. CJK), types (light vs video-heavy) and age (I sometimes close old tabs, not necessarily only new recently-opened ones).

Plus, I also mentioned in comment 20 that my initial steps for causing this issue no longer worked, and I triggered this again later via a separate STR, that was also unreproducible.
This is almost certainly somehow due to ICC, given that the top crash started on the day ICC landed, and the free is happening in the CC.  ICC can be disabled by setting dom.cycle_collector.incremental to false and see if it feels like it doesn't happen then, or something.  I should try out these ASAN builds, too.

I'm not really sure how this could be happening.  I'd think that even if the CC is destroying stuff that should be alive, that the destruction sequence we are creating would go and destroy all of the weak references to the objects.  But I don't really understand the relationship between all of these CSS style things.

One thing I noticed is that the unlink has an optimization where it doesn't clear out weak refs if tmp->GetStyleSheet() is null, which is a difference from what happens in the dtor.

I have a patch sitting around to cycle collect nsCSSRuleProcessor (bug 990160) and maybe that will magically fix this...
> that the destruction sequence we are creating would go and destroy all of the weak
> references to the objects.

In what sense?

The rule processor data has pointers into style rule internal bits, effectively, and relies on being notified when rules are removed from sheets or sheets are removed from the document.  If a rule or sheet is destroyed without those notifications getting sent, the rule processor data will have dangling pointers.
Is incremental cycle collection about doing the traversal incrementally or about doing the freeing incrementally?  If it's the latter, there could be interesting dependencies related to destruction order.
It just does graph building/traversal incrementally.  There were too many horrible things happening with incremental unlinking, like weak refs becoming strong refs after we'd decided an object was dead.

The main problem is that the CC can become confused about whether an object is alive or not, because, say, its refcount could increase after we traverse it, but before it traverses the things that point to it, so the CC thinks it sees all of the references to an object.  To counter this, any object that is addrefed or released during the graph building phase is treated as being live.  (This avoids leaks because any such object is in the purple buffer, so it will be looked at again in the next CC.)

(Technically, if you are doing a swap() of a cycle collected object's pointer into another cycle collected object, this same confusion could occur, because there is no addref/release, but I think it would be unlikely that we do 0 addref/release of an object during an entire graph building phase, due to holding things in stack vars.)
Summary: AddressSanitizer: use after free [@ ContentEnumFunc] → AddressSanitizer: use after free [@ ContentEnumFunc] or [@ nsCOMPtr<nsIAtom>::get]
Here's another stack, which now shows additional SEGV info about IPC in a second stack.
The first stack is actually bug 1011391 -- but that's very closely related to this bug.

The second stack is just a plugin killing itself after the main process dies.
(In reply to David Baron [:dbaron] (busy May 31-June 15) (UTC-4) from comment #32)
> Is incremental cycle collection about doing the traversal incrementally or
> about doing the freeing incrementally?  If it's the latter, there could be
> interesting dependencies related to destruction order.
Well, snowwhite already made objects deletion to happen possibly incrementally, and nothing guarantees
the order. ICC may have changed when snowwhite killer does its job.

So GroupRule is the only CCable Rule? Other rules are deleted immediately when refcnt drops to 0,
but GroupRule stays alive for awhile...
I've just updated my Mac ASan build to current trunk code:

http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg
http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt

Gary, I suggest you test with this for a few days with ICC turned off (with dom.cycle_collector.incremental set to false).  Then if you don't see any CC-related crashes (and if the rest of you haven't yet figured out how to fix this bug), I suggest we turn ICC off by default on the trunk for at least a week, to see what effect it has on our crash stats.
(In reply to Steven Michaud from comment #37)
> Gary, I suggest you test with this for a few days with ICC turned off (with
> dom.cycle_collector.incremental set to false).

Testing now, thanks!
> So GroupRule is the only CCable Rule?

No, page rules, font face rules, and DOMCSSStyleRule are ccable.

So I've been reading through our code for removing/etc rules again.  Removal (or addition) of a rule (or sheet) calls PresShell::RecordStyleSheetChange; the following PresShell::EndUpdate then does a ReconstructStyleData().  But that just posts an event, afaict...  So now I'm not quite sure how this is supposed to work.
Oh, I see.  Removing rules from a sheet is supposed to ClearRuleCascades(), which clears the rule cascades on all the rule processors that sheet is involved in...

And adding/removing sheets ends up marking the style set dirty, which drops the old rule processors altogether.
I hit this crash when on youtube in gdb in an opt build yesterday, but we weren't able to get much useful info.

I wasn't able to reproduce, but I do get a lot of the following assertions in debug builds using my profile on youtube and resizing the window:

> ###!!! ASSERTION: style context has old rule node: 'n == mRuleTree', file layout/style/nsStyleSet.cpp, line 238
> ###!!! ASSERTION: old rule tree still referenced: 'Not Reached', file layout/style/nsStyleSet.cpp, line 1779

If that's at all relevant, I can try to bisect my profile/tabs to a test case
Assignee: nobody → continuation
I've updated my Mac ASan build again to current trunk code:

http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg
http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt

Gary, how has your testing gone over the last week?  Did you see any CC-related crashes with ICC turned off?
Flags: needinfo?(gary)
(In reply to comment #41)

John, I don't know if the assertions you saw are relevant.  But if you can easily reproduce them, it'd be worthwhile finding a regression range for them in m-c debug nightlies.
(In reply to Steven Michaud from comment #42)
> Gary, how has your testing gone over the last week?  Did you see any
> CC-related crashes with ICC turned off?

Fortunately/Unfortunately, I have not seen any crashes at all in the past week since I've turned off ICC.

Is there a very slight chance that the ReleaseSliceNow crash is related to this in any way?
Flags: needinfo?(gary)
> Is there a very slight chance that the ReleaseSliceNow crash is related to this in any way?

I think there's a pretty good chance it's related, and that your results increase that chance.

In a few days I'll open a bug to turn off ICC on the trunk.  I want to wait a bit, to see how the latest patch for bug 1011391 (attachment 8442347 [details] [diff] [review]) plays out.  Current results suggest it makes that bug's crashes happen three times more often!
Summary: AddressSanitizer: use after free [@ ContentEnumFunc] or [@ nsCOMPtr<nsIAtom>::get] → AddressSanitizer: use after free with incremental cycle collection [@ ContentEnumFunc] or [@ nsCOMPtr<nsIAtom>::get]
johns managed to get a cycle collector log from right before the UAF using a modified version of Martijn's test case in bug 1011391.  There's a style sheet that has some group rules in it, and we unlink the group rule but not the sheet.  I don't 100% know that this is the problem, but that would certainly cause this crash.  Both the style sheet and group rules are alive at the time we hit the use-after-free, judging by poking around with GDB.

This is weird because the sheet (which has a refcount of 1 and no known references in the graph) should prevent the rules from being unlinked.  Also, the sheet is not listed as being a root. So, something is messed up in the CC graph (probably in the state of the CC node for the sheet), or in the flooding algorithm.  I'm not sure exactly what is wrong, so I'll have to add some more logging code and try again.
Component: CSS Parsing and Computation → XPCOM
Does removing the mInner->mSheets.Length() != 1 optimizations at the beginning of CSSStyleSheet::TraverseInner and CSSStyleSheet::UnlinkInner change anything?
That does look a little weird, and I'll try removing it, but fundamentally the CC itself seems to be doing something incorrectly.
Another thing is that the sheet of the rule at the time of the UAF seemed to be different than the sheet it was in at the time of the CC, but maybe one is the inner and one is the outer sheet and so that's expected?
The "inner sheet" is not an nsCSSStyleSheet.  Rules do not have references to it.
Past that, the sheet of a rule can in fact change.  Say you have two CSStyleSheet instances that have the same inner.  The rules will point to one of them as parent.  Now say that one of the CSSStyleSheets goes away.  It'll call CSSStyleSheetInner::RemoveSheet, which will reparent the rules to the other sheet.
(In reply to Boris Zbarsky [:bz] from comment #47)
> Does removing the mInner->mSheets.Length() != 1 optimizations at the
> beginning of CSSStyleSheet::TraverseInner and CSSStyleSheet::UnlinkInner
> change anything?

Looking at the code some more, it is not ok to remove that check.  If I understand this correctly, when the inner has more than one thing in mSheets, then it is the inner of multiple sheets.  So without that check, multiple sheets will traverse the mOrderedRules (etc.) of the inner, claiming they own it, when in fact there is only a single reference.
I think I've figured out what is happening. My theory is that the sequence of steps goes something like this:
- Initially, we have a style sheet S that has an inner I, and the inner owns GroupRule R.
- We start a CC, then traverse S and R, which adds nodes N_S and N_R to the CC graph, where N_S points to N_R, and both have a refcount of 1.
- We create a new style sheet S', add S' to I and remove S from I. (S' is new, so we probably won't end up traversing it during the CC.) (No refcount traffic occurs on R, so the write barrier doesn't fire for it.)
- S dies.  This causes us to set the mParticipant field of N_S to null.
- We start to finish the CC.  The CC looks at N_S, but because its participant is null, it ignores it completely. Germane to our behavior here, it does not unlink it (good!), and it does not mark it black and flood black from it (bad!).
- Because the number of known references to N_R (1, from the dead N_S) is equal to the refcount of N_R, and no other object that reaches N_R has flooded black, the CC thinks it is dead and unlinks R.
- The unlinking of R causes a style rule to be destroyed without being removed from the sheet, causing the dangling pointer, which leads to our crash.

Anyways, that's what I inferred from the CC log in comment 46.  I add logging which confirmed that N_S was neutered in the graph.  I can't be sure this is causing the crash, but even if it isn't, it is a bug in ICC that must be fixed.

The basic problem is that we need to distinguish dead objects that were not traversed from those that were. For the former, we want the current behavior.  For those that were traversed but died, I think we can treat them as being alive, though that sounds contradictory.  I need to think about whether this might cause leaks.
> I can't be sure this is causing the crash

I'm pretty sure the sequence of events you describe would cause exactly this crash.

So the fundamental issue is that object A is holding a ref to object B, and then between the graph construction and the finish stage A hands over the reference to someone else without changing B's refcount in the process, and then A dies, right?
(In reply to Boris Zbarsky [:bz] from comment #54)
> So the fundamental issue is that object A is holding a ref to object B, and
> then between the graph construction and the finish stage A hands over the
> reference to someone else without changing B's refcount in the process, and
> then A dies, right?
Yup.

I have a patch that makes any node in the graph that was traversed but is dead be treated as alive to the CC, and it seems to fix the crash in limited testing.  With a test case running on John's machine, it was consistently crashing after 2-5 minutes, but with the patch, it wasn't crashing after 10 or so.  I need to clean up the patch and test it some more.
(In reply to Andrew McCreight [:mccr8] from comment #55)
> I have a patch that makes any node in the graph that was traversed but is
> dead be treated as alive to the CC, and it seems to fix the crash in limited
> testing. 
That sounds good to me.
Should the objects which the now-dead object had edges to be put to the purple buffer to emulate
normal addref/release?
The first version of my patch (not uploaded) did two things

1. Color traversed nodes, even if they are dead.
2. If a dead node was traversed, treat it as being always alive.

In this new patch I have attached, I just do the first thing. In any situation where we need 2, I think we could have a slightly different situation where the node does not die, and thus we're in trouble anyways. I wasn't able to come up with a situation where 1 is insufficient (assuming people don't manipulate fields of objects held by a weak pointer without taking a strong reference to the object).

I think this should more clearly eliminate the need to add anything to the purple buffer, because if we decide the children of the dead node should be freed, with this version, we'll just free it, so there should be no additional danger of leaks.

Anyways, the patch works by introducing a WasTraversed() method on PtrInfo, that returns true if the refcount has been set to something aside from the initial bogus value.  Then, we basically replace all null-checks on mParticipant with calls to WasTraversed(), except for places where we are going to actually call a method on mParticipant.  In practice, this means that the mParticipant null checks in the graph coloring phase, ScanRoots(), are changed to WasTraversed() calls.

This version has been running on the PDF test case for 31 minutes without crashing, so it looks good in terms of fixing the crash.

Hopefully the try run will look okay:
  https://tbpl.mozilla.org/?tree=Try&rev=262dec3e2f9f
Comment on attachment 8447402 [details] [diff] [review]
The cycle collector should color traversed dead nodes.

This should be ready to review now, but feel free to wait until after the try results come in.
Attachment #8447402 - Flags: review?(bugs)
Comment on attachment 8447402 [details] [diff] [review]
The cycle collector should color traversed dead nodes.

I'm asking for approval now, because it would be nice to get it landed today or over the weekend to fix the crashes.  I obviously will not land until it is reviewed. :)

[Security approval request comment]
How easily could an exploit be constructed based on the patch? Not easily. I think the bigger giveaway is the top crash.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? No.

Which older supported branches are affected by this flaw? 32 Aurora.

If not all supported branches, which bug introduced the flaw? bug 911246

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? This code has not changed in the last month, so it should be easy to backport, and trunk should reveal any problems.

How likely is this patch to cause regressions; how much testing does it need?  This is affecting behavior not tested on TBPL, so it would be good to get it plenty of testing.
Attachment #8447402 - Flags: sec-approval?
(In reply to Andrew McCreight [:mccr8] from comment #59)
> Which older supported branches are affected by this flaw? 32 Aurora.

(In reply to Steven Michaud from comment #45)
> > Is there a very slight chance that the ReleaseSliceNow crash is related to this in any way?
> 
> I think there's a pretty good chance it's related, and that your results
> increase that chance.

Steven, this is strange. My ReleaseSliceNow crashes seemed to happen on 31 as well, but if it was related to this bug, then something seems amiss because Andrew mentions that only Fx32 Aurora is affected.
Andrew seems well on his way to fixing this bug.  Once a patch has landed on trunk, we should know within a few days whether or not the ReleaseSliceNow() crash stats are effected.
Comment on attachment 8447402 [details] [diff] [review]
The cycle collector should color traversed dead nodes.

Sec-approval+ for trunk. Let's get this on Aurora as well.
Attachment #8447402 - Flags: sec-approval? → sec-approval+
Comment on attachment 8447402 [details] [diff] [review]
The cycle collector should color traversed dead nodes.

https://tbpl.mozilla.org/php/getParsedLog.php?id=42663471&tree=Try&full=1#error0
doesn't look quite right.
Attachment #8447402 - Flags: review?(bugs)
Oh, I bet what is happening is that we add an object to the graph, then it gets addrefed and we add whatever is holding it to the graph, then the object dies (so it isn't in the purple buffer any more to get added as an incremental root).  The easiest thing to do would be to ignore problems in that case.  The alternative is to track purple objects that died and treat them as incremental roots.  I'll have to think about that.
This will make it easier to scan for things besides gray JS nodes.  It slightly
reorders the tests we do, but it shouldn't really matter.

This is kind of gross because sometimes we run the CC when there is no JS runtime,
in some test harness, I believe. So, we have to confirm that we have a JS runtime
anytime we have a grey object. Otherwise, we'd somehow have to worry about dead
objects (that have a null participant) when there is no JS runtime, and that's
annoying.

The else continue looks silly, but in the next patch I insert another test in there.
Comment on attachment 8447402 [details] [diff] [review]
The cycle collector should color traversed dead nodes.

I'll keep this unobsolete, because it has the sec-approval+, but this version isn't quite right, so I'm not going to use it.
Attachment #8447402 - Flags: checkin-
Attachment #8447742 - Flags: review?(bugs)
This patch does two things. First, as in the previous version, it adds a new notion of a WasTraversed() node, and uses that instead of checking mParticipant to see if we should look at a node, aside from when we really are trying to use the participant.  The effect of this is that we will properly color a node that was traversed, but since died.

The second thing this patch does is treat traversed dead nodes as incremental roots. If a node was traversed, it must have been alive at the start of the CC. If it has since died, then its refcount must have decreased during the CC, and thus it would have been in the purple buffer if it hadn't died. So it should be okay to treat it as an incremental root, as we do for anything in the purple buffer.

Treating dead objects as incremental roots also handles the case that was causing the previous patch to fail, where an object gets addrefed during a CC after we traverse it, then we traverse the things that hold onto it, then it dies. In that case, we are treating it as an incremental root, and thus do not check its ref count.

I also tried a version of this patch that does not do the incremental root thing, but just doesn't check any dead node, but that does not feel as internally consistent as this, so I think this version is better, if a little more complex.

try run: https://tbpl.mozilla.org/?tree=Try&rev=2928acf74f78

Theres's a lot of failures in one dt1 test, but that is happening quite frequently right now, so I think I just got unlucky in my first four dt1 runs.

This patch is a bit more complicated than the previous version, and it does not make me feel good that the first version was broken, so my current plan is to disable ICC on Aurora, so these two patches will only land on Nightly.
Attachment #8447743 - Flags: review?(bugs)
Summary: AddressSanitizer: use after free with incremental cycle collection [@ ContentEnumFunc] or [@ nsCOMPtr<nsIAtom>::get] → Incremental cycle collection does not properly handle dead traversed nodes, leading to CSS use-after-free
Blocks: 1011391
No longer depends on: 1011391
OS: Mac OS X → All
Hardware: x86_64 → All
Attachment #8447742 - Flags: review?(bugs) → review+
Comment on attachment 8447743 [details] [diff] [review]
part 2 - Dead traversed objects should be treated as incremental roots and colored.

Tiny bit hard to review

>+    } else if (!pi->mParticipant && pi->WasTraversed()) {
>+      // Dead traversed refcounted objects:
>+      // If the object was traversed, it must have been alive at the start of
>+      // the CC, and thus had a positive refcount. It is dead now, so its
>+      // refcount must have decreased at some point during the CC. Therefore,
>+      // it would be in the purple buffer if it wasn't dead, so treat it as an
>+      // incremental root.
This could use a comment why we don't leak here (the dead object should have released member variables and those should be now in pb).
Attachment #8447743 - Flags: review?(bugs) → review+
> This could use a comment why we don't leak here (the dead object should have released member variables and those should be now in pb).
I added a comment.

Thanks for the reviews!

remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/01aca028ba9d
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/c213049465be
Oh, and the latest version of the patch has been running the crashing test case one johns's machine for 2 hours and 17 minutes without crashing.
and now on m-c https://hg.mozilla.org/mozilla-central/rev/01aca028ba9d and
https://hg.mozilla.org/mozilla-central/rev/c213049465be
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
ICC has been disabled on 33 so that should fix this issue there.
Olli pointed out I meant 32, not 33.
This patch did not land on 32, but the crash should be fixed by disabling.
Group: core-security
Is there anywhere where I can download Steven's asan build from June 9th so I can try reproducing the original issue before going through verification?
Flags: needinfo?(gary)
I think you should be able to reproduce this on a regular Linux ASan build with the test case in bug 1011391 comment 47.  The original issue in this bug is something like "Be Gary and browse the web for a day" which will be hard to confirm. :)
(In reply to Kamil Jozwiak [:kjozwiak] from comment #75)
> Is there anywhere where I can download Steven's asan build from June 9th so
> I can try reproducing the original issue before going through verification?

Nope, I don't think so. I think Steven updates his builds weekly, and I don't save them locally either.
Flags: needinfo?(gary)
Marking this [qe-verify-] due to difficulty in reproducing, without a test case or access to build.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: