Closed
Bug 1023758
Opened 10 years ago
Closed 10 years ago
Incremental cycle collection does not properly handle dead traversed nodes, leading to CSS use-after-free
Categories
(Core :: XPCOM, defect)
Core
XPCOM
Tracking
()
RESOLVED
FIXED
mozilla33
Tracking | Status | |
---|---|---|
firefox31 | --- | unaffected |
firefox32 | + | unaffected |
firefox33 | + | fixed |
firefox-esr24 | --- | unaffected |
b2g-v2.0 | --- | fixed |
b2g-v2.1 | --- | fixed |
People
(Reporter: gkw, Assigned: mccr8)
References
Details
(Keywords: csectype-uaf, regression, sec-critical)
Attachments
(5 files, 1 obsolete file)
18.44 KB,
text/plain
|
Details | |
27.16 KB,
text/plain
|
Details | |
6.03 KB,
patch
|
abillings
:
sec-approval+
mccr8
:
checkin-
|
Details | Diff | Splinter Review |
5.30 KB,
patch
|
smaug
:
review+
|
Details | Diff | Splinter Review |
7.29 KB,
patch
|
smaug
:
review+
|
Details | Diff | Splinter Review |
+++ This bug was initially created as a clone of Bug #1021612 +++
I tested Steven's asan build ( http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg ) from Jun 9 (bug 997908 comment 106) and when I was loading a Wordpress admin/edit page, the asan build crashed with a heap buffer overflow at ContentEnumFunc.
( cc'ing :peterv, :bz, :dbaron and :khuey as randomly picked from searching "CSS" at https://wiki.mozilla.org/Modules/All - needinfo? selectively from :peterv and :bz, please look at it whoever gets to it first)
Not sure if DOM: CSS Object Model is the right component - please also feel free to assign this to the right one.
Setting s-s and sec-critical until we know what is going on.
Flags: needinfo?(peterv)
Flags: needinfo?(bzbarsky)
Updated•10 years ago
|
Component: DOM: CSS Object Model → CSS Parsing and Computation
Flags: needinfo?(peterv)
Comment 1•10 years ago
|
||
Interesting. Could this be what's behind the ReleaseSliceNow() crashes at bug 997908?
Comment 2•10 years ago
|
||
Note possibly related bug 1011391 and bug 1013603.
Gary, can we possibly get line numbers out of ASAN here?
Updated•10 years ago
|
Flags: needinfo?(gary)
Updated•10 years ago
|
Flags: needinfo?(bzbarsky)
Reporter | ||
Comment 3•10 years ago
|
||
Steven says "Probably not, but I'm not 100% sure." as per bug 1021612 comment 5.
Flags: needinfo?(gary) → needinfo?(bzbarsky)
Comment 4•10 years ago
|
||
That makes it pretty hard to tell what we're overflowing, exactly. :(
Flags: needinfo?(bzbarsky)
Reporter | ||
Comment 5•10 years ago
|
||
(In reply to Boris Zbarsky [:bz] from comment #4)
> That makes it pretty hard to tell what we're overflowing, exactly. :(
Any way around this, Steven? Does this mean I *have* to do my own ASan builds?
If so, I'd think I have to get your mozconfigs, as well as your local ASan changes. Also, I'm on LLVM rev 200213 as per bug 957865 and https://developer.mozilla.org/en-US/docs/Mozilla/Testing/Firefox_and_Address_Sanitizer#LLVM.2FClang
Flags: needinfo?(smichaud)
Comment 6•10 years ago
|
||
I've now documented exactly how I do my Mac ASan builds, here:
http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt
Let me know if you have any questions.
Flags: needinfo?(smichaud)
Comment 7•10 years ago
|
||
Most of what I've learned about ASan builds I learned in the last week or two. I do remember seeing something about line numbers in ASan logs. Let me see if I can find it again.
Comment 8•10 years ago
|
||
The best info I can find on the line numbers issue is this:
http://stackoverflow.com/questions/23554398/get-line-numbers-with-addresssanitizer-output
Which, needless to say, is pretty inconclusive. But other things I've seen imply that it *is* possible -- for example:
http://www.chromium.org/developers/testing/addresssanitizer
At the very least, though, I'll probably have to change one or more build parameters and redo my build.
Comment 9•10 years ago
|
||
This is probably the answer -- the issue is probably a clang/llvm-symbolizer bug:
https://code.google.com/p/address-sanitizer/issues/detail?id=207
But running dsymutil on the XUL in my current Mac ASan builds doesn't fix the problem. So I'll probably need to rebuild with the "-g" flag (which I'm currently not using), *then* run dsymutil manually on XUL (at least).
It may be a while before you see line numbers, Boris :-(
Updated•10 years ago
|
status-firefox33:
--- → affected
tracking-firefox33:
--- → +
Reporter | ||
Comment 10•10 years ago
|
||
(In reply to Steven Michaud from comment #9)
> But running dsymutil on the XUL in my current Mac ASan builds doesn't fix
> the problem. So I'll probably need to rebuild with the "-g" flag (which I'm
> currently not using), *then* run dsymutil manually on XUL (at least).
Steven, I'll await your new build to try this out again.
Flags: needinfo?(smichaud)
Comment 11•10 years ago
|
||
I've spent several more hours on this, without any luck. Currently I have no idea when or if I'll be able to get line numbers into ASan logs. So Boris, please try to make do without them.
Flags: needinfo?(smichaud)
Reporter | ||
Comment 12•10 years ago
|
||
(In reply to Steven Michaud from comment #6)
> I've now documented exactly how I do my Mac ASan builds, here:
> http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt
>
> Let me know if you have any questions.
Just curious, Steven - your instructions are for Lion 10.7, have you tried on a 10.9 machine? Or are you keeping 10.7 for a reason?
Flags: needinfo?(smichaud)
Comment 13•10 years ago
|
||
I haven't tried on a 10.9 machine (or a 10.8 one). I'm sticking to 10.7 (and the 10.6 SDK) because I want these builds to work on all our supported versions of OS X.
Flags: needinfo?(smichaud)
Comment 14•10 years ago
|
||
I think I've got the line numbers problem fixed ... at least mostly. New build coming up shortly.
I had to fix a couple of bugs to stop llvm-symbolizer (and the code that invokes it) from crashing while processing the Dwarf debugging info for such an enormous (and complex) file as XUL.
Comment 15•10 years ago
|
||
I spoke too soon. I've found another stupid llvm crash bug.
With luck I'll have a new build in the next few days, though.
What do you expect to learn from the line numbers? It seems like a testcase would be vastly more informative, but I'm not sure what line numbers would add to what we know from the existing stacks (which are somewhat useful but don't point to a problem).
Comment 17•10 years ago
|
||
Well, we'd know what exact thing we're overflowing...
But yes, a testcase would help even more.
Also, the ASAN output here is not a heap buffer overflow -- it's a use after free.
And, in particular, it's a use after free of part of the selector data structure (I think more likely an nsCSSSelector than an nsCSSSelectorList, though I didn't look too closely, and don't think it matters much) of a style rule that's inside an @media rule which is inside a style sheet, which we're somehow trying to use after the @media rule has been cycle-collected. (I don't see how we'd know from the stacks whether or not the sheet is still alive, although there might be a way.)
Summary: AddressSanitizer: heap-buffer-overflow [@ ContentEnumFunc] → AddressSanitizer: use after free [@ ContentEnumFunc]
Assignee | ||
Updated•10 years ago
|
Blocks: 911246
Keywords: csectype-uaf,
regression
Assignee | ||
Updated•10 years ago
|
status-firefox32:
--- → affected
Comment 19•10 years ago
|
||
As it turns out, the latest llvm bug wasn't hard to fix. So I've created a new Mac ASan build (using current m-c code), and have uploaded it to the usual place:
http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg
I've also updated my readme:
http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt
You should now see line numbers for XUL ... or at least for most of it. I followed the general strategy in comment #6 -- I ran dsymutil on XUL to create a separate directory (XUL.dSYM) containing Dwarf debug information, readable by llvm-symbolizer. But I also had to fix three separate llvm crash bugs, and I'm not sure I've got them all. So please let me know if you see any llvm crashes.
The DMG file is now *huge* -- more than 500 MB, and about five times the size of the previous build's DMG. You will need lots of RAM to run this build :-)
You'll also need to use the embedded llvm-symbolizer (in Nightly.app/Contents/MacOS), instead of whatever llvm-symbolizer you may have built yourself. Only mine has the fixes to stop it from choking on XUL.dSYM.
Reporter | ||
Comment 20•10 years ago
|
||
(In reply to Steven Michaud from comment #19)
> http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg
I'm now testing this build, but when I re-tried going to the same place (Wordpress admin page) where I hit the bug, this issue wouldn't occur again. Will keep trying.
(In reply to Steven Michaud from comment #19)
> You'll also need to use the embedded llvm-symbolizer (in
> Nightly.app/Contents/MacOS), instead of whatever llvm-symbolizer you may
> have built yourself. Only mine has the fixes to stop it from choking on
> XUL.dSYM.
Yep, using the embedded llvm-symbolizer too.
Reporter | ||
Comment 21•10 years ago
|
||
Boris, here's the stack with line numbers as hoped for.
Hit this while surfing Twitter. I noticed prior to crashing that it was substantially slower (e.g. trudging along) than after restart.
Steven, I'm not sure what the m-c rev is for this build, did I miss the changeset rev from which you compiled your ASan build from? (so the line numbers can match up with source)
Attachment #8438283 -
Attachment is obsolete: true
Flags: needinfo?(smichaud)
Flags: needinfo?(bzbarsky)
Reporter | ||
Comment 22•10 years ago
|
||
#10 0x1091fb7d2 in mozilla::css::GroupRule::cycleCollection::Unlink(void*) /usr/local/src/Mozilla/asan/mozilla-central/layout/style/nsCSSRules.cpp:568
Cycle collection at / involving GroupRule?
Comment 23•10 years ago
|
||
> Steven, I'm not sure what the m-c rev is for this build
It's http://hg.mozilla.org/mozilla-central/rev/f00616c14171. I got that by running the build and doing about:buildconfig :-)
Flags: needinfo?(smichaud)
Reporter | ||
Comment 24•10 years ago
|
||
(In reply to Steven Michaud from comment #23)
> It's http://hg.mozilla.org/mozilla-central/rev/f00616c14171. I got that by
> running the build and doing about:buildconfig :-)
Oh yeah, I should have remembered that. Nonetheless, this stack shows your build with line numbers worked. :)
I'd suggest filing a new bug to get some of the workarounds to LLVM(?) landed and see if we can get regular (nightly?) ASan builds with line numbers, if there isn't any yet. Judging by the 2 s-s bugs found recently by my real-world testing, I would be surprised if we feel it isn't worth it.
Comment 25•10 years ago
|
||
Upstreaming my LLVM patches will take some time. It's complicated by the fact that I'm not using the latest version of LLVM, since I haven't (yet) gotten that to work properly on OS X 10.7. Also by the fact that I don't want to spend much time on it, now that I've got working builds. I'd be very happy if someone else picked that task up ... but I'm not sure when/if that'll happen.
Even more complicated is the automation of Mac ASan builds. I imagine that will take months ... if not longer.
I'll at least open some bmo bugs on these subjects, probably sometime next week.
In the meantime I'm happy to keep doing the builds by hand (about once a week), and to fix occasional LLVM bugs as they show up. That will presumably be a lot less work than the alternative.
And yes, I think these builds have been a big success. You've found two security bugs in just days. And I found one bug myself, though it's not a security bug (bug 125078).
Comment 26•10 years ago
|
||
Comment 27•10 years ago
|
||
OK, so there are two obvious possibilities.
Either we're screwing up CC somehow and destroying a GroupRule that should be live, or we're failing to notify the style set when a sheet is removed from a document.
Gary, can you reproduce this reliably enough to try bisecting, by any chance? We may need to fall back to that if we don't come up with any better ideas...
GroupRule itself is pretty simple in terms of CC stuff; it's DOMCSSStyleRule and DOMCSSDeclarationImpl that are complicated....
Flags: needinfo?(bzbarsky) → needinfo?(gary)
Reporter | ||
Comment 28•10 years ago
|
||
(In reply to Boris Zbarsky [:bz] from comment #27)
> Gary, can you reproduce this reliably enough to try bisecting, by any
> chance? We may need to fall back to that if we don't come up with any
> better ideas...
Nope. (1) It is not reliable, and (2) I rely on builds produced by Steven, which tend to be once a week, and which only started these couple of days.
For the crash in bug 997908 (which is what led me to use Steven's ASan builds), I noticed ReleaseSliceNow crashes happening from when Fx31 was nightly, so that might be a start (assuming the ReleaseSliceNow crashes are actually this bug).
If the 2 bugs are completely unrelated, then I really have no idea when it started.
Flags: needinfo?(gary)
Reporter | ||
Comment 29•10 years ago
|
||
> Nope. (1) It is not reliable
My browser usage is such that I have upwards of 250-300 tabs open, and I almost always never restart my browser - sometimes for a week. During this period, a lot of tabs get opened and closed, in all sorts of languages (incl. CJK), types (light vs video-heavy) and age (I sometimes close old tabs, not necessarily only new recently-opened ones).
Plus, I also mentioned in comment 20 that my initial steps for causing this issue no longer worked, and I triggered this again later via a separate STR, that was also unreproducible.
Assignee | ||
Comment 30•10 years ago
|
||
This is almost certainly somehow due to ICC, given that the top crash started on the day ICC landed, and the free is happening in the CC. ICC can be disabled by setting dom.cycle_collector.incremental to false and see if it feels like it doesn't happen then, or something. I should try out these ASAN builds, too.
I'm not really sure how this could be happening. I'd think that even if the CC is destroying stuff that should be alive, that the destruction sequence we are creating would go and destroy all of the weak references to the objects. But I don't really understand the relationship between all of these CSS style things.
One thing I noticed is that the unlink has an optimization where it doesn't clear out weak refs if tmp->GetStyleSheet() is null, which is a difference from what happens in the dtor.
I have a patch sitting around to cycle collect nsCSSRuleProcessor (bug 990160) and maybe that will magically fix this...
Comment 31•10 years ago
|
||
> that the destruction sequence we are creating would go and destroy all of the weak
> references to the objects.
In what sense?
The rule processor data has pointers into style rule internal bits, effectively, and relies on being notified when rules are removed from sheets or sheets are removed from the document. If a rule or sheet is destroyed without those notifications getting sent, the rule processor data will have dangling pointers.
Is incremental cycle collection about doing the traversal incrementally or about doing the freeing incrementally? If it's the latter, there could be interesting dependencies related to destruction order.
Assignee | ||
Comment 33•10 years ago
|
||
It just does graph building/traversal incrementally. There were too many horrible things happening with incremental unlinking, like weak refs becoming strong refs after we'd decided an object was dead.
The main problem is that the CC can become confused about whether an object is alive or not, because, say, its refcount could increase after we traverse it, but before it traverses the things that point to it, so the CC thinks it sees all of the references to an object. To counter this, any object that is addrefed or released during the graph building phase is treated as being live. (This avoids leaks because any such object is in the purple buffer, so it will be looked at again in the next CC.)
(Technically, if you are doing a swap() of a cycle collected object's pointer into another cycle collected object, this same confusion could occur, because there is no addref/release, but I think it would be unlikely that we do 0 addref/release of an object during an entire graph building phase, due to holding things in stack vars.)
Reporter | ||
Updated•10 years ago
|
Summary: AddressSanitizer: use after free [@ ContentEnumFunc] → AddressSanitizer: use after free [@ ContentEnumFunc] or [@ nsCOMPtr<nsIAtom>::get]
Reporter | ||
Comment 34•10 years ago
|
||
Here's another stack, which now shows additional SEGV info about IPC in a second stack.
Comment 35•10 years ago
|
||
The first stack is actually bug 1011391 -- but that's very closely related to this bug.
The second stack is just a plugin killing itself after the main process dies.
Comment 36•10 years ago
|
||
(In reply to David Baron [:dbaron] (busy May 31-June 15) (UTC-4) from comment #32)
> Is incremental cycle collection about doing the traversal incrementally or
> about doing the freeing incrementally? If it's the latter, there could be
> interesting dependencies related to destruction order.
Well, snowwhite already made objects deletion to happen possibly incrementally, and nothing guarantees
the order. ICC may have changed when snowwhite killer does its job.
So GroupRule is the only CCable Rule? Other rules are deleted immediately when refcnt drops to 0,
but GroupRule stays alive for awhile...
Comment 37•10 years ago
|
||
I've just updated my Mac ASan build to current trunk code:
http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg
http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt
Gary, I suggest you test with this for a few days with ICC turned off (with dom.cycle_collector.incremental set to false). Then if you don't see any CC-related crashes (and if the rest of you haven't yet figured out how to fix this bug), I suggest we turn ICC off by default on the trunk for at least a week, to see what effect it has on our crash stats.
Reporter | ||
Comment 38•10 years ago
|
||
(In reply to Steven Michaud from comment #37)
> Gary, I suggest you test with this for a few days with ICC turned off (with
> dom.cycle_collector.incremental set to false).
Testing now, thanks!
Comment 39•10 years ago
|
||
> So GroupRule is the only CCable Rule?
No, page rules, font face rules, and DOMCSSStyleRule are ccable.
So I've been reading through our code for removing/etc rules again. Removal (or addition) of a rule (or sheet) calls PresShell::RecordStyleSheetChange; the following PresShell::EndUpdate then does a ReconstructStyleData(). But that just posts an event, afaict... So now I'm not quite sure how this is supposed to work.
Comment 40•10 years ago
|
||
Oh, I see. Removing rules from a sheet is supposed to ClearRuleCascades(), which clears the rule cascades on all the rule processors that sheet is involved in...
And adding/removing sheets ends up marking the style set dirty, which drops the old rule processors altogether.
Comment 41•10 years ago
|
||
I hit this crash when on youtube in gdb in an opt build yesterday, but we weren't able to get much useful info.
I wasn't able to reproduce, but I do get a lot of the following assertions in debug builds using my profile on youtube and resizing the window:
> ###!!! ASSERTION: style context has old rule node: 'n == mRuleTree', file layout/style/nsStyleSet.cpp, line 238
> ###!!! ASSERTION: old rule tree still referenced: 'Not Reached', file layout/style/nsStyleSet.cpp, line 1779
If that's at all relevant, I can try to bisect my profile/tabs to a test case
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → continuation
status-firefox31:
--- → unaffected
Comment 42•10 years ago
|
||
I've updated my Mac ASan build again to current trunk code:
http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg
http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt
Gary, how has your testing gone over the last week? Did you see any CC-related crashes with ICC turned off?
Flags: needinfo?(gary)
Comment 43•10 years ago
|
||
(In reply to comment #41)
John, I don't know if the assertions you saw are relevant. But if you can easily reproduce them, it'd be worthwhile finding a regression range for them in m-c debug nightlies.
Reporter | ||
Comment 44•10 years ago
|
||
(In reply to Steven Michaud from comment #42)
> Gary, how has your testing gone over the last week? Did you see any
> CC-related crashes with ICC turned off?
Fortunately/Unfortunately, I have not seen any crashes at all in the past week since I've turned off ICC.
Is there a very slight chance that the ReleaseSliceNow crash is related to this in any way?
Flags: needinfo?(gary)
Comment 45•10 years ago
|
||
> Is there a very slight chance that the ReleaseSliceNow crash is related to this in any way?
I think there's a pretty good chance it's related, and that your results increase that chance.
In a few days I'll open a bug to turn off ICC on the trunk. I want to wait a bit, to see how the latest patch for bug 1011391 (attachment 8442347 [details] [diff] [review]) plays out. Current results suggest it makes that bug's crashes happen three times more often!
Updated•10 years ago
|
Summary: AddressSanitizer: use after free [@ ContentEnumFunc] or [@ nsCOMPtr<nsIAtom>::get] → AddressSanitizer: use after free with incremental cycle collection [@ ContentEnumFunc] or [@ nsCOMPtr<nsIAtom>::get]
Assignee | ||
Comment 46•10 years ago
|
||
johns managed to get a cycle collector log from right before the UAF using a modified version of Martijn's test case in bug 1011391. There's a style sheet that has some group rules in it, and we unlink the group rule but not the sheet. I don't 100% know that this is the problem, but that would certainly cause this crash. Both the style sheet and group rules are alive at the time we hit the use-after-free, judging by poking around with GDB.
This is weird because the sheet (which has a refcount of 1 and no known references in the graph) should prevent the rules from being unlinked. Also, the sheet is not listed as being a root. So, something is messed up in the CC graph (probably in the state of the CC node for the sheet), or in the flooding algorithm. I'm not sure exactly what is wrong, so I'll have to add some more logging code and try again.
Component: CSS Parsing and Computation → XPCOM
Comment 47•10 years ago
|
||
Does removing the mInner->mSheets.Length() != 1 optimizations at the beginning of CSSStyleSheet::TraverseInner and CSSStyleSheet::UnlinkInner change anything?
Assignee | ||
Comment 48•10 years ago
|
||
That does look a little weird, and I'll try removing it, but fundamentally the CC itself seems to be doing something incorrectly.
Assignee | ||
Comment 49•10 years ago
|
||
Another thing is that the sheet of the rule at the time of the UAF seemed to be different than the sheet it was in at the time of the CC, but maybe one is the inner and one is the outer sheet and so that's expected?
Comment 50•10 years ago
|
||
The "inner sheet" is not an nsCSSStyleSheet. Rules do not have references to it.
Comment 51•10 years ago
|
||
Past that, the sheet of a rule can in fact change. Say you have two CSStyleSheet instances that have the same inner. The rules will point to one of them as parent. Now say that one of the CSSStyleSheets goes away. It'll call CSSStyleSheetInner::RemoveSheet, which will reparent the rules to the other sheet.
Assignee | ||
Comment 52•10 years ago
|
||
(In reply to Boris Zbarsky [:bz] from comment #47)
> Does removing the mInner->mSheets.Length() != 1 optimizations at the
> beginning of CSSStyleSheet::TraverseInner and CSSStyleSheet::UnlinkInner
> change anything?
Looking at the code some more, it is not ok to remove that check. If I understand this correctly, when the inner has more than one thing in mSheets, then it is the inner of multiple sheets. So without that check, multiple sheets will traverse the mOrderedRules (etc.) of the inner, claiming they own it, when in fact there is only a single reference.
Assignee | ||
Comment 53•10 years ago
|
||
I think I've figured out what is happening. My theory is that the sequence of steps goes something like this:
- Initially, we have a style sheet S that has an inner I, and the inner owns GroupRule R.
- We start a CC, then traverse S and R, which adds nodes N_S and N_R to the CC graph, where N_S points to N_R, and both have a refcount of 1.
- We create a new style sheet S', add S' to I and remove S from I. (S' is new, so we probably won't end up traversing it during the CC.) (No refcount traffic occurs on R, so the write barrier doesn't fire for it.)
- S dies. This causes us to set the mParticipant field of N_S to null.
- We start to finish the CC. The CC looks at N_S, but because its participant is null, it ignores it completely. Germane to our behavior here, it does not unlink it (good!), and it does not mark it black and flood black from it (bad!).
- Because the number of known references to N_R (1, from the dead N_S) is equal to the refcount of N_R, and no other object that reaches N_R has flooded black, the CC thinks it is dead and unlinks R.
- The unlinking of R causes a style rule to be destroyed without being removed from the sheet, causing the dangling pointer, which leads to our crash.
Anyways, that's what I inferred from the CC log in comment 46. I add logging which confirmed that N_S was neutered in the graph. I can't be sure this is causing the crash, but even if it isn't, it is a bug in ICC that must be fixed.
The basic problem is that we need to distinguish dead objects that were not traversed from those that were. For the former, we want the current behavior. For those that were traversed but died, I think we can treat them as being alive, though that sounds contradictory. I need to think about whether this might cause leaks.
Comment 54•10 years ago
|
||
> I can't be sure this is causing the crash
I'm pretty sure the sequence of events you describe would cause exactly this crash.
So the fundamental issue is that object A is holding a ref to object B, and then between the graph construction and the finish stage A hands over the reference to someone else without changing B's refcount in the process, and then A dies, right?
Assignee | ||
Comment 55•10 years ago
|
||
(In reply to Boris Zbarsky [:bz] from comment #54)
> So the fundamental issue is that object A is holding a ref to object B, and
> then between the graph construction and the finish stage A hands over the
> reference to someone else without changing B's refcount in the process, and
> then A dies, right?
Yup.
I have a patch that makes any node in the graph that was traversed but is dead be treated as alive to the CC, and it seems to fix the crash in limited testing. With a test case running on John's machine, it was consistently crashing after 2-5 minutes, but with the patch, it wasn't crashing after 10 or so. I need to clean up the patch and test it some more.
Comment 56•10 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #55)
> I have a patch that makes any node in the graph that was traversed but is
> dead be treated as alive to the CC, and it seems to fix the crash in limited
> testing.
That sounds good to me.
Should the objects which the now-dead object had edges to be put to the purple buffer to emulate
normal addref/release?
Assignee | ||
Comment 57•10 years ago
|
||
The first version of my patch (not uploaded) did two things
1. Color traversed nodes, even if they are dead.
2. If a dead node was traversed, treat it as being always alive.
In this new patch I have attached, I just do the first thing. In any situation where we need 2, I think we could have a slightly different situation where the node does not die, and thus we're in trouble anyways. I wasn't able to come up with a situation where 1 is insufficient (assuming people don't manipulate fields of objects held by a weak pointer without taking a strong reference to the object).
I think this should more clearly eliminate the need to add anything to the purple buffer, because if we decide the children of the dead node should be freed, with this version, we'll just free it, so there should be no additional danger of leaks.
Anyways, the patch works by introducing a WasTraversed() method on PtrInfo, that returns true if the refcount has been set to something aside from the initial bogus value. Then, we basically replace all null-checks on mParticipant with calls to WasTraversed(), except for places where we are going to actually call a method on mParticipant. In practice, this means that the mParticipant null checks in the graph coloring phase, ScanRoots(), are changed to WasTraversed() calls.
This version has been running on the PDF test case for 31 minutes without crashing, so it looks good in terms of fixing the crash.
Hopefully the try run will look okay:
https://tbpl.mozilla.org/?tree=Try&rev=262dec3e2f9f
Assignee | ||
Comment 58•10 years ago
|
||
Comment on attachment 8447402 [details] [diff] [review]
The cycle collector should color traversed dead nodes.
This should be ready to review now, but feel free to wait until after the try results come in.
Attachment #8447402 -
Flags: review?(bugs)
Assignee | ||
Comment 59•10 years ago
|
||
Comment on attachment 8447402 [details] [diff] [review]
The cycle collector should color traversed dead nodes.
I'm asking for approval now, because it would be nice to get it landed today or over the weekend to fix the crashes. I obviously will not land until it is reviewed. :)
[Security approval request comment]
How easily could an exploit be constructed based on the patch? Not easily. I think the bigger giveaway is the top crash.
Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? No.
Which older supported branches are affected by this flaw? 32 Aurora.
If not all supported branches, which bug introduced the flaw? bug 911246
Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? This code has not changed in the last month, so it should be easy to backport, and trunk should reveal any problems.
How likely is this patch to cause regressions; how much testing does it need? This is affecting behavior not tested on TBPL, so it would be good to get it plenty of testing.
Attachment #8447402 -
Flags: sec-approval?
Reporter | ||
Comment 60•10 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #59)
> Which older supported branches are affected by this flaw? 32 Aurora.
(In reply to Steven Michaud from comment #45)
> > Is there a very slight chance that the ReleaseSliceNow crash is related to this in any way?
>
> I think there's a pretty good chance it's related, and that your results
> increase that chance.
Steven, this is strange. My ReleaseSliceNow crashes seemed to happen on 31 as well, but if it was related to this bug, then something seems amiss because Andrew mentions that only Fx32 Aurora is affected.
Comment 61•10 years ago
|
||
Andrew seems well on his way to fixing this bug. Once a patch has landed on trunk, we should know within a few days whether or not the ReleaseSliceNow() crash stats are effected.
Updated•10 years ago
|
tracking-firefox32:
--- → +
Comment 62•10 years ago
|
||
Comment on attachment 8447402 [details] [diff] [review]
The cycle collector should color traversed dead nodes.
Sec-approval+ for trunk. Let's get this on Aurora as well.
Attachment #8447402 -
Flags: sec-approval? → sec-approval+
Comment 63•10 years ago
|
||
Comment on attachment 8447402 [details] [diff] [review]
The cycle collector should color traversed dead nodes.
https://tbpl.mozilla.org/php/getParsedLog.php?id=42663471&tree=Try&full=1#error0
doesn't look quite right.
Attachment #8447402 -
Flags: review?(bugs)
Assignee | ||
Comment 64•10 years ago
|
||
Oh, I bet what is happening is that we add an object to the graph, then it gets addrefed and we add whatever is holding it to the graph, then the object dies (so it isn't in the purple buffer any more to get added as an incremental root). The easiest thing to do would be to ignore problems in that case. The alternative is to track purple objects that died and treat them as incremental roots. I'll have to think about that.
Assignee | ||
Comment 65•10 years ago
|
||
This will make it easier to scan for things besides gray JS nodes. It slightly
reorders the tests we do, but it shouldn't really matter.
This is kind of gross because sometimes we run the CC when there is no JS runtime,
in some test harness, I believe. So, we have to confirm that we have a JS runtime
anytime we have a grey object. Otherwise, we'd somehow have to worry about dead
objects (that have a null participant) when there is no JS runtime, and that's
annoying.
The else continue looks silly, but in the next patch I insert another test in there.
Assignee | ||
Comment 66•10 years ago
|
||
Comment on attachment 8447402 [details] [diff] [review]
The cycle collector should color traversed dead nodes.
I'll keep this unobsolete, because it has the sec-approval+, but this version isn't quite right, so I'm not going to use it.
Attachment #8447402 -
Flags: checkin-
Assignee | ||
Updated•10 years ago
|
Attachment #8447742 -
Flags: review?(bugs)
Assignee | ||
Comment 67•10 years ago
|
||
This patch does two things. First, as in the previous version, it adds a new notion of a WasTraversed() node, and uses that instead of checking mParticipant to see if we should look at a node, aside from when we really are trying to use the participant. The effect of this is that we will properly color a node that was traversed, but since died.
The second thing this patch does is treat traversed dead nodes as incremental roots. If a node was traversed, it must have been alive at the start of the CC. If it has since died, then its refcount must have decreased during the CC, and thus it would have been in the purple buffer if it hadn't died. So it should be okay to treat it as an incremental root, as we do for anything in the purple buffer.
Treating dead objects as incremental roots also handles the case that was causing the previous patch to fail, where an object gets addrefed during a CC after we traverse it, then we traverse the things that hold onto it, then it dies. In that case, we are treating it as an incremental root, and thus do not check its ref count.
I also tried a version of this patch that does not do the incremental root thing, but just doesn't check any dead node, but that does not feel as internally consistent as this, so I think this version is better, if a little more complex.
try run: https://tbpl.mozilla.org/?tree=Try&rev=2928acf74f78
Theres's a lot of failures in one dt1 test, but that is happening quite frequently right now, so I think I just got unlucky in my first four dt1 runs.
This patch is a bit more complicated than the previous version, and it does not make me feel good that the first version was broken, so my current plan is to disable ICC on Aurora, so these two patches will only land on Nightly.
Attachment #8447743 -
Flags: review?(bugs)
Assignee | ||
Updated•10 years ago
|
Summary: AddressSanitizer: use after free with incremental cycle collection [@ ContentEnumFunc] or [@ nsCOMPtr<nsIAtom>::get] → Incremental cycle collection does not properly handle dead traversed nodes, leading to CSS use-after-free
Assignee | ||
Updated•10 years ago
|
Assignee | ||
Updated•10 years ago
|
OS: Mac OS X → All
Hardware: x86_64 → All
Updated•10 years ago
|
Attachment #8447742 -
Flags: review?(bugs) → review+
Comment 68•10 years ago
|
||
Comment on attachment 8447743 [details] [diff] [review]
part 2 - Dead traversed objects should be treated as incremental roots and colored.
Tiny bit hard to review
>+ } else if (!pi->mParticipant && pi->WasTraversed()) {
>+ // Dead traversed refcounted objects:
>+ // If the object was traversed, it must have been alive at the start of
>+ // the CC, and thus had a positive refcount. It is dead now, so its
>+ // refcount must have decreased at some point during the CC. Therefore,
>+ // it would be in the purple buffer if it wasn't dead, so treat it as an
>+ // incremental root.
This could use a comment why we don't leak here (the dead object should have released member variables and those should be now in pb).
Attachment #8447743 -
Flags: review?(bugs) → review+
Assignee | ||
Comment 69•10 years ago
|
||
> This could use a comment why we don't leak here (the dead object should have released member variables and those should be now in pb).
I added a comment.
Thanks for the reviews!
remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/01aca028ba9d
remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/c213049465be
Assignee | ||
Comment 70•10 years ago
|
||
Oh, and the latest version of the patch has been running the crashing test case one johns's machine for 2 hours and 17 minutes without crashing.
Comment 71•10 years ago
|
||
and now on m-c https://hg.mozilla.org/mozilla-central/rev/01aca028ba9d and
https://hg.mozilla.org/mozilla-central/rev/c213049465be
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Assignee | ||
Comment 72•10 years ago
|
||
ICC has been disabled on 33 so that should fix this issue there.
Assignee | ||
Comment 73•10 years ago
|
||
Olli pointed out I meant 32, not 33.
Updated•10 years ago
|
status-firefox-esr24:
--- → unaffected
Updated•10 years ago
|
Assignee | ||
Comment 74•10 years ago
|
||
This patch did not land on 32, but the crash should be fixed by disabling.
Assignee | ||
Updated•10 years ago
|
Group: core-security
Comment 75•10 years ago
|
||
Is there anywhere where I can download Steven's asan build from June 9th so I can try reproducing the original issue before going through verification?
Flags: needinfo?(gary)
Assignee | ||
Comment 76•10 years ago
|
||
I think you should be able to reproduce this on a regular Linux ASan build with the test case in bug 1011391 comment 47. The original issue in this bug is something like "Be Gary and browse the web for a day" which will be hard to confirm. :)
Reporter | ||
Comment 77•10 years ago
|
||
(In reply to Kamil Jozwiak [:kjozwiak] from comment #75)
> Is there anywhere where I can download Steven's asan build from June 9th so
> I can try reproducing the original issue before going through verification?
Nope, I don't think so. I think Steven updates his builds weekly, and I don't save them locally either.
Flags: needinfo?(gary)
Comment 78•10 years ago
|
||
Marking this [qe-verify-] due to difficulty in reproducing, without a test case or access to build.
Flags: qe-verify-
You need to log in
before you can comment on or make changes to this bug.
Description
•