Closed Bug 719450 Opened 13 years ago Closed 13 years ago

StackMemory is not same as native stack; cannot skip in ThreadEdgeWork

Categories

(Tamarin Graveyard :: Garbage Collection (mmGC), defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pnkfelix, Assigned: pnkfelix)

References

Details

(Whiteboard: WE:3083702)

Attachments

(2 files, 2 obsolete files)

Spawned from WE #3083702 fguinan: GC::Collect() calls GC::ReapZCT(), which properly “pins” all the objects in the stack root object. However, I don’t think the GC objects referenced within the stack root are “marked” in the ensuing call to GC::FinishIncrementalMark(), because the call to MarkQueueAndStack() does not mark stack roots if ‘scanStack’ is false. Since the objects within the stack root aren’t marked, I think GCAlloc::FinalizationPass() still deems them collectable and goes ahead and destroys them even though they are still in use on the main thread. lars: It seems to me that the bug is that the root marker treats these suspended stack segments as stack memory - they really are not stack memory in the conventional sense. It could be that the reason it does that is an attempt at optimization, to wit, only scanning the segments once, at the end of the gc cycle and not also at the beginning as we do for normal roots. (If so then there's an implicit contract that we not muck with the suspended stacks while they're suspended. Whether that's safe or not...) lars: maybe the underlying bug is that the "stack" optimization is important for AS2 stacks, and we just reused the infrastructure for these suspended-thread stack segments. in fact that's highly plausible. it may be sufficient for DoCreateRootFromCurrentStack to pass a flag to AutoRCRootSegment that states "not actually stack memory", but that's just a first guess. felix: having DoCreateRootFromCurrentStack pass such a flag certainly would band-aid over this instance of the problem are there other cases we can think of where this scenario would arise? (Where a stack segment is promoted to a root and needs to be treated a root, not as stack?) lars: i know some of the mobile people were toying with allocating the stack in the heap, because the stacks on the mobile devices were very limited i suspect they did not use this mechanism so we should be in the clear it may not matter anymore since we don't support those devices
Summary: CreateRootFromCurrentStack yields roots that should not be treated as stack, but are → roots from CreateRootFromCurrentStack should not be treated as stack (i.e. dont skip them)
hack outlined by Lars. At this point, I think it builds. :)
Assignee: nobody → fklockii
Status: NEW → ASSIGNED
Attachment #589879 - Flags: feedback?(lhansen)
Comment on attachment 589879 [details] [diff] [review] patch B v1: stack-isnt-stack band-aid Yes, what I had in mind.
Attachment #589879 - Flags: feedback?(lhansen) → feedback+
Whiteboard: WE:3083702
(In reply to Lars T Hansen from comment #2) > Comment on attachment 589879 [details] [diff] [review] > patch B v1: stack-isnt-stack band-aid > > Yes, what I had in mind. It was "slightly" flawed, unfortunately. This part of the change has a bug: - GC::RCRootSegment::RCRootSegment(GC* gc, void* mem, size_t size) - : StackMemory(gc, mem, size) + GC::RCRootSegment::RCRootSegment(GC* gc, void* mem, size_t size, bool suspendedStackRoot) + : StackMemory(gc, mem, size, suspendedStackRoot) See it? Here's a hint: - StackMemory(GC *gc, const void *object, size_t size, bool isExactlyTraced=false); + StackMemory(GC *gc, const void *object, size_t size, bool isExactlyTraced=false, bool suspendedStackRoot=false);
This time I actually tested the patch; it appears to address the bug in question.
Attachment #589879 - Attachment is obsolete: true
Attachment #590143 - Flags: review?(lhansen)
Comment on attachment 590143 [details] [diff] [review] patch B v2: stack-is-stack band-aid (removing review request; I anticipate being asked to add some comments that at least link to the appropriate tickets, so I will take care of that first.)
Attachment #590143 - Flags: review?(lhansen)
Okay now I have added more explanatory comments. R?
Attachment #590143 - Attachment is obsolete: true
Attachment #590155 - Flags: review?(lhansen)
Group: tamarin-security
Comment on attachment 590155 [details] [diff] [review] patch B v3: stack-is-stack band-aid (verbal R+ from Lars.)
Attachment #590155 - Flags: review?(lhansen) → review+
changeset: 7171:040cdd2789b3 user: Felix S Klock II <fklockii@adobe.com> summary: Bug 719450: stack-isnt-stack bandaid (r=lhansen). http://hg.mozilla.org/tamarin-redux/rev/040cdd2789b3
The code CreateRootFromCurrentStack needs more (as in, any!) testing within Tamarin itself. Filed that as followup work item in Bug 719784.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
(In reply to Tamarin Bot from comment #8) > changeset: 7171:040cdd2789b3 > user: Felix S Klock II <fklockii@adobe.com> > summary: Bug 719450: stack-isnt-stack bandaid (r=lhansen). > > http://hg.mozilla.org/tamarin-redux/rev/040cdd2789b3 Landed this same changeset in FR main @CL 10236276
The analysis on this bug is not quite right. The problem is that the destructor of GCAutoEnter is trying to do the "we've returned to a quiescent point and the stack is empty so we don't have to worry about stack references to ZCT objects" work. But when there is a suspended stack, it just is not true. The solution is to not to do extra work when leaving a nested GCAutoEnter. The solution is to not do the quiescent work at all until you exit when there are no suspended stacks. Again, the essential issue here is that unlike normal roots, stacks can contain references to objects in the ZCT. The thing we want to do whenever possible is to poof the ZCT when there are no stacks. That was what the code was trying to do, but it had overlooked that it can only do that when exiting GCAutoEnter AND there are no suspended stacks. Whether to do deferred System.gc() work is a separate question, but I would argue that that shouldn't be done either until a complete quiesce point.
(The problem and thus solution need further investigation. Stan had pointed out an alternative solution over email earlier in the week, i.e. to disable collection entirely while there exist suspended stacks. I was and reamin hesitant to adopt such a solution without better understanding the possible threading scenarios at play here. But yes, it bears investigation and will get such when I return to work next week.)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Just to clarify: it's not that you can't run the GC. Of course you can: the GC can run any time it needs to. It's that you can't run the special "we're quiesced and it's a great time to poof the ZCT because no stack variables reference its elements" collection.
(In reply to Stan Switzer from comment #13) > Just to clarify: it's not that you can't run the GC. Of course you can: the > GC can run any time it needs to. It's that you can't run the special "we're > quiesced and it's a great time to poof the ZCT because no stack variables > reference its elements" collection. Thanks for the clarification; yes, I did misinterpret your suggestion when I saw your first e-mail, and did not realize your comment was restricted to ZCT reaping. I'm not convinced that the ZCT is the sole explanation for the crash at this point, because I have disabled DRC in the mms.cfg and still see the crash. (It could well be that the ZCT-handling *is* busted here; I'm just saying that its not the whole story.) I am trying to disentangle the different kinds of stacks in play here, and understand how work is split between the scan in StartIncrementalMark and the scan in FinishIncrementalMark. (It is odd that the scanStack flag is not passed to StartIncrementalMark.) Some changesets of potential interest: * TR 1975 (Bug 489345, comment 15) landed "stack-free" scanning. * Before TR 5953 (Bug 634651, comment 7) landed, we would initially scan all stack and non-stack roots in StartIncrementalMark, regardless of what value one passed for scanStack to Collect; TR 5953 changed things so that the StartIncrementalMark only marked non-stack roots, and saved marking the stack for FinishIncrementalMark. That final stack-mark could be skipped if scanStack is false; I assume that this was assumed sound, but I also suspect the reviewers overlooked the potential for some stack-memory to get scanned before and now not get marked at all.
* TR 2088 (changeset 4c40f2edffbf from Bug 486504, comment 39) was where we switched to not passing scanStack to StartIncrementalMark.
(In reply to Felix S Klock II from comment #15) > * TR 2088 (changeset 4c40f2edffbf from Bug 486504, comment 39) was where we > switched to not passing scanStack to StartIncrementalMark. (but even when we did pass scanStack to StartIncrementalMark, we unconditionally did MarkAllRoots, without filtering out the roots for e.g. the AS2 stack; so TR 2088 may be a little bit of a red herring here; TR 5953 is probably the real point of injection, as fpguinan has said previously.) I still need to wrap my head around this a bit more. (I've confirmed on a private build that passing scanStack=true within ThreadEdgeWork seems to make the original problem go away, but that's applying a sledgehammer when something more nuanced is required.)
(In reply to Felix S Klock II from comment #16) > (I've confirmed on a > private build that passing scanStack=true within ThreadEdgeWork seems to > make the original problem go away, but that's applying a sledgehammer when > something more nuanced is required.) Nuanced as in something like what Stan suggested, perhaps. The fact that the sledgehammer helped at all is a sign that something may indeed have gone wrong in how the two notions of "stack" have been conflated in the code base. Patch B (attachment 590155 [details] [diff] [review]) does not remove the crash in all cases; that shows that the suspended stacks are not solely the problem here (and likewise, my disabling of DRC shows that ZCT not solely the problem either). As noted in comment 14, Bug 634651 is relevant: it divided the GCRoots into initial roots, which are scanned at both the beginning and the end, and stack-like memory, which can be scanned solely at the end of the collection. (In principle any subset of the roots could be set aside for scanning solely at the end; the issue is how much redundant work you want to do up-front in order to give the collector enough roots to trace nearly all of the heap.) From reading the spec for OOM: [FlashPlayer:Out of Memory Handling] https://zerowing.corp.adobe.com/display/FlashPlayer/Out+of+Memory+Handling in particular the section MMGC_GCENTER, one sees the assertion that the point to stack-free scanning is that no GC references can exist beneath the MMGC_GCENTER macro invocation. So far stepping through the code, I wonder about the StackMemory objects hooked onto the ActionScriptStack member of the CorePlayer object. I will look more at this tomorrow.
Summary: roots from CreateRootFromCurrentStack should not be treated as stack (i.e. dont skip them) → StackMemory is not same as native stack; cannot skip in ThreadEdgeWork
The effect of patch C is a generalization of that of patch B (though it takes a much different form than patch B), and seems to properly fix the problem since it handles all non-native stack the same way (rather than treating the suspended stack segments as a special case). This brings the behavior back in line with what we saw before Bug 634651 landed. I believe the root of the problem is the following (natural) mistake in the change to MarkQueueAndStack: http://hg.mozilla.org/tamarin-redux/rev/fb4760cc5b5f#l1.102 to put the non-native stack scan underneath the scanStack condition. Patch C fixes this by lifting the non-native stack scan outside of the scanStack condition, and attempts to clarify the code by alpha-renaming scanStack to scanNativeStack in several places. (Orthogonally: I am not commited to the name 'native stack' for the cpu/linkage/c stack. It was just what I used for the initial work; it won't take much prodding to get me to change it to something else.)
Attachment #591168 - Flags: review?(rulohani)
The thing I would like you to verify is that the code now respects the intent of MMGC_ENTER's (GCAutoEnter's) clean-up work: to find a safe and efficient place (where it is known that no ActionScript is executing and there are no stack references) to blast away zero-ref-count objects. And conversely, when an exit to MMGC_ENTER is not such a place (because of other suspended stacks) then that work almost certainly should NOT be done, but deferred until the suspended stacks are exited as well. Specifically: it would be better not to do inappropriate clean-up work than to make it safe to do the inappropriate thing. I was not able to follow the entire discussion but it seems to have danced around the design intent.
(In reply to Stan Switzer from comment #19) > The thing I would like you to verify is that the code now respects the > intent of MMGC_ENTER's (GCAutoEnter's) clean-up work: to find a safe and > efficient place (where it is known that no ActionScript is executing and > there are no stack references) to blast away zero-ref-count objects. And > conversely, when an exit to MMGC_ENTER is not such a place (because of other > suspended stacks) then that work almost certainly should NOT be done, but > deferred until the suspended stacks are exited as well. Okay. My intution is that such a task is orthogonal to the issue on this ticket (since for me, the crash here is not affected by the presence/absence ZCT reaping), but it would nonetheless be a good thing to check the property you describe. (All I have done so far is isolate a change in behavior introduced by TR 5953 (Bug 634651, comment 7), argue that the change was accidental, and provide a way to bring back something closer to the original intent when the Stack/NonStack root-scanning was introduced in Bug 634651.)
(In reply to Felix S Klock II from comment #18) > Created attachment 591168 [details] [diff] [review] > patch C v1: all nonnative stack roots should not be skipped R+. This looks fine to me. Though I don't understand the need of the 2nd paragraph in the comment. Do you mean to say we should not take the call to MarkStackRoots() out of MarkQueueAndStack()? I am not sure why will there be a need to have a separate loop. At the same time keeping MarkStackRoots() in MarkQueueAndStack() sounds appropriate to me as Stack refers to both native-stack and "stackroots" and based on the flag(scanNativeStack) passed, we mark one or both. May be I did not understand your comment altogether.
Attachment #591168 - Flags: review?(rulohani) → review+
(In reply to Ruchi Lohani from comment #21) > Though I don't understand the need of the 2nd > paragraph in the comment. Do you mean to say we should not take the call to > MarkStackRoots() out of MarkQueueAndStack()? I am not sure why will there be > a need to have a separate loop. That comment is hard to understand in that context. I will move the second paragraph up into FinishIncrementalMark (and then probably revise it once I have moved it there). (Explanation of the comment follows for those who care.) The second paragraph is there because in an earlier draft of the code, I said to myself: "Felix, isn't it silly to restart scanning the whole set of StackMemory+NativeStack on every overflow? If we manage to scan the entirety of the StackMemory without overflow, and subsequently have overflow on just the NativeStack, then we could focus the overflow loop on just rescanning the NativeStack alone." So I earlier split the single loop in FinishIncrementalMark of the abstract form: A(); B(); // (these correspond to the call to MarkQueueAndStack) while (overflow) { overflow = false; HandleOflow; A(); B(); } into two successive loops of the abstract form: A(); while (overflow) { overflow = false; HandleOflow; A(); } B(); while (overflow) { overflow = false; HandleOflow; B(); } (it would still need the first loop over A() because we don't have an upfront bound on the amount of StackMemory, and thus cannot assume that A() will not cause its own stack overflow). My point was to avoid making repeated calls to A() when we have reached the second loop. Other variants on this theme are available, e.g. encoding a state machine here to track progress through A(); B(); I abandoned this variant when I remembered/reread the operation of HandleMarkStackOverflow; it takes time O(heap-size). Trying to optimize this (*last-ditch*) recovery loop is silly. The comment is a reminder to myself others with similar internal dialogues. (But the comment I am now posting here may well serve in its place as the reminder.)
changeset: 7172:bddc7be69e7c user: Felix S Klock II <fklockii@adobe.com> summary: Bug 719450: nonnative stack roots should not be skipped (r=rulohani). http://hg.mozilla.org/tamarin-redux/rev/bddc7be69e7c
(I should now back out patch B, since patch C's effect is a generalization of B's effect.)
changeset: 7173:9b3fe3ac257c user: Felix S Klock II <fklockii@adobe.com> summary: Bug 719450: backout rev 7171:040cdd2789b3 as it is obsoleted by 7172:bddc7be69e7c (r=fklockii). http://hg.mozilla.org/tamarin-redux/rev/9b3fe3ac257c
(adding a regression test for this is probably not reasonable; or rather, I would lump it in with more general testing of the StackMemory functionality; that's related to the sentiment from comment 9.)
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Blocks: 634651
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: