Last Comment Bug 707051 - Failure to mark some shapes during delayed marking
: Failure to mark some shapes during delayed marking
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla11
Assigned To: [PTO to Dec5] Bill McCloskey (:billm)
: Jason Orendorff [:jorendorff]
Depends on:
  Show dependency treegraph
Reported: 2011-12-01 18:13 PST by [PTO to Dec5] Bill McCloskey (:billm)
Modified: 2012-03-23 13:30 PDT (History)
11 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---

fix (2.73 KB, patch)
2011-12-01 18:13 PST, [PTO to Dec5] Bill McCloskey (:billm)
igor: review+
akeybl: approval‑mozilla‑aurora-
akeybl: approval‑mozilla‑beta-
Details | Diff | Splinter Review
fix (6.74 KB, patch)
2011-12-06 15:29 PST, [PTO to Dec5] Bill McCloskey (:billm)
igor: review+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta-
Details | Diff | Splinter Review

Description [PTO to Dec5] Bill McCloskey (:billm) 2011-12-01 18:13:52 PST
Created attachment 578462 [details] [diff] [review]

When we do delayed marking, we call JS_TraceChildren on each marked object in the arena. It calls MarkChildren. When given a shape, MarkChildren fails to set the marked bit for its parent shape--it just walks up the shape tree. I stumbled upon this while doing incremental GC, but it suggests that we need to do better testing of delayed marking--probably via bug 673551.

In the patch, I had to change the return type of previous() so that the write barrier checks don't complain.
Comment 1 Igor Bukanov 2011-12-02 02:47:13 PST
Comment on attachment 578462 [details] [diff] [review]

Review of attachment 578462 [details] [diff] [review]:

Comment in ScanShape before shape = shape->previous(); why the loop in that function is essential to stop unbounded recursion.
Comment 2 David Mandelin [:dmandelin] 2011-12-02 11:35:47 PST
Could this have been the cause of some of our GC topcrashes?
Comment 3 [PTO to Dec5] Bill McCloskey (:billm) 2011-12-02 14:31:32 PST
Comment on attachment 578462 [details] [diff] [review]

I think I broke this in Firefox 5. So I don't think it's causing the topcrashes that we examined a long time ago, but it may be causing some current topcrashes.

It would be nice to take this patch, since it seems pretty exploitable. I'm a bit worried about regression risks, though. This is the same path that the cycle collector takes. So we were probably missing a bunch of shapes from the CC graph, and now we won't be. Andrew, do you know if that's likely to be a performance problem? We often have a lot of shapes.
Comment 4 Andrew McCreight [:mccr8] 2011-12-02 20:02:42 PST
Shapes are not added to the CC graph.  See AddToCCKind.  Instead of adding the graph, the CC traces through the children to find non-shape nodes.  Potentially this could cause the CC to blow the stack, but if this only came into play in 5 maybe it is okay.  It could also cause slowdown.

I don't know much about shapes.  Do they point from parent to child as well as from child to parent?  Are there likely to be a lot of shapes that are reachable from gray nodes but not black ones?
Comment 5 Brendan Eich [:brendan] 2011-12-02 21:10:55 PST
Shapes point to parent (previous) and child or children, but the latter linkage should not matter here.

Comment 6 Andrew McCreight [:mccr8] 2011-12-03 05:02:29 PST
Well, my main worry is that the combination of two creates doubly-linked trees, which tend to embiggen the CC graph, in cases like the DOM (with the addition of strong parent pointer) and the JS global.

If you want to measure the effect on the graph, bill, you can either count how many shapes are being marked gray, or, more directly, go into NoteJSRoot in nsXPConnect.cpp and add something like

if (kind == JSTRACE_SHAPE)

in the else if (kind != JSTRACE_STRING) case.  Initialize myNewGlobalCounter in BeginCycleCollection and print it out in EndCycleCollection in that file.
Comment 7 Brendan Eich [:brendan] 2011-12-03 21:17:57 PST
Shapes can participate in cycles only if they are for accessor (getter/setter) properties or methods (function-valued properties memoized/shared via shapes). These are relatively uncommon across all object populations, more common among prototype objects. Can you take advantage of this sparseness? You shouldn't have to reify peer CC data structures for all shapes, just the ones that might be suspects.

Comment 8 Andrew McCreight [:mccr8] 2011-12-04 06:34:53 PST
No shapes are reified, they are just recursively scanned for non-shapes.  Only non-shapes are added to the graph.  But yes, it sounds like we could avoid even doing that a lot of the time.  The cycle collector does not take advantage of any preknowledge about the shape of the JS heap, except that strings (even compound ones) are never cyclic.

Ultimately, I'd like to push the CC's traversal of the GC heap inside the GC heap, if only to avoid the virtual call in XPCOM->XPConnect->JS->XPConnect->XPCOM dance we currently have to do to take single step in the CC (bug 653013).  Once that is done we could be smarter about stuff.

Peterv has some thoughts on shapes in the CC graph in bug 651445, mainly concerns about not adding them.
Comment 9 [PTO to Dec5] Bill McCloskey (:billm) 2011-12-04 17:20:26 PST

It doesn't sound like this will screw anything up. As Brendan said, we don't trace the children of a shape. And since shapes aren't actually put in the CC graph, we only have to worry about shapes further up the parent chain that point to objects via accessors. It seems pretty unlikely that this would have a noticeable performance impact. So I think the regression risk is low. And, as I said, the bug is pretty serious.
Comment 11 Alex Keybl [:akeybl] 2011-12-05 13:34:44 PST
Comment on attachment 578462 [details] [diff] [review]

[Triage Comment]
Please re-nominate once this has landed on m-c.
Comment 12 [PTO to Dec5] Bill McCloskey (:billm) 2011-12-05 18:59:51 PST
This causes a stack overflow in the cycle collector. Here is what NoteJSChild in nsXPConnect.cpp looks like:

static void
NoteJSChild(JSTracer *trc, void *thing, JSGCTraceKind kind)
    if (AddToCCKind(kind)) {
    } else if (kind != JSTRACE_STRING) {
        JS_TraceChildren(trc, thing, kind);

Since AddToCCKind(JSTRACE_SHAPE) is false, we call JS_TraceChildren. This can blow up the stack when traversing long parent chains.

It looks like the previous version of the code was doing exactly what the cycle collector needed: it was skipping over the parent shape itself, and just tracing through the shape's children. However, this is not what we need for delayed marking (this is the aspect I forgot in our conversation today, Andrew; it's still a serious bug).

I'm thinking of adding a flag to JSTracer to say which behavior is desired. This is pretty ugly, but there aren't a lot of good alternatives. We can detect a stack overflow in NoteJSChild, but what do we do then? Aborting the CC seems like a bad idea, since it would cause us to leak. Andrew and I talked about adding shapes to the CC graph selectively, when we run out of stack space for them. This seems really dangerous, though. There may be other parts of the CC that break if some shapes are in the graph and not others.

Peterv, Igor, Andreas, do you guys have any good ideas? It would be nice if whatever we come up with is safe enough to land in aurora and beta. That's why I like the flag idea.
Comment 13 Andrew McCreight [:mccr8] 2011-12-05 19:20:25 PST
I think the safest thing to do in the short term is to add a flag (or repurpose the one I already added with some kind of enum) that leaves the behavior alone except for the delayed marker.  There's a delicate symbiosis between the current behavior of JS_TraceChildren and the CC that seems to work.  In the long term, of course, having even more ways to trace through JS seems very future-unproof.
Comment 14 Igor Bukanov 2011-12-06 04:24:15 PST
We need to eliminate the tail recursion over the shape.parent. A simplest solution for that would be to provide a function like 

Shape *TraceShapeChildren(JSTracer *trc, Shape *shape)

that would not call the tracer on the shape but rather return it so NoteJSChild can have:

    } else if (kind == JSTRACE_SHAPE) {
        while (!!(thing = JS_TraceChildren(trc, (Shape *) thing));
Comment 15 [PTO to Dec5] Bill McCloskey (:billm) 2011-12-06 15:29:13 PST
Created attachment 579485 [details] [diff] [review]

I like Igor's idea. This patch doesn't die on tinderbox the way the previous one did.
Comment 16 Igor Bukanov 2011-12-06 16:10:05 PST
Comment on attachment 579485 [details] [diff] [review]

Review of attachment 579485 [details] [diff] [review]:

::: js/src/jsgcmark.cpp
@@ +983,5 @@
>  MarkChildren(JSTracer *trc, const Shape *shape)
>  {
> +    MarkShapeChildrenAcyclic(trc, shape);
> +    if (shape->previous())
> +        MarkShape(trc, shape->previous(), "parent");

This should use MarkShapeChildrenAcyclic result instead of accessing shape->previous() one more time.
Comment 17 [PTO to Dec5] Bill McCloskey (:billm) 2011-12-06 16:57:47 PST
(In reply to Igor Bukanov from comment #16)
> This should use MarkShapeChildrenAcyclic result instead of accessing
> shape->previous() one more time.

The problem is that MarkShape wants a MarkablePointer and MarkShapeChildrenAcyclic just returns a Shape. I could use MarkShapeUnbarriered, but I'm trying to limit the number of those.
Comment 18 Igor Bukanov 2011-12-06 17:08:54 PST
(In reply to Bill McCloskey (:billm) from comment #17)
> The problem is that MarkShape wants a MarkablePointer and
> MarkShapeChildrenAcyclic just returns a Shape. I could use
> MarkShapeUnbarriered, but I'm trying to limit the number of those.

Then comment why the result of MarkShapeChildrenAcyclic is ignored.

Also, we need comments in MarkShapeChildrenAcyclic that tracing via BaseShape link shape->base()->baseUnowned() stops right at that point so MarkShapeChildrenAcyclic is truly O(1) in space.
Comment 19 [PTO to Dec5] Bill McCloskey (:billm) 2011-12-07 09:54:39 PST

I did my best to write a good comment.
Comment 20 Igor Bukanov 2011-12-07 10:43:14 PST
(In reply to Bill McCloskey (:billm) from comment #19)
> I did my best to write a good comment.

The comment is very clear, thanks!
Comment 21 Ian Melven :imelven 2011-12-08 11:51:39 PST
Via ed morley:

marking fixed with target milestone of mozilla11
Comment 22 [PTO to Dec5] Bill McCloskey (:billm) 2011-12-12 11:09:58 PST
Comment on attachment 579485 [details] [diff] [review]

This seems to have stuck. This patch is more conservative than the last one in that it shouldn't alter the cycle collector behavior at all. I think the regression risk should be low.
Comment 23 Alex Keybl [:akeybl] 2011-12-12 14:58:47 PST
Comment on attachment 579485 [details] [diff] [review]

[Triage Comment]
Having discussed with curtisk and dveditz, it sounds like this issue would not cause us to chemspill FF9 or otherwise mitigate (the bar we're setting for beta at this point), and the code that's already landed on m-c does not point directly at an exploit.

Let's take this on aurora only.
Comment 24 [PTO to Dec5] Bill McCloskey (:billm) 2011-12-12 16:12:53 PST
Seems reasonable.
Comment 25 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-12-28 13:28:55 PST
Is this something QA can verify?
Comment 26 [PTO to Dec5] Bill McCloskey (:billm) 2011-12-28 13:31:58 PST
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #25)
> Is this something QA can verify?

No. We might be able to write a test if we could artificially induce delayed marking. That work in going on in bug 673551.

Note You need to log in before you can comment on or make changes to this bug.