Closed Bug 600139 Opened 14 years ago Closed 14 years ago

Delayed marking can skip marking live objects

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
critical

Tracking

()

VERIFIED FIXED
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: gkw, Assigned: gwagner)

References

Details

(Keywords: assertion, regression, testcase, Whiteboard: [sg:critical?] fixed-in-tracemonkey)

Attachments

(2 files, 1 obsolete file)

var count = 0;

var x = Proxy.create(
  {
    has: function (name) {
      ++count;
      if (count == 320) gczeal(2);
      print(count);
      return name in x;
    }
  }, 
  {}
);

3 in x;

asserts js debug shell on TM changeset 54700fad8cf9 with -m at Assertion failure: addr % sizeof(FreeCell) == 0, when passed as a CLI argument.

s-s because this concerns gczeal.

Many thanks go out to Jesse for helping to reduce this.

autoBisect shows this is probably related to the following changeset: (not sure if it really is related)

The first bad revision is:
changeset:   54483:1c913526c597
user:        Gregor Wagner
date:        Fri Sep 24 10:54:39 2010 -0700
summary:     Bug 558861 - Compartmental GC (r=gal)
blocking2.0: --- → ?
This testcase is fragile.  Small changes can cause it to switch between these two assertions or not happen at all.

* Assertion failure: !JSID_IS_VOID(lastProp->id), at jsscope.h:678
* Assertion failure: addr % sizeof(FreeCell) == 0, at jsgc.h:312

Valgrind sheds no light.
Blocks: JaegerFuzz
Attached file stack trace
It's crashing while trying to create an exception object to report the too-much-recursion condition.  I see some 0xdadadada near the top of the stack.
Summary: JM: "Assertion failure: addr % sizeof(FreeCell) == 0," with gczeal → JM: "Assertion failure: addr % sizeof(FreeCell) == 0," or "Assertion failure: !JSID_IS_VOID(lastProp->id)," with gczeal
Assignee: general → anygregor
What configuration/os did you test with?
Definitely a blocker for final. This is a GC hazard.
(In reply to comment #3)
> What configuration/os did you test with?

32-bit js debug shell on Ubuntu 10.04 is what I tested with..
I also tested with a 32-bit shell (Mac OS X 10.5).
blocking2.0: ? → betaN+
I can reproduce it now on linux. The expected output is a "too much recursion" error right?
Maybe the stack size or loopcount is a hint that something goes wrong in the jit?

On mac the loop count goes always up to 68 (interpreter, -j, -m) and returns.
On linux the loopcount is between 139-142 for interpreter and -j and returns normally. It's about 467 when it crashes with -m.

Jesse I don't see the output numbers for your stack. Are they the same?
The reason why we crash is that we are marking an object that was finalized during the previous GC run. We also hit the delay marking children point very often for each GC right before we crash but the amount of actual objects is also very limited. There is not much stack left for the GC.
We perform a few GC's way above the stack limit and go into the delayMarking mode with the very first object to mark.
Limit: bf87cee1, we start the GC at: 0xbf87c5ec
Maybe we should avoid this and set a minimum amount of stackspace that is needed for a GC. A small check but big security improvement!
Igor what do you think?

But this has not much todo with this bug. I don't know why the methodjit doesn't stop earlier. Maybe somebody else wants to take a look?
(In reply to comment #9)
> We perform a few GC's way above the stack limit and go into the delayMarking
> mode with the very first object to mark.
> Limit: bf87cee1, we start the GC at: 0xbf87c5ec
> Maybe we should avoid this and set a minimum amount of stackspace that is
> needed for a GC.

If delayMarking starts immediately, then we have problems elsewhere as some code allowed to breach the stack limit.

> A small check but big security improvement!

Could you clarify what exactly do you mean by the check and the security improvement?
> > A small check but big security improvement!
> 
> Could you clarify what exactly do you mean by the check and the security
> improvement?

I don't know what the right thing is (perform no GC, report too much recursion...) but we should definitely not start a GC outside of the stack boundaries. We should check this when we start the GC.
(In reply to comment #9)

How much of the contiguous slab is needed for GC? (Why is any needed at all?)
(In reply to comment #11)
> I don't know what the right thing is (perform no GC, report too much
> recursion...) but we should definitely not start a GC outside of the stack
> boundaries. We should check this when we start the GC.

If we call the GC outside the quota, it implies that we have bigger problems to worry about as we could not bound the stack growth and the stack overflow is exploitable on some platforms. So we should just assert at js_GC and a few other places that quota is OK so we can find bugs faster.
Whiteboard: [sg:critical?]
Filed bug 600310 for the GC outside of stack quota problem but the over recursion seems to be another problem that is not GC related.
Assignee: anygregor → general
Spun of the GC-related part as 600310. Need a new owner since this is not GC related and JM-only (stack overflow).
Assignee: general → dvander
(In reply to comment #14)
What exactly is the over recursion problem? JM uses much less stack than js_Interpret, so it makes sense that it'd take longer to reach the limit. And it's not going over the limit.
we're hitting this on the tinderbox in jquery tests.
(In reply to comment #16)
> (In reply to comment #14)
> What exactly is the over recursion problem? JM uses much less stack than
> js_Interpret, so it makes sense that it'd take longer to reach the limit. And
> it's not going over the limit.

Where do you see it's not over the limit? I think its over the limit.
(In reply to comment #17)
> we're hitting this on the tinderbox in jquery tests.

Do you mean bug 583554? They are not related.
The jquery orange is bug 600622.
(In reply to comment #20)
> The jquery orange is bug 600622.

bug 600622 seems to be bug 583554 but this bug is still different.
Depends on: 600310
David you should hit now the assertion from bug 600310. I guess the title is no longer valid?
Yup, thanks.
Summary: JM: "Assertion failure: addr % sizeof(FreeCell) == 0," or "Assertion failure: !JSID_IS_VOID(lastProp->id)," with gczeal → JM: Assertion failure, JS_CHECK_STACK_SIZE in jsgc.cpp
Assertion failure: JS_CHECK_STACK_SIZE(cx->stackLimit, &stackDummy), 
at js/src/jsgc.cpp:2680
Summary: JM: Assertion failure, JS_CHECK_STACK_SIZE in jsgc.cpp → JM: "Assertion failure: JS_CHECK_STACK_SIZE(cx->stackLimit, &stackDummy)" in jsgc.cpp
Summarizing of findings so far:

1. The new assert is not a JM bug. It's somewhat bogus but I don't know how to fix it. The problem is that js_ReportOverRecursed wants to allocate an exception object, and this can cause a GC. JM is correctly throwing an error, and the GC doesn't account for this edge case.

2. The original assert actually went away when bug 589398 landed.

3. Gregor points out that *some* bug still exists, with an optimized, gc-zeal enabled build, we iloop marking a shape which is its own parent.

I've rolled my tree back before bug 589398 and I'm now debugging the original assert. I don't think that bug ever really went away, it's just hiding.
Summary: JM: "Assertion failure: JS_CHECK_STACK_SIZE(cx->stackLimit, &stackDummy)" in jsgc.cpp → JM: GC hazard with gczeal, proxies, recursion
Okay, bug 589398 "fixed" the problem by accidentally disabling the empty script optimization, which made some method JIT where it previously didn't. Will file another bug on this shortly.

It looks like we miss marking a funobj so its slot values get collected, and the next GC *does* mark it and crashes trying to dereference the values.
Aha. The bug is in the delayed marking code, best visualized with a call stack:

#0  js::GCMarker::delayMarkingChildren (this=0xfff83708, thing=0xf7503050) at ../jsgc.cpp:1410
#1  0x080dd196 in TypedMarker (trc=0xfff83708, thing=0xf7503050) at ../jsgcinlines.h:285
#2  0x080e11e2 in Mark<JSObject> (trc=0xfff83708, thing=0xf7503050) at ../jsgcinlines.h:148
#3  0x080dcda8 in MarkObject (trc=0xfff83708, obj=..., name=0x8303fc9 "proto") at ../jsgcinlines.h:184
#4  0x080dcedc in MarkChildren (trc=0xfff83708, fun=0xf7503eb0) at ../jsgcinlines.h:214
#5  0x080e6fa2 in js::gc::Arena<JSFunction>::markDelayedChildren (this=0xf7503000, trc=0xfff83708) at ../jsgc.cpp:1438

We're marking delayed children in arena X, and this recursively causes a wider range of X to need delayed marking. However, markDelayedChildren() clears the |link| field, so the range is reset.

Gregor, I'm not entirely comfortable attempting a fix here, are you willing to take this?
Summary: JM: GC hazard with gczeal, proxies, recursion → JM: Delayed marking can skip marking live objects
Summary: JM: Delayed marking can skip marking live objects → Delayed marking can skip marking live objects
(In reply to comment #27)
> Aha. The bug is in the delayed marking code, best visualized with a call stack:
> We're marking delayed children in arena X, and this recursively causes a wider
> range of X to need delayed marking.

Nice find - the arena should be popped of the delayed marking stack before doing any marking in it.
> 
> We're marking delayed children in arena X, and this recursively causes a wider
> range of X to need delayed marking. However, markDelayedChildren() clears the
> |link| field, so the range is reset.
> 
> Gregor, I'm not entirely comfortable attempting a fix here, are you willing to
> take this?

Uh that's an evil one! Nice work!
Assignee: dvander → anygregor
Attached patch patch (obsolete) — Splinter Review
Comment on attachment 481700 [details] [diff] [review]
patch

>diff -r a4383c16a9b5 js/src/jsgc.cpp
>--- a/js/src/jsgc.cpp	Thu Oct 07 15:35:08 2010 -0500
>+++ b/js/src/jsgc.cpp	Thu Oct 07 16:22:25 2010 -0700
>@@ -1434,23 +1434,26 @@ Arena<T>::markDelayedChildren(JSTracer *
>         thing++;
>     }
> }
> 
> void
> GCMarker::markDelayedChildren()
> {
>     while (Arena<Cell> *a = unmarkedArenaStackTop) {
>+        MarkingDelay *markingDelay = a->getMarkingDelay();
>+        unmarkedArenaStackTop = (markingDelay->link != a)
>+            ? markingDelay->link
>+            : NULL;

markingDelay->link reset must also be moved here. The code should also comment on exactly what is going here noting that Arena<T>::markDelayedChildren reads getMarkingDelay()->start before it can be affected by the following marking.
Attached patch patchSplinter Review
I forgot that we are not pushing the arena on the stack if markingdelay->link is still set. That's some tricky code there.
Thx!
Attachment #481700 - Attachment is obsolete: true
Attachment #481904 - Flags: review?(igor)
Comment on attachment 481904 [details] [diff] [review]
patch

I realized that another possibility to fix this bug is to change the loop in Arena<T>::markDelayedChildren into something like:

    while ((T *)getMarkingDelay()->start <= thingsEnd) {
        if ((T *)getMarkingDelay()->start->asCell()->isMarked())
            MarkChildren(trc, thing);

        (T *)getMarkingDelay()->start++;
    }

This way if the MarkChildren would need to delay marking for the arena, it would just reset the start. This would roughly correspond to what the code was doing before the compartment changes. But it was hard to follow the older logic, so lets stick to a simple solution. 

>diff -r 7b037f68bccf js/src/jsgc.cpp
>--- a/js/src/jsgc.cpp	Thu Oct 07 18:59:18 2010 -0700
>+++ b/js/src/jsgc.cpp	Fri Oct 08 12:56:37 2010 -0700
>@@ -1435,22 +1435,31 @@ Arena<T>::markDelayedChildren(JSTracer *
>     }
> }
> 
> void
> GCMarker::markDelayedChildren()
> {
>     while (Arena<Cell> *a = unmarkedArenaStackTop) {
>         /*
>-         * The following assert verifies that the current arena belongs to the
>-         * unmarked stack, since DelayMarkingChildren ensures that even for
>-         * the stack's bottom, prevUnmarked != 0 but rather points to
>-         * itself.
>-         */
>+        * Get next arena from delayed marking stack. 

Nit: drop this sentence as it just rephrases the code.

>+        * markingDelay->link == current arena indicates last arena on stack.
>+        * If marking gets delayed at the same arena again, the arena is pushed 
>+        * again in delayMarkingChildren. markingDelay->link has to be cleared, 
>+        * otherwise the arena is not pushed again.
>+        */

Nit: it looks like the comment is not properly aligned. 

I also think we should simplify the code and replace that markingDelay->link == current hack with a special sentinel address like 1 to denote the bottom of the stack. But this is for another bug.
Attachment #481904 - Flags: review?(igor) → review+
http://hg.mozilla.org/tracemonkey/rev/187dba1da035
Whiteboard: [sg:critical?] → fixed-in-tracemonkey
Whiteboard: fixed-in-tracemonkey → [sg:critical?] fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/187dba1da035
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
JSBugMon: This bug has been automatically verified fixed.
JSBugMon: This bug has been automatically verified fixed.
Wonderful, found another bug in the script ;)
Status: RESOLVED → VERIFIED
Group: core-security
You need to log in before you can comment on or make changes to this bug.