Last Comment Bug 351602 - crash when GCThing exists in both rt and thread local freeLists
: crash when GCThing exists in both rt and thread local freeLists
Status: RESOLVED FIXED
: crash, fixed1.8.1
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: x86 Windows 2000
: -- critical (vote)
: ---
Assigned To: Igor Bukanov
:
Mentors:
Depends on:
Blocks: 348798 js1.7
  Show dependency treegraph
 
Reported: 2006-09-06 15:13 PDT by Mike Moening
Modified: 2006-09-13 22:56 PDT (History)
4 users (show)
mbeltzner: blocking1.8.1+
bob: in‑testsuite-
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Fix v1 (4.68 KB, patch)
2006-09-08 05:19 PDT, Igor Bukanov
brendan: review+
feng.qian.moz: review+
mtschrep: approval1.8.1+
Details | Diff | Splinter Review

Description Mike Moening 2006-09-06 15:13:20 PDT
To test the integrity of the local freeList and against the runtime's freeList simply insert code toward the end of js_NewGCThing() like this:

------------
  checkedGCThing = rt->gcArenaList[flindex].freeList;
  while(thing && checkedGCThing )
  {
     JS_ASSERT(checkedGCThing != thing);
     checkedGCThing = checkedGCThing ->next;
  }
------------

The ASSERT will fire in this scenario.

1) Thread 1 executes and re-fills thread local freeList from runtime freeList
2) Thread 1 pauses for a slight amount of time.
2) Thread 2 executes and call JS_GC()
3) Thread 1 (still active) tries to use GCThings from its local freeList
   which now also exist on the runtime freelist again.
4) Bad things happen...lists are tangled and thread blows chunks.


The problem seems to be that the js_GC() function doesn't take GCThings that have already been allocated to thread local freelists into account OR there is a  small logic bug in this code somewhere.

See code in jc_GC() starting here:
-------------
    /*
     * Free phase.
     * Free any unused arenas and rebuild the JSGCThing freelist.
     */
    for (i = 0; i < GC_NUM_FREELISTS; i++) {
        arenaList = &rt->gcArenaList[i];
   .
   .
---------

If you set a data breakpoint on the GCThing* which is being corrupted in the localfree list it fires in JS_GC(),indicating something isn't quite right in there. Maybe an off by one bug???  Any help or explaination of the logic above would help me greatly in fashioning a patch.

Thanks
Comment 1 Brendan Eich [:brendan] 2006-09-06 18:40:32 PDT
Sorry, this is going to have to go to a gc buddy -- igor, feng, can one of you help?

/be
Comment 2 Mike Moening 2006-09-06 21:23:05 PDT
More info:

My understanding of the GC Free phase is dim, please bear with me.

It seems that the free phase is looping through all free things and using the flagp variable to determine if the thing is really free or not.

flagp is set like this:

   flagp = a->base + offset / sizeof(JSGCThing);

in the case i'm debugging this results in *flagp = 32

Then a few lines later this code is executed:
   thing = (JSGCThing *)(firstPage + offset);

When I look at thing->flagp it is set to NULL. In my case because the GCThing is currently in use (as slots for an object)

Would it be better to look at each individual thing->flagp and check against GCF_FINAL?

Either the variable flagp is wrong or the logic is busted.
It should not be putting things back into the freelist that are still in use.
Comment 3 Igor Bukanov 2006-09-07 01:00:00 PDT
(In reply to comment #0)
> 1) Thread 1 executes and re-fills thread local freeList from runtime freeList
> 2) Thread 1 pauses for a slight amount of time.
> 2) Thread 2 executes and call JS_GC()

Thread2 can not trigger execution of the GC when Thread1 re-fills the free list. This is not only because the refill is done with the GC lock held, but also due to the begin/end request model [1]. So something else corrupts the free list.

(In reply to comment #2)
> When I look at thing->flagp it is set to NULL. In my case because the GCThing
> is currently in use (as slots for an object)

If a live currently-in-use objects end-up on the free list, this indicates a GC hazard bug when the embedding or even SpiderMonkey code itself does not use properly various API to tell the GC about the roots. 

> Would it be better to look at each individual thing->flagp and check against
> GCF_FINAL?

thing->flagp is set to the calculated flagp pointer when the thing is marked as final and free for re-use. At this point GC is free to use the released cell for what-ever internal needs like, for example, optimizing away calculations of the flag pointer when allocating new things. Such recalculation according to bug 333236 comment 15 are expensive. 

[1] http://developer.mozilla.org/en/docs/JS_BeginRequest
Comment 4 Mike Moening 2006-09-07 06:31:04 PDT
>Thread2 can not trigger execution of the GC when Thread1 re-fills the free
>list. This is not only because the refill is done with the GC lock held, but
>also due to the begin/end request model [1]. So something else corrupts the
>free list.

There isn't a thread syncronization problem. The threads are running in serial.
We are also using proper begin/end request model.

Please answer this question.  Should a GCThing be able to exist in the runtime freeList and the thread local freeList at the same time?

I'm assuming no.

Here's a transcript showing the problem inside an HTTP server.
The 1st column is the thread id.
It shows the JS requests as well as GC running. In this example the GC thing at address 0x00eb7310 is the problem.

-----------------------------------------------
2136 Starting New HTTP Request:
2136 starting JS Request
2136 ********** Refilling the local free list with GCThing 0x00eb7310
2136 ending JS Request
2136 Http Request Done
(Thread 2136 does NOT end here. ThreadDestructor callback is NOT run)

1772 Starting New HTTP Request:
1772 starting JS Request
1772 js_gc is running!
1772 Returning GCThing 0x00eb7310 to runtime freelist
1772 js_gc is DONE
1772 ending JS Request
1772 Http Request Done

2136 Starting New HTTP Request:
2136 starting JS Request
2136 ********** Popped thing 0x00eb7310 from LOCAL freeList
2136 js_NewGCThing about to return 0x00eb7310
2136 Assertion failure: THING 0x00eb7310 is in runtime free list too!
-----------------------------------------------

It appears to me that when thread 1772 returns GCThing 0x00eb7310 to runtime freelist it never bothers to remove it from the thread local free list kept by thread 2136. 

There doesn't seem to be anything preventing GC from yanking free things back into the runtime freelist and leaving them orphaned in a thread local freeList for any thread currently not running a JS request.

In otherwords each thread can service multiple HTTP requests and execute many JSScripts (each using a proper begin/end request model).
Comment 5 Igor Bukanov 2006-09-07 07:21:22 PDT
(In reply to comment #4)
> Please answer this question.  Should a GCThing be able to exist in the runtime
> freeList and the thread local freeList at the same time?

No, it should not.

> 
> I'm assuming no.
> 
> Here's a transcript showing the problem inside an HTTP server.
> The 1st column is the thread id.
> It shows the JS requests as well as GC running. In this example the GC thing at
> address 0x00eb7310 is the problem.
> 
> -----------------------------------------------
> 2136 Starting New HTTP Request:
> 2136 starting JS Request
> 2136 ********** Refilling the local free list with GCThing 0x00eb7310
> 2136 ending JS Request
> 2136 Http Request Done
> (Thread 2136 does NOT end here. ThreadDestructor callback is NOT run)
> 
> 1772 Starting New HTTP Request:
> 1772 starting JS Request
> 1772 js_gc is running!
> 1772 Returning GCThing 0x00eb7310 to runtime freelist
> 1772 js_gc is DONE
> 1772 ending JS Request
> 1772 Http Request Done

So far so fine, GC runs, assemble the new global free list and empty all thread local free lists. 

> 
> 2136 Starting New HTTP Request:
> 2136 starting JS Request
> 2136 ********** Popped thing 0x00eb7310 from LOCAL freeList
> 2136 js_NewGCThing about to return 0x00eb7310
> 2136 Assertion failure: THING 0x00eb7310 is in runtime free list too!
> -----------------------------------------------

AFAICS this can happen only if GC does not enumerate all the threads when it empties the free lists. Could you print the value of cx->thread pointer in each of your test cases. Plus do you use context pooling?


> There doesn't seem to be anything preventing GC from yanking free things back
> into the runtime freelist and leaving them orphaned in a thread local freeList
> for any thread currently not running a JS request.

If a thread does not run a request, then GC should empty the free list before JS_BeginRequest returns.
Comment 6 Mike Moening 2006-09-07 07:34:37 PDT
>So far so fine, GC runs, assemble the new global free list and empty all thread
>local free lists. 
I was just investigating this very same thing before your post.
This is where the problem lies.

Bug 345365 added a "Fix to null-defend acx->thread in js_GC"

In short JS_ClearContextThread sets cx->thread to NULL.
So the line of code:

   memset(acx->thread->gcFreeLists, 0, sizeof acx->thread->gcFreeLists);

is not being run and the thread local freeList remains intact.

As a general question...Why could you want to to this? Wouldn't it be better to wipe the thread local free list in the threads destructor callback so that a thread running many scripts won't have to continually "re-fill" its local freelist?   The extra trashing moving GCThings between runtime and thread local freeLists seems sub-optimal.  Then you just have to make JS_GC smarter so as not to mess with thread local freelists.  Iterating over ALL the GCThings seems sub-optimal too.
Comment 7 Mike Moening 2006-09-07 08:11:42 PDT
A simple fix would be to add:

    memset(cx->thread->gcFreeLists, 0, sizeof cx->thread->gcFreeLists);

At the top of js_ClearContextThread()

However, this doesn't address the sub-optimal behavior I spoke about..
Comment 8 Igor Bukanov 2006-09-07 08:18:27 PDT
(In reply to comment #6)
> In short JS_ClearContextThread sets cx->thread to NULL.
> So the line of code:
> 
>    memset(acx->thread->gcFreeLists, 0, sizeof acx->thread->gcFreeLists);
> 
> is not being run and the thread local freeList remains intact.

Good, the bug is clear.

> As a general question...Why could you want to to this? Wouldn't it be better to
> wipe the thread local free list in the threads destructor callback so that a
> thread running many scripts won't have to continually "re-fill" its local
> freelist?   The extra trashing moving GCThings between runtime and thread local
> freeLists seems sub-optimal.  Then you just have to make JS_GC smarter so as
> not to mess with thread local freelists. 

GC wants to mess with local free list when it detects that the whole GC arena can be released as it consists of only free things. AFAICS to support this GC would need to scan the free list to check if its members belong to the arena. That is too much complexity with less and less returns. 

> Iterating over ALL the GCThings seems sub-optimal too.

Yes, but GC in SpiderMonkey is not known to be the fastest on the planet ;)
Comment 9 Mike Moening 2006-09-07 08:27:30 PDT
>GC wants to mess with local free list when it detects that the whole GC arena
>can be released as it consists of only free things. AFAICS to support this GC
>would need to scan the free list to check if its members belong to the arena.
>That is too much complexity with less and less returns. 

Not really.   Can't we just use another bit in the flagp mask to indicate that the GCThing is part of of a local free list?

>> Iterating over ALL the GCThings seems sub-optimal too.
>Yes, but GC in SpiderMonkey is not known to be the fastest on the planet ;)

I thought that's why we switched over to thread local freeLists in the first place...for speed!  I guess its only faster within a begin/end request pair since it doesn't have to got back to the runtime.
So how much faster is this model. Do you have benchmarks?  I'm just curious.

Comment 10 Mike Moening 2006-09-07 08:31:51 PDT
In addition to my patch idea...I think it would be necessary to call     JS_ClearContextThread() within the beginRequest endRequest pair.

Do you agree?
Comment 11 Igor Bukanov 2006-09-07 08:37:20 PDT
(In reply to comment #9)
> Can't we just use another bit in the flagp mask to indicate that
> the GCThing is part of of a local free list?

There is no flags left there to use! Plus a flag would just indicate that the thing is on some free list, but it would not tell on which one preventing fast release of the arena. And you do want to release the memory back to the system as soon as possible. 
 
> I thought that's why we switched over to thread local freeLists in the first
> place...for speed!  

See bug 312238 for benchmarks.

> I guess its only faster within a begin/end request pair
> since it doesn't have to got back to the runtime.
> So how much faster is this model. Do you have benchmarks?  I'm just curious.

You mean benchmarks for begin/end request model? You have to ask Brendam ;)

Comment 12 Igor Bukanov 2006-09-07 08:48:30 PDT
(In reply to comment #10)
> In addition to my patch idea...I think it would be necessary to call    
> JS_ClearContextThread() within the beginRequest endRequest pair.
> 
> Do you agree?

I  am not sure about js_ClearContextThread(), but the thread-local list has to be cleared when the thread can no longer be reachable via any context.
Comment 13 Brendan Eich [:brendan] 2006-09-07 09:44:43 PDT
(In reply to comment #12)
> (In reply to comment #10)
> > In addition to my patch idea...I think it would be necessary to call    
> > JS_ClearContextThread() within the beginRequest endRequest pair.

No, that should not be necessary.

It's only when the GC can't find the JSThread, because you've returned all contexts that point to it to the context pool, that we have this problem.

It should suffice to have js_ClearContextThread zero thread->gcFreeLists when when the reference count (implicit right now, needs to be explicit; but doesn't need locking) tallying all cx->thread references to this JSThread goes to zero.

> I  am not sure about js_ClearContextThread(), but the thread-local list has to
> be cleared when the thread can no longer be reachable via any context.

What Igor said, IOW.

/be
Comment 14 Brendan Eich [:brendan] 2006-09-07 09:50:22 PDT
(In reply to comment #11)
> > I guess its only faster within a begin/end request pair
> > since it doesn't have to got back to the runtime.
> > So how much faster is this model. Do you have benchmarks?  I'm just curious.
> 
> You mean benchmarks for begin/end request model? You have to ask Brendam ;)

See bug 54743.

/be

Comment 15 Brendan Eich [:brendan] 2006-09-07 09:56:32 PDT
(In reply to comment #14)
> See bug 54743.

Long bug, so skip to bug 54743 comment 94 for speedup results.

/be

Comment 16 Mike Moening 2006-09-07 11:29:16 PDT
Igor,
Do you want to attempt a patch based on Brendan's comment #13?
I'm sure you would do a better job then me.

My JS source is pretty messed up from all the debugging I've been doing.
My CVS access is through a 3rd party too.
Thanks! 
Comment 17 Igor Bukanov 2006-09-07 13:36:44 PDT
(In reply to comment #16)
> Igor,
> Do you want to attempt a patch based on Brendan's comment #13?

Yes, I will do it either today or tomorrow.
Comment 18 Igor Bukanov 2006-09-08 05:19:37 PDT
Created attachment 237322 [details] [diff] [review]
Fix v1

The patch cleanups the thread-local gc free lists each time js_SetContextThread is executed on thread without active contexts. The patch also cleanups jsgc.c under assumption that cx->thread must be valid for any context that executes GC.
Comment 19 Brendan Eich [:brendan] 2006-09-08 09:24:10 PDT
Comment on attachment 237322 [details] [diff] [review]
Fix v1

Looks good, thanks.

/be
Comment 20 Feng Qian (Google) 2006-09-11 15:37:38 PDT
(In reply to comment #18)
> Created an attachment (id=237322) [edit]
> Fix v1
> 
> The patch cleanups the thread-local gc free lists each time js_SetContextThread
> is executed on thread without active contexts. The patch also cleanups jsgc.c
> under assumption that cx->thread must be valid for any context that executes
> GC. 

Thanks, Igor and Mike. I was away from my computer in last two weeks.

Why not always clear gcFreeList in js_ClearContextThread (as in the DEBUG mode) when a thread's context list becomes empty? Although it may do some unnecessary work when a thread is not reused. It will make the code easier to read.

 


> 
Comment 21 Igor Bukanov 2006-09-11 16:10:04 PDT
(In reply to comment #20)
> Why not always clear gcFreeList in js_ClearContextThread (as in the DEBUG mode)
> when a thread's context list becomes empty? Although it may do some unnecessary
> work when a thread is not reused. It will make the code easier to read.

Then you would need to clean it twice, first during the initialization and second during the cleanup. The patch and that DEBUG-only code emphasis that contextless thread should not ever access the free lists. Yes, it is ugly, but with the forthcoming valgrind annotations from bug 348798 the ugliness is inevitable AFAICS.
Comment 22 Mike Moening 2006-09-11 19:23:59 PDT
>Why not always clear gcFreeList in js_ClearContextThread (as in the DEBUG mode)
>when a thread's context list becomes empty? Although it may do some unnecessary
>work when a thread is not reused. It will make the code easier to read.

I tested both the suggestion above and the patch.  Both work fine.
Let's land this thing.
Comment 23 Feng Qian (Google) 2006-09-12 10:18:54 PDT
Comment on attachment 237322 [details] [diff] [review]
Fix v1

Looks good, thanks.
Comment 24 Brendan Eich [:brendan] 2006-09-12 10:52:58 PDT
Comment on attachment 237322 [details] [diff] [review]
Fix v1

Igor, are you able to land this today?  I can do it if necessary.

/be
Comment 25 Feng Qian (Google) 2006-09-12 11:05:07 PDT
(In reply to comment #21)
> (In reply to comment #20)
> > Why not always clear gcFreeList in js_ClearContextThread (as in the DEBUG mode)
> > when a thread's context list becomes empty? Although it may do some unnecessary
> > work when a thread is not reused. It will make the code easier to read.
> 
> Then you would need to clean it twice, first during the initialization and
> second during the cleanup. The patch and that DEBUG-only code emphasis that
> contextless thread should not ever access the free lists. Yes, it is ugly, but
> with the forthcoming valgrind annotations from bug 348798 the ugliness is
> inevitable AFAICS.

Igor & Brendan, can I have access to bug 348798? I am curious what the bug is.
Comment 26 Igor Bukanov 2006-09-12 13:33:12 PDT
I committed the patch from comment 18 to the trunk:

Checking in jscntxt.c;
/cvsroot/mozilla/js/src/jscntxt.c,v  <--  jscntxt.c
new revision: 3.91; previous revision: 3.90
done
Checking in jscntxt.h;
/cvsroot/mozilla/js/src/jscntxt.h,v  <--  jscntxt.h
new revision: 3.117; previous revision: 3.116
done
Checking in jsgc.c;
/cvsroot/mozilla/js/src/jsgc.c,v  <--  jsgc.c
new revision: 3.173; previous revision: 3.172
done
Comment 27 Mike Schroepfer 2006-09-12 16:20:30 PDT
Comment on attachment 237322 [details] [diff] [review]
Fix v1

a=schrep for 18drivers.
Comment 28 Igor Bukanov 2006-09-13 03:02:01 PDT
I committed the patch from comment 18 to MOZILLA_1_8_BRANCH:

Checking in jscntxt.c;
/cvsroot/mozilla/js/src/jscntxt.c,v  <--  jscntxt.c
new revision: 3.60.4.8; previous revision: 3.60.4.7
done
Checking in jscntxt.h;
/cvsroot/mozilla/js/src/jscntxt.h,v  <--  jscntxt.h
new revision: 3.80.4.12; previous revision: 3.80.4.11
done
Checking in jsgc.c;
/cvsroot/mozilla/js/src/jsgc.c,v  <--  jsgc.c
new revision: 3.104.2.19; previous revision: 3.104.2.18
done

Note You need to log in before you can comment on or make changes to this bug.