Closed Bug 345365 Opened 17 years ago Closed 17 years ago

JS_ClearContextThread leaves dangling pointers in JSContext

Categories

(Core :: JavaScript Engine, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.8.1

People

(Reporter: MikeM, Assigned: brendan)

Details

(Keywords: fixed1.8.1, Whiteboard: [schrep-181approval pending])

Attachments

(3 files, 2 obsolete files)

js_GC() crashes on following line when acx->thread is null.
    memset(acx->thread->gcFreeLists, 0, sizeof acx->thread->gcFreeLists);

js_NewGCThing() corrupts memory on the following line if flagp is not initialized (by default it contains a random memory address)
    /* We can't fail now, so update flags and rt->gc{,Private}Bytes. */
    *flagp = (uint8)flags;
The flagp member needs to be initialized to 0 and checked before setting here.

Both the crash on the corruption happen when the engine is thrashed with multiple threads on a dual CPU machine.  Simple testing for null fixes the problems. Memory checking with boundschecker comes up clean after the fixes are applied.

A patch is on the way.

Oh yeah...also a simple fix for VC 6.0 build problem with JS_STATIC_ASSERT() inside a static function (this change may not be acceptable to all since it doesn't address the root "typedef" problem with JS_STATIC_ASSERT)
Attached patch patch for corruption and crash (obsolete) — Splinter Review
Please review
Attachment #229995 - Flags: review?(brendan)
(In reply to comment #0)
> js_GC() crashes on following line when acx->thread is null.
>     memset(acx->thread->gcFreeLists, 0, sizeof acx->thread->gcFreeLists);

An initialized context shouldn't have a null thread field, I doubt somewhere else causes the problem. 

> 
> js_NewGCThing() corrupts memory on the following line if flagp is not
> initialized (by default it contains a random memory address)
>     /* We can't fail now, so update flags and rt->gc{,Private}Bytes. */
>     *flagp = (uint8)flags;
> The flagp member needs to be initialized to 0 and checked before setting here.

I couldn't find a path leading to this dereference without initializing flagp :(. 

> Both the crash on the corruption happen when the engine is thrashed with
> multiple threads on a dual CPU machine.  Simple testing for null fixes the
> problems. Memory checking with boundschecker comes up clean after the fixes are
> applied.
> 
> A patch is on the way.
> 
> Oh yeah...also a simple fix for VC 6.0 build problem with JS_STATIC_ASSERT()
> inside a static function (this change may not be acceptable to all since it
> doesn't address the root "typedef" problem with JS_STATIC_ASSERT)
> 
Comment on attachment 229995 [details] [diff] [review]
patch for corruption and crash

As feng said, there is no path reaching label success: that fails to set flagp.  And cx->thread is set by js_NewContext or js_SetContextThread without fail.  If you find it null, then the context is corrupt.

Mike, before patching with null checks we need reproducible test cases or at least a theory of how the tested pointer became null.  There's no way for flagp to become null under control of js_NewGCThing, so unless the local variables are being corrupted (stack overwrite from something called from js_NewGCThing), we have no theory of the crime to test.

/be
Attachment #229995 - Flags: review?(brendan) → review-
Mike: without the patch, what does boundschecker say?  Can you run purify?

/be
mike: are you using the js request model (JS_BeginRequest/JS_EndRequest)?
>Mike, before patching with null checks we need reproducible test cases or at
>least a theory of how the tested pointer became null.  There's no way for flagp
>to become null under control of js_NewGCThing, so unless the local variables
>are being corrupted (stack overwrite from something called from js_NewGCThing),
>we have no theory of the crime to test.

I understand.  I'll try to find some evidence.  I realize it makes no sense to fix the symtoms!  Do you have any good advice in finding corruption in spidermonkeys memory allocation scheme? My lack of understanding with JS certaining going to hamper me here.

>mike: are you using the js request model (JS_BeginRequest/JS_EndRequest)?
Yes.  I went over the model I'm using with Brendan in another test app. He said it was good.

>As feng said, there is no path reaching label success: that fails to set flagp.
> And cx->thread is set by js_NewContext or js_SetContextThread without fail. 
>If you find it null, then the context is corrupt.

I think you are wrong here.
We maintain a pool of contexts. Upon thread destruction, after GC/MaybeGC and just before returning the context to the pool, we call JS_ClearContextThread().
When the app terminates we destroy the runtime. 
At that point each cx should have a NULL for the thread member.
In that case the final GC call will crash.

(In reply to comment #7)
> >As feng said, there is no path reaching label success: that fails to set flagp.
> > And cx->thread is set by js_NewContext or js_SetContextThread without fail. 
> >If you find it null, then the context is corrupt.
> 
> I think you are wrong here.
> We maintain a pool of contexts. Upon thread destruction, after GC/MaybeGC and
> just before returning the context to the pool, we call JS_ClearContextThread().
> When the app terminates we destroy the runtime. 
> At that point each cx should have a NULL for the thread member.
> In that case the final GC call will crash.

I am a little bit confused here. JS_ClearContextThread is called nowhere. js_CleanContextThread is only called in js_DestroyContext, and right before freeing the JSContext object, code snippet here:

void
js_DestroyContext(JSContext *cx, JSGCMode gcmode)
{
....

#ifdef JS_THREADSAFE
  js_ClearContextThread(cx);
#endif
  
  /* Finally, free cx itself. */
  free(cx);
}

Unless GC happens between these two statements, a valid context should not have a null thread field. 

Mike, as Brenden said, a reproducible testcase would be very helpful to diagnose the problem.

Sorry my comments weren't 100% accurate or clear.

We are doing the JS_ClearContextThread() before we return each context back into the pool of "free" contexts (available for use on some new thread).
When all threads are stopped each context in the free pool will have cx->thread set to NULL.

Then as part of shutting down we execute the following code:
---------------
oCxIter = g_oFreeContextsList.begin();
while(oCxIter != g_oFreeContextsList.end())
{
   pContext = (*oCxIter);
   JS_SetContextThread(pContext);
   JS_SetGlobalObject(pContext, NULL);
   JS_GC(pContext);
   JS_DestroyContext(pContext);
   ++oCxIter;
}
-------------------

In this case when we do JS_GC() on the first context it apparently iterates ALL the contexts now (see below).  If there are two contexts still alive then the second context will have NULL for its thread member since we do GC for each context one at a time.

while ((acx = js_ContextIterator(rt, JS_FALSE, &iter)) != NULL) {
   if (acx->thread == cx->thread)
        continue;
   memset(acx->thread->gcFreeLists, 0, sizeof acx->thread->gcFreeLists);
}
Also if there was 1 thread running GC (outside of a request) and 1 or more in the free pool (with cx->thread NULL) couldn't this problem crop up then too?
This seems possible.
OK. I tested this scenario and it DOES cause a crash too.

I'd like to venture out on the branch a bit further.
I know this is a stretch but here goes...

If every time you run JS_GC() it zeros gcFreeLitss for for every other context:
   memset(acx->thread->gcFreeLists, 0, sizeof acx->thread->gcFreeLists);

Wouldn't that also cause my null pointer problem with the flagp member in js_newGCThing()?
See code snipit from js_NewGCThing() below:

------------------------------------
flbase = cx->thread->gcFreeLists;
JS_ASSERT(flbase);
thing = flbase[flindex];
localMallocBytes = cx->thread->gcMallocBytes;
if (thing && rt->gcMaxMallocBytes - rt->gcMallocBytes > localMallocBytes) {
   flagp = thing->flagp;
-------------------------------------
Based on what is happeing inside JS_GC() it seems possible that thread->gcFreeLists may be null in this code?
Let me know if I've pushed this theory too far. :-)
I probably have!
Mike: I'm a big dummy -- JS_ClearContextThread of course can leave a context in the runtime's list with null thread member.  Thanks for pointing this out.

On the other question, clearing thread->gcFreeLists is harmless, because js_NewGCThing will see the null pointer to a JSGCThing and go to the runtime's arena list.  Again, the control flow in js_NewGCThing ensures that flagp is set before used.  If you think a freelist entry (JSGCThing struct) has a null flagp, that would be a different bug -- some kind of memory corruption -- and not what memset'ing thread->gcFreeLists to zero could cause.  Different storage.

/be
Assignee: general → brendan
Attachment #229995 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #230169 - Flags: review?(feng.qian.moz)
Attachment #230169 - Flags: review?(feng.qian.moz) → review+
>Mike: I'm a big dummy -- JS_ClearContextThread of course can leave a context in
>the runtime's list with null thread member.  Thanks for pointing this out.

No problem.  BTW..Your probably one of the smarter guys I know.
FYI  Your patch works great. That crash is gone now.

>If you think a freelist entry (JSGCThing struct) has a null
>flagp, that would be a different bug -- some kind of memory corruption -- and
>not what memset'ing thread->gcFreeLists to zero could cause.  Different
>storage.

With regards to corruption... I'm starting to think my flagp problem and thread deadlock are related. If memory is being corrupted it may manifest itself in several different ways. Did you ever get that corruption tester I sent to compile?
OS: Windows 2000 → All
Priority: -- → P1
Hardware: PC → All
Target Milestone: --- → mozilla1.8.1beta2
(In reply to comment #14)
> With regards to corruption... I'm starting to think my flagp problem and thread
> deadlock are related. If memory is being corrupted it may manifest itself in
> several different ways. Did you ever get that corruption tester I sent to
> compile?

I did, but I haven't had time to try building it on Windows.  Why don't you attach it here so others can test it too?  Thanks.

/be
Attached file Tester app for crash (obsolete) —
This attachment can be used for debugging this crash as well as other general stability/corruption issues I'm seeing.  For debugging the crash just uncomment line 420 so it stops generating new threads.

The files **SHOULD** work on WIN32 or Linux.  I've attached a VC 6.0 project file for WIN32.
The exe will create a bunch of threads and run the a static script over once for each thread until you hit a key on the keyboard. Then it shuts down.
In WIN32, if the process hangs the main thread will dump a stack trace for each thread still active.  For linux builds just ignore the StackWalk.* files.
For WIN32 builds it also requires the dbghelp.dll. Please use version 6.4.7.1 or later. (I couldn't upload it because it made the zip file too big)

Dependancies for WIN32 build include NSPR build for WINNT target (NOT WIN95)
The following enviroment variable should be created too:
NSPR_NATIVE_THREADS_ONLY = 1
Note you must logout/login for the env var change to take affect on windows.

Build JS with the latest CVS and obviously JS_THREADSAFE defined.
Please let me know if you see anything weird.

Note:  defining JS_USE_ONLY_NSPR_LOCKS when building JS seems to help quite a bit on the hanging problem I've been seeing.

If you have a debugging tool such as boundschecker/purify please use it.

Thanks!
JS 1.7 is still corrupting itself.

The line below is from the VC 8.0 compiler in debug mode.

HEAP[CorruptionTester.exe]: HEAP: Free Heap block 231bac8 modified at 231baf0 after it was freed

I noticed that if you just keep creating new JSContexts and executing it works fine (but dies on shutdown).  The heap corruption occurs once you start re-using contexts from the pool.

I've included a VC 8.0 project.  
FYI:  I linked in a debug build of JS under the name js32d.dll.

The GC heap corruption is way over my head.  I need your help here guys!
Attachment #230196 - Attachment is obsolete: true
We don't know who is corrupting "itself" yet -- could be anything, including the thread pool code, although it looked ok last time I read it.

I don't have Windows handy, and I'm even shorter on time.  Suggestion: identify who is setting acx->thread null while the GC is running over the runtime's context list.

/be
>I don't have Windows handy
You don't need windows. You should be able to compile this and run on Linux too.  I put in pthreads for linux.  Leave out the call stackdump files and try it...

>We don't know who is corrupting "itself" yet
There is very little code in there other than JS calls. The pooling is sound.
I've tested on other JS versions without issues.

Mike, did you miss the part where I said I'm even shorter on time?

/be
So we know that the thread pool code is clearing acx->thread via JS_ClearContextThread, and that's ok -- and js_GC now defends with a null test.

So what is the remaining problem?  It can't be flagp being badly maintained in js_NewGCThing, since it is well-maintained.  If it seems corrupted, then the stack is corrupted and you need a memory safety tool (purify or valgrind).

/be
(In reply to comment #17)
> Created an attachment (id=235000) [edit]
> Updated tester app for heap corruption
> 
> JS 1.7 is still corrupting itself.
> 
> The line below is from the VC 8.0 compiler in debug mode.
> 
> HEAP[CorruptionTester.exe]: HEAP: Free Heap block 231bac8 modified at 231baf0
> after it was freed
> 
> I noticed that if you just keep creating new JSContexts and executing it works
> fine (but dies on shutdown).  The heap corruption occurs once you start
> re-using contexts from the pool.
> 
> I've included a VC 8.0 project.  
> FYI:  I linked in a debug build of JS under the name js32d.dll.
> 
> The GC heap corruption is way over my head.  I need your help here guys!
> 

Mike, I compiled and run it in VC8 debug mode, it prints out some messages and then triggers an assertion error:

3876 Started new thread. Active thread count(1)
1588 Started new thread, Active thread count(3)
....
.... Started new thread, Active thread count(2)

--------------
About to iterate...
registering testClass
Done iterating.

1588 MAYBE GC... Hit count(0)
Pool Size(1) Active Threads(4) Total Executions(1)

--------------
About to iterate...
1540 Started new thread. Active thread count(5)
registering testClass
Done iterating.

1904 MAYBE GC... Hit Count(1)
Pool Size(1) Active Threads(4) Total Executions(2)
Assertion failure: !OBJ_GET_PROTO(cx, ctor), at .../js/src/jsapi.c:2285

If I comment out this ASSERTION, I did see the error message you pointed out. If this assertion is necessary, something was wrong before heap corruption.

Brendan, do you have first-hand knowledge of what's wrong here?
Please note. The newest tester application is compiling the scripts under a different global than they are being executed. This global is rooted and saved in order to allow "Anonymous Functions" to remain properly initialized in later executions.  This was done per recommendation from Brendan.  I don't know if this is contributing...

>If I comment out this ASSERTION, I did see the error message you pointed out.
>If this assertion is necessary, something was wrong before heap corruption.
The call stack should be coming from JS_InitStandardClasses.  However there shouldn't be any standard classes defined in the global since we JS_ClearScoped it.  So apparently JS_ClearScope isn't working or I'm misunderstanding how JS_ClearScope works.
JS_ClearScope works fine, try debugging harder to diagnose the cause of the assertion.

The problem (as I've said in private email) indicated by that assertion botching is that the Function class is being re-initialized on a global, which is not only unnecessary, but due to the way standard class initialization works, will lead to an extra layer of prototype in the chain leading to Object.prototype.

There's no known memory corruption bug associated with this assertion botching. Any memory bug is probably separate.

/be
I downloaded and ran an evaluation copy of Rational Purify.
It says it found a "free memory write"
on a call to  JS_REMOVE_LINK(&cx->threadLinks);

Here's the purify Dump (I suggest viewing in a better tool):
------------------------
[I] Starting main
[I] Terminating thread 0x668
        Call location
            ExitThread     [C:\WINNT\system32\KERNEL32.dll]
            endthread      [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:364]
            callthreadstart [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:295]
            threadstart    [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:275]
[E] FMW: Free memory write in js_SetContextThread {126 occurrences}
        Writing 4 bytes to 0x02c2ff28 (4 bytes at 0x02c2ff28 illegal)
        Address 0x02c2ff28 is at the beginning of a 60 byte block
        Address 0x02c2ff28 points to a malloc'd block in heap 0x02c20000
        Thread ID: 0x668
        Error location
            js_SetContextThread [c:\dev\common\spidermonkey\js\src\jscntxt.c:134]
                        return JS_FALSE;
                    }
                    cx->thread = thread;
             =>     JS_REMOVE_LINK(&cx->threadLinks);
                    JS_APPEND_LINK(&cx->threadLinks, &thread->contextList);
                    return JS_TRUE;
                }
    JS_SetContextThread [c:\dev\common\spidermonkey\js\src\jsapi.c:5008]

            JS_SetContextThread [c:\dev\common\spidermonkey\js\src\jsapi.c:5008]
            callthreadstart [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:293]
            threadstart    [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:275]
        Allocation location
            calloc         [f:\rtm\vctools\crt_bld\self_x86\crt\src\dbgheap.c:505]
            js_GetCurrentThread [c:\dev\common\spidermonkey\js\src\jscntxt.c:104]
                    thread = (JSThread *)PR_GetThreadPrivate(rt->threadTPIndex);
                    if (!thread) {
                        /* New memory is set to 0 so that elements in gcFreeLists are NULL. */
             =>         thread = (JSThread *) calloc(1, sizeof(JSThread));
                        if (!thread)
                            return NULL;
                
            js_SetContextThread [c:\dev\common\spidermonkey\js\src\jscntxt.c:127]
            js_NewContext  [c:\dev\common\spidermonkey\js\src\jscntxt.c:184]
            JS_NewContext  [c:\dev\common\spidermonkey\js\src\jsapi.c:967]
            callthreadstart [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:293]
            threadstart    [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:275]
        Free location
            free           [f:\rtm\vctools\crt_bld\self_x86\crt\src\dbgheap.c:1151]
            js_ThreadDestructorCB [c:\dev\common\spidermonkey\js\src\jscntxt.c:82]
            PR_DestroyThreadPrivate [c:\dev\common\nspr-4.6.2\mozilla\nsprpub\pr\src\threads\prtpd.c:265]
            PR_CleanupThread [c:\dev\common\nspr-4.6.2\mozilla\nsprpub\pr\src\threads\prcthr.c:62]
            PRI_DetachThread [c:\dev\common\nspr-4.6.2\mozilla\nsprpub\pr\src\threads\combined\pruthr.c:1542]
            DllMain        [c:\dev\common\nspr-4.6.2\mozilla\nsprpub\pr\src\md\windows\ntdllmn.c:81]
            _DllMainCRTStartup [f:\rtm\vctools\crt_bld\self_x86\crt\src\crtdll.c:495]
            DllMainCRTStartup [f:\rtm\vctools\crt_bld\self_x86\crt\src\crtdll.c:459]
            GetThreadPriorityBoost [C:\WINNT\system32\KERNEL32.dll]
            endthread      [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:364]
            callthreadstart [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:295]
            threadstart    [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:275]
[E] FMW: Free memory write in js_SetContextThread {126 occurrences}
        Writing 4 bytes to 0x02c2ff2c (4 bytes at 0x02c2ff2c illegal)
        Address 0x02c2ff2c is 4 bytes into a 60 byte block at 0x02c2ff28
        Address 0x02c2ff2c points to a malloc'd block in heap 0x02c20000
        Thread ID: 0x668
        Error location
            js_SetContextThread [c:\dev\common\spidermonkey\js\src\jscntxt.c:134]
            JS_SetContextThread [c:\dev\common\spidermonkey\js\src\jsapi.c:5008]
            callthreadstart [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:293]
            threadstart    [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:275]
        Allocation location
            calloc         [f:\rtm\vctools\crt_bld\self_x86\crt\src\dbgheap.c:505]
            js_GetCurrentThread [c:\dev\common\spidermonkey\js\src\jscntxt.c:104]
            js_SetContextThread [c:\dev\common\spidermonkey\js\src\jscntxt.c:127]
            js_NewContext  [c:\dev\common\spidermonkey\js\src\jscntxt.c:184]
            JS_NewContext  [c:\dev\common\spidermonkey\js\src\jsapi.c:967]
            callthreadstart [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:293]
            threadstart    [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:275]
        Free location
            free           [f:\rtm\vctools\crt_bld\self_x86\crt\src\dbgheap.c:1151]
            js_ThreadDestructorCB [c:\dev\common\spidermonkey\js\src\jscntxt.c:82]
            PR_DestroyThreadPrivate [c:\dev\common\nspr-4.6.2\mozilla\nsprpub\pr\src\threads\prtpd.c:265]
            PR_CleanupThread [c:\dev\common\nspr-4.6.2\mozilla\nsprpub\pr\src\threads\prcthr.c:62]
            PRI_DetachThread [c:\dev\common\nspr-4.6.2\mozilla\nsprpub\pr\src\threads\combined\pruthr.c:1542]
            DllMain        [c:\dev\common\nspr-4.6.2\mozilla\nsprpub\pr\src\md\windows\ntdllmn.c:81]
            _DllMainCRTStartup [f:\rtm\vctools\crt_bld\self_x86\crt\src\crtdll.c:495]
            DllMainCRTStartup [f:\rtm\vctools\crt_bld\self_x86\crt\src\crtdll.c:459]
            GetThreadPriorityBoost [C:\WINNT\system32\KERNEL32.dll]
            endthread      [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:364]
            callthreadstart [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:295]
            threadstart    [f:\rtm\vctools\crt_bld\self_x86\crt\src\thread.c:275]
--------------------
(In reply to comment #25)
> I downloaded and ran an evaluation copy of Rational Purify.
> It says it found a "free memory write"
> on a call to  JS_REMOVE_LINK(&cx->threadLinks);

Purify says you exited or otherwise "detached" the thread owning that JSThread, but somehow js_ThreadDestructorCB failed to unlink all the JSContexts linked on thread->contextList, in spite of the code to do just that:

    while (!JS_CLIST_IS_EMPTY(&thread->contextList))
        JS_REMOVE_AND_INIT_LINK(thread->contextList.next);

which comes just before the free(thread) call that Purify indeed saw.

This means either:

A.  There was a context whose threadLinks member was not on that thread's contextList when js_ThreadDestructorCB ran, yet somehow, later the context's threadLinks member was put on the thread's contextList.  Since the thread was being detached, it should have gone away.  Is there some kind of thread pooling or recycling going on?

Or:

B.  There was some kind of race to set cx->thread to point at a JSThread and link cx->threadLinks onto the thread's contextList, where two threads are trying to set the same context as affine at the same time.

My money is on B.

/be
>This means either:
>
>A.  There was a context whose threadLinks member was not on that thread's
>contextList when js_ThreadDestructorCB ran, yet somehow, later the context's
>threadLinks member was put on the thread's contextList.  Since the thread was
>being detached, it should have gone away.  Is there some kind of thread pooling
>or recycling going on?
>
>Or:
>
>B.  There was some kind of race to set cx->thread to point at a JSThread and
>link cx->threadLinks onto the thread's contextList, where two threads are
>trying to set the same context as affine at the same time.
>
>My money is on B.

I can rule out B.  If I only allow 1 thread to execute at a time I still have corruption.

>being detached, it should have gone away.  Is there some kind of thread pooling
>or recycling going on?
I'm not pooling theads...just calling beginThread() each time and executing a script and letting the thread exit.  What IS weird is that windows seems to be giving me the same thread id for the 2nd thread.
GetCurrenThreadId() returns the same id for two threads started back to back.  I don't know if that matters. Probably just some windows weirdness.

With regards to the following code in js_ThreadDestructorCB()
--------
    while (!JS_CLIST_IS_EMPTY(&thread->contextList))
        JS_REMOVE_AND_INIT_LINK(thread->contextList.next);
--------
JS_REMOVE_AND_INIT_LINK() is NOT being called after the 1st thread terminates and the function is called from NSPR's DllMain with reason DLL_THREAD_DETACH.

Is this bad? Apparently the CLIST is empty at that point. next & prev are the same pointers.
I ***THINK*** I know what the problem is.

The code in js_SetContextThread() calls:
    JS_REMOVE_LINK(&cx->threadLinks);

just before calling
    JS_APPEND_LINK(&cx->threadLinks, &thread->contextList);

If I comment out the JS_REMOVE_LINK(),  the heap corruption goes away!
Brendan, Why is the call to REMOVE_LINK necessary?

FYI.  Switching back to multiple threads is still causing the Assertion failure: !OBJ_GET_PROTO(cx, ctor), at .../js/src/jsapi.c:2285
to occur again.  

But, as you said Brendan, that probably is another bug in and of itself and not related to the heap corruption.
(In reply to comment #28)
> I ***THINK*** I know what the problem is.
> 
> The code in js_SetContextThread() calls:
>     JS_REMOVE_LINK(&cx->threadLinks);
> 
> just before calling
>     JS_APPEND_LINK(&cx->threadLinks, &thread->contextList);
> 
> If I comment out the JS_REMOVE_LINK(),  the heap corruption goes away!
> Brendan, Why is the call to REMOVE_LINK necessary?

I think JS_REMOVE_LINK(&cx->threadLinks) is supposed to remove the context from the its old owning thread's contextList. JS_APPEND_LINK(&cx->threadLinks, &thread->contextList) adds the context to the new owning thread's context list.

Could it be because cx->threadLinks is empty? Do we need a check before JS_REMOVE_LINK? e.g.,

  if (!JS_LIST_IS_EMPTY(&cx->threadLinks))
    JS_REMOVE_LINK(&cx->threadLinks);
  JS_APPEND_LINK(&cx->threadLinks, &thread->contextList);

That's just my guess.

> 
> FYI.  Switching back to multiple threads is still causing the Assertion
> failure: !OBJ_GET_PROTO(cx, ctor), at .../js/src/jsapi.c:2285
> to occur again.  
> 
> But, as you said Brendan, that probably is another bug in and of itself and not
> related to the heap corruption.
> 

(In reply to comment #28)
> I ***THINK*** I know what the problem is.
> 
> The code in js_SetContextThread() calls:
>     JS_REMOVE_LINK(&cx->threadLinks);

Purify already flagged a FMW there, we knew that.

> 
> just before calling
>     JS_APPEND_LINK(&cx->threadLinks, &thread->contextList);
> 
> If I comment out the JS_REMOVE_LINK(),  the heap corruption goes away!

That only avoids the FMW in your cacse, and it breaks other cases.

> Brendan, Why is the call to REMOVE_LINK necessary?

Because the context could be associated with another thread, and when some context pooling API client that failed to call JS_ClearContextThread when it put the context into the pool now calls JS_SetContextThread, the context must be disassociated from that other thread before the JS_APPEND_LINK associates it with the new thread.

If you really only have one thread running, then the first time that js_SetContextThread runs, it is called from js_NewContext with cx->threadLinks initialized just before the call to link to itself.  The JS_REMOVE_LINK will in effect do nothing (unlink on an empty JSCList just leaves it empty).

The second time js_SetContextThread is called on a given context must be from your pooling code calling JS_SetContextThread.  In this case, the context will be on the same thread list that you want to append it to.  So it's absolutely necessary to remove it from that list before appending it, or you'd get a double-insertion bug.

The code is correct as far as I can tell, and cutting out pieces of it that are necessary just to dodge a memory operation that fails in your embedding is not the right approach.  The debugging problem to solve is: what left that cx's threadLinks referring to memory associated with a thread other than the current thread, if there is only ever one thread?

You said there's only one thread.  If so, why is NSPR running the thread-private storage destructor callbacks?

/be
(In reply to comment #29)
> Could it be because cx->threadLinks is empty? Do we need a check before
> JS_REMOVE_LINK? e.g.,

No.  See jsclist.h, specifically

#define JS_REMOVE_LINK(_e)             \
    JS_BEGIN_MACRO                     \
        (_e)->prev->next = (_e)->next; \
        (_e)->next->prev = (_e)->prev; \
    JS_END_MACRO

and

#define JS_INIT_CLIST(_l)  \
    JS_BEGIN_MACRO         \
        (_l)->next = (_l); \
        (_l)->prev = (_l); \
    JS_END_MACRO

Linked list primitives must be empty-list safe, and they are.  This is not the bug, and I don't think it's my bug.  Is it an NSPR bug?

/be
Assignee: brendan → MikeM
Status: ASSIGNED → NEW
Why would putting in this check hurt?

if (!JS_LIST_IS_EMPTY(&cx->threadLinks))
    JS_REMOVE_LINK(&cx->threadLinks);
  JS_APPEND_LINK(&cx->threadLinks, &thread->contextList);

The first time in &cx->threadLinks, prev and next all point to the same thing.

>You said there's only one thread.  If so, why is NSPR running the
>thread-private storage destructor callbacks?

There are many threads created and destroyed.  I'm just doing them in serial. i.e waiting for 1 thread to die before starting another.
(In reply to comment #32)
> Why would putting in this check hurt?
> 
> if (!JS_LIST_IS_EMPTY(&cx->threadLinks))
>     JS_REMOVE_LINK(&cx->threadLinks);

Read the macros and mentally execute JS_REMOVE_LINK on an empty list.  Yes, it would hurt -- it would waste time and obscure the code.

Let's focus on the problem, which is that the thread destructor callback runs and frees the JSThread, yet somehow a context still refers to the thread's contextList member via the context's threadLinks.

/be
I've uncovered the root problem.
You basically have two circularly linked lists, cx->threadLinks and thread->contextList, which are pointing to each other at the end of the 1st script execution.

Inside JS_ClearContextThread() you execute:

   JS_REMOVE_LINK(&cx->threadLinks);

That call actually removes the context from the thread->contextList but leaves a pointer to "thread" in the cx->threadLinks list.
Then when the destructor callback is run it does a free(thread), which now leaves a dangling pointer inside the threadLinks list.

When JS_REMOVE_LINK(&cx->threadLinks) is again called on the next call to js_SetContextThread() you are operating on the dangling pointer.
(This is where purify wailed)

The answer is to make sure that when you call js_ClearContextThread() that both circular lists are valid and no longer refer to each other.

How to do this???  I tried once and royally broke it.
It needs to work in the simple case of 1 thread and 1 context but also when lists are longer.

CLISTS are elegant....but REMOVE_LINK is causing you to operate on both lists at the same time when you really don't want to.  Part of the problem is that you initializing each CLIST with the address of itself, then adding a link which is "crossing the streams".

Question:  Why does a context need a CLIST of threads?
Mike, thanks for getting to the bug -- sorry I didn't see it.

To answer your question, the thread's contextList is the head of a circular list of contexts.  Each context uses the same struct JSCList for linkage in that list, but that does not make the context's threadLinks a "list of threads".  It's part of the doubly-linked list whose head is the thread's contextList.

The fix is simple: js_ClearContextThread must use JS_REMOVE_AND_INIT_LINK.

/be
Assignee: MikeM → brendan
Attached patch fixSplinter Review
Feng, could you review?  Thanks,

/be
Attachment #235525 - Flags: review?(feng.qian.moz)
This is a must-fix bug with a no-brainer, simple and safe patch ;-).

/be
Severity: major → critical
Status: NEW → ASSIGNED
Flags: blocking1.8.1?
Summary: crash & corruption in js_GC() and js_NewGCThing() → JS_ClearContextThread leaves dangling pointers in JSContext
Target Milestone: mozilla1.8.1beta2 → mozilla1.8.1
>>Mike, thanks for getting to the bug -- sorry I didn't see it.

No problem.  I'm happy we FINALLY found the root cause of the corruption! I've been stuck on this for way too many weeks.
An you thought I was crazy.... :-)

What really pisses me off is that boundchecker 8.x didn't find this.  I must have instrumented and checked JS 50 times and couldn't find it.  Purify got it on the 1st run!
I think I'll ask for a refund from CA.

Question1:  Does JS_REMOVE_AND_INIT_LINK() initialze the CSLIST and wipe it out? If so do we really want to do this? My brain hurts too much to figure this out.

Question2:  How long has this bug been there? I'm just suprised I didn't run into it before. I've been doing context pooling since JS 1.5.  I suppose we just got lucky in that it stomped on 4 bytes of something that didn't hurt.
(In reply to comment #38)
> >>Mike, thanks for getting to the bug -- sorry I didn't see it.
> 
> No problem.  I'm happy we FINALLY found the root cause of the corruption! I've
> been stuck on this for way too many weeks.
> An you thought I was crazy.... :-)

No, but I hoped you'd find it in your setup sooner than I'd make time for it.

> Question1:  Does JS_REMOVE_AND_INIT_LINK() initialze the CSLIST and wipe it
> out? If so do we really want to do this? My brain hurts too much to figure this
> out.

Repeat after me: a self-linked JSCList can be remove-linked an infinite number of times without harm.  next->prev = prev and prev->next = next won't change next or prev when they point to self.

> Question2:  How long has this bug been there? I'm just suprised I didn't run
> into it before. I've been doing context pooling since JS 1.5.  I suppose we
> just got lucky in that it stomped on 4 bytes of something that didn't hurt.

This bug is new in JS1.7, it came in with the patch for bug 312238.

/be
Attachment #235525 - Flags: review?(feng.qian.moz) → review+
Mike & Brendan, sorry for troubles and thanks for fixing it. I should get a copy of purify too.

(In reply to comment #39)
> (In reply to comment #38)
> > >>Mike, thanks for getting to the bug -- sorry I didn't see it.
> > 
> > No problem.  I'm happy we FINALLY found the root cause of the corruption! I've
> > been stuck on this for way too many weeks.
> > An you thought I was crazy.... :-)
> 
> No, but I hoped you'd find it in your setup sooner than I'd make time for it.
> 
> > Question1:  Does JS_REMOVE_AND_INIT_LINK() initialze the CSLIST and wipe it
> > out? If so do we really want to do this? My brain hurts too much to figure this
> > out.
> 
> Repeat after me: a self-linked JSCList can be remove-linked an infinite number
> of times without harm.  next->prev = prev and prev->next = next won't change
> next or prev when they point to self.
> 
> > Question2:  How long has this bug been there? I'm just suprised I didn't run
> > into it before. I've been doing context pooling since JS 1.5.  I suppose we
> > just got lucky in that it stomped on 4 bytes of something that didn't hurt.
> 
> This bug is new in JS1.7, it came in with the patch for bug 312238.
> 
> /be
> 

Comment on attachment 235525 [details] [diff] [review]
fix

Thanks, feng -- sorry I didn't spot this back when I should have.

/be
Attachment #235525 - Flags: approval1.8.1?
Fixed on trunk.

/be
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Flags: in-testsuite-
Whiteboard: [schrep-181approval pending]
Comment on attachment 235525 [details] [diff] [review]
fix

a=schrep/beltnzer for drivers.
Attachment #235525 - Flags: approval1.8.1? → approval1.8.1+
Flags: blocking1.8.1? → blocking1.8.1+
Fixed on the 1.8 branch.

/be
Keywords: fixed1.8.1
You need to log in before you can comment on or make changes to this bug.