Last Comment Bug 664647 - Work around MSVC 2005 PGO bug in GetJSObjectSlotsCallback and friends
: Work around MSVC 2005 PGO bug in GetJSObjectSlotsCallback and friends
Status: RESOLVED FIXED
[see comment 16][fixed-in-tracemonkey]
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: ---
Assigned To: Nicholas Nethercote [:njn]
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-06-15 19:39 PDT by Nicholas Nethercote [:njn]
Modified: 2011-06-20 17:16 PDT (History)
7 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch (3.38 KB, patch)
2011-06-17 00:01 PDT, Nicholas Nethercote [:njn]
khuey: review+
Details | Diff | Review

Description Nicholas Nethercote [:njn] 2011-06-15 19:39:15 PDT
Bug 571249 is the first user of IterateCells.  I'm getting crashes on Windows PGO builds.  khuey helped me debug them a bit, the crashes are reliable -- about:memory crashes every time it's loaded in PGO builds -- but the crash location varies from run to run.  It's always in IterateCells or just afterwards, and usually near some GC locking, sometimes relating to condvars.

Here's the code that uses IterateCells:


struct PRInt64Data {
    PRInt64Data() : n(0) { }
    PRInt64 n;
};

void
GetJSObjectSlotsCallback(JSContext *cx, void *v, size_t traceKind, void *thing)
{
    JS_ASSERT(traceKind == JSTRACE_OBJECT);
    JSObject *obj = (JSObject *)thing;
    if (obj->hasSlotsArray()) {
        PRInt64Data *data = (PRInt64Data *) v;
        data->n += obj->numSlots() * sizeof(js::Value);
    }
}

static PRInt64
GetJSObjectSlots(void *dummy)
{
    JSRuntime *rt = nsXPConnect::GetRuntimeInstance()->GetJSRuntime();
    JSContext *cx = JS_NewContext(rt, 0);
    if (!cx) {
        NS_ERROR("couldn't create context for memory tracing");
        return (PRInt64) -1;
    }

    PRInt64Data data;
    js::IterateCells(cx, NULL, js::TraceKindMask(JSTRACE_OBJECT), &data,
                     *GetJSObjectSlotsCallback);

    JS_DestroyContextNoGC(cx);

    return data.n;
}


It's pretty innocuous.  A lot of the crashes happen in one of JS_DestroyContextNoGC's children.

billm, I'd really appreciate it if you double-check all the locking and related stuff in IterateCells and the above code.  I don't understand the GC locking enough to work this out myself.
Comment 1 Bill McCloskey (:billm) 2011-06-15 20:46:36 PDT
I'd be kinda surprised if this is the problem, but you should be in a request when you call IterateCells. We should probably assert that.

Are there any other threads doing anything (i.e., with JS code on the stack) when it crashes? What sort of memory is it trying to access when it crashes?

If nothing else works, I would suggest cutting out code until the crashes go away. Maybe first try commenting out the entire IterateCells call. If that doesn't crash, comment out the main loop in IterateCells.
Comment 2 Nicholas Nethercote [:njn] 2011-06-15 21:09:23 PDT
(In reply to comment #1)
> I'd be kinda surprised if this is the problem, but you should be in a
> request when you call IterateCells. We should probably assert that.

What does the assertion look like?

> Are there any other threads doing anything (i.e., with JS code on the stack)
> when it crashes? What sort of memory is it trying to access when it crashes?

Hard to say, and it varies.  The crash always happens when about:memory is loaded.  about:memory uses JS to generate it's contents, but there's not a lot else going on.

> If nothing else works, I would suggest cutting out code until the crashes go
> away. Maybe first try commenting out the entire IterateCells call. If that
> doesn't crash, comment out the main loop in IterateCells.

I'm in the process of doing that :)
Comment 3 Bill McCloskey (:billm) 2011-06-15 21:17:40 PDT
(In reply to comment #2)
> (In reply to comment #1)
> > I'd be kinda surprised if this is the problem, but you should be in a
> > request when you call IterateCells. We should probably assert that.
> 
> What does the assertion look like?

Sorry, this was unclear. You're definitely not in a request, and I think you should be. I just don't think it would fix this crash. Just wrap the whole thing in JS_BeginRequest/JS_EndRequest for the context you create.

The assertion would just check that cx->requestDepth > 0.
Comment 4 Nicholas Nethercote [:njn] 2011-06-15 21:21:20 PDT
Some edited excerpts from the IRC conversation between me and khuey, in case it's any use:

	khuey	and now I crash on the JS_LOCK_GC bit just down from there
	khuey	that seems weird
	njn	where's the JS_LOCK_GC bit?
	khuey	I'm also crashing on a completely random thread
	khuey	jscntxt.cpp line 575
	khuey	the debugger tells me the main thread is blocked there and I crashed on some random thread waiting on a condvar ...
	khuey	that's ... odd
	njn	I wonder if I need some locking in GetJSObjectSlots
	khuey	stepping through things seems to change where we're crashing, which sucks
	njn	and makes it sound like a threading bug
	khuey	ok this time I made it through the JS_LOCK_GC call
	khuey	and crashed randomly again in some condvar stuff


	khuey	so the main thread is in js::IterateCells
	njn	interesting
	khuey	and it's trying to relock the GC lock
	khuey	at the end of the function
	njn	at the end?
	khuey	yes
	njn	how/why?
	khuey	I assume the AutoUnlockGC has expired
	khuey	so we're relocking
	khuey	sucks that the compiler can't just combine the two
	khuey	and realize that relocking is dumb
	khuey	anyways, the relevant disassembly at the end of that function is http://khuey.pastebin.mozilla.org/1250553
	khuey	and this function is in the PR_Lock call
	khuey	so it's waiting on the lock
	njn	ok
	khuey	finds the crashing thread again
	khuey	ok, so we're crashing on the gc thread
	khuey	at jsgc.cpp line 2063
	njn	definitely sounds like locking problems
	khuey	on the PR_WaitCondVar call
	khuey	and it's the same lock
	njn	how does that crash?
	khuey	has no idea
Comment 5 Gregor Wagner [:gwagner] 2011-06-15 21:29:29 PDT
Seems like a problem with the order of deconstructors for AutoLockGC and AutoUnlockGC?
Comment 6 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-06-15 21:35:03 PDT
The disassembly isn't obviously wrong there; we destroy AutoUnlockGC (relock), AutoGCSession (the movs to memory and the notify) and AutoLockGC (the unlock).
Comment 7 Nicholas Nethercote [:njn] 2011-06-15 21:39:30 PDT
(In reply to comment #3)
> 
> The assertion would just check that cx->requestDepth > 0.

Do you mean cx->checkRequestDepth?  Should CHECK_REQUEST(cx) be used instead?  Should this assertion be inserted at the start of IterateCells()?
Comment 8 Bill McCloskey (:billm) 2011-06-15 22:42:05 PDT
Could you post a link to the build? I'll look at it in the morning.
Comment 9 Nicholas Nethercote [:njn] 2011-06-15 22:50:38 PDT
Here's one I just ran (actually, it's still running;  I'll let you know if the crash doesn't reproduce in it):

https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-f78aa2b1d49d/
Comment 10 Gregor Wagner [:gwagner] 2011-06-15 22:56:39 PDT
Can you try to replace the Auto(Un)LockGC with JS_LOCK_GC(rt) and JS_UNLOCK_GC(rt);?
Comment 11 Nicholas Nethercote [:njn] 2011-06-15 22:59:24 PDT
(In reply to comment #10)
> Can you try to replace the Auto(Un)LockGC with JS_LOCK_GC(rt) and
> JS_UNLOCK_GC(rt);?

I can, but I can only reproduce by building on the try server, which takes a couple of hours, and I already have two builds in flight.  So I'll try it tomorrow :)
Comment 12 Nicholas Nethercote [:njn] 2011-06-16 03:10:41 PDT
(In reply to comment #9)
> Here's one I just ran (actually, it's still running;  I'll let you know if
> the crash doesn't reproduce in it):
> 
> https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/
> nnethercote@mozilla.com-f78aa2b1d49d/

The crashes happened in js_DestroyContext() as expected.  In the following build I commented out the IterateCell() calls in GetJSObjectSlots() and GetJSStringChars() and it didn't crash:

https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-8e8d271eb1eb/

So that's good evidence that there's a problem with IterateCells() and/or the way it's being called.  I'll try Gregor's suggestion from comment 10 next.
Comment 13 Nicholas Nethercote [:njn] 2011-06-16 03:23:09 PDT
(In reply to comment #12)
> I'll try Gregor's suggestion from comment 10 next.

I tried that -- in IterateCells, I replaced |AutoLockGC lock(rt);| with |JS_LOCK_GC(rt)|, and replaced |AutoUnlockGC unlock(rt);| with |JS_UNLOCK_GC(rt);|.  I got this:

Assertion failure: PR_TRUE == cvar->lock->locked, at /home/njn/moz/tm1/nsprpub/pr/src/pthreads/ptsynch.c:314

Program d64/dist/bin/firefox-bin (pid = 22973) received signal 6.
Stack:
UNKNOWN [/lib/x86_64-linux-gnu/libpthread.so.0 +0x0000FC60]
gsignal+0x00000035 [/lib/x86_64-linux-gnu/libc.so.6 +0x00033D05]
abort+0x00000186 [/lib/x86_64-linux-gnu/libc.so.6 +0x00037AB6]
PR_CreateFileMap+0x00000000 [d64/dist/bin/libnspr4.so +0x00012248]
UNKNOWN [d64/dist/bin/libnspr4.so +0x000245B0]
PR_NotifyAllCondVar+0x0000002E [d64/dist/bin/libnspr4.so +0x00025048]
UNKNOWN [d64/dist/bin/libxul.so +0x01E82BAF]
js::IterateCells(JSContext*, JSCompartment*, unsigned long, void*, void (*)(JSContext*, void*, unsigned long, void*))+0x00000139 [d64/dist/bin/libxul.so +0x01E82CEA]


Is there any documentation on how all this GC locking works?  It's all voodoo to me at the moment -- locks, sessions, requests, etc -- and I'd like to understand it.
Comment 14 Nicholas Nethercote [:njn] 2011-06-16 03:24:39 PDT
BTW, this code is in TM now, but the crashing memory reporters are commented out.  Look for "bug 664647" in xpcjsruntime.cpp to uncomment them.
Comment 15 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-06-16 06:25:39 PDT
AutoGCSession notifies on a condvar in the dtor, so we have to relock at the end of the function.  That's what that assert is.
Comment 16 Bill McCloskey (:billm) 2011-06-16 14:11:24 PDT
This appears to be a compiler bug, which I guess isn't that surprising given that it's a PGO build. The bad code is generated for GetJSObjectSlotsCallback. Here's the disassembly:

void
GetJSObjectSlotsCallback(JSContext *cx, void *v, size_t traceKind, void *thing)
{
64606845  push        esi  
64606846  mov         esi,dword ptr [esp+8]  
6460684A  push        edi  
6460684B  mov         edi,dword ptr [esp+10h]  
    JS_ASSERT(traceKind == JSTRACE_OBJECT);
    JSObject *obj = (JSObject *)thing;
    if (obj->hasSlotsArray()) {
6460684F  push        edi  
64606850  call        645A987D  
64606855  test        al,al  
64606857  je          64606865  
        PRInt64Data *data = (PRInt64Data *) v;
        data->n += obj->numSlots() * sizeof(js::Value);
64606859  mov         eax,dword ptr [edi+20h]  
6460685C  shl         eax,3  
6460685F  add         dword ptr [esi],eax  
64606861  adc         dword ptr [esi+4],0  
64606865  pop         edi  
64606866  pop         esi  
    }
}
64606867  ret  

When we enter the function, the stack looks as follows (where greater addresses are at the top, so pushes happen to the bottom):
  thing
  traceKind
  v
  cx
  return address

First we push esi. Then we move [esp+8], which is now cx, into esi. Then we push edi and move [esp+16], which is now v, into edi.

Then we call JSObject::hasSlotsArray, passing it edi (= v) for |this|. This is totally wrong; we really wanted |thing|. But we don't crash here because it's stack memory, so it's valid. Then we do the add to esi (= cx). Again, this is totally wrong; we wanted |v|.

I think the bug here is that the compiler is computing the addresses for the moves to esi and edi without accounting for the stack pushes it just did. Were it not for these pushes, the addresses would be correct. I expect that this is some sort of optimization that's trying to eliminate uses of the frame pointer.

The reason we crash is that the first 64 bits of cx (which got incremented instead of v) hold the doubly-linked list of contexts. When we mess this up, we cause a crash later when destroying the context.

You might be able to "fix" this by perturbing the code a little bit. But given all the problems we've had with PGO, I think it would be much better if we just turn it off (or maybe upgrade the compiler, if that's possible). Tracking down bugs like this is awful.
Comment 17 Nicholas Nethercote [:njn] 2011-06-16 15:28:01 PDT
(In reply to comment #16)
>
> Tracking down bugs like this is awful.

Indeed it is, thanks for working it out.
Comment 18 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-06-16 21:25:17 PDT
I built the broken revision locally with PGO using MSVC 2010 and could not reproduce the crashes.
Comment 19 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-06-16 21:42:32 PDT
The disassembly from the 2010 build:

GetJSObjectSlotsCallback(JSContext *cx, void *v, size_t traceKind, void *thing)
{
60D55F40  push        esi  
    JS_ASSERT(traceKind == JSTRACE_OBJECT);
    JSObject *obj = (JSObject *)thing;
    if (obj->hasSlotsArray()) {
60D55F41  mov         esi,dword ptr [esp+14h]  
60D55F45  push        esi  
60D55F46  call        JSObject::hasSlotsArray (60CF8EA7h)  
60D55F4B  test        al,al  
60D55F4D  je          GetJSObjectSlotsCallback+1Fh (60D55F5Fh)  
        PRInt64Data *data = (PRInt64Data *) v;
        data->n += obj->numSlots() * sizeof(js::Value);
60D55F4F  mov         ecx,dword ptr [esi+20h]  
60D55F52  mov         eax,dword ptr [esp+0Ch]  
60D55F56  shl         ecx,3  
60D55F59  add         dword ptr [eax],ecx  
60D55F5B  adc         dword ptr [eax+4],0  
60D55F5F  pop         esi  
    }
}
60D55F60  ret  

One less push and 4 more bytes of offset get us to the right place.
Comment 20 Nicholas Nethercote [:njn] 2011-06-17 00:01:39 PDT
Created attachment 540001 [details] [diff] [review]
patch

billm, khuey, thanks ever so much for diagnosing this.  I would have had an awful time working it out by myself.

khuey, here's an obvious patch that de-optimizes the problematic functions.
Comment 21 Nicholas Nethercote [:njn] 2011-06-17 00:43:07 PDT
(In reply to comment #20)
> 
> khuey, here's an obvious patch that de-optimizes the problematic functions.

I forgot to mention:  the try server run was all green :)
Comment 22 Nicholas Nethercote [:njn] 2011-06-19 16:53:49 PDT
http://hg.mozilla.org/tracemonkey/rev/cecc5adfc440
Comment 23 Chris Leary [:cdleary] (not checking bugmail) 2011-06-20 17:16:51 PDT
cdleary-bot mozilla-central merge info:
http://hg.mozilla.org/mozilla-central/rev/cecc5adfc440

Note You need to log in before you can comment on or make changes to this bug.