Work around MSVC 2005 PGO bug in GetJSObjectSlotsCallback and friends

RESOLVED FIXED

Status

()

Core
JavaScript Engine
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: njn, Assigned: njn)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [see comment 16][fixed-in-tracemonkey])

Attachments

(1 attachment)

(Assignee)

Description

6 years ago
Bug 571249 is the first user of IterateCells.  I'm getting crashes on Windows PGO builds.  khuey helped me debug them a bit, the crashes are reliable -- about:memory crashes every time it's loaded in PGO builds -- but the crash location varies from run to run.  It's always in IterateCells or just afterwards, and usually near some GC locking, sometimes relating to condvars.

Here's the code that uses IterateCells:


struct PRInt64Data {
    PRInt64Data() : n(0) { }
    PRInt64 n;
};

void
GetJSObjectSlotsCallback(JSContext *cx, void *v, size_t traceKind, void *thing)
{
    JS_ASSERT(traceKind == JSTRACE_OBJECT);
    JSObject *obj = (JSObject *)thing;
    if (obj->hasSlotsArray()) {
        PRInt64Data *data = (PRInt64Data *) v;
        data->n += obj->numSlots() * sizeof(js::Value);
    }
}

static PRInt64
GetJSObjectSlots(void *dummy)
{
    JSRuntime *rt = nsXPConnect::GetRuntimeInstance()->GetJSRuntime();
    JSContext *cx = JS_NewContext(rt, 0);
    if (!cx) {
        NS_ERROR("couldn't create context for memory tracing");
        return (PRInt64) -1;
    }

    PRInt64Data data;
    js::IterateCells(cx, NULL, js::TraceKindMask(JSTRACE_OBJECT), &data,
                     *GetJSObjectSlotsCallback);

    JS_DestroyContextNoGC(cx);

    return data.n;
}


It's pretty innocuous.  A lot of the crashes happen in one of JS_DestroyContextNoGC's children.

billm, I'd really appreciate it if you double-check all the locking and related stuff in IterateCells and the above code.  I don't understand the GC locking enough to work this out myself.
I'd be kinda surprised if this is the problem, but you should be in a request when you call IterateCells. We should probably assert that.

Are there any other threads doing anything (i.e., with JS code on the stack) when it crashes? What sort of memory is it trying to access when it crashes?

If nothing else works, I would suggest cutting out code until the crashes go away. Maybe first try commenting out the entire IterateCells call. If that doesn't crash, comment out the main loop in IterateCells.
(Assignee)

Comment 2

6 years ago
(In reply to comment #1)
> I'd be kinda surprised if this is the problem, but you should be in a
> request when you call IterateCells. We should probably assert that.

What does the assertion look like?

> Are there any other threads doing anything (i.e., with JS code on the stack)
> when it crashes? What sort of memory is it trying to access when it crashes?

Hard to say, and it varies.  The crash always happens when about:memory is loaded.  about:memory uses JS to generate it's contents, but there's not a lot else going on.

> If nothing else works, I would suggest cutting out code until the crashes go
> away. Maybe first try commenting out the entire IterateCells call. If that
> doesn't crash, comment out the main loop in IterateCells.

I'm in the process of doing that :)
(In reply to comment #2)
> (In reply to comment #1)
> > I'd be kinda surprised if this is the problem, but you should be in a
> > request when you call IterateCells. We should probably assert that.
> 
> What does the assertion look like?

Sorry, this was unclear. You're definitely not in a request, and I think you should be. I just don't think it would fix this crash. Just wrap the whole thing in JS_BeginRequest/JS_EndRequest for the context you create.

The assertion would just check that cx->requestDepth > 0.
(Assignee)

Comment 4

6 years ago
Some edited excerpts from the IRC conversation between me and khuey, in case it's any use:

	khuey	and now I crash on the JS_LOCK_GC bit just down from there
	khuey	that seems weird
	njn	where's the JS_LOCK_GC bit?
	khuey	I'm also crashing on a completely random thread
	khuey	jscntxt.cpp line 575
	khuey	the debugger tells me the main thread is blocked there and I crashed on some random thread waiting on a condvar ...
	khuey	that's ... odd
	njn	I wonder if I need some locking in GetJSObjectSlots
	khuey	stepping through things seems to change where we're crashing, which sucks
	njn	and makes it sound like a threading bug
	khuey	ok this time I made it through the JS_LOCK_GC call
	khuey	and crashed randomly again in some condvar stuff


	khuey	so the main thread is in js::IterateCells
	njn	interesting
	khuey	and it's trying to relock the GC lock
	khuey	at the end of the function
	njn	at the end?
	khuey	yes
	njn	how/why?
	khuey	I assume the AutoUnlockGC has expired
	khuey	so we're relocking
	khuey	sucks that the compiler can't just combine the two
	khuey	and realize that relocking is dumb
	khuey	anyways, the relevant disassembly at the end of that function is http://khuey.pastebin.mozilla.org/1250553
	khuey	and this function is in the PR_Lock call
	khuey	so it's waiting on the lock
	njn	ok
	khuey	finds the crashing thread again
	khuey	ok, so we're crashing on the gc thread
	khuey	at jsgc.cpp line 2063
	njn	definitely sounds like locking problems
	khuey	on the PR_WaitCondVar call
	khuey	and it's the same lock
	njn	how does that crash?
	khuey	has no idea
Seems like a problem with the order of deconstructors for AutoLockGC and AutoUnlockGC?
The disassembly isn't obviously wrong there; we destroy AutoUnlockGC (relock), AutoGCSession (the movs to memory and the notify) and AutoLockGC (the unlock).
(Assignee)

Comment 7

6 years ago
(In reply to comment #3)
> 
> The assertion would just check that cx->requestDepth > 0.

Do you mean cx->checkRequestDepth?  Should CHECK_REQUEST(cx) be used instead?  Should this assertion be inserted at the start of IterateCells()?
Could you post a link to the build? I'll look at it in the morning.
(Assignee)

Comment 9

6 years ago
Here's one I just ran (actually, it's still running;  I'll let you know if the crash doesn't reproduce in it):

https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-f78aa2b1d49d/
Can you try to replace the Auto(Un)LockGC with JS_LOCK_GC(rt) and JS_UNLOCK_GC(rt);?
(Assignee)

Comment 11

6 years ago
(In reply to comment #10)
> Can you try to replace the Auto(Un)LockGC with JS_LOCK_GC(rt) and
> JS_UNLOCK_GC(rt);?

I can, but I can only reproduce by building on the try server, which takes a couple of hours, and I already have two builds in flight.  So I'll try it tomorrow :)
(Assignee)

Comment 12

6 years ago
(In reply to comment #9)
> Here's one I just ran (actually, it's still running;  I'll let you know if
> the crash doesn't reproduce in it):
> 
> https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/
> nnethercote@mozilla.com-f78aa2b1d49d/

The crashes happened in js_DestroyContext() as expected.  In the following build I commented out the IterateCell() calls in GetJSObjectSlots() and GetJSStringChars() and it didn't crash:

https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-8e8d271eb1eb/

So that's good evidence that there's a problem with IterateCells() and/or the way it's being called.  I'll try Gregor's suggestion from comment 10 next.
(Assignee)

Comment 13

6 years ago
(In reply to comment #12)
> I'll try Gregor's suggestion from comment 10 next.

I tried that -- in IterateCells, I replaced |AutoLockGC lock(rt);| with |JS_LOCK_GC(rt)|, and replaced |AutoUnlockGC unlock(rt);| with |JS_UNLOCK_GC(rt);|.  I got this:

Assertion failure: PR_TRUE == cvar->lock->locked, at /home/njn/moz/tm1/nsprpub/pr/src/pthreads/ptsynch.c:314

Program d64/dist/bin/firefox-bin (pid = 22973) received signal 6.
Stack:
UNKNOWN [/lib/x86_64-linux-gnu/libpthread.so.0 +0x0000FC60]
gsignal+0x00000035 [/lib/x86_64-linux-gnu/libc.so.6 +0x00033D05]
abort+0x00000186 [/lib/x86_64-linux-gnu/libc.so.6 +0x00037AB6]
PR_CreateFileMap+0x00000000 [d64/dist/bin/libnspr4.so +0x00012248]
UNKNOWN [d64/dist/bin/libnspr4.so +0x000245B0]
PR_NotifyAllCondVar+0x0000002E [d64/dist/bin/libnspr4.so +0x00025048]
UNKNOWN [d64/dist/bin/libxul.so +0x01E82BAF]
js::IterateCells(JSContext*, JSCompartment*, unsigned long, void*, void (*)(JSContext*, void*, unsigned long, void*))+0x00000139 [d64/dist/bin/libxul.so +0x01E82CEA]


Is there any documentation on how all this GC locking works?  It's all voodoo to me at the moment -- locks, sessions, requests, etc -- and I'd like to understand it.
(Assignee)

Comment 14

6 years ago
BTW, this code is in TM now, but the crashing memory reporters are commented out.  Look for "bug 664647" in xpcjsruntime.cpp to uncomment them.
AutoGCSession notifies on a condvar in the dtor, so we have to relock at the end of the function.  That's what that assert is.
This appears to be a compiler bug, which I guess isn't that surprising given that it's a PGO build. The bad code is generated for GetJSObjectSlotsCallback. Here's the disassembly:

void
GetJSObjectSlotsCallback(JSContext *cx, void *v, size_t traceKind, void *thing)
{
64606845  push        esi  
64606846  mov         esi,dword ptr [esp+8]  
6460684A  push        edi  
6460684B  mov         edi,dword ptr [esp+10h]  
    JS_ASSERT(traceKind == JSTRACE_OBJECT);
    JSObject *obj = (JSObject *)thing;
    if (obj->hasSlotsArray()) {
6460684F  push        edi  
64606850  call        645A987D  
64606855  test        al,al  
64606857  je          64606865  
        PRInt64Data *data = (PRInt64Data *) v;
        data->n += obj->numSlots() * sizeof(js::Value);
64606859  mov         eax,dword ptr [edi+20h]  
6460685C  shl         eax,3  
6460685F  add         dword ptr [esi],eax  
64606861  adc         dword ptr [esi+4],0  
64606865  pop         edi  
64606866  pop         esi  
    }
}
64606867  ret  

When we enter the function, the stack looks as follows (where greater addresses are at the top, so pushes happen to the bottom):
  thing
  traceKind
  v
  cx
  return address

First we push esi. Then we move [esp+8], which is now cx, into esi. Then we push edi and move [esp+16], which is now v, into edi.

Then we call JSObject::hasSlotsArray, passing it edi (= v) for |this|. This is totally wrong; we really wanted |thing|. But we don't crash here because it's stack memory, so it's valid. Then we do the add to esi (= cx). Again, this is totally wrong; we wanted |v|.

I think the bug here is that the compiler is computing the addresses for the moves to esi and edi without accounting for the stack pushes it just did. Were it not for these pushes, the addresses would be correct. I expect that this is some sort of optimization that's trying to eliminate uses of the frame pointer.

The reason we crash is that the first 64 bits of cx (which got incremented instead of v) hold the doubly-linked list of contexts. When we mess this up, we cause a crash later when destroying the context.

You might be able to "fix" this by perturbing the code a little bit. But given all the problems we've had with PGO, I think it would be much better if we just turn it off (or maybe upgrade the compiler, if that's possible). Tracking down bugs like this is awful.
(Assignee)

Comment 17

6 years ago
(In reply to comment #16)
>
> Tracking down bugs like this is awful.

Indeed it is, thanks for working it out.
(Assignee)

Updated

6 years ago
Summary: IterateCells is buggy? → Work around Windows PGO compiler bug in GetJSObjectSlotsCallback and friends
Whiteboard: [see comment 16]
I built the broken revision locally with PGO using MSVC 2010 and could not reproduce the crashes.
The disassembly from the 2010 build:

GetJSObjectSlotsCallback(JSContext *cx, void *v, size_t traceKind, void *thing)
{
60D55F40  push        esi  
    JS_ASSERT(traceKind == JSTRACE_OBJECT);
    JSObject *obj = (JSObject *)thing;
    if (obj->hasSlotsArray()) {
60D55F41  mov         esi,dword ptr [esp+14h]  
60D55F45  push        esi  
60D55F46  call        JSObject::hasSlotsArray (60CF8EA7h)  
60D55F4B  test        al,al  
60D55F4D  je          GetJSObjectSlotsCallback+1Fh (60D55F5Fh)  
        PRInt64Data *data = (PRInt64Data *) v;
        data->n += obj->numSlots() * sizeof(js::Value);
60D55F4F  mov         ecx,dword ptr [esi+20h]  
60D55F52  mov         eax,dword ptr [esp+0Ch]  
60D55F56  shl         ecx,3  
60D55F59  add         dword ptr [eax],ecx  
60D55F5B  adc         dword ptr [eax+4],0  
60D55F5F  pop         esi  
    }
}
60D55F60  ret  

One less push and 4 more bytes of offset get us to the right place.
(Assignee)

Comment 20

6 years ago
Created attachment 540001 [details] [diff] [review]
patch

billm, khuey, thanks ever so much for diagnosing this.  I would have had an awful time working it out by myself.

khuey, here's an obvious patch that de-optimizes the problematic functions.
Assignee: general → nnethercote
Attachment #540001 - Flags: review?(khuey)
(Assignee)

Updated

6 years ago
Summary: Work around Windows PGO compiler bug in GetJSObjectSlotsCallback and friends → Work around MSVC 2005 PGO bug in GetJSObjectSlotsCallback and friends
(Assignee)

Comment 21

6 years ago
(In reply to comment #20)
> 
> khuey, here's an obvious patch that de-optimizes the problematic functions.

I forgot to mention:  the try server run was all green :)
Attachment #540001 - Flags: review?(khuey) → review+
(Assignee)

Comment 22

6 years ago
http://hg.mozilla.org/tracemonkey/rev/cecc5adfc440
Whiteboard: [see comment 16] → [see comment 16][fixed-in-tracemonkey]
cdleary-bot mozilla-central merge info:
http://hg.mozilla.org/mozilla-central/rev/cecc5adfc440
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.