Closed
Bug 664647
Opened 14 years ago
Closed 14 years ago
Work around MSVC 2005 PGO bug in GetJSObjectSlotsCallback and friends
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
People
(Reporter: n.nethercote, Assigned: n.nethercote)
Details
(Whiteboard: [see comment 16][fixed-in-tracemonkey])
Attachments
(1 file)
3.38 KB,
patch
|
khuey
:
review+
|
Details | Diff | Splinter Review |
Bug 571249 is the first user of IterateCells. I'm getting crashes on Windows PGO builds. khuey helped me debug them a bit, the crashes are reliable -- about:memory crashes every time it's loaded in PGO builds -- but the crash location varies from run to run. It's always in IterateCells or just afterwards, and usually near some GC locking, sometimes relating to condvars.
Here's the code that uses IterateCells:
struct PRInt64Data {
PRInt64Data() : n(0) { }
PRInt64 n;
};
void
GetJSObjectSlotsCallback(JSContext *cx, void *v, size_t traceKind, void *thing)
{
JS_ASSERT(traceKind == JSTRACE_OBJECT);
JSObject *obj = (JSObject *)thing;
if (obj->hasSlotsArray()) {
PRInt64Data *data = (PRInt64Data *) v;
data->n += obj->numSlots() * sizeof(js::Value);
}
}
static PRInt64
GetJSObjectSlots(void *dummy)
{
JSRuntime *rt = nsXPConnect::GetRuntimeInstance()->GetJSRuntime();
JSContext *cx = JS_NewContext(rt, 0);
if (!cx) {
NS_ERROR("couldn't create context for memory tracing");
return (PRInt64) -1;
}
PRInt64Data data;
js::IterateCells(cx, NULL, js::TraceKindMask(JSTRACE_OBJECT), &data,
*GetJSObjectSlotsCallback);
JS_DestroyContextNoGC(cx);
return data.n;
}
It's pretty innocuous. A lot of the crashes happen in one of JS_DestroyContextNoGC's children.
billm, I'd really appreciate it if you double-check all the locking and related stuff in IterateCells and the above code. I don't understand the GC locking enough to work this out myself.
I'd be kinda surprised if this is the problem, but you should be in a request when you call IterateCells. We should probably assert that.
Are there any other threads doing anything (i.e., with JS code on the stack) when it crashes? What sort of memory is it trying to access when it crashes?
If nothing else works, I would suggest cutting out code until the crashes go away. Maybe first try commenting out the entire IterateCells call. If that doesn't crash, comment out the main loop in IterateCells.
![]() |
Assignee | |
Comment 2•14 years ago
|
||
(In reply to comment #1)
> I'd be kinda surprised if this is the problem, but you should be in a
> request when you call IterateCells. We should probably assert that.
What does the assertion look like?
> Are there any other threads doing anything (i.e., with JS code on the stack)
> when it crashes? What sort of memory is it trying to access when it crashes?
Hard to say, and it varies. The crash always happens when about:memory is loaded. about:memory uses JS to generate it's contents, but there's not a lot else going on.
> If nothing else works, I would suggest cutting out code until the crashes go
> away. Maybe first try commenting out the entire IterateCells call. If that
> doesn't crash, comment out the main loop in IterateCells.
I'm in the process of doing that :)
(In reply to comment #2)
> (In reply to comment #1)
> > I'd be kinda surprised if this is the problem, but you should be in a
> > request when you call IterateCells. We should probably assert that.
>
> What does the assertion look like?
Sorry, this was unclear. You're definitely not in a request, and I think you should be. I just don't think it would fix this crash. Just wrap the whole thing in JS_BeginRequest/JS_EndRequest for the context you create.
The assertion would just check that cx->requestDepth > 0.
![]() |
Assignee | |
Comment 4•14 years ago
|
||
Some edited excerpts from the IRC conversation between me and khuey, in case it's any use:
khuey and now I crash on the JS_LOCK_GC bit just down from there
khuey that seems weird
njn where's the JS_LOCK_GC bit?
khuey I'm also crashing on a completely random thread
khuey jscntxt.cpp line 575
khuey the debugger tells me the main thread is blocked there and I crashed on some random thread waiting on a condvar ...
khuey that's ... odd
njn I wonder if I need some locking in GetJSObjectSlots
khuey stepping through things seems to change where we're crashing, which sucks
njn and makes it sound like a threading bug
khuey ok this time I made it through the JS_LOCK_GC call
khuey and crashed randomly again in some condvar stuff
khuey so the main thread is in js::IterateCells
njn interesting
khuey and it's trying to relock the GC lock
khuey at the end of the function
njn at the end?
khuey yes
njn how/why?
khuey I assume the AutoUnlockGC has expired
khuey so we're relocking
khuey sucks that the compiler can't just combine the two
khuey and realize that relocking is dumb
khuey anyways, the relevant disassembly at the end of that function is http://khuey.pastebin.mozilla.org/1250553
khuey and this function is in the PR_Lock call
khuey so it's waiting on the lock
njn ok
khuey finds the crashing thread again
khuey ok, so we're crashing on the gc thread
khuey at jsgc.cpp line 2063
njn definitely sounds like locking problems
khuey on the PR_WaitCondVar call
khuey and it's the same lock
njn how does that crash?
khuey has no idea
Comment 5•14 years ago
|
||
Seems like a problem with the order of deconstructors for AutoLockGC and AutoUnlockGC?
The disassembly isn't obviously wrong there; we destroy AutoUnlockGC (relock), AutoGCSession (the movs to memory and the notify) and AutoLockGC (the unlock).
![]() |
Assignee | |
Comment 7•14 years ago
|
||
(In reply to comment #3)
>
> The assertion would just check that cx->requestDepth > 0.
Do you mean cx->checkRequestDepth? Should CHECK_REQUEST(cx) be used instead? Should this assertion be inserted at the start of IterateCells()?
Could you post a link to the build? I'll look at it in the morning.
![]() |
Assignee | |
Comment 9•14 years ago
|
||
Here's one I just ran (actually, it's still running; I'll let you know if the crash doesn't reproduce in it):
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-f78aa2b1d49d/
Comment 10•14 years ago
|
||
Can you try to replace the Auto(Un)LockGC with JS_LOCK_GC(rt) and JS_UNLOCK_GC(rt);?
![]() |
Assignee | |
Comment 11•14 years ago
|
||
(In reply to comment #10)
> Can you try to replace the Auto(Un)LockGC with JS_LOCK_GC(rt) and
> JS_UNLOCK_GC(rt);?
I can, but I can only reproduce by building on the try server, which takes a couple of hours, and I already have two builds in flight. So I'll try it tomorrow :)
![]() |
Assignee | |
Comment 12•14 years ago
|
||
(In reply to comment #9)
> Here's one I just ran (actually, it's still running; I'll let you know if
> the crash doesn't reproduce in it):
>
> https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/
> nnethercote@mozilla.com-f78aa2b1d49d/
The crashes happened in js_DestroyContext() as expected. In the following build I commented out the IterateCell() calls in GetJSObjectSlots() and GetJSStringChars() and it didn't crash:
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-8e8d271eb1eb/
So that's good evidence that there's a problem with IterateCells() and/or the way it's being called. I'll try Gregor's suggestion from comment 10 next.
![]() |
Assignee | |
Comment 13•14 years ago
|
||
(In reply to comment #12)
> I'll try Gregor's suggestion from comment 10 next.
I tried that -- in IterateCells, I replaced |AutoLockGC lock(rt);| with |JS_LOCK_GC(rt)|, and replaced |AutoUnlockGC unlock(rt);| with |JS_UNLOCK_GC(rt);|. I got this:
Assertion failure: PR_TRUE == cvar->lock->locked, at /home/njn/moz/tm1/nsprpub/pr/src/pthreads/ptsynch.c:314
Program d64/dist/bin/firefox-bin (pid = 22973) received signal 6.
Stack:
UNKNOWN [/lib/x86_64-linux-gnu/libpthread.so.0 +0x0000FC60]
gsignal+0x00000035 [/lib/x86_64-linux-gnu/libc.so.6 +0x00033D05]
abort+0x00000186 [/lib/x86_64-linux-gnu/libc.so.6 +0x00037AB6]
PR_CreateFileMap+0x00000000 [d64/dist/bin/libnspr4.so +0x00012248]
UNKNOWN [d64/dist/bin/libnspr4.so +0x000245B0]
PR_NotifyAllCondVar+0x0000002E [d64/dist/bin/libnspr4.so +0x00025048]
UNKNOWN [d64/dist/bin/libxul.so +0x01E82BAF]
js::IterateCells(JSContext*, JSCompartment*, unsigned long, void*, void (*)(JSContext*, void*, unsigned long, void*))+0x00000139 [d64/dist/bin/libxul.so +0x01E82CEA]
Is there any documentation on how all this GC locking works? It's all voodoo to me at the moment -- locks, sessions, requests, etc -- and I'd like to understand it.
![]() |
Assignee | |
Comment 14•14 years ago
|
||
BTW, this code is in TM now, but the crashing memory reporters are commented out. Look for "bug 664647" in xpcjsruntime.cpp to uncomment them.
AutoGCSession notifies on a condvar in the dtor, so we have to relock at the end of the function. That's what that assert is.
This appears to be a compiler bug, which I guess isn't that surprising given that it's a PGO build. The bad code is generated for GetJSObjectSlotsCallback. Here's the disassembly:
void
GetJSObjectSlotsCallback(JSContext *cx, void *v, size_t traceKind, void *thing)
{
64606845 push esi
64606846 mov esi,dword ptr [esp+8]
6460684A push edi
6460684B mov edi,dword ptr [esp+10h]
JS_ASSERT(traceKind == JSTRACE_OBJECT);
JSObject *obj = (JSObject *)thing;
if (obj->hasSlotsArray()) {
6460684F push edi
64606850 call 645A987D
64606855 test al,al
64606857 je 64606865
PRInt64Data *data = (PRInt64Data *) v;
data->n += obj->numSlots() * sizeof(js::Value);
64606859 mov eax,dword ptr [edi+20h]
6460685C shl eax,3
6460685F add dword ptr [esi],eax
64606861 adc dword ptr [esi+4],0
64606865 pop edi
64606866 pop esi
}
}
64606867 ret
When we enter the function, the stack looks as follows (where greater addresses are at the top, so pushes happen to the bottom):
thing
traceKind
v
cx
return address
First we push esi. Then we move [esp+8], which is now cx, into esi. Then we push edi and move [esp+16], which is now v, into edi.
Then we call JSObject::hasSlotsArray, passing it edi (= v) for |this|. This is totally wrong; we really wanted |thing|. But we don't crash here because it's stack memory, so it's valid. Then we do the add to esi (= cx). Again, this is totally wrong; we wanted |v|.
I think the bug here is that the compiler is computing the addresses for the moves to esi and edi without accounting for the stack pushes it just did. Were it not for these pushes, the addresses would be correct. I expect that this is some sort of optimization that's trying to eliminate uses of the frame pointer.
The reason we crash is that the first 64 bits of cx (which got incremented instead of v) hold the doubly-linked list of contexts. When we mess this up, we cause a crash later when destroying the context.
You might be able to "fix" this by perturbing the code a little bit. But given all the problems we've had with PGO, I think it would be much better if we just turn it off (or maybe upgrade the compiler, if that's possible). Tracking down bugs like this is awful.
![]() |
Assignee | |
Comment 17•14 years ago
|
||
(In reply to comment #16)
>
> Tracking down bugs like this is awful.
Indeed it is, thanks for working it out.
![]() |
Assignee | |
Updated•14 years ago
|
Summary: IterateCells is buggy? → Work around Windows PGO compiler bug in GetJSObjectSlotsCallback and friends
Whiteboard: [see comment 16]
I built the broken revision locally with PGO using MSVC 2010 and could not reproduce the crashes.
The disassembly from the 2010 build:
GetJSObjectSlotsCallback(JSContext *cx, void *v, size_t traceKind, void *thing)
{
60D55F40 push esi
JS_ASSERT(traceKind == JSTRACE_OBJECT);
JSObject *obj = (JSObject *)thing;
if (obj->hasSlotsArray()) {
60D55F41 mov esi,dword ptr [esp+14h]
60D55F45 push esi
60D55F46 call JSObject::hasSlotsArray (60CF8EA7h)
60D55F4B test al,al
60D55F4D je GetJSObjectSlotsCallback+1Fh (60D55F5Fh)
PRInt64Data *data = (PRInt64Data *) v;
data->n += obj->numSlots() * sizeof(js::Value);
60D55F4F mov ecx,dword ptr [esi+20h]
60D55F52 mov eax,dword ptr [esp+0Ch]
60D55F56 shl ecx,3
60D55F59 add dword ptr [eax],ecx
60D55F5B adc dword ptr [eax+4],0
60D55F5F pop esi
}
}
60D55F60 ret
One less push and 4 more bytes of offset get us to the right place.
![]() |
Assignee | |
Comment 20•14 years ago
|
||
billm, khuey, thanks ever so much for diagnosing this. I would have had an awful time working it out by myself.
khuey, here's an obvious patch that de-optimizes the problematic functions.
Assignee: general → nnethercote
Attachment #540001 -
Flags: review?(khuey)
![]() |
Assignee | |
Updated•14 years ago
|
Summary: Work around Windows PGO compiler bug in GetJSObjectSlotsCallback and friends → Work around MSVC 2005 PGO bug in GetJSObjectSlotsCallback and friends
![]() |
Assignee | |
Comment 21•14 years ago
|
||
(In reply to comment #20)
>
> khuey, here's an obvious patch that de-optimizes the problematic functions.
I forgot to mention: the try server run was all green :)
Attachment #540001 -
Flags: review?(khuey) → review+
![]() |
Assignee | |
Comment 22•14 years ago
|
||
Whiteboard: [see comment 16] → [see comment 16][fixed-in-tracemonkey]
Comment 23•14 years ago
|
||
cdleary-bot mozilla-central merge info:
http://hg.mozilla.org/mozilla-central/rev/cecc5adfc440
Updated•14 years ago
|
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•