664647 - Work around MSVC 2005 PGO bug in GetJSObjectSlotsCallback and friends

Assignee

Description

•

14 years ago

Bug 571249 is the first user of IterateCells. I'm getting crashes on Windows PGO builds. khuey helped me debug them a bit, the crashes are reliable -- about:memory crashes every time it's loaded in PGO builds -- but the crash location varies from run to run. It's always in IterateCells or just afterwards, and usually near some GC locking, sometimes relating to condvars. Here's the code that uses IterateCells: struct PRInt64Data { PRInt64Data() : n(0) { } PRInt64 n; }; void GetJSObjectSlotsCallback(JSContext *cx, void *v, size_t traceKind, void *thing) { JS_ASSERT(traceKind == JSTRACE_OBJECT); JSObject *obj = (JSObject *)thing; if (obj->hasSlotsArray()) { PRInt64Data *data = (PRInt64Data *) v; data->n += obj->numSlots() * sizeof(js::Value); } } static PRInt64 GetJSObjectSlots(void *dummy) { JSRuntime *rt = nsXPConnect::GetRuntimeInstance()->GetJSRuntime(); JSContext *cx = JS_NewContext(rt, 0); if (!cx) { NS_ERROR("couldn't create context for memory tracing"); return (PRInt64) -1; } PRInt64Data data; js::IterateCells(cx, NULL, js::TraceKindMask(JSTRACE_OBJECT), &data, *GetJSObjectSlotsCallback); JS_DestroyContextNoGC(cx); return data.n; } It's pretty innocuous. A lot of the crashes happen in one of JS_DestroyContextNoGC's children. billm, I'd really appreciate it if you double-check all the locking and related stuff in IterateCells and the above code. I don't understand the GC locking enough to work this out myself.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 1

•

14 years ago

I'd be kinda surprised if this is the problem, but you should be in a request when you call IterateCells. We should probably assert that. Are there any other threads doing anything (i.e., with JS code on the stack) when it crashes? What sort of memory is it trying to access when it crashes? If nothing else works, I would suggest cutting out code until the crashes go away. Maybe first try commenting out the entire IterateCells call. If that doesn't crash, comment out the main loop in IterateCells.

Nicholas Nethercote [inactive]

Assignee

Comment 2

•

14 years ago

(In reply to comment #1) > I'd be kinda surprised if this is the problem, but you should be in a > request when you call IterateCells. We should probably assert that. What does the assertion look like? > Are there any other threads doing anything (i.e., with JS code on the stack) > when it crashes? What sort of memory is it trying to access when it crashes? Hard to say, and it varies. The crash always happens when about:memory is loaded. about:memory uses JS to generate it's contents, but there's not a lot else going on. > If nothing else works, I would suggest cutting out code until the crashes go > away. Maybe first try commenting out the entire IterateCells call. If that > doesn't crash, comment out the main loop in IterateCells. I'm in the process of doing that :)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 3

•

14 years ago

(In reply to comment #2) > (In reply to comment #1) > > I'd be kinda surprised if this is the problem, but you should be in a > > request when you call IterateCells. We should probably assert that. > > What does the assertion look like? Sorry, this was unclear. You're definitely not in a request, and I think you should be. I just don't think it would fix this crash. Just wrap the whole thing in JS_BeginRequest/JS_EndRequest for the context you create. The assertion would just check that cx->requestDepth > 0.

Nicholas Nethercote [inactive]

Assignee

Comment 4

•

14 years ago

Some edited excerpts from the IRC conversation between me and khuey, in case it's any use: khuey and now I crash on the JS_LOCK_GC bit just down from there khuey that seems weird njn where's the JS_LOCK_GC bit? khuey I'm also crashing on a completely random thread khuey jscntxt.cpp line 575 khuey the debugger tells me the main thread is blocked there and I crashed on some random thread waiting on a condvar ... khuey that's ... odd njn I wonder if I need some locking in GetJSObjectSlots khuey stepping through things seems to change where we're crashing, which sucks njn and makes it sound like a threading bug khuey ok this time I made it through the JS_LOCK_GC call khuey and crashed randomly again in some condvar stuff khuey so the main thread is in js::IterateCells njn interesting khuey and it's trying to relock the GC lock khuey at the end of the function njn at the end? khuey yes njn how/why? khuey I assume the AutoUnlockGC has expired khuey so we're relocking khuey sucks that the compiler can't just combine the two khuey and realize that relocking is dumb khuey anyways, the relevant disassembly at the end of that function is http://khuey.pastebin.mozilla.org/1250553 khuey and this function is in the PR_Lock call khuey so it's waiting on the lock njn ok khuey finds the crashing thread again khuey ok, so we're crashing on the gc thread khuey at jsgc.cpp line 2063 njn definitely sounds like locking problems khuey on the PR_WaitCondVar call khuey and it's the same lock njn how does that crash? khuey has no idea

Gregor Wagner [:gwagner]

Comment 5

•

14 years ago

Seems like a problem with the order of deconstructors for AutoLockGC and AutoUnlockGC?

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 6

•

14 years ago

The disassembly isn't obviously wrong there; we destroy AutoUnlockGC (relock), AutoGCSession (the movs to memory and the notify) and AutoLockGC (the unlock).

Nicholas Nethercote [inactive]

Assignee

Comment 7

•

14 years ago

(In reply to comment #3) > > The assertion would just check that cx->requestDepth > 0. Do you mean cx->checkRequestDepth? Should CHECK_REQUEST(cx) be used instead? Should this assertion be inserted at the start of IterateCells()?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 8

•

14 years ago

Could you post a link to the build? I'll look at it in the morning.

Nicholas Nethercote [inactive]

Assignee

Comment 9

•

14 years ago

Here's one I just ran (actually, it's still running; I'll let you know if the crash doesn't reproduce in it): https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-f78aa2b1d49d/

Gregor Wagner [:gwagner]

Comment 10

•

14 years ago

Can you try to replace the Auto(Un)LockGC with JS_LOCK_GC(rt) and JS_UNLOCK_GC(rt);?

Nicholas Nethercote [inactive]

Assignee

Comment 11

•

14 years ago

(In reply to comment #10) > Can you try to replace the Auto(Un)LockGC with JS_LOCK_GC(rt) and > JS_UNLOCK_GC(rt);? I can, but I can only reproduce by building on the try server, which takes a couple of hours, and I already have two builds in flight. So I'll try it tomorrow :)

Nicholas Nethercote [inactive]

Assignee

Comment 12

•

14 years ago

(In reply to comment #9) > Here's one I just ran (actually, it's still running; I'll let you know if > the crash doesn't reproduce in it): > > https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/ > nnethercote@mozilla.com-f78aa2b1d49d/ The crashes happened in js_DestroyContext() as expected. In the following build I commented out the IterateCell() calls in GetJSObjectSlots() and GetJSStringChars() and it didn't crash: https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-8e8d271eb1eb/ So that's good evidence that there's a problem with IterateCells() and/or the way it's being called. I'll try Gregor's suggestion from comment 10 next.

Nicholas Nethercote [inactive]

Assignee

Comment 13

•

14 years ago

(In reply to comment #12) > I'll try Gregor's suggestion from comment 10 next. I tried that -- in IterateCells, I replaced |AutoLockGC lock(rt);| with |JS_LOCK_GC(rt)|, and replaced |AutoUnlockGC unlock(rt);| with |JS_UNLOCK_GC(rt);|. I got this: Assertion failure: PR_TRUE == cvar->lock->locked, at /home/njn/moz/tm1/nsprpub/pr/src/pthreads/ptsynch.c:314 Program d64/dist/bin/firefox-bin (pid = 22973) received signal 6. Stack: UNKNOWN [/lib/x86_64-linux-gnu/libpthread.so.0 +0x0000FC60] gsignal+0x00000035 [/lib/x86_64-linux-gnu/libc.so.6 +0x00033D05] abort+0x00000186 [/lib/x86_64-linux-gnu/libc.so.6 +0x00037AB6] PR_CreateFileMap+0x00000000 [d64/dist/bin/libnspr4.so +0x00012248] UNKNOWN [d64/dist/bin/libnspr4.so +0x000245B0] PR_NotifyAllCondVar+0x0000002E [d64/dist/bin/libnspr4.so +0x00025048] UNKNOWN [d64/dist/bin/libxul.so +0x01E82BAF] js::IterateCells(JSContext*, JSCompartment*, unsigned long, void*, void (*)(JSContext*, void*, unsigned long, void*))+0x00000139 [d64/dist/bin/libxul.so +0x01E82CEA] Is there any documentation on how all this GC locking works? It's all voodoo to me at the moment -- locks, sessions, requests, etc -- and I'd like to understand it.

Nicholas Nethercote [inactive]

Assignee

Comment 14

•

14 years ago

BTW, this code is in TM now, but the crashing memory reporters are commented out. Look for "bug 664647" in xpcjsruntime.cpp to uncomment them.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 15

•

14 years ago

AutoGCSession notifies on a condvar in the dtor, so we have to relock at the end of the function. That's what that assert is.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 16

•

14 years ago

This appears to be a compiler bug, which I guess isn't that surprising given that it's a PGO build. The bad code is generated for GetJSObjectSlotsCallback. Here's the disassembly: void GetJSObjectSlotsCallback(JSContext *cx, void *v, size_t traceKind, void *thing) { 64606845 push esi 64606846 mov esi,dword ptr [esp+8] 6460684A push edi 6460684B mov edi,dword ptr [esp+10h] JS_ASSERT(traceKind == JSTRACE_OBJECT); JSObject *obj = (JSObject *)thing; if (obj->hasSlotsArray()) { 6460684F push edi 64606850 call 645A987D 64606855 test al,al 64606857 je 64606865 PRInt64Data *data = (PRInt64Data *) v; data->n += obj->numSlots() * sizeof(js::Value); 64606859 mov eax,dword ptr [edi+20h] 6460685C shl eax,3 6460685F add dword ptr [esi],eax 64606861 adc dword ptr [esi+4],0 64606865 pop edi 64606866 pop esi } } 64606867 ret When we enter the function, the stack looks as follows (where greater addresses are at the top, so pushes happen to the bottom): thing traceKind v cx return address First we push esi. Then we move [esp+8], which is now cx, into esi. Then we push edi and move [esp+16], which is now v, into edi. Then we call JSObject::hasSlotsArray, passing it edi (= v) for |this|. This is totally wrong; we really wanted |thing|. But we don't crash here because it's stack memory, so it's valid. Then we do the add to esi (= cx). Again, this is totally wrong; we wanted |v|. I think the bug here is that the compiler is computing the addresses for the moves to esi and edi without accounting for the stack pushes it just did. Were it not for these pushes, the addresses would be correct. I expect that this is some sort of optimization that's trying to eliminate uses of the frame pointer. The reason we crash is that the first 64 bits of cx (which got incremented instead of v) hold the doubly-linked list of contexts. When we mess this up, we cause a crash later when destroying the context. You might be able to "fix" this by perturbing the code a little bit. But given all the problems we've had with PGO, I think it would be much better if we just turn it off (or maybe upgrade the compiler, if that's possible). Tracking down bugs like this is awful.

Nicholas Nethercote [inactive]

Assignee

Comment 17

•

14 years ago

(In reply to comment #16) > > Tracking down bugs like this is awful. Indeed it is, thanks for working it out.

Nicholas Nethercote [inactive]

Assignee

Updated

•

14 years ago

Summary: IterateCells is buggy? → Work around Windows PGO compiler bug in GetJSObjectSlotsCallback and friends

Whiteboard: [see comment 16]

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 18

•

14 years ago

I built the broken revision locally with PGO using MSVC 2010 and could not reproduce the crashes.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 19

•

14 years ago

The disassembly from the 2010 build: GetJSObjectSlotsCallback(JSContext *cx, void *v, size_t traceKind, void *thing) { 60D55F40 push esi JS_ASSERT(traceKind == JSTRACE_OBJECT); JSObject *obj = (JSObject *)thing; if (obj->hasSlotsArray()) { 60D55F41 mov esi,dword ptr [esp+14h] 60D55F45 push esi 60D55F46 call JSObject::hasSlotsArray (60CF8EA7h) 60D55F4B test al,al 60D55F4D je GetJSObjectSlotsCallback+1Fh (60D55F5Fh) PRInt64Data *data = (PRInt64Data *) v; data->n += obj->numSlots() * sizeof(js::Value); 60D55F4F mov ecx,dword ptr [esi+20h] 60D55F52 mov eax,dword ptr [esp+0Ch] 60D55F56 shl ecx,3 60D55F59 add dword ptr [eax],ecx 60D55F5B adc dword ptr [eax+4],0 60D55F5F pop esi } } 60D55F60 ret One less push and 4 more bytes of offset get us to the right place.

Nicholas Nethercote [inactive]

Assignee

Comment 20

•

14 years ago

Attached patch patch — Details — Splinter Review

billm, khuey, thanks ever so much for diagnosing this. I would have had an awful time working it out by myself. khuey, here's an obvious patch that de-optimizes the problematic functions.

Assignee: general → nnethercote

Attachment #540001 - Flags: review?(khuey)

Nicholas Nethercote [inactive]

Assignee

Updated

•

14 years ago

Summary: Work around Windows PGO compiler bug in GetJSObjectSlotsCallback and friends → Work around MSVC 2005 PGO bug in GetJSObjectSlotsCallback and friends

Nicholas Nethercote [inactive]

Assignee

Comment 21

•

14 years ago

(In reply to comment #20) > > khuey, here's an obvious patch that de-optimizes the problematic functions. I forgot to mention: the try server run was all green :)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Updated

•

14 years ago

Attachment #540001 - Flags: review?(khuey) → review+

Nicholas Nethercote [inactive]

Assignee

Comment 22

•

14 years ago

http://hg.mozilla.org/tracemonkey/rev/cecc5adfc440

Whiteboard: [see comment 16] → [see comment 16][fixed-in-tracemonkey]

Chris Leary [:cdleary] (not checking bugmail)

Comment 23

•

14 years ago

cdleary-bot mozilla-central merge info: http://hg.mozilla.org/mozilla-central/rev/cecc5adfc440

Chris Leary [:cdleary] (not checking bugmail)

Updated

•

14 years ago

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED