Closed Bug 133773 Opened 23 years ago Closed 23 years ago

malign race to use and destroy runtime state in js_DestroyContext

Categories

(Core :: JavaScript Engine, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla1.0

People

(Reporter: brendan, Assigned: brendan)

Details

(Keywords: crash, js1.5)

Attachments

(4 files, 6 obsolete files)

See the thread running through this message: news://news.mozilla.org:119/a7o6cb$k4t2@ripley.netscape.com The js_NewContext/js_DestroyContext synchronization mechanism (the JSRTS_DOWN, LAUNCHING, UP, and LANDING state variable, in conjunction with the stateChange condition) works fine except for one bad bug: js_DestroyContext removes cx from rt->contextList first, before transitioning from UP to LANDING if cx is "last". This means a next-to-last thread may still be racing through js_DestroyContext, having already removed its cx from rt->contextList (necessary for the first thread to be "last"), but still using rt->atomState, etc. The race after that is wide open, and Michel Drapeau indeed found one thread (the last one) freeing atom state while the other was trying to mark it from its default-forced GC called near the bottom of js_DestroyContext in the not-last case. The workaround Michel found, calling JS_DestroyContextNoGC (with a JS_GC call before it, but that's immaterial to the workaround, even if a good idea) just dodges the symptom-bullet by not forcing a GC from the next-to-last call to js_DestroyContext. /be
This needs to be fixed ASAP. Patch coming up. /be
Status: NEW → ASSIGNED
Keywords: js1.5, mozilla1.0
Target Milestone: --- → mozilla1.0
Attached patch proposed fix (obsolete) — Splinter Review
Please test, r=, and sr= ASAP. It tests well for me, notably shutdown leak testing -- but more testing is welcome. /be
This patch passes the JS testsuite in the debug/optimized JS shell on WinNT.
Keywords: mozilla1.0mozilla1.0+
Because we must make the next-to-last context empty rt->contextList only after it has (possibly, maybe) GC'd, and because we must have one code path through the top of js_DestroyContext before we can divince last from next-to-last from other cases, *all* cases must defer removing cx from the list. This may sound obvious now, it was missed for quite a while. An alternative is discussed with reasons for its rejection in the jscntxt.h comment. /be
Attachment #76429 - Attachment is obsolete: true
The tweak is important to API users who call JS_BeginRequest and keep that request open till JS_DestroyContext implicitly ends the request. Without this change from the last patch (moving cx->destroyingSelf = JS_TRUE; above the while (cx->requestDepth != 0) JS_EndRequest(cx); loop), such API users could not ensure that a racing GC would never mark cx's roots; this might result in too much garbage, or worse: shutdown leaks. /be
Attachment #76450 - Attachment is obsolete: true
+ * in that case, if the GC should see cx before out store to this boolean s/out store/our store/ I didn't want to attach another patch just for this comment typo fix. Ok, *now* I'm ready for r= and sr=! /be
Attached patch one more time... (obsolete) — Splinter Review
jband wisely waited to review, and guilted me into iterating in public yet again (I don't mind, I do it as a public service :-). Apart from the comment changes, which I hope are worth their weight (I forgot to update comments shown in the last cvs diff -u20 -- too many comments and some will rust and lie, but I think I got all relevant ones here), the main change is to make JS_ContextIterator safe. It calls js_ContextIterator, as does the GC. All context iterations should skip contexts being destroyed. This patch therefore doesn't touch jsgc.c, it modifies only jscntxt.[ch]. The cx->destroyingSelf = JS_TRUE; assignment now happens under the protection of the rt lock, which also protects context iterator steps. /be
Attachment #76648 - Attachment is obsolete: true
brendan: My head is starting to throb. It seems to me that you now risk a failure to detect 'last'. If the last two context destructions race, then the last one into the lock may not see itself as the only member of the context list at the point where 'last' is detected because the first one is still in the list. I wonder how bad that really is. I also wonder if this is all worth it. I sugested this once before... Would it really be so bad to not do the stuff we do for the 'last' context destruction when we do it. What if we waited until runtime destuction? It would be ugly to create a new context at that point just for the purpose of being able to cleanup the last stuff in the runtime. But, it would be safe and sane. Thoughts?
>It seems to me that you now risk a failure to detect 'last'. If the last two >context destructions race, then the last one into the lock may not see itself as >the only member of the context list at the point where 'last' is detected >because the first one is still in the list. I wonder how bad that really is. Right, and there's another problem: js_LiveContext. More work needed, I'm invalidating this patch. However, I don't believe we can defer other things till runtime destruction. That's an API incompatibility that some embedders I know of will notice, and call a memory bloat bug. /be
Attachment #76690 - Attachment is obsolete: true
Attached patch a better approach (obsolete) — Splinter Review
This patch touches more files, but changes fewer lines (66 -/+ lines in the diff instead of 89). Thanks to jband for buddying me here, and to timeless for reminding me to get more sleep. The last patch series attempted to leave a context being destroyed on rt->contextList until GC had been run on it, but that created new states that weren't handled properly by the logic of the code. It overcomplicated things, too, causing jband's head to hurt :-(. This patch keeps the old logic whereby js_DestroyContext removes cx from rt->contextList first, possibly changing rt->state to JSRTS_LANDING if cx is the last context in rt. But it makes a crucial fix, changing the lock that protects contextList and stateChange from rt->rtLock to rt->gcLock. This is needed for two reasons: 1. js_LiveContext, called while rt->gcLock is held by ClaimScope in jslock.c, without this patch could race badly with js_DestroyContext. The race was rare, and its bad outcome even rarer, I'm sure. But in the worst case, if ClaimScope found a context pointed at by a scope->ownercx that was about to be destroyed, but which js_DestroyContext had not yet removed from rt->contextList, its call to js_LiveContext would return true -- but if immediately after that return, the thread on which ClaimScope was running were preempted until after that cx had been destroyed, then ClaimScope would resume and fondle free memory. 2. To fix the present bug, we need to interlock contextList and state updates using gcLock, and teach the GC to bail if it is racing with the forced GC called on the last context in the runtime (we can't use rtLock because the GC's logic is protected by gcLock). That last-racing-with-next-to-last-context is the case that Michel Drapeau showed, and now, with this patch, it can't happen. js_DestroyContext calls js_ForceGC in the last-context case with the GC_LAST_CONTEXT parameter flag set, which distinguishes the call as a necessary one, compared to the potentially racing js_ForceGC(cx, 0) made in the not-last case of js_DestroyContext. Any call to js_GC without GC_LAST_CONTEXT in the gcflags arg now returns early if rt->state != JSRTS_UP. This tests well for me, although I was never able to reproduce Michel's crash. I'll mail him to make sure he tries this patch. Jband, I promise this won't hurt (as much :-). /be
Keywords: crash
Priority: -- → P1
Brendan: I think I buy this as addressing the stated problem. I see one issue though... You changed the JS_FRIEND_API js_ForceGC. lxr shows that there are callers outside of js/src. I doubt we want to expose the flags. Perhaps JS_ForceGC should be added (without the flags param) and js_ForceGC be made a non-friend? Either way, outside callers need to be fixed.
Otherwise, no code change from last patch's files, just more files. The js.c code was copied more often than I had thought. /be
Attachment #76907 - Attachment is obsolete: true
Comment on attachment 77163 [details] [diff] [review] complete fix, removes js_ForceGC as a FRIEND API sr=jband. And a change to shamu too!
Attachment #77163 - Flags: superreview+
Comment on attachment 77163 [details] [diff] [review] complete fix, removes js_ForceGC as a FRIEND API r/sr=shaver (back from space!)
Attachment #77163 - Flags: review+
Comment on attachment 77163 [details] [diff] [review] complete fix, removes js_ForceGC as a FRIEND API a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #77163 - Flags: approval+
Fix is in, thanks. /be
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
I'm an idiot -- the (last) call to js_UnpinPinnedAtoms can race badly with a next-to-last GC in progress. D'oh! Thanks to Michel for testing the patch, sorry I jumped the gun on checking in. This last window of vulnerability will be easy to close, I think. /be
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Yes, this means keeping the last request on a cx you want to destroy open and letting JS_DestroyContext (js_DestroyContext, actually) end that request for you is the way to go (workaround for the crash Michel found that this patch fixes). As noted in the comment in this patch, this simple change complements rather than replaces the previous patch, which is still needed in case the "not-last" js_DestroyContext races into the GC after the last one has ended its request. I don't believe there's any way for a GC to race with the dangerous parts of the last-context-destruction case now. Review appreciated, I'm waiting on Michel's testing thumbs-up. /be
Comment on attachment 77311 [details] [diff] [review] simple fix to avoid racing js_UnpinPinnedAtoms vs. js_{Mark,Sweep}AtomState sr=jband. I buy this. Let's hope the tests don't turn out to be smarter than we are (again).
Attachment #77311 - Flags: superreview+
Comment on attachment 77311 [details] [diff] [review] simple fix to avoid racing js_UnpinPinnedAtoms vs. js_{Mark,Sweep}AtomState r=shaver.
Attachment #77311 - Flags: review+
Michel wrote: >Congratulation ! >It seems to work fine. > >Thanks a lot ! So I'm going for a= from drivers. /be
Status: REOPENED → ASSIGNED
Comment on attachment 77311 [details] [diff] [review] simple fix to avoid racing js_UnpinPinnedAtoms vs. js_{Mark,Sweep}AtomState a=rjesup@wgate.com for drivers
Attachment #77311 - Flags: approval+
Fixed for good. /be
Status: ASSIGNED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
Marking Verified -
Status: RESOLVED → VERIFIED
Conclusion: From the gdb backtrace we can see that it is because the state->table is NULL, the _HashTableEnumerateEntries triggers a segmentation fault. The reason of segmentation fault is we call js_FreeAtomState first (free the table of JSAtomState), and then we call js_MarkAtomState (access the table of JSAtomState). This order is possible under the fix patches in this page. This backtrace symptom is the same as that of the original bug. Please note that in my patch, I have commented the “if (!state->table)” check in js_MarkAtomState(). This check is added by the bug #131246, a different bug, and this check will hide the segmentation fault in firefox 133773 bug. In this bug, developers added lots of code in order to avoid the race of js_FreeAtomState vs. js_MarkAtomState (we want js_FreeAtomState is called after all js_MarkAtomState are called), but failed. I recommend that we reconsider mechanisms to achieve our goals (although the segmentation fault can be hidden by the check added in bug #131246).
Attachment #399019 - Attachment is obsolete: true
Attachment #399027 - Flags: review+
Sorry, there is a typo in my pseudo code in attachment 399027 [details]. If(last) { Wait until gcLevel == 1; // here is a typo UnpinAtom(); ... A correct form should be: If(last) { Wait until gcLevel == 0; UnpinAtom(); ...
Heming, thanks for the comments and analysis. It looks like you've independently found bug 477021. In general it's best to file a new bug rather than append to an old and resolved bug. Please see bug 477021 and let us know if anything still is broken. Thanks again, /be
See also bug 478336. /be
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: