747 bytes, text/plain
9.18 KB, text/plain
17.19 KB, patch
|Details | Diff | Splinter Review|
2.42 KB, text/plain
1.71 KB, patch
|Details | Diff | Splinter Review|
1.79 KB, patch
|Details | Diff | Splinter Review|
1.75 KB, text/plain
I've been working on the LDAP datasource (which is implemented in JS). With my newest version (patch available in 83360), I always get a crash on quitting after I've loaded our XUL example. I can't tell what causes this crash, my new code for the datasource or that I switched the example to outliner. I (or dmose) can provide instructions on how to build the datasource and example if you let me know what platform you want to debug this on. I'm not too sure about the right component for this bug, it seems related to XPConnect and to the JS GC. I'll attach a stack trace for the crash.
jband, I think this is the one that you helped me track down some months ago but I was too lame to actually file the bug on. You and Brendan had a rather involved discussion about finalization, and (aybe) XBL, and came to the conclusion that the fix would be some work, but should be done eventually. Hope this rings a bell...
peterv: David Bradley is the primary problem solver for xpconnect now. Like me he prefers windows. It would be good to see what line this crashes on - and which pointer(?) is bad. I'll help him if I can. dmose: I remember working with you then. The conversation with brendan was about the problems with code paths that go from gc to a finalizer to code that ends up calling js_AllocGCThing - and asserting. What I can't remember is what your specific problem was that got us talking about that.
The specific problem was exactly this: quitting the browser while the LDAP datasource was being held by XUL.
http://lxr.mozilla.org/seamonkey/source/js/src/xpconnect/src/xpcwrappednative.cpp#842 is the line where it fails, obj seems to be bad. I can't really tell you much about obj though (Codewarrior can't handle pointers to interfaces :().
I can't try this out right now (no win box at home), but these should be the steps to reproduce: apply my patch from bug 83360 (attachment 36531 [details] [diff] [review]) set ENABLE_LDAP_EXPERIMENTAL=1 set DISABLE_TESTS= cd mozilla\directory\xpcom nmake /f makefile.win start the browser open chrome://communicator/content/ldapviewer/example.xul wait a bit for the data to load quit the browser, this is where you should crash dmose, please correct me if I'm wrong.
hmm. It is certainly possible that xpconnect is messing up its reference counting here. But, I'd bet that some other code has released a reference that xpconnect owns; i.e. some other code has more Release calls than AddRef calls on the same object that xpconnect is referencing so that the object has been deleted even though xpconnect thinks it is holding a reference. Inspection in a more helpful debugger can clear this up.
Sorry for the delay, thanks jband for jumping in. jband, did you attempt his test? Just curious. I would think if xpconnect had some kind of ref' counting issue it would be rearing its head in many other places. Not to discount the possibility. jband, if you haven't run the test, I can give it a try. Maybe I'll be able to get more information that will point to source. I've got to do a full build before I can do any of this though, so it will take a couple of hours before I see any results. jband, if you've already tried his patch let me know, and I won't bother.
I was able to duplicate the problem and I'll attach a stack trace. Hopefully this will provide more clues.
Ah, so it is coming back to call js_AllocGCThing during a gc. This is just not allowed. Perhaps a *big* chunk of the stack was just missing in the Mac stacks from peterv? Peter can this be true? Or is there somehow sometimes a problem with a bad pointer in that same location? Seems unlikely. I have not tried the test case. But the windows stack says a lot. I guess this was the thing dmose and I looked at. I don't see what can be done off hand. This is why brendan and I got into a 'thowing up hands' discussion back then. This code is doing something pretty darn reasonable in calling an observer in its Release handling. It can't be expected to know that the call will result in a call via xpconnect requiring an JS object allocation in order to build a wrapper. I see lots of things wrong with every solution I can think of to this general problem. This is where I tried to convince Brendan that the JS engine *should* allow this - even if it means restarting gc. Yet, we came up with plenty of problems with that.
jst and I were talking about this. Though the general solution for all embeddings of the js engine might be elusive, the paths to this problem through xpconnect could be dealt with by queueing up and deferring the release calls in xpconnect until the JSGC_END event. The downside of this is that it would require the allocation of memory to track the queued object pointers during gc - which might sometimes have been triggered because of a low memory condition. This seems unlikely to be a big problem in the browser though.
jband, I guess we could hope that GC already reclaimed memory before it got to this point, but that's like roling the dice. Would it be safe to preallocate a small amount of memory to hold the list? I doubt this list would need to be very big, would it? How much data would we need to store per object, just a pointer?
Over a full GC cycle I guess it could grow quite large, how likely is that to happen vs being out of memory?
Weird. I did get that other stack trace twice, but now I (again) always get the one I posted.
Peter, you running release or debug?
Oh, I guess that's pretty self explanatory from the info provided in the stack trace. In any case, I find it interesting that the stack you posted terminated in the XPCWrappedNative::FlatJSObjectFinalized, while mine terminated within the JS engine. Could we be seeing a combination of two problems here?
Ok, to sum things up, we have it crashing in one place on Windows and another on Mac. The Mac version is crashing because the obj pointer returned from to->GetNative is invalid. The Windows version is crashing because of the JS_ASSERT being fired because an object is being allocated during a GC cycle. Peter, you said that you're consistantly getting the bomb on the Mac at the location indicated by the stack you posted, but had gotten a bomb in another location, but can't reproduce it. Is that correct? Also, my current code is almost a week old. I'm going to update it and see if the result changes. If that allows me to reproduce the fault at the same point as you, that might indicate a reference counting issue outside of xpconnect, as I believe I have most of the recent xpconnect changes. Under Windows I might be able to get more information and determine the source.
I have a patch to support deferring the Release calls on the native xpcom objects of gc'd wrappednatives until after the gc is done. This can be enabled at runtime via a method on nsIXPConnect. In my patch I turn this option on in a call from the nsJSEnvironment ctor. I also added support for telling xpconnect that gc is only being run on the main thread. This allows xpconnect to omit some of its locking. Actually, the flag tells xpconnect to itself force gc to happen on the main thread. So, we could remove the now redundant GCCallback in nsJSEnvironment if we choose. This code is working for me. The instrumentation show that most gcs have only tens of deferred Releases. Some have up to 500 or so. The most I've seen was 1700 on shutdown. I left in code that will do an immediate Release if we fail to grow the nsVoidArray that holds the deferred Release pointers when adding a pointer for later Release. I have not tried this code on the specific test case here. I really wonder if there are in fact two separate crashes here. I doubt it. Can someone please try my patch and we can decide if we want it checked in.
Created attachment 36978 [details] [diff] [review] patch to defer Release calls on natives of wrappednatives
John, I had a look through your patch (didn't try it yet, I'll leave that to peterv :-) and your changes look good to me, the only change you might consider is to change: + JSBool mMainThreadOnlyGC; + JSBool mDeferReleases; + JSBool mDoingFinalization; to use PRPackedBool (JSPackedBool?) to save a whopping 8 byets in the singleton XPCRuntime :-) sr=jst should you need it (and assuming it actually fixes the problem ;-).
Thanks for looking at this Johnny. If this were not a singleton I'd opt to go with your suggestion about the PRPackedBool. But, I'm betting that any given compiler is going to add more than 8 bytes of additional code just to mask access to the data in the various inlined accessors/mutators.
I installed your patch jband. I'm now getting an access violation in the NS_RELEASE(obj); statement in GCCallback. The memory obj is pointing at, is hex DD which is what VC++ usually fills delete memory with. I'll look further into this in the morning, and check back here for any further information. I'll attach a stack trace.
OK. So there were two bugs. As it happens both were associated with the same code in RDF. The first bug was the indirect call into js_AllocGCThing during gc. (see dbradley's first stack trace). My xpconnect deferred Release stuff deals with that. The other bug was the one peterv originally saw and which shows up in dbradley's second stack trace. I had to insert instrumentation in xpconnect to find it. (Note: I think we want the xpconnect deferred Release code, but it does make it a little harder to find the xpconnect wrapper data associated with a bad native pointer). Anyway, the problem is in CompositeDataSourceImpl::Release http://lxr.mozilla.org/seamonkey/search?string=CompositeDataSourceImpl%3A%3ARel In the "else if" case the code makes a call out to ds->RemoveObserver(this) and then immediately does a NS_DELETEXPCOM(this). This pattern ASSUMES that no one will AddRef the 'this' during the interim. Now, I know what was intended, but this just ain't right. It crashes because the ds is implemented in JS and so we automatically build an xpconnect wrapper around the passed 'this' and that wrapper quite reasonably does an AddRef and expects that AddRef to pin the object down. But this code deletes the object anyway. So we later crash when we dereference that pointer. Ironically, it was this same code requiring wrapper creation during a Release that caused the other problem during GC. I should have found this faster since I'd seen this code :) So what should be done to fix this? It looks to me like insead of calling NS_DELETEXPCOM(this) you could base a solution on a "return Release();" Since you've already pumped up refcount then this would normally go through the "decrement to zero" case. But if an outsider had added a ref then you'd just decrement the count and stay alive. However, you have the problem that you'd then also need to be clearing the mDataSources array after iterating it and before returning. But this brings up the ugly question: what if the mDataSources array is modified as you are calling out to ds->RemoveObserver(this) in the loop? What is the underlying model here? Also, you are doing a bunch of Atomic manipulation. This implies you think you'll be accessed on other threads. But, if so, how can you get away with iterating the array without any sort of locking? I'll attach a patch for consideration. It ignores the multi-threaded risk, but does deal with the case of changes to the mDataSources array during the call out. I'm not convinced it does not leak :( But I think it is right. I'm at a loss to know what to do about the NS_LOG_RELEASE call in this case. Anyway, it does work for me using the test cited above (and with my xpconnect patch in place to avoid the JS_ASSERT for the alloc during gc problem). I'm reassigning to waterson for comment and whatever.
Created attachment 36999 [details] [diff] [review] patch to make CompositeDataSourceImpl::Release friendlier
With this patch the adding of count+1 should no longer be needed since the members of mDataSources are removed from the array inside the loop, so I think the PR_AtomicAdd at the top could just be a PR_AtomicIncrement and the PR_AtomicDecrement added could be removed. I don't think the mDataSources manipulation should need to be threadsafe since all the remaining pointers to the composite datasource should be to the nsIRDFObserver interface, and none of the methods on that interface (except for the Release method!) manipulate mDataSources. (I don't think anyone should ever QI from nsIRDFObserver to nsIRDFDataSource, should they? Even so, should it be prevented by implementing the two on different objects?) Looking at this patch, I can't think of any way it could leak or reenter the loop in the Release method.
jband, your patch looks fine. r=waterson
New patch does as dbaron suggests. So I guess the refcount instrumentation actually balances out. waterson: still look good to you? Or did I make a mistake I'm not seeing?
Yup, looks okay. (/me is wishing that weak references had been invented when I wrote this the first time.)
OK. I'll take this one back and try to do the legwork to get both patches in 0.9.2. peterv: Do these patches fix the problem for you?
After applying the patch I'm getting another similar crash (accessing deleted memory) at a different location. I'll attach a stack trace and investigate further.
What activities did you perform before the crash? Did you rebuild the rdf dll? (this requires rebuilding in rdf/build too).
My tree has no code at xpcjsruntime.cpp line 638. Your stack implies that mDyingWrappedNativeProtoMap is whacked. Are you sure xpconnect was fully rebuilt?
r=dbradley It's on my end, a patch botched the xpcjsruntime.cpp file, the deletion of mDyingWrappedNativeProtoMap got duplicated when I applied the patch to defer release calls on natives of wrappednatives. It is working fine now and changes look good
Did you intend to include that code with this patch?
Oops. Sorry. That would be your patch to bug 82274. Yes, we might as well check it in along with this. It is reviewed and 100% safe and sane.
if the super review in the first patch applies to the better fix then a= email@example.com for checkin to the trunk. (on behalf of drivers)
Both patched were checked into the trunk.
peterv, does this fix things? If so, would you be able to mark this bug VERIFIED, thanks -
Verified fixed. Thanks all for the very quick turnaround :).