Closed Bug 83367 Opened 23 years ago Closed 23 years ago

Crash [@ XPCWrappedNative::FlatJSObjectFinalized] with LDAP datasource

Categories

(Core :: XPConnect, defect)

All
Mac System 9.x
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla0.9.2

People

(Reporter: peterv, Assigned: jband_mozilla)

References

Details

Attachments

(7 files)

I've been working on the LDAP datasource (which is implemented in JS). With my
newest version (patch available in 83360), I always get a crash on quitting
after I've loaded our XUL example. I can't tell what causes this crash, my new
code for the datasource or that I switched the example to outliner. I (or dmose)
can provide instructions on how to build the datasource and example if you let
me know what platform you want to debug this on.
I'm not too sure about the right component for this bug, it seems related to
XPConnect and to the JS GC. I'll attach a stack trace for the crash.
Attached file Stack trace
jband, I think this is the one that you helped me track down some months ago but
I was too lame to actually file the bug on.  You and Brendan had a rather
involved discussion about finalization, and (aybe) XBL, and came to the
conclusion that the fix would be some work, but should be done eventually.  Hope
this rings a bell...
peterv: David Bradley is the primary problem solver for xpconnect now. Like me 
he prefers windows. It would be good to see what line this crashes on - and 
which pointer(?) is bad. I'll help him if I can.

dmose: I remember working with you then. The conversation with brendan was about 
the problems with code paths that go from gc to a finalizer to code that ends up 
calling js_AllocGCThing - and asserting. What I can't remember is what your 
specific problem was that got us talking about that.
Assignee: jband → dbradley
The specific problem was exactly this: quitting the browser while the LDAP
datasource was being held by XUL.
http://lxr.mozilla.org/seamonkey/source/js/src/xpconnect/src/xpcwrappednative.cpp#842
is the line where it fails, obj seems to be bad. I can't really tell you much
about obj though (Codewarrior can't handle pointers to interfaces :().
I can't try this out right now (no win box at home), but these should be the
steps to reproduce:

apply my patch from bug 83360 (attachment 36531 [details] [diff] [review])
set ENABLE_LDAP_EXPERIMENTAL=1
set DISABLE_TESTS=
cd mozilla\directory\xpcom
nmake /f makefile.win

start the browser
open chrome://communicator/content/ldapviewer/example.xul
wait a bit for the data to load
quit the browser, this is where you should crash

dmose, please correct me if I'm wrong.
hmm. It is certainly possible that xpconnect is messing up its reference 
counting here. But, I'd bet that some other code has released a reference that 
xpconnect owns; i.e. some other code has more Release calls than AddRef calls on 
the same object that xpconnect is referencing so that the object has been 
deleted even though xpconnect thinks it is holding a reference. Inspection in a 
more helpful debugger can clear this up.
Sorry for the delay, thanks jband for jumping in. jband, did you attempt his 
test? Just curious. I would think if xpconnect had some kind of ref' counting 
issue it would be rearing its head in many other places. Not to discount the 
possibility. jband, if you haven't run the test, I can give it a try. Maybe I'll 
be able to get more information that will point to source. I've got to do a full 
build before I can do any of this though, so it will take a couple of hours 
before I see any results. jband, if you've already tried his patch let me know, 
and I won't bother.
Status: NEW → ASSIGNED
I was able to duplicate the problem and I'll attach a stack trace. Hopefully 
this will provide more clues.
Ah, so it is coming back to call js_AllocGCThing during a gc. This is just not 
allowed. Perhaps a *big* chunk of the stack was just missing in the Mac stacks 
from peterv? Peter can this be true? Or is there somehow sometimes a problem 
with a bad pointer in that same location? Seems unlikely.

I have not tried the test case. But the windows stack says a lot.

I guess this was the thing dmose and I looked at. I don't see what can be done 
off hand. This is why brendan and I got into a 'thowing up hands' discussion 
back then. This code is doing something pretty darn reasonable in calling an 
observer in its Release handling. It can't be expected to know that the call 
will result in a call via xpconnect requiring an JS object allocation in order 
to build a wrapper. I see lots of things wrong with every solution I can think 
of to this general problem. This is where I tried to convince Brendan that the 
JS engine *should* allow this - even if it means restarting gc. Yet, we came up 
with plenty of problems with that.
jst and I were talking about this. Though the general solution for all 
embeddings of the js engine might be elusive, the paths to this problem through 
xpconnect could be dealt with by queueing up and deferring the release calls in 
xpconnect until the JSGC_END event. The downside of this is that it would 
require the allocation of memory to track the queued object pointers during gc - 
which might sometimes have been triggered because of a low memory condition. 
This seems unlikely to be a big problem in the browser though.
jband, I guess we could hope that GC already reclaimed memory before it got to 
this point, but that's like roling the dice. Would it be safe to preallocate a 
small amount of memory to hold the list? I doubt this list would need to be very 
big, would it? How much data would we need to store per object, just a pointer?
Over a full GC cycle I guess it could grow quite large, how likely is that to 
happen vs being out of memory?
Weird. I did get that other stack trace twice, but now I (again) always get the
one I posted.
Peter, you running release or debug?
Oh, I guess that's pretty self explanatory from the info provided in the stack 
trace. In any case, I find it interesting that the stack you posted terminated 
in the XPCWrappedNative::FlatJSObjectFinalized, while mine terminated within the 
JS engine.

Could we be seeing a combination of two problems here?
Ok, to sum things up, we have it crashing in one place on Windows and another on 
Mac.

The Mac version is crashing because the obj pointer returned from to->GetNative 
is invalid.

The Windows version is crashing because of the JS_ASSERT being fired because an 
object is being allocated during a GC cycle.

Peter, you said that you're consistantly getting the bomb on the Mac at the 
location indicated by the stack you posted, but had gotten a bomb in another 
location, but can't reproduce it. Is that correct?

Also, my current code is almost a week old. I'm going to update it and see if 
the result changes. If that allows me to reproduce the fault at the same point 
as you, that might indicate a reference counting issue outside of xpconnect, as 
I believe I have most of the recent xpconnect changes. Under Windows I might be 
able to get more information and determine the source.
I have a patch to support deferring the Release calls on the native xpcom 
objects of gc'd wrappednatives until after the gc is done. This can be enabled 
at runtime via a method on nsIXPConnect. In my patch I turn this option on in a 
call from the nsJSEnvironment ctor. I also added support for telling xpconnect 
that gc is only being run on the main thread. This allows xpconnect to omit some 
of its locking. Actually, the flag tells xpconnect to itself force gc to happen 
on the main thread. So, we could remove the now redundant GCCallback in 
nsJSEnvironment if we choose.

This code is working for me. The instrumentation show that most gcs have only 
tens of deferred Releases. Some have up to 500 or so. The most I've seen was 
1700 on shutdown. I left in code that will do an immediate Release if we fail to 
grow the nsVoidArray that holds the deferred Release pointers when adding a 
pointer for later Release.

I have not tried this code on the specific test case here. I really wonder if 
there are in fact two separate crashes here. I doubt it.

Can someone please try my patch and we can decide if we want it checked in.
John, I had a look through your patch (didn't try it yet, I'll leave that to
peterv :-) and your changes look good to me, the only change you might consider
is to change:

+    JSBool mMainThreadOnlyGC;
+    JSBool mDeferReleases;
+    JSBool mDoingFinalization;

to use PRPackedBool (JSPackedBool?) to save a whopping 8 byets in the singleton
XPCRuntime :-)

sr=jst should you need it (and assuming it actually fixes the problem ;-).
Thanks for looking at this Johnny. If this were not a singleton I'd opt to go 
with your suggestion about the PRPackedBool. But, I'm betting that any given 
compiler is going to add more than 8 bytes of additional code just to mask 
access to the data in the various inlined accessors/mutators.
I installed your patch jband. I'm now getting an access violation in the 
NS_RELEASE(obj); statement in GCCallback. The memory obj is pointing at, is hex 
DD which is what VC++ usually fills delete memory with. I'll look further into 
this in the morning, and check back here for any further information. I'll 
attach a stack trace.
OK. So there were two bugs. As it happens both were associated with the same 
code in RDF.

The first bug was the indirect call into js_AllocGCThing during gc. (see 
dbradley's first stack trace). My xpconnect deferred Release stuff deals with 
that.

The other bug was the one peterv originally saw and which shows up in dbradley's 
second stack trace. I had to insert instrumentation in xpconnect to find it. 
(Note: I think we want the xpconnect deferred Release code, but it does make it 
a little harder to find the xpconnect wrapper data associated with a bad native 
pointer).

Anyway, the problem is in CompositeDataSourceImpl::Release

http://lxr.mozilla.org/seamonkey/search?string=CompositeDataSourceImpl%3A%3ARel

In the "else if" case the code makes a call out to ds->RemoveObserver(this) and 
then immediately does a NS_DELETEXPCOM(this). This pattern ASSUMES that no one 
will AddRef the 'this' during the interim.

Now, I know what was intended, but this just ain't right. It crashes because the 
ds is implemented in JS and so we automatically build an xpconnect wrapper 
around the passed 'this' and that wrapper quite reasonably does an AddRef and 
expects that AddRef to pin the object down. But this code deletes the object 
anyway. So we later crash when we dereference that pointer.

Ironically, it was this same code requiring wrapper creation during a Release 
that caused the other problem during GC. I should have found this faster since 
I'd seen this code :)

So what should be done to fix this?

It looks to me like insead of calling NS_DELETEXPCOM(this) you could base a 
solution on a "return Release();" Since you've already pumped up refcount then 
this would normally go through the "decrement to zero" case. But if an outsider 
had added a ref then you'd just decrement the count and stay alive.

However, you have the problem that you'd then also need to be clearing the 
mDataSources array after iterating it and before returning. But this brings up 
the ugly question: what if the mDataSources array is modified as you are calling 
out to ds->RemoveObserver(this) in the loop? What is the underlying model here? 
Also, you are doing a bunch of Atomic manipulation. This implies you think 
you'll be accessed on other threads. But, if so, how can you get away with 
iterating the array without any sort of locking?

I'll attach a patch for consideration. It ignores the multi-threaded risk, 
but does deal with the case of changes to the mDataSources array during the 
call out. I'm not convinced it does not leak :( But I think it is right. I'm at 
a loss to know what to do about the NS_LOG_RELEASE call in this case.

Anyway, it does work for me using the test cited above (and with my xpconnect 
patch in place to avoid the JS_ASSERT for the alloc during gc problem).

I'm reassigning to waterson for comment and whatever.
Assignee: dbradley → waterson
Status: ASSIGNED → NEW
With this patch the adding of count+1 should no longer be needed since the
members of mDataSources are removed from the array inside the loop, so I think
the PR_AtomicAdd at the top could just be a PR_AtomicIncrement and the
PR_AtomicDecrement added could be removed.

I don't think the mDataSources manipulation should need to be threadsafe since
all the remaining pointers to the composite datasource should be to the
nsIRDFObserver interface, and none of the methods on that interface (except for
the Release method!) manipulate mDataSources.  (I don't think anyone should ever
QI from nsIRDFObserver to nsIRDFDataSource, should they?  Even so, should it be
prevented by implementing the two on different objects?)

Looking at this patch, I can't think of any way it could leak or reenter the
loop in the Release method.
jband, your patch looks fine. r=waterson
New patch does as dbaron suggests. So I guess the refcount instrumentation 
actually balances out.

waterson: still look good to you? Or did I make a mistake I'm not seeing?
Yup, looks okay. (/me is wishing that weak references had been invented when I
wrote this the first time.)
OK. I'll take this one back and try to do the legwork to get both patches in 
0.9.2.

peterv: Do these patches fix the problem for you?
Assignee: waterson → jband
Target Milestone: --- → mozilla0.9.2
Hardware: Macintosh → All
After applying the patch I'm getting another similar crash (accessing deleted 
memory) at a different location.

I'll attach a stack trace and investigate further.
What activities did you perform before the crash? Did you rebuild the rdf dll? 
(this requires rebuilding in rdf/build too).
Status: NEW → ASSIGNED
My tree has no code at xpcjsruntime.cpp line 638. Your stack implies that 
mDyingWrappedNativeProtoMap is whacked. Are you sure xpconnect was fully 
rebuilt?
r=dbradley
It's on my end, a patch botched the xpcjsruntime.cpp file, the deletion of 
mDyingWrappedNativeProtoMap got duplicated when I applied the patch to defer 
release calls on natives of wrappednatives.

It is working fine now and changes look good
Did you intend to include that code with this patch?
Oops. Sorry. That would be your patch to bug 82274. Yes, we might as well check 
it in along with this. It is reviewed and 100% safe and sane.
Blocks: 83989
if the super review in the first patch applies to the better fix then a=
asa@mozilla.org for checkin to the trunk.
(on behalf of drivers)
Both patched were checked into the trunk.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
peterv, does this fix things? If so, would you be able to mark this bug 
VERIFIED, thanks - 
Verified fixed. Thanks all for the very quick turnaround :).
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: