Closed Bug 368869 Opened 18 years ago Closed 17 years ago

Make cycle collection suspect all native wrapper roots

Categories

(Core :: XPCOM, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9alpha8

People

(Reporter: graydon, Assigned: graydon)

References

Details

Attachments

(3 files, 1 obsolete file)

Eliminate a form of cycle-collection failure induced by a race (not related to threads) between the collector and dropping edges in script code, as described in the bug 368530 comment 21.
Attachment #253502 - Flags: review?(brendan)
Comment on attachment 253502 [details] [diff] [review]
simple implementation of strategy to suspect all native wrappers

Looks good, only request is a comment containing the description of the race from bug 368530 comment 18 that is being addressed here.

Any runtime effects noticeable?

/be
Attachment #253502 - Flags: review?(brendan) → review+
Did this order-dependency fix go in? Any runtime hit?

/be
Haven't measured. Still trying to knit all peterv's patches together.
Comment on attachment 253502 [details] [diff] [review]
simple implementation of strategy to suspect all native wrappers

sr=jst fwiw.
Attachment #253502 - Flags: superreview+
Comment on attachment 253502 [details] [diff] [review]
simple implementation of strategy to suspect all native wrappers


>-    void Suspect(nsISupports *n);
>+    void Suspect(nsISupports *n, bool current=false);

PRBool maybe?
Ah good catch. Alas, after I committed it!

I'll fix presently.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
The net effect here on leak stats seems to be that the trace-malloc leaks (on nye/seamonkey) increased from 4MB to 7MB.
Huh. Odd. This patch should have made leaking *decrease* if anything. The only thing I can imagine going on is that the aggressive scanning this patch causes might be triggering a pathological form of bug 374096, by making all the wrappers purple early, and then forgetting about them before any cycles they're involved in close.

Given that we're planning on landing 374096 ASAP, can we defer addressing this until after it lands? My hunch is it'll close whatever sore spot this patch might have aggravated. If it doesn't, naturally, this patch has to back out; but that won't be hard in any case since it's really only a few isolated lines.

(I'll reopen this bug for the time being, however)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
nye is a debug build, I'm seeing the same on my debug build, with this patch I leak more than without it. I haven't tried an optimized build yet, so not sure if that is a clue or not :-).
Yeah, I guess this does more harm than good. I've left most of the patch in
place but commented out the call in xpcwrappednativescope.cpp (and adding a
comment concerning why).
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → INVALID
So did we figure out why this patch caused a leak increase instead of a decrease?  Without this patch we'll still leak in the situation described in bug 368530 comment 6 last paragraph, right?  That situation won't be hard to come by if we remove nsDocument::Destroy, which is why I'm concerned about it.
Reopening pending answer to comment 11...
Status: RESOLVED → REOPENED
Flags: blocking1.9?
Resolution: INVALID → ---
Flags: blocking1.9? → blocking1.9+
As far as "why": my only guess, again, is that this patch was magnifying the problem in bug 374096 (failure to de-purple nodes after forgetting then in an unsuccessful scan). Since that patch has now landed, IMO we ought to try this one again to see if it causes the same regression.

It's trivial to turn on: I left the code in place with an explanatory comment. Uncomment lines 280 and 281 of xpcwrappednativescope.cpp and it'll start working again. I can do it if nobody minds seeing nye bloat jump again.
Target Milestone: --- → mozilla1.9alpha6
Let's do it, yes.  That way we'll at least know what things look like now.
Looks like it did not cause any further regressions on nye/seamonkey (or elsewhere that I can see). Guess the hunch was right. Or something. Closing for now.
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
Damn. Crashing nightlies. Backing out again and reopening.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Looks like it was landed again, 2007-06-05 16:50 PDT.
Ah, I guess that explains the random crashes I get with 2007-06-06 builds and later, often after closing a tab and so. Although I haven't been able to reproduce with a clean profile or a debug build.
For instance, talkback ID: TB32923412M
Backed out again. See bug 383763 for an incriminating stack. Still no idea how it's failing. Obviously some native wrapper doesn't behave the way I think it does.
This is a correction of the latter (currently backed-out) part of the suspect-all-native-wrappers patch. I was able to reproduce some of the crashes people were seeing if I did a lot of very heavy concurrent pageloads (50+ tabs loading simultaneously in multiple windows). With this patch, I cannot make the browser crash.

The patch makes two obvious (in hindsight) correctness changes. First, it only suspects wrapped natives that are marked as main-thread-only, and only when tracing from the main thread. Second, most importantly, it *forgets* wrapped natives from the cycle collection purple buffer when they are destroyed. Since wrapped natives do not implement the cycle collecting addref/release helpers, this is a necessary piece I forgot in the first patch.

(Before you ask, forgetting is both idempotent and ignored for non-suspected inputs)
Attachment #268283 - Flags: superreview?(jst)
Attachment #268283 - Flags: review?(brendan)
Comment on attachment 268283 [details] [diff] [review]
correction to the wrappednative / wrappednativescope side

Nits only: local xpconnect style is if(...) and { on its own line underhanging the i in if.

/be
Attachment #268283 - Flags: review?(brendan) → review+
Comment on attachment 268283 [details] [diff] [review]
correction to the wrappednative / wrappednativescope side

I'm not convinced the NS_IsMainThread() call is needed in the tracer code as GC only runs on the main thread in mozilla anyways. Either way, sr=jst
Attachment #268283 - Flags: superreview?(jst) → superreview+
(In reply to comment #22)
> (From update of attachment 268283 [details] [diff] [review])
> I'm not convinced the NS_IsMainThread() call is needed in the tracer code as GC
> only runs on the main thread in mozilla anyways.

Oh, good point. That code could just assert NS_IsMainThread() and be simpler.

/be
Comment on attachment 268283 [details] [diff] [review]
correction to the wrappednative / wrappednativescope side

>diff -r cf6d91fe2651 js/src/xpconnect/src/xpcwrappednative.cpp
>--- a/js/src/xpconnect/src/xpcwrappednative.cpp	Wed Jun 13 04:01:58 2007 -0700
>+++ b/js/src/xpconnect/src/xpcwrappednative.cpp	Wed Jun 13 13:25:39 2007 -0700
>@@ -632,6 +632,13 @@ XPCWrappedNative::~XPCWrappedNative()
>     DEBUG_TrackDeleteWrapper(this);
> 
>     XPCWrappedNativeProto* proto = GetProto();
>+
>+    if (proto && 
>+        proto->ClassIsMainThreadOnly() && 
>+        NS_IsMainThread()) 
>+    {
>+        nsCycleCollector_forget(this);

Will this work? nsCycleCollector_forget removes from mPurpleBuf, but nsCycleCollector_suspectCurrent adds to mBuf (which is a nsDeque, not sure we can easily remove from that).
Committed again and backed out again. Peterv is right: forget can't really be working here. I was seeing phantom working-ness. Will dig deeper.
Why do you need forget at all?  You're only adding them to the buffer at the beginning of a collection?  (Shouldn't the Root/Unroot around the Unlink be sufficient?)
Dbaron: yes, I'd imagine this is true. On the other hand, it crashes, so something decidedly wrong is occurring. Chalk this up to "me randomly trying things that look like they might help". Not very convincingly.
Removal was broken in your patch, I'd say we should reland just the xpcwrappednativescope.cpp chunk. That seemed to improve things for you and seems like the right thing to do.

(My problem with exceeded refs turned out to be unrelated to this patch).
(In reply to comment #28)
> Removal was broken in your patch, I'd say we should reland just the
> xpcwrappednativescope.cpp chunk. That seemed to improve things for you and
> seems like the right thing to do.

Sounds good to me.
Ahem -- I think we know who should have done the last round of reviews... ;-)

/be
Depends on: 383552
I'm not convinced this is actually correct. I recall seeing a fault with the first half of the patch alone (threadsafety). Since the second half (forgetting) is clearly incorrect, this means the threadsafety part wasn't enough to make it right.

I'm trying to provoke an illustrative failure with my little graph-writing assistant (bug 384942) turned on, so I can see what's happening.
Well, after several hours of running it through highly parallel workloads I can't get it to crash. I'm not sure whether I was mis-remembering or what. 

I'll be traveling tomorrow and in ES4 meetings all week, so can't really play chase-the-crash-reports this week. If someone wants to land the xpcwrappednativescope.cpp chunk again and see if it makes the nightlies mysteriously begin crashing or leaking again, be my guest.
I'm not going to be able to land this before a6, sorry. Haven't discovered what (if anything) is going wrong with it.
Target Milestone: mozilla1.9alpha6 → mozilla1.9beta1
Just relanded the xpcwrappednativescope.cpp changes on the trunk.
This causes extremely frequent crashes (bug 383552) (again) and I think it should be backed out (again).
I backed it out again.

I think the problems with the patch are the following:

(1) the suspect calls shouldn't be inside the wrapper->HasExternalReference() check.  We want to suspect anything that could potentially be a cycle through JS, which means we need to suspect either all edges in to JS or all edges out of JS.  The goal here seems to be suspecting all edges out (which makes sense, given that they're more concentrated), so we shouldn't check for this being an edge in to JS (which wrapped natives can be used for as well, in a somewhat obscure way, just to avoid root/unroot traffic).  Fixing this makes the crashes much more frequent.

(2) The suspectCurrent calls happen before a potential restart of GC.  JS GC can potentially restart if the collection recurs into something that calls JS GC, so just because something is marked the first time through the GC in BeginCycleCollection doesn't mean it's marked the last time.  This also doesn't seem sound in the presence of multiple language runtimes doing what JS does in BeginCycleCollection.  (Then again, neither does the reference count accumulation.)

I think the simple fix here would be to use something a lot more like the standard suspect/forget mechanism for this code, so that we can actually do the forgets when they're needed, which they sometimes are.

That doesn't solve the problem of the reference count inference being unsound in the presence of multiple language runtimes doing a collection at the start of the cycle collection.  But we can worry about that later.
Attached patch possible alternative (obsolete) — Splinter Review
This is a simpler version of what I described -- instead, just wait until the end of however many iterations of GC we do, then suspect all the native wrappers still alive.  I don't see the crashes that I saw with the previous patch, although I'm also still seeing leaks with my mScanDelay patch.
This is just the previous patch with a few comments updated.  See the explanation in comment 36 and comment 37.
Attachment #269349 - Attachment is obsolete: true
Attachment #269465 - Flags: superreview?(jst)
Attachment #269465 - Flags: review?(graydon)
Comment on attachment 269465 [details] [diff] [review]
possible alternative

On the surface it looks fine. I have no idea if the larger affects of it will be positive or crash/leak inducing.
Attachment #269465 - Flags: review?(graydon) → review+
For what it's worth, this also significantly increases cycle collection times -- enough that I needed to disable it in my build.  With my crazy 70-windows-and-hundreds-of-tabs-open main profile, without the patch a JS_GC takes about 2 seconds and the cycle collection (which includes the JS_GC) takes about 3.5 seconds.  This patch increases the typical cycle collection times (including the JS_GC) in that situation to about 6.5 seconds.

I wonder if we could change this so it only marks wrappers in scopes where we've executed JS.  Probably we don't have the right information to do that...
Comment on attachment 269465 [details] [diff] [review]
possible alternative

sr=jst for this whether it's something that we can use or not...
Attachment #269465 - Flags: superreview?(jst) → superreview+
For what it's worth, leak-gauge logging of my browsing sessions shows that this patch makes a pretty significant difference in how much I leak.  I was using it for a few days, then not for a few days, and then again for a few days, and the difference seemed pretty clear.  With the patch I'm only leaking a small number of documents (and they're almost all bugzilla pages).

I'm actually surprised it makes that much of a difference, since I'd have thought the last link being broken on the JS side would be unusual.


In any case, I checked the patch into the trunk.   Let's see if it sticks...
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: