Closed Bug 774388 Opened 8 years ago Closed 6 years ago

Work out the shutdown protocol for the Compositor thread and IPDL actors

Categories

(Core :: Graphics: Layers, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla33

People

(Reporter: cjones, Assigned: bjacob)

References

Details

Attachments

(14 files, 7 obsolete files)

3.22 KB, patch
nical
: review+
romaxa
: feedback+
Details | Diff | Splinter Review
6.79 KB, patch
nical
: review+
mattwoodrow
: review+
Details | Diff | Splinter Review
2.96 KB, patch
nical
: review+
mattwoodrow
: review+
Details | Diff | Splinter Review
23.61 KB, patch
nical
: review+
sotaro
: review+
Details | Diff | Splinter Review
827 bytes, patch
nical
: review+
Details | Diff | Splinter Review
1.55 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
2.70 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
1.62 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
4.22 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
3.83 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
14.84 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
1.40 KB, patch
nical
: review+
Details | Diff | Splinter Review
4.42 KB, patch
nical
: review+
Details | Diff | Splinter Review
7.37 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
In bug 745148 we leave this undefined because the emergency cleanup code for IPDL will take care of cleaning things up when content processes exit.  But the problem of shutting down CrossProcessCompositorParent along with CompositorParents is considerably trickier, because there are three threads, 3+ layers trees, and 2+ widgets involved.
Blocks: 860463
I don't know well the b2g situation that Chris described above, but we need to fix this as soon as possible so that the layers refactoring does not get backed out. My first attempt will be to implement it the way we do on the non-cross-process CompositorParent. If anybody knows about the complexity of the b2g setup, a bit of context would be very helpful.
Assignee: nobody → nical.bugzilla
Assignee: nical.bugzilla → nobody
Blocks: 1007284
This blocks e10s (bug 1007284) and blocks turning on B2G debug tests with the shutdown sequence run to completion. Taking.

I have a partial patch queue that tightens various aspects of this code, as the current code is too scary to touch as-is. I haven't implemented the actual shutdown sequence yet.
Assignee: nobody → bjacob
For everything but the most trivial patches here, I'm going to get 2 reviews. The point is partly to make sure that I'm not breaking things, and partly to spread knowledge of this code. It's crazy that we got in a situation where large parts of CompositorParent are understood by 1 person or even 0 person currently working on this part of Gecko.
Attachment #8433422 - Flags: review?(nical.bugzilla)
Attachment #8433422 - Flags: review?(matt.woodrow)
Attachment #8433423 - Flags: review?(nical.bugzilla)
Attachment #8433423 - Flags: review?(matt.woodrow)
For this patch the reviews are Nical and Sotaro because this code was last changed in bug 924622.
Attachment #8433427 - Flags: review?(sotaro.ikeda.g)
Attachment #8433427 - Flags: review?(nical.bugzilla)
Attachment #8433427 - Attachment description: Patch 4: Flatten the way that gfxPlatform tracks whether we are using OMTC, and move Layers IPC shutdown code back to gfxPlatform (but still triggered by ShutdownXPCOM) → Patch 4: Flatten the way that we track whether we are using OMTC, and move Layers IPC shutdown code back to gfxPlatform (but still triggered by ShutdownXPCOM)
Attachment #8433432 - Flags: review?(nical.bugzilla)
[leave open] : this patch queue is incomplete.
Whiteboard: [leave open]
Attachment #8433427 - Flags: review?(sotaro.ikeda.g) → review+
Attachment #8433422 - Flags: review?(matt.woodrow) → review+
Attachment #8433423 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8433429 [details] [diff] [review]
Patch 5: Wrap the global raw compositor thread pointer, and global raw refcount integer, into a proper refcounted singleton class

Review of attachment 8433429 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +119,5 @@
> +  }
> +
> +  // FIXME/bug 774386: we're assuming that there's only one
> +  // CompositorParent, but that's not always true.  This assumption only
> +  // affects CrossProcessCompositorParent below.

Any idea how we are relying on this assumption? We'd be broken with multiple e10s windows on desktop if we are actually doing this.
Attachment #8433429 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8433419 [details] [diff] [review]
Patch 1: remove StartUpWithExistingThread, it's unused

Review of attachment 8433419 [details] [diff] [review]:
-----------------------------------------------------------------

IIRC, this was added for romaxa's embedding work, so I'll wait for him to confirm whether it is still useful or not.
Attachment #8433419 - Flags: feedback?(romaxa)
Comment on attachment 8433422 [details] [diff] [review]
Patch 2: remove sCompositorThreadID, it's redundant

Review of attachment 8433422 [details] [diff] [review]:
-----------------------------------------------------------------

Same here
Attachment #8433422 - Flags: feedback?(romaxa)
Attachment #8433423 - Flags: feedback?(romaxa)
Attachment #8433427 - Flags: review?(nical.bugzilla) → review+
Attachment #8433429 - Flags: review?(nical.bugzilla) → review+
Attachment #8433432 - Flags: review?(nical.bugzilla) → review+
Ah, interesting.  I was involved in the gfx/gl side of that work, but didn't know about that side. Romaxa: I believe that the new CompositorThreadHolder being introduced in Patch 5 here would give you a better, less intrusive way of doing what you want. Your StartUpWithExisting path would presumably become a new CompositorThreadHolder constructor, or something like that.
Comment on attachment 8433419 [details] [diff] [review]
Patch 1: remove StartUpWithExistingThread, it's unused

I think it is ok to remove StartUpWithExistingThread because we agreed to run Compositor in it's own thread and push Textures via GLScreenBuffer into Embedding context
Attachment #8433419 - Flags: feedback?(romaxa) → feedback+
Attachment #8433419 - Flags: review?(nical.bugzilla) → review+
Attachment #8433422 - Flags: review?(nical.bugzilla)
Attachment #8433422 - Flags: review+
Attachment #8433422 - Flags: feedback?(romaxa)
Attachment #8433423 - Flags: review?(nical.bugzilla)
Attachment #8433423 - Flags: review+
Attachment #8433423 - Flags: feedback?(romaxa)
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/7a0d8feb1575
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/6f2e001c5f39
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/0f81ceab808a

(In reply to Matt Woodrow (:mattwoodrow) from comment #10)
> Comment on attachment 8433429 [details] [diff] [review]
> Patch 5: Wrap the global raw compositor thread pointer, and global raw
> refcount integer, into a proper refcounted singleton class
> 
> Review of attachment 8433429 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: gfx/layers/ipc/CompositorParent.cpp
> @@ +119,5 @@
> > +  }
> > +
> > +  // FIXME/bug 774386: we're assuming that there's only one
> > +  // CompositorParent, but that's not always true.  This assumption only
> > +  // affects CrossProcessCompositorParent below.
> 
> Any idea how we are relying on this assumption? We'd be broken with multiple
> e10s windows on desktop if we are actually doing this.

No idea at all, was just moving that old comment around. Removed it.
(In reply to Benoit Jacob [:bjacob] from comment #17)
> Bustage fix for the b2g xpcshell crashes here:
> https://tbpl.mozilla.org/?tree=Try&rev=286cf58d68a8
> 
> https://hg.mozilla.org/integration/mozilla-inbound/rev/4cfb80678174

I couldn't tell is this was supposed to also fix the build failures (could quite possibly be crashes, but the output is unhelpful) during startup cache precompilation:
https://tbpl.mozilla.org/php/getParsedLog.php?id=41209921&tree=Mozilla-Inbound

And also seg faults during xpcom tests on B2G desktop:
https://tbpl.mozilla.org/php/getParsedLog.php?id=41211357&tree=Mozilla-Inbound

The IRC conversation only mentioned xpcshell & I can't quite remember what is involved during startup cache precompilation - so I've backed out for now to avoid the risk that the followup didn't fix it (I couldn't find you in #developers):
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/779ef393fd2e
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/13b0d48c6fa3
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/7d8281d3913a
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/fc02eb625f9e
This occurred with the followup landed too:
https://tbpl.mozilla.org/php/getParsedLog.php?id=41211577&tree=Mozilla-Inbound
{
Assertion failure: gPlatform (gfxPlatform already down!), at /builds/slave/m-in-l64-d-0000000000000000000/build/gfx/thebes/gfxPlatform.cpp:445
UNKNOWN [/builds/slave/m-in-l64-d-0000000000000000000/build/obj-firefox/dist/bin/libxul.so +0x014CDCE2]
UNKNOWN [/builds/slave/m-in-l64-d-0000000000000000000/build/obj-firefox/dist/bin/libxul.so +0x00980130]
UNKNOWN [/builds/slave/m-in-l64-d-0000000000000000000/build/obj-firefox/dist/bin/libxul.so +0x00984D5E]
UNKNOWN [/builds/slave/m-in-l64-d-0000000000000000000/build/obj-firefox/dist/bin/libxul.so +0x00984DF7]
UNKNOWN [/builds/slave/m-in-l64-d-0000000000000000000/build/obj-firefox/dist/bin/libxul.so +0x00984FD2]
UNKNOWN [/builds/slave/m-in-l64-d-0000000000000000000/build/obj-firefox/dist/bin/libxul.so +0x0092BF01]
XRE_XPCShellMain+0x000013A4 [/builds/slave/m-in-l64-d-0000000000000000000/build/obj-firefox/dist/bin/libxul.so +0x01536F12]
__libc_start_main+0x000000FD [/lib64/libc.so.6 +0x0001ECDD]
UNKNOWN [/builds/slave/m-in-l64-d-0000000000000000000/build/obj-firefox/dist/bin/xpcshell +0x000035B1]
Traceback (most recent call last):
  File "/builds/slave/m-in-l64-d-0000000000000000000/build/toolkit/mozapps/installer/packager.py", line 402, in <module>
    main()
  File "/builds/slave/m-in-l64-d-0000000000000000000/build/toolkit/mozapps/installer/packager.py", line 394, in main
    args.source, gre_path, base)
  File "/builds/slave/m-in-l64-d-0000000000000000000/build/toolkit/mozapps/installer/packager.py", line 158, in precompile_cache
    errors.fatal('Error while running startup cache precompilation')
  File "/builds/slave/m-in-l64-d-0000000000000000000/build/python/mozbuild/mozpack/errors.py", line 101, in fatal
    self._handle(self.FATAL, msg)
  File "/builds/slave/m-in-l64-d-0000000000000000000/build/python/mozbuild/mozpack/errors.py", line 96, in _handle
    raise ErrorMessage(msg)
mozpack.errors.ErrorMessage: Error: Error while running startup cache precompilation
make[3]: *** [stage-package] Error 1
make[3]: Leaving directory `/builds/slave/m-in-l64-d-0000000000000000000/build/obj-firefox/browser/installer'
make[2]: *** [make-package] Error 2
make[2]: Leaving directory `/builds/slave/m-in-l64-d-0000000000000000000/build/obj-firefox/browser/installer'
make[1]: *** [default] Error 2
make[1]: Leaving directory `/builds/slave/m-in-l64-d-0000000000000000000/build/obj-firefox/browser/installer'
make: *** [package] Error 2
State Changed: unlock buildroot
program finished with exit code 2
elapsedTime=20.303491
========= Finished 'mock_mozilla -r ...' failed (results: 2, elapsed: 20 secs) (at 2014-06-06 07:51:29.533904) =========
}
Since I was pinged on irc and nobody involved is there:
Startup cache precompilation is only running xpcshell, so whatever makes that fail means there's a problem in libxul, most of the time. Looking at the log in comment 19 makes it pretty obvious:
Assertion failure: gPlatform (gfxPlatform already down!), at /builds/slave/m-in-l64-d-0000000000000000000/build/gfx/thebes/gfxPlatform.cpp:445
Thanks Mike.

Update on the status of re-landing this:

After several iterations I've arrived at a version of these patches that's green on try, except for one thing: it now consistently hits bug 1007284, i.e. late Compositor messages hitting the already dead MessageLoop of the compositor thread, which prompted work here. That's both good and bad: it's good to be hitting this consistently instead of intermittently on tryserver, but it's bad that it prevents me from proceeding with incremental landings. As things currently look, I can't land much until I've got the shutdown sequence implemented.
Made some more attempts but still hitting the same race condition. Things are particularly clear cut on this try push:

https://tbpl.mozilla.org/?tree=Try&rev=e79a6054f8a9
Got an imperfect, incomplete implementation of enough of a shutdown sequence that we're now green on tryserver,

https://tbpl.mozilla.org/?tree=Try&rev=b6876daa639f

and more to the point, we're green not by accident but by organizing the shutdown sequence just enough that the race condition (bug 1007284) that prompted us to look at this for e10s, should be explicitly fixed.

Since this changes much of what was "patch 5" here, I've folded this into it.

The concrete race condition that we were hitting (bug 1007284) was that during shutdown a CrossProcessCompositorParent's MessageChannel was trying to post a task on the compositor thread's MessageLoop, to which it had a raw pointer, but the compositor thread was gone already. The more general problem there was that we had no synchronization at all between the lifetime of the compositor thread and the lifetime of CrossProcessCompositorParent objects.

As a first step toward fixing this, the old (obsolete) Patch 5 here did introduce a CompositorThreadHolder reference-counted object so that the problem of tracking the lifetime of the compositor thread would boil down to the problem of managing one refcounted object. That old (obsolete) patch 5 did not try to synchronize that lifetime with that of CrossProcessCompositorParent objects.

The new Patch 5 does that, by having both CrossProcessCompositorParent and CompositorParent hold strong references to the CompositorThreadHolder, so that in effect the compositor thread isn't torn down until these objects are gone.

The nontrivial aspect of this is that we don't currently want to (or we are not currently ready to) let the entire shutdown sequence determined by refcounting. The way that much of our shutdown sequence currently works (see e.g. ShutdownXPCOM), we execute a series of operations that rely on strict ordering. So in particular here, we need to have our ShutDownCompositorThread function guarantee that by the time that it returns, the compositor thread really is shut down.

So, ShutDownCompositorThread, which runs on the main thread, must wait for the compositor thread to be actually gone. The compositor thread can only be destroyed on the main thread. Since (in this naive, imperfect patch) we are managing its lifetime by refcounting, that means that we want to drop our last reference to it on the main thread. That's why ShutDownCompositorThread's code is:

void CompositorParent::ShutDownCompositorThread()
{
  MOZ_ASSERT(NS_IsMainThread(), "Should be on the main Thread!");
  MOZ_ASSERT(gCompositorThreadHolder, "The compositor thread has already been shut down!");

  gCompositorThreadHolder->WaitForRefcountToBe1();
  gCompositorThreadHolder = nullptr;
}

The gCompositorThreadHolder->WaitForRefcountToBe1(); is where we wait for [CrossProcess]CompositorParent objects to be destroyed on other threads. Once the refcount is 1, we know that that last reference is gCompositorThreadHolder, so we null it, causing the compositor thread to be immediately torn down.

I know that using the refcount and watching for it to hit the value 1, is a hack. Suggestions welcome for what would be the right way to implement such logic! (While I think that the implementation is a hack, it seems that the basic logic is the only possibility).
Attachment #8433429 - Attachment is obsolete: true
Attachment #8438146 - Flags: review?(nical.bugzilla)
Attachment #8438146 - Flags: review?(matt.woodrow)
Comment on attachment 8438146 [details] [diff] [review]
Patch 5 (take 2): Properly manage the lifetime of the compositor thread

Realistically we're at the stage of "feedback" not "review" :-)

Possible idea: instead of doing this custom refcouncting to watch when the refcount hits one, we could have a refcounted object only for the off-main-thread references from [CrossProcess]CompositorParent objects, so that all what we would have to watch is the _lifetime_ of that object, not its refcount value.
Attachment #8438146 - Flags: review?(nical.bugzilla)
Attachment #8438146 - Flags: review?(matt.woodrow)
Attachment #8438146 - Flags: feedback?(nical.bugzilla)
Attachment #8438146 - Flags: feedback?(matt.woodrow)
Comment on attachment 8438146 [details] [diff] [review]
Patch 5 (take 2): Properly manage the lifetime of the compositor thread

Review of attachment 8438146 [details] [diff] [review]:
-----------------------------------------------------------------

Benoit and I talked about this over a coffee, waiting for the next version.
Attachment #8438146 - Flags: feedback?(nical.bugzilla)
Comment on attachment 8438146 [details] [diff] [review]
Patch 5 (take 2): Properly manage the lifetime of the compositor thread

I'll wait for the updated patch too then
Attachment #8438146 - Flags: feedback?(matt.woodrow)
Finally figured why Patch 4 was causing B2G xpcshell crashes. On B2G, GetGonkDisplay is a weak symbol, and in xpcshell it's not loaded. gfxAndroidPlatform initialization calls it. So if we initialize gfxPlatform on B2G xpcshell, we crash. My patch caused that to happen, because of the ContentParent.cpp hunk of it, which was just bad, so I removed it.

Landed:

https://hg.mozilla.org/integration/mozilla-inbound/rev/bb0c777636e9

For sheriff reference, green try push:

https://tbpl.mozilla.org/?tree=Try&rev=6187c849695d
Hm, there was a similar hunk in TabParent.cpp which in all likeliness wants to be reverted too.

https://hg.mozilla.org/integration/mozilla-inbound/rev/407e66c2e91d

(Although that wasn't hit by any b2g xpcshell test yet, it seems that that hunk was bad for the same reason as above).
This is ready for review!
Attachment #8438146 - Attachment is obsolete: true
Attachment #8440740 - Flags: review?(nical.bugzilla)
Attachment #8440740 - Flags: review?(matt.woodrow)
Revert an unnecessarily change to how we schedule the compositor thread destruction, after talking with nical on IRC. Also fixed an unused variable warning that was causing this to be red on tryserver opt builds.
Attachment #8440740 - Attachment is obsolete: true
Attachment #8440740 - Flags: review?(nical.bugzilla)
Attachment #8440740 - Flags: review?(matt.woodrow)
Attachment #8440754 - Flags: review?(nical.bugzilla)
Attachment #8440754 - Flags: review?(matt.woodrow)
Comment on attachment 8440754 [details] [diff] [review]
Patch 5 (take 2): Properly manage the lifetime of the compositor thread

Review of attachment 8440754 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +171,5 @@
> +private:
> +
> +  Thread* const mCompositorThread;
> +  Monitor mCPCPReferencesMonitor;
> +  int mCPCPRefCnt;

There are a lot of mentions to CPCP in here and I am not certain I'll always remember that it stands for CrossProcessCompositorParent, please add a small comment somewhere here saying that CPCP stands for CrossProcessCompositorParent (I know it's not too hard to find out but it I'm sure it'll save us some brain energy occasionally).

::: gfx/layers/ipc/CompositorParent.h
@@ +175,2 @@
>     */
> +  static void ShutDownCompositorThread();

nit: We should find a better name for this.
Attachment #8440754 - Flags: review?(nical.bugzilla) → review+
New try with a trivial null deref fixed:
https://tbpl.mozilla.org/?tree=Try&rev=53b8e97aba74

(In reply to Nicolas Silva [:nical] from comment #35)
> ::: gfx/layers/ipc/CompositorParent.h
> @@ +175,2 @@
> >     */
> > +  static void ShutDownCompositorThread();
> 
> nit: We should find a better name for this.

Would anyone have a suggestion? This function waits for CPCPs to be gone, and nulls the global reference to the Compositor thread, but if there are remaining CPs they can still hold the compositor thread alive; the compositor thread is destroyed as soon as all CPs are gone.
Attachment #8440754 - Flags: review?(matt.woodrow) → review+
Landed patch 5 --- the main part with the CPCP shutdown sequence.

https://hg.mozilla.org/integration/mozilla-inbound/rev/66f9a0038c67
Matt / Nical: do you think that we want to keep this bug open to do anything else here?

What we have is still unnecessarily complex, so there is definitely work that could be done around here, but maybe now is a good time to close this particular bug --- unless I missed something important that falls in "the shutdown protocol for CPCP".
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(matt.woodrow)
Aye, this was intermittent enough that it didn't happen on my try push before we retriggered jobs... looking... my working theory is that this might have something to do with the fact that we're still leaking the CompositorChild object, and Compositor is a top-level protocol...
Flags: needinfo?(bjacob)
I think that you can close the bug and open new ones whenever you manage to land what's already there, but I don't feel very strongly either way. Long bugs usually make me spend a lot of time scrolling down, that's all.
Flags: needinfo?(nical.bugzilla)
Depends on: 1021149
Summary: Work out the shutdown protocol for CrossProcessCompositorParent → Work out the shutdown protocol for the Compositor thread and IPDL actors
Renamed the bug because the problems were far more generalized, and best fixed all together.
This is basically the fix for bug 1007284 were we were using-after-free the compositor thread's MessageLoop. Similar to earlier versions that you already r+'d, but significantly different and simpler now that 1) I realized that top-level protocol actors must be destroyed on the main thread (a requirement that we're going to guard by an assertion, bug 1028383),  and 2) other aspects of the shutdown sequence are now split into other patches, see next attachments.
Attachment #8440754 - Attachment is obsolete: true
Attachment #8449397 - Flags: review?(matt.woodrow)
This handles another key piece of synchronization that patch 5 doesn't try to handle anymore: while patch 5 ensures that the teardown of the compositor thread does not happen too early, it does not in turn make the main thread wait on that teardown to complete.

This patch does that.
Attachment #8449400 - Flags: review?(matt.woodrow)
This fixes a deadlock that we were running into in patch 7 --- but I kept it a separate patch for review clarity.
Attachment #8449408 - Flags: review?(matt.woodrow)
This is the same as NS_INLINE_DECL_THREADSAFE_REFCOUNTED (compare in nsISupportsImpl.h) but with one difference: if the refcount falls to 0 off-main-thread, instead of destroying the object immediately off-main-thread, it will post a task on the main thread to destroy it there.

Expected question:
 "Why are we getting into abstract nonsense work here now? Don't we have enough work to do?"

Answer:
 Because at patch 8 we still had intermittent failures on TBPL as CompositorParent's were *still* often destroyed off-main-thread. How could that be, given all the care that CompositorParent.cpp puts in ensuring that destruction happens on main thread? That's because nsBaseWidget was also holding a strong reference, and released it off the main thread. Seeing how difficult this is to debug, and how likely this is to regress, giving again intermittent failures whose rate could be initially low enough that we don't notice, it seemed that we needed a more systematic solution to this problem.

Ideally, all refcounted toplevel actors would use this special refcounting. But unfortunately, ImageBridgeParent is already participating in a complex class inheritance scheme from which it inherits another refcounting implementation already (AtomicRefcountedWithFinalize -> ISurfaceAllocator -> CompositableParentManager -> ImageBridgeParent) which means it would be a nontrivial project to make it use NS_INLINE_DECL_THREADSAFE_REFCOUNTING_WITH_MAIN_THREAD_OWNERSHIP instead. It would have to involve decoupling ImageBridgeParent from ISurfaceAllocator (have it not inherit from that), similar to what we did in other gfx/layers/ipc classes in bug 933082.
Attachment #8449417 - Flags: review?(matt.woodrow)
Now we get to ImageBridge, and, for the reason explained above, we can't just yet use the nice refcounting mechanism introduced in patch 9, so we are back to having to care about tiny details of where we null strong references.
Attachment #8449420 - Flags: review?(matt.woodrow)
And likewise, another occasion to regret not being able to just rely on the nice refcounting mechanism of patch 9 for ImageBridge.

We were posting a RunnableMethod to the main thread to do the final Release() of the ImageBridgeParent singleton there. What could go wrong? Turns out that RunnableMethod's hold a strong reference to the object, so the Release() that we were posting on the main thread actually wasn't the last release. Instead, the last release was now... the destruction of the RunnableMethod! Which, of course, happens on the IO thread. So we were still destroying our ImageBridgeParent off the main thread.
Attachment #8449421 - Flags: review?(matt.woodrow)
Here's a completely green tryserver run. Complete with a fix for a near-perma-orange bug that turned out to be unrelated (bug 1008254) but was preventing getting nice green try runs and scarily had error messages involving CompositorChild::ActorDestroy.

https://tbpl.mozilla.org/?tree=Try&rev=3c30cfba8f99
Attachment #8449397 - Flags: review?(matt.woodrow) → review+
Attachment #8449400 - Flags: review?(matt.woodrow) → review+
Attachment #8449408 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8449417 [details] [diff] [review]
Patch 9: Introduce NS_INLINE_DECL_THREADSAFE_REFCOUNTING_WITH_MAIN_THREAD_OWNERSHIP

Review of attachment 8449417 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +333,5 @@
>    // after this function returns while some ipdl code still needs to run on
>    // this thread.
>    // We must keep the compositor parent alive untill the code handling message
>    // reception is finished on this thread.
> +  this->AddRef(); // Corresponds to DeferredDeleteCompositorParentOnIOThread's Release

Comment should reference 'DeferredDestroy', right?

::: gfx/layers/ipc/ThreadSafeRefcountingWithMainThreadOwnership.h
@@ +40,5 @@
> +
> +} // namespace layers
> +} // namespace mozilla
> +
> +#define NS_INLINE_DECL_THREADSAFE_REFCOUNTING_WITH_MAIN_THREAD_OWNERSHIP(_class) \

I think I'd prefer WITH_MAIN_THREAD_DESTRUCTION, but not too fussed.

@@ +58,5 @@
> +      if (NS_IsMainThread()) {                                                \
> +        delete this;                                                        \
> +      } else {                                                                \
> +        MessageLoop *l = ::mozilla::layers::GetMainLoop();                    \
> +        AddRef();                                                             \

We've already called NS_LOG_RELEASE with a count of 0 here, that might confuse the leak tracking tools.

http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsTraceRefcnt.cpp#1084

I think we should move the NS_LOG_RELEASE into the first branch of this if, and just bump count back up instead of calling AddRef() here.
Attachment #8449420 - Flags: review?(matt.woodrow) → review+
All great points, thanks.

On top of that, I noticed that this was using a RunnableMethod! Not sure how that didn't blow up on me in the way described above, but that was a bad idea. New version uses a RunnableFunction to avoid the hidden AddRef(this).
Attachment #8449417 - Attachment is obsolete: true
Attachment #8449417 - Flags: review?(matt.woodrow)
Attachment #8449919 - Flags: review?(matt.woodrow)
Attachment #8449919 - Attachment is obsolete: true
Attachment #8449919 - Flags: review?(matt.woodrow)
Attachment #8449924 - Flags: review?(matt.woodrow)
Attachment #8449924 - Flags: review?(matt.woodrow) → review+
Attachment #8449421 - Flags: review?(matt.woodrow) → review+
Regarding patch 11, I double checked that it's really needed: here's a try push without it, with the typical android mochitest-5 crash:
https://tbpl.mozilla.org/?tree=Try&rev=c9596e6f081f
https://tbpl.mozilla.org/php/getParsedLog.php?id=42997816&tree=Try

21:00:08  WARNING -  PROCESS-CRASH | /tests/dom/datastore/tests/test_sync.html | application crashed [@ mozilla::layers::ImageBridgeParent::~ImageBridgeParent()]
21:00:08     INFO -  Crash dump filename: /tmp/tmpMFCK4F/7b33c575-7efa-7535-4b19ebff-48749fde.dmp
21:00:08     INFO -  Operating system: Android
21:00:08     INFO -                    0.0.0 Linux 3.2.0+ #2 SMP PREEMPT Thu Nov 29 08:06:57 EST 2012 armv7l pandaboard/pandaboard/pandaboard:4.0.4/IMM76I/5:eng/test-keys
21:00:08     INFO -  CPU: arm
21:00:08     INFO -       2 CPUs
21:00:08     INFO -  Crash reason:  SIGSEGV
21:00:08     INFO -  Crash address: 0x0
21:00:08     INFO -  Thread 35 (crashed)
21:00:08     INFO -   0  libxul.so!mozilla::layers::ImageBridgeParent::~ImageBridgeParent() [ImageBridgeParent.cpp:c9596e6f081f : 67 + 0x10]
21:00:08     INFO -       r4 = 0x6e051200    r5 = 0x00000000    r6 = 0x69365ef0    r7 = 0x6abffe04
21:00:08     INFO -       r8 = 0x63706062    r9 = 0x6a89f7f0   r10 = 0x00100000    fp = 0x00000001
21:00:08     INFO -       sp = 0x6abffcc0    lr = 0x6231842f    pc = 0x62324582
21:00:08     INFO -      Found by: given as instruction pointer in context
21:00:08     INFO -   1  libxul.so!mozilla::layers::ImageBridgeParent::~ImageBridgeParent() [ImageBridgeParent.cpp:c9596e6f081f : 73 + 0x3]
21:00:08     INFO -       r4 = 0x6e051200    r5 = 0x6e051398    r6 = 0x69365ef0    r7 = 0x6abffe04
21:00:08     INFO -       r8 = 0x63706062    r9 = 0x6a89f7f0   r10 = 0x00100000    fp = 0x00000001
21:00:08     INFO -       sp = 0x6abffce8    pc = 0x62324651
21:00:08     INFO -      Found by: call frame info
21:00:08     INFO -   2  libxul.so!mozilla::AtomicRefCountedWithFinalize<mozilla::layers::ISurfaceAllocator>::Release() [AtomicRefCountedWithFinalize.h:c9596e6f081f : 46 + 0x9]
21:00:08     INFO -       r4 = 0x6e05139c    r5 = 0x6e051398    r6 = 0x69365ef0    r7 = 0x6abffe04
21:00:08     INFO -       r8 = 0x63706062    r9 = 0x6a89f7f0   r10 = 0x00100000    fp = 0x00000001
21:00:08     INFO -       sp = 0x6abffcf0    pc = 0x622ae221
21:00:08     INFO -      Found by: call frame info
21:00:08     INFO -   3  libxul.so!RunnableMethod<mozilla::layers::ImageBridgeParent, void (mozilla::layers::ImageBridgeParent::*)(), Tuple0>::~RunnableMethod() [task.h:c9596e6f081f : 263 + 0x7]
21:00:08     INFO -       r4 = 0x6648e5a0    r5 = 0x6abffdf8    r6 = 0x69365ef0    r7 = 0x6abffe04
21:00:08     INFO -       r8 = 0x63706062    r9 = 0x6a89f7f0   r10 = 0x00100000    fp = 0x00000001
21:00:08     INFO -       sp = 0x6abffd00    pc = 0x6231f791
21:00:08     INFO -      Found by: call frame info
21:00:08     INFO -   4  libxul.so!RunnableMethod<mozilla::layers::ImageBridgeParent, void (mozilla::layers::ImageBridgeParent::*)(), Tuple0>::~RunnableMethod() [task.h:c9596e6f081f : 303 + 0x3]
21:00:08     INFO -       r4 = 0x6648e5a0    r5 = 0x6abffdf8    r6 = 0x69365ef0    r7 = 0x6abffe04
21:00:08     INFO -       r8 = 0x63706062    r9 = 0x6a89f7f0   r10 = 0x00100000    fp = 0x00000001
21:00:08     INFO -       sp = 0x6abffd08    pc = 0x6231f7ad
Flags: needinfo?(matt.woodrow)
Last try push before landing - had last minute changes on Bug 1033358 - https://tbpl.mozilla.org/?tree=Try&rev=766153861079
Unfortunately, I had to backout this entire push for a significant spike in shutdown crashes across all platforms.
https://hg.mozilla.org/integration/mozilla-inbound/rev/1fd5a864e81d

On Windows, we saw the emergence of bug 1034340. Additionally, we began hitting CxxStackFrame crashes with high frequency that were never hit before:
https://tbpl.mozilla.org/php/getParsedLog.php?id=43080399&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=43083954&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=43083716&tree=Mozilla-Inbound

On OSX, we saw new shutdown crashes in ImageBridgeParent:
https://tbpl.mozilla.org/php/getParsedLog.php?id=43081534&tree=Mozilla-Inbound

On Linux, we were hitting Bug 1007284 on a nearly once per push pace between linux32/linux64. I believe the below LSAN failure might also be related:
https://tbpl.mozilla.org/php/getParsedLog.php?id=43082203&tree=Mozilla-Inbound

On Android, CxxStackFrame crashes persisted:
https://tbpl.mozilla.org/php/getParsedLog.php?id=43080921&tree=Mozilla-Inbound

On B2G debug, I'm not sure if these are actually related or not, but mochitest-9 was near perma-fail with the following:
https://tbpl.mozilla.org/php/getParsedLog.php?id=43071168&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=43076887&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=43084064&tree=Mozilla-Inbound
(The tree's still closed for these until I get a clear picture of whether this backout actually improved things or not).

So unfortunately, all of the above basically left inbound in an unmergeable state. I'm sorry for backing these patches out, but under the circumstances, I feel it was my only choice.
Sorry for the trouble and many thanks for taking the time to write this detailed summary. Looking into it now; won't reland until all of that is solid green... now at least I know what to retrigger many times on tryserver.
The B2G debug M9 failures are still happening post-backout.
Yes, and it was also nearly perma-orange already before the landing. Looking into the other failures.

It looks like we won't be able to get away with less work than to:
 1) make ImageBridgeParent use NS_INLINE_DECL_THREADSAFE_REFCOUNTING_WITH_MAIN_THREAD_DESTRUCTION
 2) As a prerequisite for 1), and that is what is a lot of work, we need to make ImageBridgeParent not inherit from ISurfaceAllocator anymore, as that is dragging in another custom refcounting scheme that is not doing what we need here. Scumbag inheritance... NOT having inheritance might be the single best feature of Rust over C++ for large-scale project maintenability...
 3) Properly handle the Stop message in PImageBridge. (That is what those CxxStackFrame assertions are complaining about)
Argh, no, making ImageBridgeParent not inherit from ISurfaceAllocator (via CompositableTransactionParent) anymore looks like it would be a really big change :-( will have to shoot for an ugly hack instead :-(
13 seems like an apt number of patches for this bug.

https://tbpl.mozilla.org/?tree=Try&rev=e2386267e82b
Comment on attachment 8450928 [details] [diff] [review]
Patch 12: make ImageBridgeParent 's refcounting implementation automatically defer destruction to the main thread

Review of attachment 8450928 [details] [diff] [review]:
-----------------------------------------------------------------

I am not found of adding main-thread specific stuff in AtomicRefCountedWidthFinalize, which is also used a lot outside of the main thread (for instance textures with ImageBridge).
If you generalize this boolean into a pointer to a thread (say, mOwningThread) and make it so that if mOwnigThread != null && mOwningThread != CurrentThread, forward the destruction to mOwningThread, this will be actually useful for TextureClient which uses a dirty hack to have part of the destruction be done in the ImageBridgeChild thread specifically (see bug 1002451).
Attachment #8450928 - Flags: feedback-
We can't use nsThreads here because this may have to run late during shutdown, after nsThreadManager is already down, so functions like NS_GetCurrentThread() are no longer available.

We can use IPC MessageLoop's though. Would that be OK? It seems strange to tie lifetime to messageloops instead of threads, but that idiom is already spread all over the place :-/ messageloops are how ipc/glue/ references threads.
We could technically use chromium IPC Thread's but it seems that current code all over the place is referencing messageloops not threads.
(In reply to Benoit Jacob [:bjacob] from comment #67)
> We can use IPC MessageLoop's though. Would that be OK? It seems strange to
> tie lifetime to messageloops instead of threads, but that idiom is already
> spread all over the place :-/ messageloops are how ipc/glue/ references
> threads.

Using chromium MessageLoops here sounds good to me (I mean, a bit hacky, sure, but not more than special casing the main thread (I really hate having even just the notion of a "main thread" to be honest), and it would help simplify some already hacky code).
Attachment #8450929 - Flags: review?(nical.bugzilla) → review+
Attachment #8450928 - Attachment is obsolete: true
Attachment #8450928 - Flags: review?(nical.bugzilla)
Attachment #8450928 - Flags: review?(matt.woodrow)
Attachment #8450974 - Flags: review?(nical.bugzilla)
Attachment #8450974 - Flags: review?(nical.bugzilla) → review+
Attachment #8450929 - Flags: review?(matt.woodrow)
If you want to see the most epic TBPL push of all times, it's over here:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=d2e7bd70dd95

I've retriggered lots of test runs many times to make sure that if there are still issues then we find about them. It's 100% green so far, cross fingers...

Oh and I know I'm hogging resources, but our US colleagues are away barbecuing today.
backed out again,

https://hg.mozilla.org/integration/mozilla-inbound/rev/1370a39b329a

because we realized that this STILL doesn't prevent the compositor thread's message loop from dying too early, see bug 1007284 comment 52.
See bug 1007284 comment 55: the problem is that in patch 5, we only block the destruction of the compositor thread on the death of CompositorParent and CrossProcessCompositorParent, but we also need to block it on the death of ImageBridgeParent, because the MessageChannel of ImageBridgeParent also uses the compositor thread's MessageLoop.
see above comment. The other thing that this patch does is to release the ImageBridgeParent singleton at the beginning of gfx ipc shutdown, otherwise we deadlock as we now wait on it being gone.
Attachment #8451259 - Flags: review?(nical.bugzilla)
Attachment #8451259 - Flags: review?(matt.woodrow)
https://tbpl.mozilla.org/?tree=Try&rev=caee2cd3ff6c is solid green. I'm going to make a full try push now. But basically, its looking like this was the last issue and we could re-land with this.
https://tbpl.mozilla.org/?tree=Try&rev=81a08416e5a5 is solid green after a lot of retriggers. Can land as soon as I get a r+.
Attachment #8451259 - Flags: review?(matt.woodrow) → review+
Attachment #8451259 - Flags: review?(nical.bugzilla)
Blocks: 1021149
No longer depends on: 1021149
Depends on: 1035833
Duplicate of this bug: 1034523
Blocks: 924622
You need to log in before you can comment on or make changes to this bug.