Closed Bug 1134385 Opened 5 years ago Closed 5 years ago

Intermittent leakcheck | default process: 7736 bytes leaked (CameraPreferences, Composer2D, CompositorParent, CondVar, MessagePump, ...) after "Assertion failure: NS_IsMainThread(), at ../../gecko/widget/VsyncDispatcher.cpp:40"

Categories

(Core :: Graphics, defect)

ARM
Gonk (Firefox OS)
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox38 --- wontfix
firefox39 --- wontfix
firefox40 --- fixed
firefox-esr31 --- unaffected
b2g-v2.2 --- affected
b2g-master --- fixed

People

(Reporter: RyanVM, Assigned: mchang)

References

Details

(4 keywords)

Attachments

(2 files, 2 obsolete files)

Not really sure what to make of this. Logs point to camera stuff, but the assert is pointing to the Vsync stuff.

08:49:10 INFO - WARNING | leakcheck | default process: multiple BloatView byte totals found
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 1 CameraPreferences (12 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 1 Composer2D (8 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 2 CompositorParent (1280 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 8 CondVar (192 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 2 MessagePump (16 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 19 Mutex (380 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 1 PCompositorParent (432 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 1 PProcLoaderParent (424 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 1 PSharedBufferManagerChild (424 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 1 PSharedBufferManagerParent (424 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 2 PerThreadData (24 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 7 ReentrantMonitor (168 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 3 RefCountedMonitor (144 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 8 RefCountedTask (64 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 5 SocketBase (100 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 1 StoreRef (8 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 2 WaitableEventKernel (40 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 4 WeakReference<MessageListener> (32 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 4 ipc::MessageChannel (1184 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsAStreamCopier (88 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsPipe (136 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsPipeInputStream (72 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsSocketTransport (368 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsSocketTransportService (176 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 136 nsStringBuffer (1088 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 29 nsTArray_base (116 bytes)
08:49:10 INFO - TEST-INFO | leakcheck | default process: leaked 2 nsThread (336 bytes)
08:49:10 WARNING - TEST-UNEXPECTED-FAIL | leakcheck | default process: 7736 bytes leaked (CameraPreferences, Composer2D, CompositorParent, CondVar, MessagePump, ...) 

08:50:53 INFO - 02-18 16:49:06.593 F/MOZ_Assert( 701): Assertion failure: NS_IsMainThread(), at ../../gecko/widget/VsyncDispatcher.cpp:40
08:50:53 ERROR - 02-18 16:49:06.593 F/libc ( 701): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1)
08:50:53 ERROR - This usually indicates the B2G process has crashed
mchang, do you know something about the assert?
Flags: needinfo?(mchang)
Yeah, it means that the CompositorVsyncDispatcher is being cleaned up on something other than the main thread, probably the Compositor thread. This probably means that ClearOnShutdown is executing before the Compositor shuts down, which is odd. It's a safe assert to delete, I've just been debugging other things. Are you hitting this often?
Flags: needinfo?(mchang)
No, I do not hit it by myself. From the comment 0, system might become abnormal state when it happens.
Since the CompositorVsyncDispatcher is only NS_INLINE_DECL_REFCOUNTING, we have no guarantee that the object will be destroyed on the main thread. It's accessed by both the GeckoTouchDispatcher and the CompositorParent. The GeckoTouchDispatcher shuts down with ClearOnShutdown which executes on the main thread wheras the CompositorParent shuts down either on the main thread or Compositor thread. Delete this assertion as we can't guarantee which thread or order shutdown occurs. I assumed CleraOnShutdown is last, but that's not really the case then.
Attachment #8575419 - Flags: review?(bugmail.mozilla)
Probably won't properly review this until tomorrow, but if the CompositorVsyncDispatcher is getting release'd from the non-main thread then it shouldn't be NS_INLINE_DECL_REFCOUNTING, it should be _THREADSAFE_REFCOUNTING. So something here seems fishy.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #7)
> Probably won't properly review this until tomorrow, but if the
> CompositorVsyncDispatcher is getting release'd from the non-main thread then
> it shouldn't be NS_INLINE_DECL_REFCOUNTING, it should be
> _THREADSAFE_REFCOUNTING. So something here seems fishy.

Err sorry, yeah it's THREADSAFE_REFCOUNTING - https://dxr.mozilla.org/mozilla-central/source/widget/VsyncDispatcher.h?from=VsyncDispatcher.h&case=true#37
(In reply to Mason Chang [:mchang] from comment #3)
> This
> probably means that ClearOnShutdown is executing before the Compositor shuts
> down, which is odd.

Which ClearOnShutdown are you referring to here?
Comment on attachment 8575419 [details] [diff] [review]
Delete assertion in CompositorVsyncDispatcher

Review of attachment 8575419 [details] [diff] [review]:
-----------------------------------------------------------------

As far as I can tell this assert should never be failing. If it is failing we should figure out why because it represents other assumptions being violated as well.
Attachment #8575419 - Flags: review?(bugmail.mozilla) → review-
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #10)
> Comment on attachment 8575419 [details] [diff] [review]
> Delete assertion in CompositorVsyncDispatcher
> 
> Review of attachment 8575419 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> As far as I can tell this assert should never be failing. If it is failing
> we should figure out why because it represents other assumptions being
> violated as well.

We have two nsRefPtrs to the CompositorVsyncDispatcher:

1) In the CompositorParent (https://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.h?from=CompositorParent.h&case=true#125)
2) In the nsBaseWidget (https://dxr.mozilla.org/mozilla-central/source/widget/nsBaseWidget.h#451) 

The nsBaseWidget always shuts down on the main thread and the refpointer is nulled out after the nsBaseWidget shuts down. The reference in the CompositorVsyncObserver is nulled out during CompositorVsyncObserver's destructor which sometimes occurs on the Compositor thread. We don't have any explicit guarantee that the nsBaseWidget's reference will be nulled one after the CompositorVsyncObservers' destructor runs. Also, we don't guarantee MAIN_THREAD destruction on the CompositorVsyncObserver, so we don't have a strong guarantee that the CompositorVsyncDispatcher will be cleaned up on the main thread. Does that make sense?
Flags: needinfo?(bugmail.mozilla)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #9)
> (In reply to Mason Chang [:mchang] from comment #3)
> > This
> > probably means that ClearOnShutdown is executing before the Compositor shuts
> > down, which is odd.
> 
> Which ClearOnShutdown are you referring to here?

My fault, I mixed it up with GeckoTouchDispatcher's reference to the CompositorVsyncObserver.
Thanks, that makes sense. Please update https://github.com/changm/SilkDocs/blob/master/silk.md and land it into gfx/docs as well as part of this bug.
Flags: needinfo?(bugmail.mozilla)
Assignee: nobody → mchang
Status: NEW → ASSIGNED
Attached patch Part 2: Silk Docs (obsolete) — Splinter Review
Easier version to read - https://github.com/changm/SilkDocs/blob/master/silk.md
Attachment #8588079 - Flags: review?(bugmail.mozilla)
Comment on attachment 8575419 [details] [diff] [review]
Delete assertion in CompositorVsyncDispatcher

Landing with silk docs per comment 13.
Attachment #8575419 - Flags: review- → review?(bugmail.mozilla)
Comment on attachment 8588079 [details] [diff] [review]
Part 2: Silk Docs

Review of attachment 8588079 [details] [diff] [review]:
-----------------------------------------------------------------

The compositor shutdown info is outdated so I think that at least needs updating before this can land. Rest are mostly nits and clarifications here and there.

::: gfx/doc/silk.md
@@ +12,5 @@
> +
> +1. Hardware Vsync event occurs on an OS specific *Hardware Vsync Thread* on a per monitor basis.
> +2. For every Firefox window on the specific monitor, notify a **CompositorVsyncDispatcher**. The **CompositorVsyncDispatcher** is specific to one window.
> +3. The **CompositorVsyncDispatcher** will notify the **Compositor** that a vsync has occured.
> +4. The **RefreshTimerVsyncDispatcher** will then notify the Chrome **RefreshTimer** that a vsync has occured.

Who notifies the RefreshTimerVsyncDispatcher?

@@ +14,5 @@
> +2. For every Firefox window on the specific monitor, notify a **CompositorVsyncDispatcher**. The **CompositorVsyncDispatcher** is specific to one window.
> +3. The **CompositorVsyncDispatcher** will notify the **Compositor** that a vsync has occured.
> +4. The **RefreshTimerVsyncDispatcher** will then notify the Chrome **RefreshTimer** that a vsync has occured.
> +5. The **RefreshTimerVsyncDispatcher** will send IPC messages to all content processes to tick their respective active **RefreshTimer**.
> +6. The **Compositor** dispatches input events on the *Compositor Thread*, then composites.

input events are only dispatched on the compositor thread on B2G

@@ +40,5 @@
> +Each **CompositorVsyncDispatcher** is notified that a vsync has occured with the vsync's timestamp.
> +It is the responsibility of the **CompositorVsyncDispatcher** to notify the **Compositor** that is awaiting vsync notifications.
> +The **Display** will then notify the associated **RefreshTimerVsyncDispatcher**, which should notify all active **RefreshDrivers** to tick.
> +
> +All **Display** objects are encapsulated in a **Vsync Source** object.

remove space in Vsync Source

@@ +41,5 @@
> +It is the responsibility of the **CompositorVsyncDispatcher** to notify the **Compositor** that is awaiting vsync notifications.
> +The **Display** will then notify the associated **RefreshTimerVsyncDispatcher**, which should notify all active **RefreshDrivers** to tick.
> +
> +All **Display** objects are encapsulated in a **Vsync Source** object.
> +The **VsyncSource** object lives in **gfxPlatform** and is instantited only on the parent process when **gfxPlatform** is created.

typo: instantiated

@@ +53,5 @@
> +#Compositor
> +When the **CompositorVsyncDispatcher** is notified of the vsync event, the **CompositorVsyncObserver** associated with the **CompositorVsyncDispatcher** begins execution.
> +Since the **CompositorVsyncDispatcher** executes on the *Hardware Vsync Thread* and the **Compositor** composites on the *CompositorThread*, the **CompositorVsyncObserver** posts a task to the *CompositorThread*.
> +The **CompositorParent** then composites.
> +Thus the **CompositorVsyncDispatcher** notifies the **CompositorVsyncObserver**, which then schedules the task on the appropriate thread.

I think this sentence is unnecessary, it seems redundant given what's explained a couple of lines above.

@@ +54,5 @@
> +When the **CompositorVsyncDispatcher** is notified of the vsync event, the **CompositorVsyncObserver** associated with the **CompositorVsyncDispatcher** begins execution.
> +Since the **CompositorVsyncDispatcher** executes on the *Hardware Vsync Thread* and the **Compositor** composites on the *CompositorThread*, the **CompositorVsyncObserver** posts a task to the *CompositorThread*.
> +The **CompositorParent** then composites.
> +Thus the **CompositorVsyncDispatcher** notifies the **CompositorVsyncObserver**, which then schedules the task on the appropriate thread.
> +The model where the **CompositorVsyncDispatcher** notifies components on the *Hardware Vsync Thread*, and the component schedules the task on the appropriarate thread is used everywhere.

typo: appropriated

@@ +92,5 @@
> +###Input Events
> +One large goal of Silk is to align touch events with vsync events.
> +On Firefox OS, touchscreens often have different touch scan rates than the display refreshes.
> +A Flame device has a touch refresh rate of 75 HZ, while a Nexus 4 has a touch refresh rate of 100 HZ, while the device's display refresh rate is 60HZ.
> +When a vsync event occurs, we resample touch events then dispatch the resampled touch event to APZ.

s/events then/events, and then/

@@ +98,5 @@
> +We use [Google Android's touch resampling](http://www.masonchang.com/blog/2014/8/25/androids-touch-resampling-algorithm) algorithm to resample touch events.
> +
> +Currently, we have a strict ordering between Composites and touch events.
> +When a touch event occurs on the *Touch Input Thread*, we store the touch event in a queue.
> +When a vsync event occurs, the **CompositorVsyncDispatcher** notifies the **Compositor** of a vsync event.

, which notifies the GeckoTouchDispatcher.

@@ +110,5 @@
> +If touch events were not dispatched, and since the **Compositor** is not listening to vsync events, the touch events would never be dispatched.
> +The **GeckoTouchDispatcher** handles this case by always forcing the **Compositor** to listen to vsync events while touch events are occurring.
> +
> +###Widget, Compositor, CompositorVsyncDispatcher, GeckoTouchDispatcher Shutdown Procedure
> +Shutdown Process

Remove "Shutdown Process", you already have a big title on the previous line.

@@ +112,5 @@
> +
> +###Widget, Compositor, CompositorVsyncDispatcher, GeckoTouchDispatcher Shutdown Procedure
> +Shutdown Process
> +
> +When the [nsBaseWidget's destructor runs](http://dxr.mozilla.org/mozilla-central/source/widget/nsBaseWidget.cpp?from=nsBaseWidget.cpp#221) - It calls nsBaseWidget::DestroyCompositor on the *Gecko Main Thread*. The main issue is that we destroy the Compositor through the nsBaseWidget, and the widget will not be kept alive by the nsRefPtr on the CompositorVsyncObserver.

Use links to specific revisions of the file. What's at line 221 now may not be at line 221 in a few days. Same goes for all the dxr links you have in this document. Also the second sentence here ("The main issue...") doesn't make sense without more context. Issue with what? Why is the widget not kept alive? Why is that important?

Try to start each sentence on a new line (as you've done in most of this document) for easier review/diffing.

@@ +116,5 @@
> +When the [nsBaseWidget's destructor runs](http://dxr.mozilla.org/mozilla-central/source/widget/nsBaseWidget.cpp?from=nsBaseWidget.cpp#221) - It calls nsBaseWidget::DestroyCompositor on the *Gecko Main Thread*. The main issue is that we destroy the Compositor through the nsBaseWidget, and the widget will not be kept alive by the nsRefPtr on the CompositorVsyncObserver.
> +
> +During nsBaseWidget::DestroyCompositor, we first destroy the CompositorChild. This sends a sync IPC call to CompositorParent::RecvStop, which calls [CompositorParent::Destroy](http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?from=CompositorParent.cpp#474). During this time, the *main thread* is blocked on the parent process. CompositorParent::Destroy runs on the *Compositor thread* and cleans up some resources, including setting the **CompositorVsyncObserver** to nullptr. CompositorParent::Destroy also explicitly keeps the CompositorParent alive and posts another task to run CompositorParent::DeferredDestroy on the Compositor loop so that all ipdl code can finish executing. The **CompositorVsyncObserver** removes itself as a reference to the **GeckoTouchDispatcher** and the **Compositor** removes the reference to the **CompositorVsyncObserver**.
> +
> +Once CompositorParent::RecvStop finishes, the *main thread* in the parent process continues destroying nsBaseWidget. nsBaseWidget posts another task to [DeferedDestroyCompositor on the main thread](http://dxr.mozilla.org/mozilla-central/source/widget/nsBaseWidget.cpp#168). At the same time, the *Compositor thread* is executing tasks until CompositorParent::DeferredDestroy runs. Now we have a two tasks as both the nsBaseWidget::DeferredDestroyCompositor releases a reference to the Compositor on the *main thread* and the CompositorParent::DeferredDestroy releases a reference to the Compositor on the *compositor thread*. Finally, the CompositorParent itself is destroyed on the *main thread* once both deferred destroy's execute.

A lot of this is outdated now, as of bug 1125848.

@@ +118,5 @@
> +During nsBaseWidget::DestroyCompositor, we first destroy the CompositorChild. This sends a sync IPC call to CompositorParent::RecvStop, which calls [CompositorParent::Destroy](http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?from=CompositorParent.cpp#474). During this time, the *main thread* is blocked on the parent process. CompositorParent::Destroy runs on the *Compositor thread* and cleans up some resources, including setting the **CompositorVsyncObserver** to nullptr. CompositorParent::Destroy also explicitly keeps the CompositorParent alive and posts another task to run CompositorParent::DeferredDestroy on the Compositor loop so that all ipdl code can finish executing. The **CompositorVsyncObserver** removes itself as a reference to the **GeckoTouchDispatcher** and the **Compositor** removes the reference to the **CompositorVsyncObserver**.
> +
> +Once CompositorParent::RecvStop finishes, the *main thread* in the parent process continues destroying nsBaseWidget. nsBaseWidget posts another task to [DeferedDestroyCompositor on the main thread](http://dxr.mozilla.org/mozilla-central/source/widget/nsBaseWidget.cpp#168). At the same time, the *Compositor thread* is executing tasks until CompositorParent::DeferredDestroy runs. Now we have a two tasks as both the nsBaseWidget::DeferredDestroyCompositor releases a reference to the Compositor on the *main thread* and the CompositorParent::DeferredDestroy releases a reference to the Compositor on the *compositor thread*. Finally, the CompositorParent itself is destroyed on the *main thread* once both deferred destroy's execute.
> +
> +With the **CompositorVsyncObserver**, any accesses to the widget after nsBaseWidget::~nsBaseWidget executes are invalid. While the sync call to CompositorParent::RecvStop executes, we set the CompositorVsyncObserver to null. Since the CompositorVsyncObserver's vsync notification executes on the *hardware vsync thread*, it will post a task to the Compositor loop and reference an invalid widget. Posting a task to the CompositorLoop is invalid as we could destroy the Compositor before the Vsync's tasks executes. Any accesses to the widget between the time the nsBaseWidget's destructor runs and the CompositorVsyncObserver's destructor runs aren't safe yet a hardware vsync event could occur between these times. Thus, we explicitly shut down vsync events in the **CompositorVsyncDispatcher** during nsBaseWidget destruction.

First sentence here seems to be missing a word, I'm not sure what it's trying to say.

@@ +123,5 @@
> +
> +The **CompositorVsyncDispatcher** may be destroyed on either the *main thread* or *Compositor Thread*, since both the nsBaseWidget and CompositorVsyncObserver race to destroy on different threads. Whichever thread finishes destroying last will hold the last reference to the **CompositorVsyncDispatcher**, which destroys the object.
> +
> +#Refresh Driver
> +The Refresh Driver is ticked from a [single active timer](http://dxr.mozilla.org/mozilla-central/source/layout/base/nsRefreshDriver.cpp?from=nsRefreshDriver.cpp#11). The assumption is that there are multiple **RefreshDrivers** connected to a single **RefreshTimer**. There are multiple **RefreshTimers**: an active and an inactive **RefreshTimer**. Each Tab has its own **RefreshDriver**, which connects to one of the global **RefreshTimers**. The **RefreshTimers** execute on the *Main Thread* and tick their connected **RefreshDrivers**. We do not want to break this model of multiple **RefreshDrivers** per a limited set of global **RefreshTimers**. Each **RefreshDriver** switches between the active and inactive **RefreshTimer**, which already occurs before Silk.

s/multiple **RefreshTimers**/two **RefreshTimers**/
s/, which already occurs before Silk//

@@ +160,5 @@
> +Each **RefreshTimer** is associated with a specific **Display** via an id and tick when it's respective **Display** vsync occurs.
> +We have **N RefreshTimers**, where N is the number of connected displays.
> +Each **RefreshTimer** still has multiple **RefreshDrivers**.
> +
> +When a tab or window change monitors, the **nsIWidget** receives a display changed notification.

s/change/changes/

@@ +162,5 @@
> +Each **RefreshTimer** still has multiple **RefreshDrivers**.
> +
> +When a tab or window change monitors, the **nsIWidget** receives a display changed notification.
> +Based on which display the window is on, the window switches to the correct **RefreshTimerVsyncDispatcher** and **CompositorVsyncDispatcher** on the parent process based on the display id.
> +Each **TabParent** should also sendsa notification to their child.

s/sendsa/send a/
Attachment #8588079 - Flags: review?(bugmail.mozilla) → review-
Attachment #8575419 - Flags: review?(bugmail.mozilla) → review+
Attached patch Part 2: Silk Docs (obsolete) — Splinter Review
Attachment #8588079 - Attachment is obsolete: true
Attachment #8594122 - Flags: review?(bugmail.mozilla)
Comment on attachment 8594122 [details] [diff] [review]
Part 2: Silk Docs

Review of attachment 8594122 [details] [diff] [review]:
-----------------------------------------------------------------

See questions below (sorry for the back-and-forth but I really want to make sure we understand this well because multithreading lifetime issues can drag on for a long time if we don't have a clean model - just ask nical!)

::: gfx/doc/silk.md
@@ +111,5 @@
> +The **GeckoTouchDispatcher** handles this case by always forcing the **Compositor** to listen to vsync events while touch events are occurring.
> +
> +###Widget, Compositor, CompositorVsyncDispatcher, GeckoTouchDispatcher Shutdown Procedure
> +When the [nsBaseWidget shuts down](http://hg.mozilla.org/mozilla-central/file/0df249a0e4d3/widget/nsBaseWidget.cpp#l182) - It calls nsBaseWidget::DestroyCompositor on the *Gecko Main Thread*.
> +During nsBaseWidget::DestroyCompositor, it first destroy's the CompositorChild.

destroys (no apostrophe)

@@ +115,5 @@
> +During nsBaseWidget::DestroyCompositor, it first destroy's the CompositorChild.
> +CompositorChild sends a sync IPC call to CompositorParent::RecvStop, which calls [CompositorParent::Destroy](http://hg.mozilla.org/mozilla-central/file/ab0490972e1e/gfx/layers/ipc/CompositorParent.cpp#l509).
> +During this time, the *main thread* is blocked on the parent process.
> +CompositorParent::Destroy runs on the *Compositor thread* and cleans up some resources, including setting the **CompositorVsyncObserver** to nullptr.
> +CompositorParent::Destroy also explicitly keeps the CompositorParent alive and posts another task to run CompositorParent::DeferredDestroy on the Compositor loop so that all ipdl code can finish executing.

s/::Destroy/::RecvStop/

@@ +120,5 @@
> +The **CompositorVsyncObserver** also unobserves from vsync and cancels any pending composite tasks.
> +Once CompositorParent::RecvStop finishes, the *main thread* in the parent process continues shutting down the nsBaseWidget.
> +
> +At the same time, the *Compositor thread* is executing tasks until CompositorParent::DeferredDestroy runs, which flushes the compositor message loop.
> +Now we have a two tasks as both the nsBaseWidget releases a reference to the Compositor on the *main thread* during destruction and the CompositorParent::DeferredDestroy releases a reference to the Compositor on the *compositor thread*.

/have a two/have two/
Also when you say "Compositor" here, I'm assuming you mean "CompositorParent"

@@ +121,5 @@
> +Once CompositorParent::RecvStop finishes, the *main thread* in the parent process continues shutting down the nsBaseWidget.
> +
> +At the same time, the *Compositor thread* is executing tasks until CompositorParent::DeferredDestroy runs, which flushes the compositor message loop.
> +Now we have a two tasks as both the nsBaseWidget releases a reference to the Compositor on the *main thread* during destruction and the CompositorParent::DeferredDestroy releases a reference to the Compositor on the *compositor thread*.
> +Finally, the CompositorParent itself is destroyed on the *main thread* once both references are gone.

I don't see how this sentence follows from the previous one. The previous one says that there are two tasks on different threads racing to release a reference to the Compositor. So how can you conclude the main thread always "loses" the race and always does the last destroy? Is it not possible that the DeferredDestroy in the parent gets stuck for a while, so the widget finishes destruction and releases all references, and then the compositor thread runs DeferredDestroy and actually releases the last reference? I see that that there is a MOZ_ASSERT(NS_IsMainThread()) in CompositorParent::~CompositorParent so either I must be missing something or the code is wrong.

@@ +125,5 @@
> +Finally, the CompositorParent itself is destroyed on the *main thread* once both references are gone.
> +
> +With the **CompositorVsyncObserver**, any accesses to the widget after nsBaseWidget::DestroyCompositor executes are invalid.
> +When the sync call to CompositorParent::RecvStop executes, we set the CompositorVsyncObserver to null.
> +Since the **CompositorVsyncObserver**'s vsync notification executes on the *hardware vsync thread*, it will post a task to the Compositor loop and may execute after CompositorParent::DeferredDestroy.

I don't see how this task could ever execute after DeferredDestroy. The task in question is posted to the compositor loop at [1]. If we walk up the stack from here, this can only happen if mCompositorVsyncObserver is non-null at [2]. Another way of saying this is that mCompositorVsyncObserver must be nulled out at a later point in time from the task at [1] getting posted to the compositor loop. We know that the nulling out happens at [3] (which happens inside the call at [4]), and it is only after that point that the DeferredDestroy is posted to the compositor loop. Therefore the DeferredDestroy must be posted to the compositor loop after the last vsync notification gets posted, and they are executed in FIFO order.

[1] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?rev=11077895df62#310
[2] http://mxr.mozilla.org/mozilla-central/source/widget/VsyncDispatcher.cpp?rev=2e051c2c7cf7#51
[3] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?rev=11077895df62#379
[4] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?rev=11077895df62#581

@@ +130,5 @@
> +Posting a task to the CompositorLoop is invalid as we could destroy the Compositor before the vsync's tasks executes.
> +Any accesses to the compositor between the time the nsBaseWidget::DestroyCompositor runs and the CompositorVsyncObserver's destructor runs aren't safe yet a hardware vsync event could occur between these times.
> +Thus, we explicitly shut down vsync events in the **CompositorVsyncDispatcher** and **CompositorVsyncObserver** during nsBaseWidget::Shutdown.
> +
> +The **CompositorVsyncDispatcher** may be destroyed on either the *main thread* or *Compositor Thread*, since both the nsBaseWidget and **CompositorVsyncObserver** race to destroy on different threads.

I don't get this either. The compositor ref to CompositorVsyncDispatcher is cleared out at [1] which is called from [2] which is running while the main thread is blocked on the sync ipc Destroy() message. So there's no race, the compositor side ref must be cleared first.

[1] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?rev=11077895df62#244
[2] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?rev=11077895df62#534
Attachment #8594122 - Flags: review?(bugmail.mozilla)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #23)
> Comment on attachment 8594122 [details] [diff] [review]
> Part 2: Silk Docs
> 
> Review of attachment 8594122 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> See questions below (sorry for the back-and-forth but I really want to make
> sure we understand this well because multithreading lifetime issues can drag
> on for a long time if we don't have a clean model - just ask nical!)

No problem, better to get this right and this isn't a very frequent crash or valid on release builds. Thanks for being thorough! 

Fixed the other minor nits.

> @@ +121,5 @@
> > +Once CompositorParent::RecvStop finishes, the *main thread* in the parent process continues shutting down the nsBaseWidget.
> > +
> > +At the same time, the *Compositor thread* is executing tasks until CompositorParent::DeferredDestroy runs, which flushes the compositor message loop.
> > +Now we have a two tasks as both the nsBaseWidget releases a reference to the Compositor on the *main thread* during destruction and the CompositorParent::DeferredDestroy releases a reference to the Compositor on the *compositor thread*.
> > +Finally, the CompositorParent itself is destroyed on the *main thread* once both references are gone.
> 
> I don't see how this sentence follows from the previous one. The previous
> one says that there are two tasks on different threads racing to release a
> reference to the Compositor. So how can you conclude the main thread always
> "loses" the race and always does the last destroy? Is it not possible that
> the DeferredDestroy in the parent gets stuck for a while, so the widget
> finishes destruction and releases all references, and then the compositor
> thread runs DeferredDestroy and actually releases the last reference? I see
> that that there is a MOZ_ASSERT(NS_IsMainThread()) in
> CompositorParent::~CompositorParent so either I must be missing something or
> the code is wrong.

It's not that the main thread always loses, versus the CompositorParent has explicit THREADSAFE_REFCOUNTING_WITH_MAIN_THREAD_DESTRUCTION.

https://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.h?from=CompositorParent.h&case=true#148

I updated the docs with a link to this reference.

> 
> @@ +125,5 @@
> > +Finally, the CompositorParent itself is destroyed on the *main thread* once both references are gone.
> > +
> > +With the **CompositorVsyncObserver**, any accesses to the widget after nsBaseWidget::DestroyCompositor executes are invalid.
> > +When the sync call to CompositorParent::RecvStop executes, we set the CompositorVsyncObserver to null.
> > +Since the **CompositorVsyncObserver**'s vsync notification executes on the *hardware vsync thread*, it will post a task to the Compositor loop and may execute after CompositorParent::DeferredDestroy.
> 
> I don't see how this task could ever execute after DeferredDestroy. The task
> in question is posted to the compositor loop at [1]. If we walk up the stack
> from here, this can only happen if mCompositorVsyncObserver is non-null at
> [2]. Another way of saying this is that mCompositorVsyncObserver must be
> nulled out at a later point in time from the task at [1] getting posted to
> the compositor loop. We know that the nulling out happens at [3] (which
> happens inside the call at [4]), and it is only after that point that the
> DeferredDestroy is posted to the compositor loop. Therefore the
> DeferredDestroy must be posted to the compositor loop after the last vsync
> notification gets posted, and they are executed in FIFO order.

I modified the text a bit. It should read more like "This is why we HAVE to make sure no more tasks get posted to the compositor thread, otherwise it could execute after DeferredDestroy".
E.g., we explicitly set things to null to prevent any tasks from being posted on the compositor loop so that the DeferredDestroy is really the last task on the Compositor loop." IF we did not set things to null during RecvStop, a vsync task could be posted AFTER DeferredDestroy, which is invalid. Therefore, we set everything to null to prevent this from happening. Does that make more sense?

> @@ +130,5 @@
> > +Posting a task to the CompositorLoop is invalid as we could destroy the Compositor before the vsync's tasks executes.
> > +Any accesses to the compositor between the time the nsBaseWidget::DestroyCompositor runs and the CompositorVsyncObserver's destructor runs aren't safe yet a hardware vsync event could occur between these times.
> > +Thus, we explicitly shut down vsync events in the **CompositorVsyncDispatcher** and **CompositorVsyncObserver** during nsBaseWidget::Shutdown.
> > +
> > +The **CompositorVsyncDispatcher** may be destroyed on either the *main thread* or *Compositor Thread*, since both the nsBaseWidget and **CompositorVsyncObserver** race to destroy on different threads.
> 
> I don't get this either. The compositor ref to CompositorVsyncDispatcher is
> cleared out at [1] which is called from [2] which is running while the main
> thread is blocked on the sync ipc Destroy() message. So there's no race, the
> compositor side ref must be cleared first.
> 
> [1]
> http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/
> CompositorParent.cpp?rev=11077895df62#244
> [2]
> http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/
> CompositorParent.cpp?rev=11077895df62#534

I thought that at [2], we set the CompositorVsyncObserver to null on the Compositor thread, which is thread safe refcounted. The reference to the CompositorVsyncDispatcher is nulled out during the CompositorVsyncObserver's destructor. Just because the reference is nulled out does not mean that the actual CompositorVsyncObserver is cleaned up at this point. Or when set a reference to null, after that line, is the object actually destroyed at that exact point in time or during GC? 

If the object is destroyed during GC, then we have a race since one reference is destroyed on the nsBaseWidget on the main thread and another on the Compositor thread. Or if the object is explicitly destroyed at that line [2], then yes we don't have a race, which would then be confusing as to why this intermittent occurs in the first place. 

Do these answer all of your questions?
Flags: needinfo?(bugmail.mozilla)
(In reply to Mason Chang [:mchang] from comment #24)
> It's not that the main thread always loses, versus the CompositorParent has
> explicit THREADSAFE_REFCOUNTING_WITH_MAIN_THREAD_DESTRUCTION.
> 
> https://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/
> CompositorParent.h?from=CompositorParent.h&case=true#148
> 
> I updated the docs with a link to this reference.

Ah, that makes sense, thanks.

> I modified the text a bit. It should read more like "This is why we HAVE to
> make sure no more tasks get posted to the compositor thread, otherwise it
> could execute after DeferredDestroy".
> E.g., we explicitly set things to null to prevent any tasks from being
> posted on the compositor loop so that the DeferredDestroy is really the last
> task on the Compositor loop." IF we did not set things to null during
> RecvStop, a vsync task could be posted AFTER DeferredDestroy, which is
> invalid. Therefore, we set everything to null to prevent this from
> happening. Does that make more sense?

Yup, makes sense. I guess the text was just unclear as to cause and effect there.

> > [1]
> > http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/
> > CompositorParent.cpp?rev=11077895df62#244
> > [2]
> > http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/
> > CompositorParent.cpp?rev=11077895df62#534
> 
> I thought that at [2], we set the CompositorVsyncObserver to null on the
> Compositor thread, which is thread safe refcounted. The reference to the
> CompositorVsyncDispatcher is nulled out during the CompositorVsyncObserver's
> destructor. Just because the reference is nulled out does not mean that the
> actual CompositorVsyncObserver is cleaned up at this point. Or when set a
> reference to null, after that line, is the object actually destroyed at that
> exact point in time or during GC? 

It should be getting destroyed inside the Release() method when the last reference is nulled out. I missed/forgot the fact that GeckoTouchDispatcher holds a ref to it also. So of those two refs whichever is the last to get nulled out will be the one that actually destroys the CompositorVsyncObserver and therefore the CompositorVsyncDispatcher. I'm not sure exactly when the ClearOnShutdown for GeckoTouchDispatcher runs relative to the widget destruction though - I guess there might be a race of some sort there, but I would be surprised if that were the case. The reason I would be surprised is because I would expect the ClearOnShutdown to deterministically happen before or after the widget shutdown, and since the widget shutdown blocks on the CompositorParent::Destroy, the order in which those references are released should be deterministic. Nonetheless I agree that this is probably not deterministic in gecko because gecko is so nondeterministic in general :p. Also there might be future changes in far-away code that affect this ordering and so to be robust we should either force it one way or properly handle both. So removing the assertion is fine, but please update the documentation with this information.

> 
> If the object is destroyed during GC, then we have a race since one
> reference is destroyed on the nsBaseWidget on the main thread and another on
> the Compositor thread. Or if the object is explicitly destroyed at that line
> [2], then yes we don't have a race, which would then be confusing as to why
> this intermittent occurs in the first place. 
> 
> Do these answer all of your questions?

Yup, thanks!
Flags: needinfo?(bugmail.mozilla)
Updated with feedback from comments 23 and 25. Easier version 

https://github.com/changm/SilkDocs/blob/master/silk.md
Attachment #8594122 - Attachment is obsolete: true
Attachment #8594953 - Flags: review?(bugmail.mozilla)
Comment on attachment 8594953 [details] [diff] [review]
Part 2: Silk Docs

Review of attachment 8594953 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8594953 - Flags: review?(bugmail.mozilla) → review+
https://hg.mozilla.org/mozilla-central/rev/cc294803b7d9
https://hg.mozilla.org/mozilla-central/rev/0e2c0966c3df
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Duplicate of this bug: 1188213
Can we safely backport this to b2g37? If so, please nominate :)
Flags: needinfo?(mchang)
Hmm, I'm not sure. I forget what exactly has been uplifted to 2.2 and what hasn't now since it's been a while. I'd rather wait and see if bug 1188213 becomes a big issue on b2g 2.2. If so, then we can nominate. It looks like it's the first one on b2g 2.2 in a long while, so I'd prefer to not uplift for now.
Flags: needinfo?(mchang)
You need to log in before you can comment on or make changes to this bug.