Closed Bug 1119981 Opened 6 years ago Closed 6 years ago

test_deferred_start.html failing on b2g ics emulator opt with silk compositor

Categories

(Core Graveyard :: Widget: Gonk, defect)

26 Branch
x86
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla38

People

(Reporter: mchang, Assigned: mchang)

References

Details

Attachments

(1 file, 3 obsolete files)

From https://treeherder.mozilla.org/#/jobs?repo=try&revision=d9d7fd0b2a60

B2g ICS emulator opt Mochitest suite 1 failing with:

INFO TEST-UNEXPECTED-FAIL | dom/animation/test/mozilla/test_deferred_start.html | 
Starting an animation with a delay starts from the correct point - Starting an animation with a delay starts from the correct point: assert_equals: Got a valid transform matrix on the compositor (got: "") expected 6 but got 0
At least locally, the test also fails with just the software vsync enabled and the vsync compositor disabled.
I can reproduce on OSX by force enabling OMTA and enabling hardware vsync + vsync compositor if anyone else needs to debug.
I think I've found what's going on. The problem is that the Layers.mAnimation data structure isn't actually thread safe, but is accessed on both the main thread and compositor thread. Animations are initially added on the main thread at [1]. During composition, the same animation array is accessed to start animations [2]. From my reading of the code, the AnimationArray is an InfallableArray[3], without any implicit locking.

What the test case assumes is that on the main thread, we append an animation to the layer and on the next composite, the animation starts and the matrix is initialized. So if we try to read the compositor transform matrix between when the animation is appended to the layer and the next composite, since the animation hasn't been initialized, there is no animation matrix and thus the test case fails.

I've also seen that we get CompositorParent::RecvFlushRendering which forces a composite. But this can occur BEFORE the animation or while the Compositor composites, in which case the animation data will still not be initialized. 

Seems to me like we have a couple of problems here.
1) The animation data is not thread safe.
2) There is a timing issue to ensure that the animation has been added to the dom and (b) the compositor has composited at least once to initialize the animation.

This would explain why just enabling vsync on the emulator could cause a thread execution reordering, causing the error. However a few things are still off. Disabling vsync seems to reliably pass the test. But as I added printfs with vsync enabled, the failure rate would fall, which makes me think it's still a thread ordering. I am still inclined to think this is a threading issue where the Compositor composites before the animation is added on the main thread. Once the main thread adds the animation, the player.ready executes on the main thread before the next composite, so there is no transform matrix and the test case fails.

Brian, does any of this make sense? 

Thanks

[1] https://dxr.mozilla.org/mozilla-central/source/gfx/layers/Layers.cpp?from=Layers.cpp&case=true#225
[2] https://dxr.mozilla.org/mozilla-central/source/gfx/layers/composite/AsyncCompositionManager.cpp?from=SampleAnimations&case=true#436
[3] https://dxr.mozilla.org/mozilla-central/source/gfx/layers/Layers.h#724
Flags: needinfo?(bbirtles)
I think I'm completely misunderstanding the question here but I'm surprised that we'd be accessing the same mAnimations from both threads/processes.

My understanding of the layers setup is that first we add the animations to the client layer on the main thread, e.g. http://mxr.mozilla.org/mozilla-central/source/gfx/layers/Layers.cpp#219

These changes are recorded in a layer transaction and then we use our IPC infrastructure to send an update to the host layer tree on the compositor thread.

(I don't really know the details of how this works but http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/ShadowLayers.h has a good overview.)

In LayerTransactionParent::RecvUpdate we update the copy of mAnimations on the host layer by calling SetAnimations (http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/LayerTransactionParent.cpp#350).

As a result, we should have two separate copies of mAnimations. Note that for e10s (and Firefox OS) the compositor runs in a separate process, not just a separate thread.

So I don't expect (1) is a problem.

As for (2), it's possible we don't run the animation on the compositor before the main thread tests its value. I'd suggest adding some logging to SampleAnimations on the compositor to check if that's what's happenning.

When debugging these sorts of OMTA bugs I often put printfs in both AddAnimation (and AddAnimationForNextTransaction) and SetAnimations as well to check the layer transaction is happening at the expected time. We should probably set up some generic animation logging for this sort of thing.

Note that the compositor thread notifies the main thread when it has finished compositing via the DidComposite message:
  http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorChild.cpp#138

I guess I assumed the main thread was waiting on that but I don't think it is so maybe those tests aren't valid and they should really wait for the compositor thread do finish compositing before querying the animated value.

Matt, is my understanding of this setup anywhere near correct?
Flags: needinfo?(bbirtles) → needinfo?(matt.woodrow)
Yes, that all sounds correct.

DidComposite results in the MozAfterPaint event being sent to content, if we're waiting on that then we should know that compositing has completed.
Flags: needinfo?(matt.woodrow)
Thanks Matt. Sounds like test_deferred_start.html needs to be tweaked to wait for paints to finish, probably using paint_listener.js. I can have a look at doing that on Tuesday.
Attached patch Animation Logging Patch (obsolete) — Splinter Review
(In reply to Brian Birtles (:birtles) from comment #4)
> As a result, we should have two separate copies of mAnimations. Note that
> for e10s (and Firefox OS) the compositor runs in a separate process, not
> just a separate thread.
> 
> So I don't expect (1) is a problem.

Yes, sorry you're right. There are two separate copies of mAnimations.

> As for (2), it's possible we don't run the animation on the compositor
> before the main thread tests its value. I'd suggest adding some logging to
> SampleAnimations on the compositor to check if that's what's happenning.
> 
> When debugging these sorts of OMTA bugs I often put printfs in both
> AddAnimation (and AddAnimationForNextTransaction) and SetAnimations as well
> to check the layer transaction is happening at the expected time. We should
> probably set up some generic animation logging for this sort of thing.

So I did this, and here's the order which seems to always work:

1) Content Process, Main Thread - Adds an animation[1]
2) Parent Process, Compositor Thread - Receives IPC Layer Transaction and updates the animation data. 
3) Parent Process, Compositor Thread - Composites and starts the animation
4) Content Process, Main Thread, the JS in the test calls GetOMTAStyle[2] 
5) Parent Process, Compositor Thread -  The call from (4) ends up in LayerTransactionParent::RecvGetAnimationTransform and returns the proper matrix from (3).

[1] https://dxr.mozilla.org/mozilla-central/source/gfx/layers/Layers.cpp?from=AddAnimation&case=true#219
[2] https://dxr.mozilla.org/mozilla-central/source/dom/base/nsDOMWindowUtils.cpp?from=GetOMTAStyle&case=true#3770

The problem arises if (5) occurs before (3). Just adding printfs everywhere while still disabling software vsync and the vsync aligned compositor is enough to trigger this bug, so it's not being effected by the software vsync stuff. What I see when the test fails is also kind of strange though.

1) Content Process, Main Thread - Add Animation
2) Parent Process, Compositor Thread - LayerTransaction, set animations
3) Parent Process, Compositor Thread - Start composite / animation
4) Content Process, Main Thread - Clear Animation
5) Content Process, Main Thread - Add Animation
6) Content Process, Main Thread - JS Calls to GetOMTAStyle
7) Parent Process, Compositor Thread - LayerTransaction, set Animations
8) Parent Process, Compositor Thread - LayerTrasnactionParent::RecvGetAnimationTransform. Since the animation didn't start yet, this returns the empty string, failing the test.
9) Parent Process, Compositor Thread - Animation Starts

It seems like there is a timing issue where if the call to GetOMTAStyle executes on the compositor thread between the LayerTransaction and the next composite, the test will fail. The curious part that I still don't understand is why we would clear the animation then re-add it. Since all the Animation code seems to occur in nsDisplayListBuilder, this must mean we painted twice. Any idea why we remove then re-add the animation?

BTW, after we start an animation, I don't see CompositorChild::RecvDidComposite get called either. However, in the broken case I can sometimes see this order:

Add animation -> Recv Layer Transaction -> Start Composite -> CompositorChild::RecvDidComposite -> Clear Animations.
Flags: needinfo?(bbirtles)
(In reply to Mason Chang [:mchang] from comment #9)
> > As for (2), it's possible we don't run the animation on the compositor
> > before the main thread tests its value. I'd suggest adding some logging to
> > SampleAnimations on the compositor to check if that's what's happenning.
> > 
> > When debugging these sorts of OMTA bugs I often put printfs in both
> > AddAnimation (and AddAnimationForNextTransaction) and SetAnimations as well
> > to check the layer transaction is happening at the expected time. We should
> > probably set up some generic animation logging for this sort of thing.
> 
> So I did this, and here's the order which seems to always work:
> 
> 1) Content Process, Main Thread - Adds an animation[1]
> 2) Parent Process, Compositor Thread - Receives IPC Layer Transaction and
> updates the animation data. 
> 3) Parent Process, Compositor Thread - Composites and starts the animation
> 4) Content Process, Main Thread, the JS in the test calls GetOMTAStyle[2] 
> 5) Parent Process, Compositor Thread -  The call from (4) ends up in
> LayerTransactionParent::RecvGetAnimationTransform and returns the proper
> matrix from (3).
> 
> [1]
> https://dxr.mozilla.org/mozilla-central/source/gfx/layers/Layers.
> cpp?from=AddAnimation&case=true#219
> [2]
> https://dxr.mozilla.org/mozilla-central/source/dom/base/nsDOMWindowUtils.
> cpp?from=GetOMTAStyle&case=true#3770
> 
> The problem arises if (5) occurs before (3). Just adding printfs everywhere
> while still disabling software vsync and the vsync aligned compositor is
> enough to trigger this bug, so it's not being effected by the software vsync
> stuff. What I see when the test fails is also kind of strange though.
> 
> 1) Content Process, Main Thread - Add Animation
> 2) Parent Process, Compositor Thread - LayerTransaction, set animations
> 3) Parent Process, Compositor Thread - Start composite / animation
> 4) Content Process, Main Thread - Clear Animation
> 5) Content Process, Main Thread - Add Animation
> 6) Content Process, Main Thread - JS Calls to GetOMTAStyle
> 7) Parent Process, Compositor Thread - LayerTransaction, set Animations
> 8) Parent Process, Compositor Thread -
> LayerTrasnactionParent::RecvGetAnimationTransform. Since the animation
> didn't start yet, this returns the empty string, failing the test.
> 9) Parent Process, Compositor Thread - Animation Starts
> 
> It seems like there is a timing issue where if the call to GetOMTAStyle
> executes on the compositor thread between the LayerTransaction and the next
> composite, the test will fail. The curious part that I still don't
> understand is why we would clear the animation then re-add it. Since all the
> Animation code seems to occur in nsDisplayListBuilder, this must mean we
> painted twice. Any idea why we remove then re-add the animation?

I'm not sure why we're removing the animation. Are you sure it's the same animation? Not, for example, an animation from a previous test?

Where did you stick the printf that displays ClearAnimation? We regularly create animations and delete them when recalculating style so if that printf is inside, e.g. AnimationPlayer for example, then it's definitely expected.

> BTW, after we start an animation, I don't see
> CompositorChild::RecvDidComposite get called either. However, in the broken
> case I can sometimes see this order:
> 
> Add animation -> Recv Layer Transaction -> Start Composite ->
> CompositorChild::RecvDidComposite -> Clear Animations.

I think we just need to make sure the composite happens before querying the OMTA value.

I'm running a patch through try now that does that:

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=f0ac03455fdd

(But I forgot to check the paint_listener.js include path was correct before pushing so that push may be bogus and I may need to run it again. And my local build is still running so I can't check there either.)
Flags: needinfo?(bbirtles) → needinfo?(mchang)
(In reply to Brian Birtles (:birtles) from comment #10)
> I'm running a patch through try now that does that:
> 
> https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=f0ac03455fdd
> 
> (But I forgot to check the paint_listener.js include path was correct before
> pushing so that push may be bogus and I may need to run it again. And my
> local build is still running so I can't check there either.)

That patch works locally so the try run should be ok. I expect it might fail on Mulet though, but that's a different bug.
(In reply to Brian Birtles (:birtles) from comment #10
> 
> I'm not sure why we're removing the animation. Are you sure it's the same
> animation? Not, for example, an animation from a previous test?
> 
> Where did you stick the printf that displays ClearAnimation? We regularly
> create animations and delete them when recalculating style so if that printf
> is inside, e.g. AnimationPlayer for example, then it's definitely expected.

I stuck it in here - https://dxr.mozilla.org/mozilla-central/source/gfx/layers/Layers.cpp?from=ClearAnimations&case=true#234

Hmm, I deleted all the other test cases in test_deferred_start.html other than the "// Test that compositor animations with delays get synced correctly" and am running the mochitest only on test_deferred_start.html. I will double check again tomorrow that it is the same animation.

> 
> > BTW, after we start an animation, I don't see
> > CompositorChild::RecvDidComposite get called either. However, in the broken
> > case I can sometimes see this order:
> > 
> > Add animation -> Recv Layer Transaction -> Start Composite ->
> > CompositorChild::RecvDidComposite -> Clear Animations.
> 
> I think we just need to make sure the composite happens before querying the
> OMTA value.
> 
> I'm running a patch through try now that does that:
> 
> https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=f0ac03455fdd
> 
> (But I forgot to check the paint_listener.js include path was correct before
> pushing so that push may be bogus and I may need to run it again. And my
> local build is still running so I can't check there either.)

Awesome thanks for testing! I will try locally tomorrow.
Flags: needinfo?(mchang)
(In reply to Mason Chang [:mchang] from comment #12)
> (In reply to Brian Birtles (:birtles) from comment #10
> > Where did you stick the printf that displays ClearAnimation? We regularly
> > create animations and delete them when recalculating style so if that printf
> > is inside, e.g. AnimationPlayer for example, then it's definitely expected.
> 
> I stuck it in here -
> https://dxr.mozilla.org/mozilla-central/source/gfx/layers/Layers.
> cpp?from=ClearAnimations&case=true#234

Ok, that makes sense then. When building display lists we clear all the animations before re-adding them. So, I'm pretty sure that call to ClearAnimations is to be expected.

> > > BTW, after we start an animation, I don't see
> > > CompositorChild::RecvDidComposite get called either. However, in the broken
> > > case I can sometimes see this order:
> > > 
> > > Add animation -> Recv Layer Transaction -> Start Composite ->
> > > CompositorChild::RecvDidComposite -> Clear Animations.
> > 
> > I think we just need to make sure the composite happens before querying the
> > OMTA value.
> > 
> > I'm running a patch through try now that does that:
> > 
> > https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=f0ac03455fdd

Unfortunately that patch didn't work at all. :)

I'm going to try and look into this as well (I need to fix bug 1113425 which I suspect is actually a dupe of this bug) but it's going to take me a while to set up an environment to reproduce this since I've all but run out of space on my Linux VM.

If I don't get to it by the time you start tomorrow, then I think the next step is to take the changes from here:

  https://hg.mozilla.org/try/rev/f0ac03455fdd

merge that with your debugging stuff, and add some extra dump() statements in paint_listener.js and test_deferred_start.html to see why the paint listening is not causing us to wait for that composite to happen.
BTW, is there a reason player.ready fires before we composite? Would it make more sense to make player.ready fire only after the animation has started? I glanced at the spec and docs and couldn't see if this is by design. Thanks!

Ok, if it will take some time to fix this, can we disable this test for now to reduce intermittents?
See Also: → 1113425
(In reply to Mason Chang [:mchang] from comment #14)
> BTW, is there a reason player.ready fires before we composite? Would it make
> more sense to make player.ready fire only after the animation has started? I
> glanced at the spec and docs and couldn't see if this is by design. Thanks!

player.ready just resolves on the next refresh driver tick after we hand the animation off to the compositor. Note that we run the animation in parallel, so to speak, on the main thread and compositor thread. That is, if script calls window.getComputedStyle() to inspect the animated property, we'll evaluate the animation on the main thread. So from that point of view, the animation is already running.

Outside of our mochitests that have access to DOMWindowUtils, script can't normally inspect the values being produced on the compositor so they can't observe if the compositor has yet to evaluate the animation or not.

Most of the time, by the time any promise-resolution callbacks run the compositor will be running the animation already and it's only sometimes that the compositor might lag behind. I think trying to guarantee that the compositor was up to date might be not be worth the cost for something that's typically not observable. e.g. we'd probably need to hold up the next refresh driver tick until we got DidComposite, or (more likely), not resolve the ready promise until two ticks after the animation was triggered for cases when the compositor was lagging behind (which would be an annoying inconsistency--but having said that, we might encounter something like this in bug 1109390).

> Ok, if it will take some time to fix this, can we disable this test for now
> to reduce intermittents?

Where are you seeing intermittent failures? This should be disabled already on all platforms where we know it can fail.
Forcing an extra style flush when waiting for paints seems to work (or at least, so far, Mulet has passed which has never happened before):

  https://treeherder.mozilla.org/#/jobs?repo=try&revision=08bf683d26e1

But we shouldn't need to do that. I'll try to get an emulator build running tomorrow to repro this.
Hmm, maybe we can do something even easier. I see that only test cases call nsDomWindowUtils::GetOMTAStyle [1], which calls LayerTransactionParentRecv::RecvGetAnimationTransform. And from comment 15, script can't normally inspect these values. Can we manually force a composite in LayerTransactionParent::RecvGetAnimationTransform here - https://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/LayerTransactionParent.cpp?from=RecvGetAnimationTransform#669. We'd already have access to the CompositorParent through the LayerTransactionParent::mShadowLayersManager. This would block the main thread during GetOMTAStyle, but since it's for tests only, I think it's ok. Thus, it would be easy to guarantee that a composite ocurred before we return from GetOMTAStyle. After 5 runs locally with the b2g ics emulator, I haven't been able to reproduce the crash! What do you think?

[1] https://dxr.mozilla.org/mozilla-central/source/dom/base/nsDOMWindowUtils.cpp?from=GetOMTAStyle&case=true#3770
Flags: needinfo?(bbirtles)
(In reply to Brian Birtles (:birtles) from comment #15)
> 
> > Ok, if it will take some time to fix this, can we disable this test for now
> > to reduce intermittents?
> 
> Where are you seeing intermittent failures? This should be disabled already
> on all platforms where we know it can fail.

I only see the failures when enabling silk on by default, from comment 0. Silk lets us use hardware vsync to driver the compositor. You can enable it by flipping these two prefs to true:

gfx.vsync.hw-vsync.enabled
gfx.vsync.compositor
Force a composite before returning from GetOMTAStyle to ensure that animations have started.

I'm not sure this is the exact right route to go, but seems right. If you'd prefer to do another route within the actual test itself, then by all means please do :). Thanks for helping me investigate this!
Attachment #8548376 - Flags: feedback?(bbirtles)
(In reply to Mason Chang [:mchang] from comment #17)
> Hmm, maybe we can do something even easier. I see that only test cases call
> nsDomWindowUtils::GetOMTAStyle [1], which calls
> LayerTransactionParentRecv::RecvGetAnimationTransform. And from comment 15,
> script can't normally inspect these values. Can we manually force a
> composite in LayerTransactionParent::RecvGetAnimationTransform here -
> https://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/
> LayerTransactionParent.cpp?from=RecvGetAnimationTransform#669. We'd already
> have access to the CompositorParent through the
> LayerTransactionParent::mShadowLayersManager. This would block the main
> thread during GetOMTAStyle, but since it's for tests only, I think it's ok.
> Thus, it would be easy to guarantee that a composite ocurred before we
> return from GetOMTAStyle. After 5 runs locally with the b2g ics emulator, I
> haven't been able to reproduce the crash! What do you think?
> 
> [1]
> https://dxr.mozilla.org/mozilla-central/source/dom/base/nsDOMWindowUtils.
> cpp?from=GetOMTAStyle&case=true#3770

I think this makes sense, but I'd definitely want to get Matt Woodrow or someone more familiar with this setup to review this.

First I want to try to understand why using the paint listener didn't work since that may reveal an underlying bug or provide a simpler way of fixing this.

(I finally got the emulator build running only to notice this test passes on emulator--it's the desktop b2g opt build where it fails! Trying again...)
Flags: needinfo?(bbirtles)
You can make the tests fail pretty reliably on the emulator if you enable the two prefs in comment 18. At least I can on b2g ics emulator running on ubuntu 64 bit.
(In reply to Mason Chang [:mchang] from comment #21)
> You can make the tests fail pretty reliably on the emulator if you enable
> the two prefs in comment 18. At least I can on b2g ics emulator running on
> ubuntu 64 bit.

Thanks for that. I gave it a try but it still passes for me (and I can never seem to get mochitests to run on B2G desktop). Since you are able to reproduce this locally, do you mind applying the changes from https://hg.mozilla.org/try/rev/f0ac03455fdd and sticking a few dumps() in paint_listener.js to see if we end up waiting for paints at all?

Otherwise, feel free to request review from Matt Woodrow.
Didn't have time to try this today, will tomorrow. You can also enable the two prefs on OSX along with forcing OMTA which will expose the bug, but it's still racy.
I got to dig a bit. I enabled debug=true in paint_listener.js - https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/tests/SimpleTest/paint_listener.js?from=paint_listener.js&case=true#4. The failure still occurs whether or not we waitForPaints, even if I use the waitForAllPaintsFlushed option, although the failure rate seems to go down a bit. From the logs it seems like this is generally occuring:

1) Content - Main Thread - Add Animation
2) Parent - Compositor - SetAnimations
3) Parent - Compositor - Start Animation
4) Content - Main Thread - Clear Animation
5) Content - Main Thread - Add Animation
6) Content - Main Thread - Wait for Paint
7) Content - Main Thread - RecvComposite - Moz After Paint finishes. Probably from (3)?
8) Content - Main Thread - nsDomWindowUtils::GetOMTAStyle
9) Parent - Compositor - Composites
10) Parent - Compositor - Layer Transaction, SetAnimations
11) Parent - Compositor - LayerTransaction::RecvGetOMTAStyle, returns null due to the SetAnimation

It looks like wait for paint is working, but it's assuming paints flushed because of the paint from the first add animation. The second add animation clears the animation matrix, so we still receive a null string for the animation matrix because the OMTA style is processed before the second added animation starts. Does that make sense?

I'm still curious as to why we're adding/clearing the animation, which also seems to be flaky. Since you said from comment 13, we clear all animations before re-adding them. So it looks like due to some weird timing issues, it means we painted twice before the content process received the CompositorChild::RecvDidComposite, which is causing the issue. Should we always be clearing / readding animations? Does this make sense?

I still want to land silk in the next week or two, if this is still blocking, I'd like to either (a) disable the test or (b) force composites on calls to LayerTransactionParent::RecvGetAnimationTransforms.
Flags: needinfo?(bbirtles)
(In reply to Mason Chang [:mchang] from comment #24)
> I got to dig a bit. I enabled debug=true in paint_listener.js -
> https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/tests/
> SimpleTest/paint_listener.js?from=paint_listener.js&case=true#4. The failure
> still occurs whether or not we waitForPaints, even if I use the
> waitForAllPaintsFlushed option, although the failure rate seems to go down a
> bit. From the logs it seems like this is generally occuring:
> 
> 1) Content - Main Thread - Add Animation
> 2) Parent - Compositor - SetAnimations
> 3) Parent - Compositor - Start Animation
> 4) Content - Main Thread - Clear Animation
> 5) Content - Main Thread - Add Animation
> 6) Content - Main Thread - Wait for Paint
> 7) Content - Main Thread - RecvComposite - Moz After Paint finishes.
> Probably from (3)?
> 8) Content - Main Thread - nsDomWindowUtils::GetOMTAStyle
> 9) Parent - Compositor - Composites
> 10) Parent - Compositor - Layer Transaction, SetAnimations
> 11) Parent - Compositor - LayerTransaction::RecvGetOMTAStyle, returns null
> due to the SetAnimation
> 
> It looks like wait for paint is working, but it's assuming paints flushed
> because of the paint from the first add animation. The second add animation
> clears the animation matrix, so we still receive a null string for the
> animation matrix because the OMTA style is processed before the second added
> animation starts. Does that make sense?

Not really. The "Clear Animation" should be invisible to the compositor thread.

Basically, during display list building on the main thread, we clear the set of animations and re-add them. All that should happen before the layer transaction so we shouldn't be clearing the animations on the compositor thread.

How does the second add animation clear the animation matrix when, according to the sequence above, that second add animation hasn't been sent to the compositor yet?

> So it looks like due to some weird timing issues, it
> means we painted twice before the content process received the
> CompositorChild::RecvDidComposite, which is causing the issue. Should we
> always be clearing / readding animations? Does this make sense?

Yes, the clearing shouldn't have any effect since we should re-add the same animations before the layer transaction is executed.

It's possible there's a bug and we're not re-adding those animations. Unfortunately, those kind of bugs tend to take a long time to track down. :(

I wish I could reproduce this so I could help looking into it.

> I still want to land silk in the next week or two, if this is still
> blocking, I'd like to either (a) disable the test or (b) force composites on
> calls to LayerTransactionParent::RecvGetAnimationTransforms.

I don't mind either too much. For (b), it's probably worth putting the patch up for review to see what Matt Woodrow thinks about all this. My main concern is that we'd be doing special handling for transforms that we're not doing to opacity so if we were to switch the test to opacity, it would start failing.
Flags: needinfo?(bbirtles)
(In reply to Brian Birtles (:birtles) from comment #25)
> (In reply to Mason Chang [:mchang] from comment #24)
> > I got to dig a bit. I enabled debug=true in paint_listener.js -
> > https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/tests/
> > SimpleTest/paint_listener.js?from=paint_listener.js&case=true#4. The failure
> > still occurs whether or not we waitForPaints, even if I use the
> > waitForAllPaintsFlushed option, although the failure rate seems to go down a
> > bit. From the logs it seems like this is generally occuring:
> > 
> > 1) Content - Main Thread - Add Animation
> > 2) Parent - Compositor - SetAnimations
> > 3) Parent - Compositor - Start Animation
> > 4) Content - Main Thread - Clear Animation
> > 5) Content - Main Thread - Add Animation
> > 6) Content - Main Thread - Wait for Paint
> > 7) Content - Main Thread - RecvComposite - Moz After Paint finishes.
> > Probably from (3)?
> > 8) Content - Main Thread - nsDomWindowUtils::GetOMTAStyle
> > 9) Parent - Compositor - Composites
> > 10) Parent - Compositor - Layer Transaction, SetAnimations
> > 11) Parent - Compositor - LayerTransaction::RecvGetOMTAStyle, returns null
> > due to the SetAnimation
> > 
> > It looks like wait for paint is working, but it's assuming paints flushed
> > because of the paint from the first add animation. The second add animation
> > clears the animation matrix, so we still receive a null string for the
> > animation matrix because the OMTA style is processed before the second added
> > animation starts. Does that make sense?
> 
> Not really. The "Clear Animation" should be invisible to the compositor
> thread.
> 
> Basically, during display list building on the main thread, we clear the set
> of animations and re-add them. All that should happen before the layer
> transaction so we shouldn't be clearing the animations on the compositor
> thread.
> 
> How does the second add animation clear the animation matrix when, according
> to the sequence above, that second add animation hasn't been sent to the
> compositor yet?

Hmm, I think I'm confused about something. We clear and re-add the animation in steps 4-5. At step 9, we composite, which initializes the animation. Then at 10, this is the layer transaction from 4-5, which resets the animation right? Then since 11 executes, it looks like a new animation? In this case, the second add animation hit the compositor, then the RecvGetOMTA, then we composite.
(In reply to Mason Chang [:mchang] from comment #26)
> (In reply to Brian Birtles (:birtles) from comment #25)
> > (In reply to Mason Chang [:mchang] from comment #24)
> > > I got to dig a bit. I enabled debug=true in paint_listener.js -
> > > https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/tests/
> > > SimpleTest/paint_listener.js?from=paint_listener.js&case=true#4. The failure
> > > still occurs whether or not we waitForPaints, even if I use the
> > > waitForAllPaintsFlushed option, although the failure rate seems to go down a
> > > bit. From the logs it seems like this is generally occuring:
> > > 
> > > 1) Content - Main Thread - Add Animation
> > > 2) Parent - Compositor - SetAnimations
> > > 3) Parent - Compositor - Start Animation
> > > 4) Content - Main Thread - Clear Animation
> > > 5) Content - Main Thread - Add Animation
> > > 6) Content - Main Thread - Wait for Paint
> > > 7) Content - Main Thread - RecvComposite - Moz After Paint finishes.
> > > Probably from (3)?
> > > 8) Content - Main Thread - nsDomWindowUtils::GetOMTAStyle
> > > 9) Parent - Compositor - Composites
> > > 10) Parent - Compositor - Layer Transaction, SetAnimations
> > > 11) Parent - Compositor - LayerTransaction::RecvGetOMTAStyle, returns null
> > > due to the SetAnimation
> > > 
> > > It looks like wait for paint is working, but it's assuming paints flushed
> > > because of the paint from the first add animation. The second add animation
> > > clears the animation matrix, so we still receive a null string for the
> > > animation matrix because the OMTA style is processed before the second added
> > > animation starts. Does that make sense?
> > 
> > Not really. The "Clear Animation" should be invisible to the compositor
> > thread.
> > 
> > Basically, during display list building on the main thread, we clear the set
> > of animations and re-add them. All that should happen before the layer
> > transaction so we shouldn't be clearing the animations on the compositor
> > thread.
> > 
> > How does the second add animation clear the animation matrix when, according
> > to the sequence above, that second add animation hasn't been sent to the
> > compositor yet?
> 
> Hmm, I think I'm confused about something. We clear and re-add the animation
> in steps 4-5.

At least in theory, that shouldn't make any difference.

> At step 9, we composite, which initializes the animation. Then
> at 10, this is the layer transaction from 4-5, which resets the animation
> right?

What resetting is happening? In SetAnimations we overwrite the animation data structures but, as far as I know, no state should be changed. The whole operation should effectively be a no-op. But maybe I'm overlooking something since I'm more familiar with the main thread side of things?

> Then since 11 executes, it looks like a new animation?

I don't think there's a concept of a new animation anywhere but, again, maybe I'm missing something?

I don't know of any state that changes depending on if we had an animation on the last sample or not (except for the wantNextFrame handling in https://dxr.mozilla.org/mozilla-central/source/gfx/layers/composite/AsyncCompositionManager.cpp#1017 but I don't see how that would change anything).

> In this case,
> the second add animation hit the compositor, then the RecvGetOMTA, then we
> composite.

Perhaps what I don't understand is, what does "(3) Start animation" refer to? Are we actually executing AsyncCompositionManager::TransformShadowTree?
After discussing on IRC, the problem appears to be that we're hitting the problem described here:

http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?rev=5680121a4909#1047

Basically, there is a race condition that can produce the following sequence:

1. Layer update temporarily resets the animation transform
2. Query animated transform
3. Perform composite and set animation transform

We've worked around this for tests by forcing a composite when the layers are updated. However, this only applies in "test mode". That test mode flag gets triggered when we use DOMWindowUtils.advanceTimeAndRefresh.

However, for test_deferred_start.html explicitly avoids using DOMWindowUtils.advanceTimeAndRefresh so mIsTestMode will not be set and we won't perform the extra composite--hence the race.

Clearly that test needs to be reworked. I'm not sure yet how best to do that, but for now I think it's reasonable to disable it.
Disable test_deferred_start.html on b2g platforms.
Attachment #8547727 - Attachment is obsolete: true
Attachment #8547728 - Attachment is obsolete: true
Attachment #8548376 - Attachment is obsolete: true
Attachment #8548376 - Flags: feedback?(bbirtles)
Attachment #8550034 - Flags: review?(bbirtles)
Attachment #8550034 - Flags: review?(bbirtles) → review+
https://hg.mozilla.org/mozilla-central/rev/30e7e56618d3
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
See Also: → 1150351
See Also: → 1168759
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.