Closed Bug 1379203 Opened 7 years ago Closed 7 years ago

stylo: Google Inbox messages overlap each other after marking a message as done

Categories

(Core :: CSS Parsing and Computation, defect, P1)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox57 --- fixed

People

(Reporter: mossop, Assigned: hiro)

References

(Blocks 1 open bug, )

Details

Attachments

(6 files)

Attached image Capture.JPG
I'm using Google Inbox with stylo enabled on Windows 10. With a long list of messages in my inbox if I open one near the bottom then mark it as done it closes and then a bunch of the messages at the bottom start to overlap each other.
Has STR: --- → yes
OS: Unspecified → Windows 10
Version: unspecified → Trunk
I can repro, and I see that flushing styles with the following bookmark:

javascript:(function() { var s = document.createElement('style'); s.innerHTML = '* { color: red }'; document.body.appendChild(s); document.body.offsetTop; s.parentNode.removeChild(s); }())

Makes the issue disappear. However, the same doesn't happen with just a recascade of the styles (resizing the window, for example).

So we end up with some element having invalid styles... Still no clue why off-hand. Maybe it's due to something going wrong with transitions/animations, or maybe we really miss a restyle and fail to invalidate a given element...
See Also: → 1379425
Hmm... So just disabling OMTA doesn't fix this, so there's something else going on...

Unfortunately there's a whole bunch of stuff going on, as the log shows, so it'll take me a bit to diagnose, I suppose.

In particular, they add and remove <script>'s, and <style> elements during that interaction, which is silly.

On a side note, that explains why FF, not only stylo, is slower than Chrome on Inbox, since we restyle the whole document when removing a <style> element... There's a pretty easy fix on stylo for that though, since we have the machinery I added in bug 1357583, and we just have to reuse it for removals, which is easy.

Anyway, will dig more...
However, disabling transitions, I can no longer repro the issue.

Here's the patch I used (on top of current autoland).

Hiro, any chance you can take a look into this? Do you think it could be due to the issues re. snapshots + animation-only restyle outlined at bug 1379425 and related?
Flags: needinfo?(hikezoe)
See Also: → 1379433
Yes, very likely. I also believe this happens on events (mouse movement, click, etc.), I did confirm it with dropping UpdateOnlyAnimationStyles() call in FlushThrottledStyles() using attachment 8884588 [details] in bug 1379425.  (Thanks for the great test cases)
Note that, as I commented in other bug, UpdateOnlyAnimationStyles() is slightly misleading, 
 
Honestly I don't still understand how snapshots interact with animation-only restyles, but yes, I think we don't need to clear snapshots for animation-only restyles. One thing I am concerned (don't quite understand) is SMIL. Class attribute SMIL seems to need snapshot [1].  Anyway I am going to care these snapshot and animation issues.  

[1] https://hg.mozilla.org/mozilla-central/file/a418121d4625/layout/base/ServoRestyleManager.cpp#l1012
Assignee: nobody → hikezoe
Status: NEW → ASSIGNED
Flags: needinfo?(hikezoe)
A solution I can think of is that we don't return the default computed values from Servo_ResolveStyle() when it's called from UpdateOnlyAnimationStyles (I will rename it to UpdateThrottledAnimationStyles in bug 1379534 though) even if the target element has snapshot.  This is what we talked about bug 1371450 in the last all hands, IIRC.
I start thinking we might need animation-only *post* traversal. that's because when we kick animation-only restyle for event handling, we don't need to update any style contexts for normal restyle at all, we just need style contexts for animating elements.
(In reply to Hiroyuki Ikezoe (:hiro) from comment #6)
> I start thinking we might need animation-only *post* traversal. that's
> because when we kick animation-only restyle for event handling, we don't
> need to update any style contexts for normal restyle at all, we just need
> style contexts for animating elements.

That makes sense, off-hand. Following both dirty bits on the post traversal is one of the things that doesn't make sense to me :)
Priority: -- → P1
See Also: → 1380289
As far as I can tell this has been fixed by bug 1371450. Please re-open if not fixed yet.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Attached image Capture.PNG
I'm still seeing strange effects with stylo enabled on Inbox
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I can't reproduce at all.  Can someone provide a reduced test case?
Another suspicious bug (bug 1381420) has been fixed and shipped in the latest nightly (buildid: 20170726100241). Still persists?
I still see this today, no good steps to reproduce though it seems to happen more often when marking a message as done that is inside a bundle in the inbox. When it happens everything gets super spaced out.
Thanks for the quick feedback. Considering the fact that this issue triggered by mouse events (i.e. opening/closing a message), it's related to throttling animation flush.  There must be something I am missing..
Priority: P1 → --
The final suspicious bug (bug 1387956) has been fixed. :mossop, does the problem still persist on buildid: 20170813183258?  If it still persists, it's not related to animation, I guess the problem is lurking somewhere in reframing.
Flags: needinfo?(dtownsend)
(In reply to Hiroyuki Ikezoe (:hiro) from comment #15)
> The final suspicious bug (bug 1387956) has been fixed.

bug 1388031 I meant.
Tentatively fixed. This dcoesn't reproduce all that often so I can't be sure but we may as well close this and reopen if I see it again.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Flags: needinfo?(dtownsend)
Resolution: --- → FIXED
Thank you :mossop!
Still seeing this unfortunately
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This seems to crop up most commonly when marking a message as done in the middle of a bundle of messages in the inbox.
I've tried reproducing this but haven't succeeded yet. I did, however, just a moment ago see a list in Google Keep fail to collapse after ticking (removing) an item. I suppose both properties could be using some common library/approach to list management and trip upon the same underlying bug.

In terms of animations, Inbox is generating a bunch of CSS animations (transform) and transitions (transform, opacity, visibility). I think the arrangement is:

* Transition transform and opacity on the element that is disappearing (which, at the end of the animation is
  made 'display:none')
* A separate transform CSS animation (the before-var-xx one) is applied to each element below the removed item
  to move it up.
* There are some other (presumably unrelated) animations on the toast element etc.

Adding some logging to nsAnimationManager at the point after we create the new animations, then marking an item as Done in the middle of the list I get  alot such as the following:

Got new animation remove-after-var-11 for element DIV (id=)
  0% { transform : translate(0px, 0px) }
  20% { transform : translate(0px, 0px) }
  100% { transform : translate(0px, -45px) }

Got new animation remove-before-var-11 for element DIV (id=)
  0% { transform : translate(0px, 0px) }
  20% { transform : translate(0px, 0px) }
  100% { transform : translate(0px, 0px) }

Got new animation remove-before-var-11 for element DIV (id=)
  0% { transform : translate(0px, 0px) }
  20% { transform : translate(0px, 0px) }
  100% { transform : translate(0px, 0px) }

(And so on, 3 more times with the translate(0px, 0px) version of the 100% keyframe.)

Got new animation remove-item for element DIV (id=)
  0% { opacity : 0.9999 }
  20% { opacity : 0.0001 }
  100% { opacity : 0.0001 }

Got new animation remove-after-var-11 for element DIV (id=)
  0% { transform : translate(0px, 0px) }
  20% { transform : translate(0px, 0px) }
  100% { transform : translate(0px, -45px) }

(And so on, 2 more times with the translate(0px, -45px) version of the 100% keyframe, 3 more times with the translate(0px, 0px) version, then another 2 times with the translate(0px, -45px) version.)

The interesting thing is that the 100% keyframe changes. That suggests that it was not specified and so we are filling it in using the computed style.  (Unfortunately I added the logging a little too late in the pipeline since adding it earlier would have been very noisy.)

It would also suggest they're using FLIP animation -- i.e. translate the element first, then animate from its previous position to its current position. (Then, presumably, after the animations have finished, covering up the deleted item, drop it from the DOM, drop the translations, and relayout.) The DevTools inspector doesn't update quickly enough to catch it however, so I'm going to try to slow it down and confirm.

If that is what's happening then it would point to us failing to restyle before calculating the underlying value for the 100% keyframe.

Regarding events, I couldn't find any instances of Inbox waiting on transitionend events (they're notoriously unreliable so I'm not surprised) but there was at least one instance of waiting on an animationend event but I didn't dig into that yet.
Oh, I totally missed that there is remove-before and remove-after! That means my theory is completely wrong :/
It looks like we're not filling in keyframes with computed values and it's not FLIP animation either. It's just calculating where the final position is (I suspect it keeps track of item heights on the JS side) then doing an animation: 0% translate 0px -> 20% translate 0px -> 100% translate to calculated position. It uses a forwards fill mode to hold the element at that target location until it gets around to updating the DOM.

Perhaps the step where it calculates the final position depends on style/layout and we're producing stale information there?
Another thought is that this is similar to bug 1316764 (Graphic glitch persists after animation before repaint) which was caused by faulty handling of the animation generation. I know Boris looked into Stylo's animation generation handling but I wonder if it's possible we still have a bug there.
FWIW, I am now suspecting precision issue. Given that this happens with a long list message and animations in question is 'translate' that uses 'Au' (app unit).
Anyway, Brian, thanks for looking this! It would be very helpful to see this problem through different eyes.
Ok I'll leave it with you then. You're right, I hadn't read comment 0 so I didn't realize it was particular to long lists. I just saw attachment 8884325 [details] and I thought that was a bit too big a difference for a precision issue.

Anyway, I've yet to reproduce it once so I hope you can!
I can now reproduce this locally, but no clue yet.
Brian told me that if this is precision issue it would not be fixed by flushing style as commented in comment 1. That's right, I did actually try a dirty hack to convert translate (which has app unit values) function to matrix (which has float value), nothing changed.
I am mostly confident this issue is not related to animations. I made a tweak to make any animations use the last keyframe value like this.

--- a/layout/style/ServoBindings.cpp
+++ b/layout/style/ServoBindings.cpp
@@ -801,6 +801,7 @@ Gecko_ElementTransitions_EndValueAt(RawG
 double
 Gecko_GetProgressFromComputedTiming(RawGeckoComputedTimingBorrowed aComputedTiming)
 {
+  return 1.0;
   return aComputedTiming->mProgress.Value();
 }
 
@@ -812,6 +813,7 @@ Gecko_GetPositionInSegment(RawGeckoAnima
   MOZ_ASSERT(aSegment->mFromKey < aSegment->mToKey,
              "The segment from key should be less than to key");
 
+  return 1.0;
   double positionInSegment =
     (aProgress - aSegment->mFromKey) / (aSegment->mToKey - aSegment->mFromKey)

Even with this tweak and with disabling OMTA (we use different code in the compositor), the issue still happens.  I suspect the translate Y value which is calculated with some style values is different from gecko.
(In reply to Hiroyuki Ikezoe (:hiro) from comment #28)
> I am mostly confident this issue is not related to animations. I made a
> tweak to make any animations use the last keyframe value like this.

That seems to go against the observations in comment 1, comment 2, and comment 3, though...
(In reply to Emilio Cobos Álvarez [:emilio] from comment #29)
> (In reply to Hiroyuki Ikezoe (:hiro) from comment #28)
> > I am mostly confident this issue is not related to animations. I made a
> > tweak to make any animations use the last keyframe value like this.
> 
> That seems to go against the observations in comment 1, comment 2, and
> comment 3, though...

Yeah, I think the problem we are seeing today has been changed since then because I can see the problem with disabling transitions.
I was wrong as usual. :-/  It's an animation problem.  This is a case that we need to post RequestRestyle even if the target element has no animation.
It seems that stylo fails to remove fill-forwards animation style in the case. If I forced to set fill mode to auto, the problem disappears.  Another thing I noticed is that CSS animations which are created when marking 'done' for a message are different from CSS animation created on gecko.  On stylo, I see lots of animations named 'gb__a', but on gecko there is no such named animations.
I don't have any clue yet, but I should note what I know currently:

1) there are two kind of animations;
  remove-before-var-xx: duration: 150ms, fill: forwards
    0% : transform translate(0px, 0px)
   20% : transform translate(0px, 0px)
  100% : transform translate(0px, 94px) // this value seems to depend on opened column height?
  remove-after-var-xx : duration: 150ms, fill: forwards
    0% : transform translate(0px, 0px)
   20% : transform translate(0px, 0px)
  100% : transform translate(0px, -839px) // this value depends on opened message size.
2) both type animations are initially created as 'paused' state, and changed to 'running'.
3) both animations start at the same tick.
4) remove-before-var-xx is for messages above opened message, whereas remove-after-var--xx is for messages below opened message.
5) remove-before-var-xx animations are destroyed by removing animation name property, whereas remove-after-var-xx are destroyed by Element::UnbindFromTree() at the same tick. (I don't know yet what the trigger for this UnbindFromTree() is)
6) if I changed fill property for remove-before-var-xx to 'auto', this problem is gone.
7) if I made the duration for remove-before-var-xx to a bit longer (say, 200ms), this problem is gone.

What I don't quite understand is that, as far as I can see,  the problem (i.e. incorrect position of elements) happen on 'remove-after-var-xx' elements, but as per 5), tweaking against 'remove-before-var-xx' fixes the problem.

Another thing I am wondering is when script destroys (i.e what triggers the destruction) those animations.  Both type of animations have fill:forwards so they persist until destroying explicitly.  I might need to check this gigantic uglified script.
Those symptoms feel a little similar to the animation generation bug (i.e. some combination of content means that we sometimes do updates of the animation generation that cancel each other and fail to update the compositor at the end). However, the fill:forwards thing suggests this is something different.

I noticed there was at least one animationend callback in the JS so perhaps that's where the animations are being destroyed?
(In reply to Brian Birtles (:birtles) from comment #34)
> Those symptoms feel a little similar to the animation generation bug (i.e.
> some combination of content means that we sometimes do updates of the
> animation generation that cancel each other and fail to update the
> compositor at the end). However, the fill:forwards thing suggests this is
> something different.
> 
> I noticed there was at least one animationend callback in the JS so perhaps
> that's where the animations are being destroyed?

yeah, I guess so.  One more thing I should note.  When I checked the elapsed time between starting the animations and destroying animations, it's about 200ms for stylo, but it's about 250ms for gecko. This thinks me that this problem is an underlying issue both on gecko and stylo, but in case of gecko styling is slower than stylo, then the problem does not appear at all because, if the animation duration is 200ms, the problem does not happen on stylo either.
Here is a continuation from comment 33.

8) The trigger of the animation destruction is animationend event for remove-item animation.
9) remove-item animation: duration: 150ms, fill: forwards
    0%: 	 opacity 0.9999
   20%: 	 opacity 0.0001
  100%: 	 opacity 0.0001
10) There seems to be another trigger of the destruction since even if I stopped firing the animationend event for remove-item the animations are destroyed, but it's about 1400ms later, so I guess there is a setTimeout call.
11) With stopping firing animationend event for remove-item solves this problem.  I guess the setTimeout call does something.
12) If I changed the duration for remove-item to a shorter one (say, 100ms), all animations are destroyed before its original end (150ms), even so this problem happens. More interestingly with this tweak, changing fill mode does not solve the problem.
Finally I understand what the problem is.  It's related to style cache.  In failure case, we cache style data which contains still animating style values that the animation is about to remove (i.e. there is no specified animation-name property, and the animation will be removed in a sequential task).  Patch is coming, I will write a test case apart from the patch.
(In reply to Hiroyuki Ikezoe (:hiro) from comment #37)
> Finally I understand what the problem is.  It's related to style cache.  In
> failure case, we cache style data which contains still animating style
> values that the animation is about to remove (i.e. there is no specified
> animation-name property, and the animation will be removed in a sequential
> task).  Patch is coming, I will write a test case apart from the patch.

Well, I said this over IRC, but this is a pretty nice find, thanks for catching it Hiro!

Also, I should've tested it with DISABLE_STYLE_SHARING_CACHE=1, sorry for not having done it before :(
(In reply to Emilio Cobos Álvarez [:emilio] from comment #38)
> (In reply to Hiroyuki Ikezoe (:hiro) from comment #37)
> > Finally I understand what the problem is.  It's related to style cache.  In
> > failure case, we cache style data which contains still animating style
> > values that the animation is about to remove (i.e. there is no specified
> > animation-name property, and the animation will be removed in a sequential
> > task).  Patch is coming, I will write a test case apart from the patch.
> 
> Well, I said this over IRC, but this is a pretty nice find, thanks for
> catching it Hiro!
> 
> Also, I should've tested it with DISABLE_STYLE_SHARING_CACHE=1, sorry for
> not having done it before :(

No problem at all. It was a good chance to know about style cache, now I know just a little bit though. :)
Comment on attachment 8900019 [details]
Bug 1379203 - Don't cache style data if the element has running animations.

https://reviewboard.mozilla.org/r/171350/#review176540

::: servo/components/style/sharing/mod.rs:512
(Diff revision 1)
>          if element.is_native_anonymous() {
>              debug!("Failing to insert into the cache: NAC");
>              return;
>          }
>  
> +        // If the element has any kind of running animations, we don't cache

// If the element has any kind of running animations, we don't cache
        // the style since it can be manipulate its state by script (i.e. Web
        // Animations APIs) without changing its styles. To check all the state
        // is not pragmatic.
        // Also this check needs for CSS animations/transitions since even if
        // there is no specified style for CSS animations/transitions at this
        // moment, there might be still animating style values since it might
        // be just about to remove them in sequential tasks which will be run

Let's say this:

// If the element has running animations, we can't share style.
//
// This is distinct from the specifies_{animations,transitions} check below,
// because:
//   * Animations can be triggered directly via the Web Animations API.
//   * Our computed style can still be affected by animations after we no
//     longer match any animation rules, since removing animations involves
//     a sequential task and an additional traversal.
Attachment #8900019 - Flags: review?(bobbyholley) → review+
Comment on attachment 8900019 [details]
Bug 1379203 - Don't cache style data if the element has running animations.

https://reviewboard.mozilla.org/r/171350/#review176538

Nice!

::: servo/components/style/sharing/mod.rs:512
(Diff revision 1)
> +        // If the element has any kind of running animations, we don't cache
> +        // the style since it can be manipulate its state by script (i.e. Web
> +        // Animations APIs) without changing its styles. To check all the state
> +        // is not pragmatic.

It took me little while to understand this comment. Does this summary make sense?

"If the element has running or filling animations we don't cache the style since the animated style can be invalidated by any number of changes made by script including using the Web Animations API and it's not practical to handle each of these cases.

Furthermore, in the case where a CSS animation or transition has just been removed, the checks for CSS animations and transitions later in this function will report that we have no animations or transitions but in fact the animated style will not be removed until we drop the animation in a subsequent sequential task. The check here for running animations ensures we prevent caching style until that happens."
Attachment #8900019 - Flags: review?(bbirtles) → review+
Oh, I didn't see Bobby's comment until now. Choose whichever version you like!
Thank you both!  I will take the Bobby's comment.  Also note that the test case might need bug 1392851 (as far as the test case I am writing), I will post the test case in bug 1392851.  Thank you!
Attached file Servo PR
https://hg.mozilla.org/integration/autoland/rev/7b70fa3015c7

NI hiro to land a testcase if appropriate.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Flags: needinfo?(hikezoe)
Resolution: --- → FIXED
https://hg.mozilla.org/mozilla-central/rev/b7e692446499
Landed the test case.
Flags: needinfo?(hikezoe)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: