Closed Bug 1342220 Opened 7 years ago Closed 7 years ago

Janky Twitter thread-dismissing animation

Categories

(Core :: Layout, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Performance Impact high
Tracking Status
platform-rel --- +
firefox54 --- affected

People

(Reporter: overholt, Assigned: dholbert)

References

(Blocks 2 open bugs, )

Details

(Whiteboard: [platform-rel-Twitter])

I clicked on https://twitter.com/RickByers/status/834539954948698112, looked at the full thread, then pressed Escape to dismiss the pseudo-modal thread view and the sort of fade-out animation of it disappearing was really janky.

Profile: https://perf-html.io/public/943a67b68790b07c3e03952721015687e6ed5d19/flameChart/?range=35.0302_53.2190&thread=1

(This isn't really UI jank but I'm not sure what the other meta bug is we're using to track this, Ehsan.)
Flags: needinfo?(ehsan)
Found the right bug to block.
Blocks: QuantumFlow
No longer blocks: UIJank
Flags: needinfo?(ehsan)
I think this got better when smaug's high priority refresh driver changes were landed.
There's still a really long layout flush though. I went to https://twitter.com/RickByers/with_replies , scrolled to the bottom for a while to make more tweets load, and then scrolled back up and clicked one tweet from that conversation. Then I took a profile for just the dismissal animation (which pauses just before opacity of the overlay reaches zero): https://perfht.ml/2kRZ9EP
Yeah I can also reproduce the super long reflows: https://perfht.ml/2lrlKo3

Reading this bug reminded me that Twitter is an example of a web site where I have seen super expensive reflows on many times in the past.  I don't use it daily so I hadn't really run into it when profiling for the QF project.  I'm very glad this bug reminded me of this.  :-)
Blocks: FastReflows
Flags: needinfo?(dholbert)
I used mstange's STR (comment 3), and got a 1.82 second event-handling delay at the end of the dismissal animation.

 * There's a 459ms period at the beginning of the delay which is mostly in RestyleTracker::ProcessRestyles: https://perfht.ml/2lltvLr 
   - And really most of that -- 415ms -- is in nsCSSFrameConstructor: https://perfht.ml/2llAjIS
 * After that, we spend 1051ms in DoReflow: https://perfht.ml/2llGlJq

Given all the nsCSSFrameConstructor work that we're doing, I'm not surprised that we spend so long in reflow. I suspect the nsCSSFrameConstructor work is what we need to chip away at here -- hopefully there are some cases where we don't actually need to recreate frames, or we can ask twitter to do something slightly different that doesn't trigger frame reconstruction.
This reminds me of bug 1256706 comment 5, which is another case in which twitter was causing technically-unnecessary frame reconstruction when they dynamically added "position:relative" to a subtree.  They ended up adjusting their website to do something different (bug 1256706 comment 16), but I filed bug 1261484 on the optimization that would've helped with the original site.

It's possible that the frame reconstruction here has a similar cause & can similarly be optimized away...
Yeah -- with comment 3 STR, when the dialog is dismissed, we get a RecreateFramesForContent invocation _for the <body> element itself_. Our devtools show that Twitter adds "position:relative" to the body when the overlay appears. (Technically they add class="overlay-enabled" to the <body>, and there's a .overlay-enabled CSS rule which has "position:relative".)

So we jank when the dialog appears & disappears because we're recreating the entire page's frames (which is quite a lot, if you've scrolled down lots and lots as called for in comment 3). And we have to reflow them all, sine they're freshly recreated.
Here's a little benchmark script that can be used to simulate (part of) the thread-dismissing animation from the Web Console.

STR:
0. Be logged into Twitter.
1. Visit https://twitter.com/RickByers/with_replies
2. Scroll down [triggering infinite scroll] until you see tweets from October 2016.  (Hit "end" key on your keyboard 15-20 times.)
3. Hit F12 and enter the following into your Web Console:

var start = new Date(); document.body.style.overflow = "hidden"; document.documentElement.offsetHeight; document.body.style.overflow = ""; document.documentElement.offsetHeight; alert(new Date() - start);

ACTUAL RESULTS (time duration in ms, reported in the alert):
 - Firefox: 2800-4800
 - Edge: 130-200
 - Chrome: 15-30
Flags: needinfo?(dholbert)
Depends on: 1344377
Depends on: 1344398
(Sorry, I was partly mistaken in comment 8 - I marked that comment as 'obsolete'.)

My profiling & testing shows that the frame reconstruction (and resulting slow reflow) here is entirely due to the "overflow" & "position" changes on the <body> element.
Specifically, if I apply this stylish rule...
/************/
@namespace url(http://www.w3.org/1999/xhtml);

@-moz-document domain("twitter.com") {
  body {
    overflow: scroll !important;
    position: static !important;
  }
}
/************/
...then Reflow completely disappears from my profile of the "overlay-dismissed" usecase (with mstange's STR from comment 3).

I still get 300ms of restyling (due to other restyles that are part of the overlay dismissal), but the ~1000ms of reflow almost entirely disappears (drops to 1ms).

Here's a profile of the (shorter) pause for overlay dismissal, with that Stylish rule applied, BTW (viewing https://twitter.com/RickByers/with_replies with tweets loaded back to Oct 2016): https://perfht.ml/2lnXcjT


SO. To get a serious win here, we'd need to figure out how to avoid reconstructing the world for the "overflow" & "position" changes.  I filed bug  1344377 on "position" and bug 1344398 on "overflow", with reduced benchmarks/testcases.
Do other engines manage to get away without reconstructing their CSS box objects in these cases?
(In reply to :Ehsan Akhgari from comment #11)
> Do other engines manage to get away without reconstructing their CSS box
> objects in these cases?

I can reply to that. At least blink can get away without it in the test case for bug 1344398.

Concretely, if I run a debug build of Blink with the following patch applied:

diff --git a/third_party/WebKit/Source/core/dom/Element.cpp b/third_party/WebKit/Source/core/dom/Element.cpp
index e32818b8d8cc..23b508a98f79 100644
--- a/third_party/WebKit/Source/core/dom/Element.cpp
+++ b/third_party/WebKit/Source/core/dom/Element.cpp
@@ -1714,6 +1714,9 @@ void Element::removedFrom(ContainerNode* insertionPoint) {
 void Element::attachLayoutTree(const AttachContext& context) {
   DCHECK(document().inStyleRecalc());
 
+  CString str = toString().utf8();
+  printf("Rebuilding layout for: %s\n", str.data());
+
   // We've already been through detach when doing an attach, but we might
   // need to clear any state that's been added since then.
   if (hasRareData() && needsAttach()) {

(Which is the main entry point for Blink's layout code), and with console messages at the relevant places in that test case, the output once the test starts is:

> [16049:16049:0308/200525.664015:40590405339:INFO:CONSOLE(32)] "begin", source: file:///home/emilio/.../_test.html
> [16049:16049:0308/200526.188134:40590929408:INFO:CONSOLE(36)] "once", source: file:///home/emilio/.../_test.html (36)
> [16049:16049:0308/200526.710444:40591451717:INFO:CONSOLE(39)] "Tweaking overflow (twice) took: 1045ms", source: file:///home/emilio/.../_test.html

(That is, no reconstruction happens).

I can try to provide more insight in how do they handle it if you're interested.
In particular, Blink only rebuilds the layout tree under the conditions in stylePropagationDiff[1].

The Reattach bit is the "rebuild the layout tree for this subtree" bit, and it's only done when display changes, a first-letter is created, there are different "content" properties (for ::before and ::after), or has different text-combine values.

The rest of the changes seem to be handled by layout, believe it or not, huh.

[1]: https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/style/ComputedStyle.cpp?l=203&rcl=ea7c380cd11563d17535465c740cc574f4accda9
Very interesting, thanks Emilio!  You and dholbert should probably sync up more often around this if you aren't already.  :-)

Another question about Blink, based on your experience on it, are you aware of optimization strategies that they use which may be beneficial for us to adopt?  We're looking at our reflow performance so if you are familiar with ideas that we can borrow from what Blink does it would be super helpful to know.  Thanks!!
(In reply to :Ehsan Akhgari from comment #14)
> Another question about Blink, based on your experience on it, are you aware
> of optimization strategies that they use which may be beneficial for us to
> adopt?  We're looking at our reflow performance so if you are familiar with
> ideas that we can borrow from what Blink does it would be super helpful to
> know.  Thanks!!

I can't think of anything off-hand (I know more about the CSS engine side of things), but if something comes up that could be interesting I'll remember to comment here :)
Whiteboard: [qf:p1]
platform-rel: --- → ?
Whiteboard: [qf:p1] → [qf:p1][platform-rel-Twitter]
Assignee: nobody → bugs
BTW: twitter's Direct Message modal-popup seems to trigger similar UI, and similar jank when it appears/disappears.  So hopefully the same optimizations will help with that, too.
platform-rel: ? → +
Some funny thing, seems like chrome has a fast-path for toggling overflow on body, see https://bugs.chromium.org/p/chromium/issues/detail?id=422162.

Perhaps that's an easier solution? :P
(In reply to Emilio Cobos Álvarez [:emilio] from comment #17)
> Some funny thing, seems like chrome has a fast-path for toggling overflow on
> body, see https://bugs.chromium.org/p/chromium/issues/detail?id=422162.
> 
> Perhaps that's an easier solution? :P

That's still worth a look. 

Daniel: since the work here is happening in bug 1344377 and bug 1344398, do you see where we might hang this fast-path?
Flags: needinfo?(dholbert)
(In reply to Jet Villegas (:jet) from comment #18)
> (In reply to Emilio Cobos Álvarez [:emilio] from comment #17)
> > Some funny thing, seems like chrome has a fast-path for toggling overflow on
> > body, see https://bugs.chromium.org/p/chromium/issues/detail?id=422162.
> 
> Daniel: since the work here is happening in bug 1344377 and bug 1344398, do
> you see where we might hang this fast-path?

This is basically what I ended up going with in bug 1344398, per bug 1344398 comment 12. (fast-path for "overflow" changes on <body> and <html>)

I'm hoping to have a patch up ready for review this afternoon.
Flags: needinfo?(dholbert)
(In reply to Daniel Holbert [:dholbert] from comment #6)
> This reminds me of bug 1256706 comment 5, which is another case in which
> twitter was causing technically-unnecessary frame reconstruction when they
> dynamically added "position:relative" to a subtree.  They ended up adjusting
> their website to do something different (bug 1256706 comment 16), but I
> filed bug 1261484 on the optimization that would've helped with the original
> site.
> 
> It's possible that the frame reconstruction here has a similar cause & can
> similarly be optimized away...

Sad news on this front -- the optimization in bug 1261484 will not be helpful here, because it seems Twitter does have some abspos content that's positioned with respect to the root scrollport (based on watching for nsAbsoluteContainingBlock::Reflow calls in GDB, with non-empty mAbsoluteFrames and with aDelegatingFrame being a root-ish frame).  I see calls to that function with aDelegatingFrame = the <html> element's nsCanvasFrame, as well as for the root ViewportFrame -- with the abspos frame being for some <div> somwhere in the page.

That means that when Twitter promotes <body> to be position:relative (for the overlay), the <body> element's frame likely needs to take over as being the abspos containing block for the abspos-children that were previously positioned with respect to the nsCanvasFrame/ViewportFrame (the ones I saw in my GDB session noted in previous paragraph).

And that means these Twitter STR (showing the overlay) would not fall into the ~trivially-optimize-away-able scenario which I described in bug 1261484.  So, we need a more subtle optimization to help with the "position" changes here.
I'm pretty sure Twitter doesn't really need to set "position:relative" on the body here -- so for that part of this issue, I don't know that it's really worth implementing an optimization [and taking some security risk from doing frametree surgery].

I plan to reach out to them on our partner list, to request that they consider removing that style tweak.  I might wait on that until bug 1344398 has landed, though, because then they'll actually be able to test Nightly and verify that this style tweak helps.  (We have to address BOTH their "position" tweak [possibly via this outreach] AND the "overflow" tweak [via bug 1344398] in order to see a perf impatct here, per end of comment 10.)
Assignee: bugs → dholbert
Blocks: 1351016
Depends on: 1367568
OK, I have some happy results here -- with my patch on bug 1367568 and my stress

First: I'll switch to using my Twitter feed as the "canonical" testcase, so as not to give Rick Byers too much unexpected attention. :)  I tweet a bit less frequently, so you have to scroll to a date that's further back in time to pull in enough content to make this painful.

So, current STR are in comment 9 but feel free to use https://twitter.com/CodingExon/with_replies and scroll back to December 2014. With that setup: in current Nightly, clicking a tweet at the top of the feed triggers a ~4 seconds hang before the modal tweet appears.

With bug 1367568 and with the body set to always be "position:relative" (or always "position:static !important") in devtools, clicking a tweet at the top of the feed instead triggers a ~1 second hang.  And that 1 second is pretty much entirely in restyling[1], with negligible time spent in reflow (50ms or less, spread out over time).

Still not perfect (Chrome takes a decent fraction of a second but not as much time).  But, it's much better than it was.

[1] dbaron helped me look into why we spend ~1 second restyling here -- basically, it's because they add a class to the body ("overlay-enabled") *and* they have various descendant selectors that involve that class, so they end up having to redo selector matching for all the descendants (which is a lot of descendants in this stress test).
Depends on: 1368070
(In reply to Daniel Holbert [:dholbert] from comment #22)
> OK, I have some happy results here -- with my patch on bug 1367568 and my stress

(Just noted I trailed off there. :) I was going to say "stress-test of Twitter", i.e. the steps described further down in the comment.)
Twitter landed the fix for the <body> "position" property on their end (it's always position:relative now) -- see bug 1368070 -- and I verified that this is much snappier in Nightly now!

Calling this FIXED by bug 1344398, bug 1367568, and bug 1368070.  Also, this no longer needs to depend on bug 1344377, so I'm removing that dependency.
Status: NEW → RESOLVED
Closed: 7 years ago
No longer depends on: 1344377
Resolution: --- → FIXED
(Further note: with Stylo enabled, we seem to avoid the still-remaining 1 second of restyling that I hinted at in comment 22 -- hooray!)
(In reply to Daniel Holbert [:dholbert] from comment #25)
> (Further note: with Stylo enabled, we seem to avoid the still-remaining 1
> second of restyling that I hinted at in comment 22 -- hooray!)

I wrote bug 1368240 just to avoid that kind of cases, so yay! :)
Great! I'll mark this bug as depends-on that one, then, to keep track of the fact that that's one of the things that helped here (since at this point, this is a metabug whose dependencies all chipped away at the original jank in some way or another).
Depends on: 1368240
Performance Impact: --- → P1
Whiteboard: [qf:p1][platform-rel-Twitter] → [platform-rel-Twitter]
You need to log in before you can comment on or make changes to this bug.