long repeated reflows (200-300ms+) on newsblur.com, due to repeated YouTube iframe removals (with layout flush in youtube iframe's pagehide handler, which makes us greedily flush *outer* doc's layout)

RESOLVED FIXED in Firefox 68

Status

()

defect
P3
normal
RESOLVED FIXED
a month ago
a month ago

People

(Reporter: froydnj, Assigned: emilio)

Tracking

({perf})

unspecified
mozilla68
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox68 fixed)

Details

(Whiteboard: [qf], )

Attachments

(3 attachments)

Reporter

Description

a month ago

Experienced some nasty jank this morning and captured the profile below:

https://perfht.ml/2DoO9WW

This was on newsblur.com, but I think it's also triggering some bits of embedded youtube layout. It's possible this is just a site issue, and they could be more efficient in manipulating the DOM, rather than triggering a bunch of layout flushes?

Profile zoomed to the relevant track:
https://perfht.ml/2Un87Ha

(The blue bars there are largely ~300ms reflows)

In each case, the stack looks like this (I've removed some intervening levels for clarity):

 [newsblur JS]
   map   // probably operating on an array
     Node.removeChild
       nsDocShell::FirePageHideNotificationInternal
         [youtube JS]
           mozilla::dom::Window_Binding::get_innerWidth
             mozilla::dom::Document::FlushPendingNotifications
               Reflow http://www.newsblur.com/site/2022987/carryology-exploring-better-ways-to-carry

I think this may be a pathological "modifying DOM + flushing layout, in a loop" scenario...

The map call is probably iterating over a collection of nodes/objects, and it seems we're calling removeChild() on a bunch of youtube iframes perhaps, and each one of those youtube iframes has a pagehide listener (which gets invoked during the removeChild operation), which flushes layout (on the outer document, it seems).

STR (no account needed):

  1. Visit https://www.newsblur.com/site/2022987/carryology-exploring-better-ways-to-carry

  2. Scroll down the middle pane, clicking on every thumbnail that has a "play" icon on it. Do this for ~10 items. (This should automatically scroll the right pane, and populate it with youtube embeds, I think.)

  3. Start profiling.

  4. Click some other item (e.g. "BLOGS > xkcd.com") in the left pane.

  5. Stop profiling.

Here's my profile from these STR, zoomed to 1.6s of jank that seems to be the same stuff quoted in comment 1:
https://perfht.ml/2UolW85

One "outer" relevant JS function here is called "reset_feed". Here's another profile I captured, with screenshots, filtered for "reset_feed":
https://perfht.ml/2UvrH4c
time spent in reset_feed: 1,371ms

And here's a profile I captured in Chrome 75 dev, filtered for "reset_feed":
https://perfht.ml/2Uu98Nr
time spent in reset_feed: 140ms

So, they do seem to be spending an order of magnitude less time in this reset_feed function.

Keywords: perf
Whiteboard: [qf]

So one likely problem here: I'm assuming that the innerWidth call here (in the youtube pagehide handler) is being called from JS that's inside of the youtube iframe -- but nonetheless, it's flushing layout in the outer page (" Reflow http://www.newsblur.com/"), and that's expensive because the page is in the process of rebuilding its entire feed.

If that is indeed what's happening, we should try to prevent that flush from leaking outside of the iframe.

This testcase confirms comment 4 -- a layout flush from inside of an iframe's pagehide handler does flush layout outside of the iframe, in Firefox (but not in Chrome).

testcase 1 is a simplified version of the newsblur scenario, with only one iframe. It queues up a bunch of layout work in the outer document, and then it removes a youtube iframe from the DOM, and then it does an explicit layout flush (with offsetHeight) in the context of the outer page. It reports the duration of the iframe removal and the outer page's explicit layout flush.

CHROME:
remove() took 0ms
offsetHeight took 431ms

FIREFOX:
remove() took 295ms
offsetHeight took 0ms

This indicates that we're flushing the outer page's slow layout during the iframe remove() operation.

Summary: long repeated reflows (200-300ms+) on newsblur.com → long repeated reflows (200-300ms+) on newsblur.com, due to repeated YouTube iframe removals (with layout flush in youtube iframe's pagehide handler)

Here's a more severe testcase, with more iframes and where the expensive-to-flush layout work only exists during the frame.remove() operation (and then the DOM nodes are cleaned up after the frame.remove() operation).

This makes the contrast more stark, because Chrome never flushes the expensive layout work here.

With testcase 2, Chrome gives me "iframe removal took 0ms", whereas Firefox gives me "iframe removal took 1743ms"

For comparison:

  • Edge 18 behaves like Firefox (remove() is slow in the testcases here)
  • Safari 12 behaves like Chrome (remove() is fast in the testcases here, due to not flushing outer doc's layout)

So for now we're not the only ones who do this (flush the outer doc in response to a flush in the inner doc), but we will be once Edge switches over to chromium.

Summary: long repeated reflows (200-300ms+) on newsblur.com, due to repeated YouTube iframe removals (with layout flush in youtube iframe's pagehide handler) → long repeated reflows (200-300ms+) on newsblur.com, due to repeated YouTube iframe removals (with layout flush in youtube iframe's pagehide handler, which makes us greedily flush *outer* doc's layout)

So this flush is generally needed, but I think we may be able to skip it in this case, since the owner content is no longer connected. This may be trivial to fix, let me take a look.

Flags: needinfo?(emilio)

Oh, and the other thing we may be able to do is to not propagate the flush to cross-origin iframes. Not sure how cheap that check is.

I have both a non-controversial fix and a potentially-more-controversial optimization on top.

Assignee: nobody → emilio
Assignee

Updated

a month ago
Flags: needinfo?(emilio)

And add a test for the same not happening already for normal flushes.

The patch above fixes this and I think it's straight-forward. A further potential improvement is to not flush parent document layout if the parent / child document can't observe each other.

Boris, wdyt about that? It's mostly an optimization, but it can also be a pesimization in some particular cases (when the subdocument has pending style or layout changes and the layout / style of the subdocument is actually affected by the parent document, in which case we'd end up doing more work in total, though a chunk of the work will happen async). I think it's worth it, and it's closer to what Fission will end up doing anyway (presumably we're not doing sync IPC for this stuff).

If you think it's worth it, is NS_SUCCEEDED(nsContentUtils::CheckSameOrigin(this, mParentDocument)) the right check for that? That's what I used on my WIP.

Flags: needinfo?(bzbarsky)

Boris, wdyt about that

It seems like a good idea to me, generally, and yes, the fission argument is pretty compelling.

CheckSameOrigin isn't the right check, because that can return false for things that can observe each other sync
(due to document.domain).

The right check, and one which will match the fission behavior, as far as I can tell, would be to flush the parent only if parent->GetDocGroup() == GetDocGroup(). While we're there, I think we should inline GetDocGroup and out-of-line just the debug checks it does, at which point in an opt build the GetDocGroup() check should be quite cheap.

Flags: needinfo?(bzbarsky)
Assignee

Updated

a month ago
Blocks: 1440537

Turns out we had bug 1440537 already on file for that, I'll send the patch there.

Comment 17

a month ago
Pushed by ealvarez@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bbdaf05c9d16
Don't flush parent document layout for detached frames from EnsureSizeAndPositionUpToDate. r=dholbert,bzbarsky
Priority: -- → P3

Comment 18

a month ago
bugherder
Status: NEW → RESOLVED
Last Resolved: a month ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
You need to log in before you can comment on or make changes to this bug.