Closed Bug 1579929 Opened 1 year ago Closed 1 year ago

Contents area seems to hang up after resize browser/navigate between other twitter on twitter.com

Categories

(Core :: Layout: Flexbox, defect, P2)

Desktop
Windows 10
defect

Tracking

()

VERIFIED FIXED
mozilla72
Tracking Status
firefox-esr68 --- unaffected
firefox70 --- unaffected
firefox71 --- unaffected
firefox72 --- verified

People

(Reporter: alice0775, Assigned: dholbert)

References

(Regression)

Details

(Keywords: hang, regression)

Attachments

(6 files)

Reproducible: always on Nightly71.0a1.
not reproduced on Firefox69.0.

Steps To Reproduce:
0. Start Nightly with newly created profile

  1. Open https://twitter.com/ and logged in
  2. Wait page loading
  3. Resize browser width, repeat step3 1-4 times

Actual Results:
Browser UI is working. But contents has the following problem.
Mouse left, right and middle click does not work.
Mouse hover effects and mouse cursor are not changed.
Back/Reload/Home button stops working.
However, mouse wheel scroll is working as expected though page would not be added.

Expected Results:
Page should be successfully resized without hang up

Attached file about:support
Regressed by: 1572216
Has Regression Range: --- → yes
Has STR: --- → yes
Keywords: regression
See Also: → 1579732
Attached video screencast

Hi @Alice0775 White, tested in Windows 10 machine using nightly 71.0a1 - cannot reproduce it. Can you send please an about:support report?
I will set a component, if ins't the proper one please fell free to change it.
Regards,
Liviu

Component: Untriaged → Layout
Product: Firefox → Core

(In reply to Liviu Seplecan from comment #5)

Hi @Alice0775 White, tested in Windows 10 machine using nightly 71.0a1 - cannot reproduce it. Can you send please an about:support report?
I will set a component, if ins't the proper one please fell free to change it.
Regards,
Liviu

I already attached the about:support. See attachment above.

Hi, sorry for that, could be your GPU. I saw that you've reported another bug regarding the resize of the window's browser. None of those bugs can be reproduced in my end.
Regards,
Liviu

Different way of STR

  1. Open https://twitter.com/home and Logged in
  2. Wait for page loading
  3. Click one of twitter in "Trends for you"
  4. Wait for page loading
  5. Click Back button in Nav-Bar
    --- abserve, contents hang up for a long period(more than 1 min)
    Browser UI is working. But contents has the following problem.
    Mouse left, right and middle click does not work.
    Mouse hover effects and mouse cursor are not changed.
    Back/Reload/Home button stops working.
    However, mouse wheel scroll is working as expected though page would not be added.
  6. if not reproduce repeat from step 3

I got a same regression window of comment#3

Summary: Contents area seems to hang up after resize browser on twitter.com → Contents area seems to hang up after resize browser/navigate between other twitter on twitter.com

Hi @Alice0775 White, retest in nightly 71.0a1 - based on the STR from comment 8 => cannot reproduce.

(In reply to Liviu Seplecan from comment #9)

Hi @Alice0775 White, retest in nightly 71.0a1 - based on the STR from comment 8 => cannot reproduce.

Liviu -
I note Alice is not running webrender (device-too-old per about:support). Can you retry with webrender off (in win10)? Thanks

Flags: needinfo?(liviu.seplecan)

Hi @Randell Jesup, re-tested in latest nightly 71.0a1 (30-09-2019) using Direct3D 11 (Advanced Layers) instead of Webrender => same result - cannot reproduce. The hang issue in Alice's end couldn't be related to the Webrender, other behavior could be the cause of the issue.
Regards,
Liviu

Flags: needinfo?(liviu.seplecan)

The priority flag is not set for this bug.
:dbaron, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(dbaron)

https://perfht.ml/2mnFLSe

Nightly is unusable... hangup

This is totally not GPU related, this is just hanging up on layout during a resize reflow, which is weird because I can't repro either...

Alice, is there any chance to send me (via email or using the "Make attachment and comment private" in bugzilla) a version of the page that reproduces the issue for you?

Also, which viewport dimensions are you using when it happens?

I really don't understand how bug 1572216 could possibly cause this though, that's a build system change... :(

Any idea Mike (Nathan is away)?

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(alice0775)

Umm, I can reproduce the hang with the probability of almost 100%.

STR

  1. new profile
  2. Open https://twitter.com/FxSiteCompat
  3. Logged in
    --- now 3 columns(menus/contents/search twitter)
  4. Reduce browser width so that the right side pane will be out of viewport(i.e, menus/contents)
    --- hang
    (if not hang)
  5. Enlarge browser window so that twitter has 3 columns
  6. go to step 4
Flags: needinfo?(alice0775)
Attached image Mozregression

(In reply to Alice0775 White from comment #15)

Umm, I can reproduce the hang with the probability of almost 100%.

STR

  1. new profile
  2. Open https://twitter.com/FxSiteCompat
  3. Logged in
    --- now 3 columns(menus/contents/search twitter)
  4. Reduce browser width so that the right side pane will be out of viewport(i.e, menus/contents)
    --- hang
    (if not hang)
  5. Enlarge browser window so that twitter has 3 columns
  6. go to step 4

Mozregression shows (this is same as comment#3)
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=36c3240e5cafd7b57146bab3b177bfa47f42bcfa&tochange=2909b0a1eb06cc34ce0a11544e5e6826aba87c06

finally I got same regression window....
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=e07c0fe7189bd39dd52625006a29edbcac2e9750&tochange=c2141e08383873f70201cc84613ef5904cb27d07

Well, those STR still look pretty snappy here, no trace of hangs or whatever. But I'm not too surprised since the regression seems to point to a windows-only build change.

I still don't know why people can't reproduce it on Windows either though :(

Interesting,
setting layout.interruptible-reflow.enabled to false mitigates the hangup.

Being related to interruptible reflow makes it quite plausible that the bug might be reproducible on slower machines but not on faster ones. Changing the GECKO_REFLOW_MIN_NOINTERRUPT_DURATION environment variable to a shorter/longer time may affect whether it's reproducable. (It defaults to 100, which is a time in milliseconds.)

It seems possible that there might be something in the flex (or other) layout code that doesn't handle reflow interrupts correctly and thus leads to extreme performance problems in some cases. Is it normal for the bulk of flex layout to happen within nsFlexContainerFrame::MeasureAscentAndBSizeForFlexItem as it does here?

(Does the hang end after some reasonable period of time, or does it hang forever?)

It's not clear to me why this would be a regression from cross-language LTO changes. But it's also plausible.

Component: Layout → Layout: Flexbox
Flags: needinfo?(dbaron) → needinfo?(dholbert)
Priority: -- → P2

(Does the hang end after some reasonable period of time, or does it hang forever?)

The hang is for very long time(for 1-2min and more)

(In reply to David Baron :dbaron: 🏴󠁵󠁳󠁣󠁡󠁿 ⌚UTC-7 from comment #19)

It seems possible that there might be something in the flex (or other) layout code that doesn't handle reflow interrupts correctly and thus leads to extreme performance problems in some cases. Is it normal for the bulk of flex layout to happen within nsFlexContainerFrame::MeasureAscentAndBSizeForFlexItem as it does here?

I think that is often normal, yeah. That's our first attempt at reflowing a flex item, and often we don't need a second attempt (during the same reflow of its flex container), so it may show up as the bulk of the reflow time.

Note that we do try to cache the result of that function, but interrupted-reflow makes us get rid of the cached measurement, because we didn't finish measuring:
https://searchfox.org/mozilla-central/rev/35cc00a25c4471993fdaa5761952bd3afd4f1731/layout/generic/nsFlexContainerFrame.cpp#4294-4302

Changing the GECKO_REFLOW_MIN_NOINTERRUPT_DURATION environment variable to a shorter/longer time may affect whether it's reproducable. (It defaults to 100, which is a time in milliseconds.)

I can reproduce some minor hangs (a few seconds) if I reduce that env variable to 5 (effectively allowing reflow interruption right away, I think):
https://perfht.ml/2oxjAcU

I haven't been able to reproduce anything worse than that, though, using Alice0775's STR from comment 15.

Flags: needinfo?(dholbert)

I think the hangs get somewhat worse if I scroll down a bunch, e.g. loading FxSiteCompat tweets as far back as 2017. (Still using a value of 5 for the interruptible reflow env variable.)

Here's a profile after a window-resize after I've populated FxSiteCompat backscroll back through 2017: https://perfht.ml/2pqYmhm

It shows a ~7 second hang which includes two back-to-back ~3 second reflows. The first reflow is synchronously flushed inside of a SetWindowDimensions operation, and then after that there's a very short "resize" DOM event, and then there's a refresh driver tick with the second long reflow (2784ms).

Flags: needinfo?(mh+mozilla)

Sean, could you help us with this major perf bug? It seems to be related to the layout.interruptible-reflow.enabled pref and we don't know who to send this to. Thanks!

Flags: needinfo?(svoisen)

Emilio and dholbert have been investigating. At this point, we're only able to repro if the interrupt is set very low (like 5ms), but that's the default setting. There's likely something going on where interrupts are not properly handled, but still TBD. We will keep working on it.

Flags: needinfo?(svoisen)

Emilio, can you handle this bug while Sean is on PTO? I am a bit worried that the bug is unassigned and we are in beta now.

Flags: needinfo?(emilio)

(In reply to Pascal Chevrel:pascalc from comment #25)

Emilio, can you handle this bug while Sean is on PTO? I am a bit worried that the bug is unassigned and we are in beta now.

I can still reproduce the issue on Latest Nightly72.0a1(20191023094816),
However, I cannot reproduce on Firefox 71.0b3 (20191021164841).
(I tested with new profile both)

I'm taking a look.

Assignee: nobody → dholbert
Flags: needinfo?(emilio)

I'm starting to understand a thing that goes wrong with interruptible reflow + flexbox here (which may be what's at fault).

  • Twitter has tons of flex containers with abspos content
  • When we get a pending interrupt during reflow, our flex container doesn't (itself) call CheckForInterrupt()
  • However, it does call FinishReflowWithAbsoluteFrames which (several layers down) does check for an interrupt in nsAbsoluteContainingBlock::Reflow. From this point on (until our next top-level reflow task starts) aPresContext->HasPendingInterrupt() will return true.
  • When we finish reflowing a given nsFlexContainerFrame, it tests whether we've seen an interrupt and (if so) throws away all of its cached measurements:
void nsFlexContainerFrame::DidReflow(nsPresContext* aPresContext,
                                     const ReflowInput* aReflowInput) {
  // Remove the cached values if we got an interrupt because the values will be
  // the wrong ones for following reflows.
[...]
  if (aPresContext->HasPendingInterrupt()) {
    for (nsIFrame* frame : mFrames) {
      frame->DeleteProperty(CachedFlexMeasuringReflow());
    }
  }
  • This would be an appropriate thing to do if we were bailing out all the way. However, we are not necessarily bailing out all the way -- we may get reflowed multiple times due to being e.g. a flex item that needs a measuring reflow + a final reflow, or e.g. due to being in a scrollframe (which reflows its content with + without scrollbar spacing subtracted out). If we had a cached measurement for our child content measurements, then the subsequent reflows would be cheaper; however, we've just thrown all of those cached measurements away, which ensures that we'll do a more-expensive "measuring" reflow of all our children who need it. And if those children are flex containers, then this recursively applies to them, too (they'll re-measure their own children on each repeated reflow as well, because they won't have any cached measurements either).

So this all blows up quite painfully, because we're throwing away (potentially-invalid) cached measurements which makes the rest of our interrupted reflow much more expensive.

Perhaps we should use a reflow callback to do the throwing-away....

(In reply to Daniel Holbert [:dholbert] from comment #29)

[...]
So this all blows up quite painfully, because we're throwing away (potentially-invalid) cached measurements which makes the rest of our interrupted reflow much more expensive.

Perhaps we should use a reflow callback to do the throwing-away....

Better plan:

  • Create a flex-container-specific frame state bit, that indicates "my flex item cached measurements are potentially invalid due to reflow having been interrupted"
  • The nsFlexContainerFrame::DidReflow() "if HasPendingInterrupt()" clause should simply set this bit -- it shouldn't delete the cached measurement anymore.
  • When we look up a cached measurement: if HasPendingInterrupt() is true, then proceed as normal to let us finish up this reflow quickly. If it's false and the bit is set, then we delete the cached measurements for all flex items and clear the bit.

This ensures that one reflow-interruption will trigger a single cache-purging (per flex container). And the cache purging is "lazy" (happening after we've resumed layout when we touch the cached values), so that the interrupted layout can be interrupted ~quickly.

Status: NEW → ASSIGNED

[Tracking Requested - why for this release]:

Attachment #9109217 - Attachment description: Bug 1579929: When a reflow is interrupted, don't purge flex item measurements until the next time they're needed in a non-interrupted reflow. → Bug 1579929: When a reflow is interrupted, don't purge flex item measurements until the next time they're needed in a later non-interrupted reflow. r?emilio

To demonstrate the test's validity, here's a try run with the test but without the functional changes in this bug's patch:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee2529151d906768400f68f0327398486d3aeeab

...and here's a try run with the functional changes:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee2529151d906768400f68f0327398486d3aeeab

Pushed by dholbert@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/34f0f8db631a
When a reflow is interrupted, don't purge flex item measurements until the next time they're needed in a later non-interrupted reflow. r=emilio

I forgot to adjust this bug's new logging line to use the new nsFlexContainerFrame logging macro that was just introduced in bug 1596973.

I added a trivial followup to make that change.

Pushed by dholbert@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/75035491e69f
followup: change this bug's new logging to use new FLEX_LOG macro. (no review)
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72

Alice0775 White, would you mind testing the latest Nightly to confirm that the issue has gone away? (I'm hoping it's better, with the changes here & the simplifying followup bug 1597348.)

Flags: needinfo?(alice0775)

(In reply to Daniel Holbert [:dholbert] from comment #39)

Alice0775 White, would you mind testing the latest Nightly to confirm that the issue has gone away? (I'm hoping it's better, with the changes here & the simplifying followup bug 1597348.)

I can reproduce the issue on Nightly 20191118093852.
And I verified fix on Nightly 20191119043902. No longer reproduce the problem. Thanks.

Flags: needinfo?(alice0775)
Status: RESOLVED → VERIFIED

That's wonderful news. Thanks for your patience here (and of course for reporting and helping to diagnose this)!

(I still have no idea how/why the regression range points to bug 1572216... I'm guessing bug 1572216 somehow influenced the processing order of interruption-servicing, somehow, such that it became easier to trigger an interruption earlier in reflow & cause more/worse blowup.)

You need to log in before you can comment on or make changes to this bug.