Contents area seems to hang up after resize browser/navigate between other twitter on twitter.com
Categories
(Core :: Layout: Flexbox, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox70 | --- | unaffected |
firefox71 | --- | unaffected |
firefox72 | --- | verified |
People
(Reporter: alice0775, Assigned: dholbert)
References
(Regression)
Details
(Keywords: hang, regression)
Attachments
(6 files)
Reproducible: always on Nightly71.0a1.
not reproduced on Firefox69.0.
Steps To Reproduce:
0. Start Nightly with newly created profile
- Open https://twitter.com/ and logged in
- Wait page loading
- Resize browser width, repeat step3 1-4 times
Actual Results:
Browser UI is working. But contents has the following problem.
Mouse left, right and middle click does not work.
Mouse hover effects and mouse cursor are not changed.
Back/Reload/Home button stops working.
However, mouse wheel scroll is working as expected though page would not be added.
Expected Results:
Page should be successfully resized without hang up
![]() |
Reporter | |
Comment 1•4 years ago
|
||
![]() |
Reporter | |
Comment 2•4 years ago
|
||
Gecko Profiler log: https://perfht.ml/2A0lxBf https://perfht.ml/2A58XAw
![]() |
Reporter | |
Comment 3•4 years ago
|
||
I try to find regression window using MozRegression.
Regression window:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=36c3240e5cafd7b57146bab3b177bfa47f42bcfa&tochange=2909b0a1eb06cc34ce0a11544e5e6826aba87c06
interesting thing: I got the same result as the regression range of Bug 1579596 comment#3.
![]() |
Reporter | |
Updated•4 years ago
|
![]() |
Reporter | |
Comment 4•4 years ago
|
||
Comment 5•4 years ago
|
||
Hi @Alice0775 White, tested in Windows 10 machine using nightly 71.0a1 - cannot reproduce it. Can you send please an about:support report?
I will set a component, if ins't the proper one please fell free to change it.
Regards,
Liviu
![]() |
Reporter | |
Comment 6•4 years ago
|
||
(In reply to Liviu Seplecan from comment #5)
Hi @Alice0775 White, tested in Windows 10 machine using nightly 71.0a1 - cannot reproduce it. Can you send please an about:support report?
I will set a component, if ins't the proper one please fell free to change it.
Regards,
Liviu
I already attached the about:support. See attachment above.
Comment 7•4 years ago
|
||
Hi, sorry for that, could be your GPU. I saw that you've reported another bug regarding the resize of the window's browser. None of those bugs can be reproduced in my end.
Regards,
Liviu
![]() |
Reporter | |
Comment 8•4 years ago
|
||
Different way of STR
- Open https://twitter.com/home and Logged in
- Wait for page loading
- Click one of twitter in "Trends for you"
- Wait for page loading
- Click Back button in Nav-Bar
--- abserve, contents hang up for a long period(more than 1 min)
Browser UI is working. But contents has the following problem.
Mouse left, right and middle click does not work.
Mouse hover effects and mouse cursor are not changed.
Back/Reload/Home button stops working.
However, mouse wheel scroll is working as expected though page would not be added. - if not reproduce repeat from step 3
I got a same regression window of comment#3
Comment 9•4 years ago
|
||
Hi @Alice0775 White, retest in nightly 71.0a1 - based on the STR from comment 8 => cannot reproduce.
Comment 10•4 years ago
|
||
(In reply to Liviu Seplecan from comment #9)
Hi @Alice0775 White, retest in nightly 71.0a1 - based on the STR from comment 8 => cannot reproduce.
Liviu -
I note Alice is not running webrender (device-too-old per about:support). Can you retry with webrender off (in win10)? Thanks
Comment 11•4 years ago
|
||
Hi @Randell Jesup, re-tested in latest nightly 71.0a1 (30-09-2019) using Direct3D 11 (Advanced Layers) instead of Webrender => same result - cannot reproduce. The hang issue in Alice's end couldn't be related to the Webrender, other behavior could be the cause of the issue.
Regards,
Liviu
Comment 12•4 years ago
|
||
The priority flag is not set for this bug.
:dbaron, could you have a look please?
For more information, please visit auto_nag documentation.
![]() |
Reporter | |
Comment 13•4 years ago
|
||
Nightly is unusable... hangup
Comment 14•4 years ago
|
||
This is totally not GPU related, this is just hanging up on layout during a resize reflow, which is weird because I can't repro either...
Alice, is there any chance to send me (via email or using the "Make attachment and comment private" in bugzilla) a version of the page that reproduces the issue for you?
Also, which viewport dimensions are you using when it happens?
I really don't understand how bug 1572216 could possibly cause this though, that's a build system change... :(
Any idea Mike (Nathan is away)?
![]() |
Reporter | |
Comment 15•4 years ago
|
||
Umm, I can reproduce the hang with the probability of almost 100%.
STR
- new profile
- Open https://twitter.com/FxSiteCompat
- Logged in
--- now 3 columns(menus/contents/search twitter) - Reduce browser width so that the right side pane will be out of viewport(i.e, menus/contents)
--- hang
(if not hang) - Enlarge browser window so that twitter has 3 columns
- go to step 4
![]() |
Reporter | |
Comment 16•4 years ago
•
|
||
(In reply to Alice0775 White from comment #15)
Umm, I can reproduce the hang with the probability of almost 100%.
STR
- new profile
- Open https://twitter.com/FxSiteCompat
- Logged in
--- now 3 columns(menus/contents/search twitter)- Reduce browser width so that the right side pane will be out of viewport(i.e, menus/contents)
--- hang
(if not hang)- Enlarge browser window so that twitter has 3 columns
- go to step 4
Mozregression shows (this is same as comment#3)
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=36c3240e5cafd7b57146bab3b177bfa47f42bcfa&tochange=2909b0a1eb06cc34ce0a11544e5e6826aba87c06
finally I got same regression window....
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=e07c0fe7189bd39dd52625006a29edbcac2e9750&tochange=c2141e08383873f70201cc84613ef5904cb27d07
Comment 17•4 years ago
|
||
Well, those STR still look pretty snappy here, no trace of hangs or whatever. But I'm not too surprised since the regression seems to point to a windows-only build change.
I still don't know why people can't reproduce it on Windows either though :(
![]() |
Reporter | |
Comment 18•4 years ago
|
||
Interesting,
setting layout.interruptible-reflow.enabled to false mitigates the hangup.
Comment 19•4 years ago
|
||
Being related to interruptible reflow makes it quite plausible that the bug might be reproducible on slower machines but not on faster ones. Changing the GECKO_REFLOW_MIN_NOINTERRUPT_DURATION
environment variable to a shorter/longer time may affect whether it's reproducable. (It defaults to 100
, which is a time in milliseconds.)
It seems possible that there might be something in the flex (or other) layout code that doesn't handle reflow interrupts correctly and thus leads to extreme performance problems in some cases. Is it normal for the bulk of flex layout to happen within nsFlexContainerFrame::MeasureAscentAndBSizeForFlexItem
as it does here?
(Does the hang end after some reasonable period of time, or does it hang forever?)
It's not clear to me why this would be a regression from cross-language LTO changes. But it's also plausible.
![]() |
Reporter | |
Comment 20•4 years ago
•
|
||
(Does the hang end after some reasonable period of time, or does it hang forever?)
The hang is for very long time(for 1-2min and more)
Assignee | ||
Comment 21•4 years ago
|
||
(In reply to David Baron :dbaron: 🏴 ⌚UTC-7 from comment #19)
It seems possible that there might be something in the flex (or other) layout code that doesn't handle reflow interrupts correctly and thus leads to extreme performance problems in some cases. Is it normal for the bulk of flex layout to happen within
nsFlexContainerFrame::MeasureAscentAndBSizeForFlexItem
as it does here?
I think that is often normal, yeah. That's our first attempt at reflowing a flex item, and often we don't need a second attempt (during the same reflow of its flex container), so it may show up as the bulk of the reflow time.
Note that we do try to cache the result of that function, but interrupted-reflow makes us get rid of the cached measurement, because we didn't finish measuring:
https://searchfox.org/mozilla-central/rev/35cc00a25c4471993fdaa5761952bd3afd4f1731/layout/generic/nsFlexContainerFrame.cpp#4294-4302
Changing the GECKO_REFLOW_MIN_NOINTERRUPT_DURATION environment variable to a shorter/longer time may affect whether it's reproducable. (It defaults to 100, which is a time in milliseconds.)
I can reproduce some minor hangs (a few seconds) if I reduce that env variable to 5 (effectively allowing reflow interruption right away, I think):
https://perfht.ml/2oxjAcU
I haven't been able to reproduce anything worse than that, though, using Alice0775's STR from comment 15.
Assignee | ||
Comment 22•4 years ago
•
|
||
I think the hangs get somewhat worse if I scroll down a bunch, e.g. loading FxSiteCompat tweets as far back as 2017. (Still using a value of 5 for the interruptible reflow env variable.)
Here's a profile after a window-resize after I've populated FxSiteCompat backscroll back through 2017: https://perfht.ml/2pqYmhm
It shows a ~7 second hang which includes two back-to-back ~3 second reflows. The first reflow is synchronously flushed inside of a SetWindowDimensions operation, and then after that there's a very short "resize" DOM event, and then there's a refresh driver tick with the second long reflow (2784ms).
Updated•4 years ago
|
Updated•4 years ago
|
Comment 23•4 years ago
|
||
Sean, could you help us with this major perf bug? It seems to be related to the layout.interruptible-reflow.enabled pref and we don't know who to send this to. Thanks!
Comment 24•4 years ago
|
||
Emilio and dholbert have been investigating. At this point, we're only able to repro if the interrupt is set very low (like 5ms), but that's the default setting. There's likely something going on where interrupts are not properly handled, but still TBD. We will keep working on it.
Comment 25•4 years ago
|
||
Emilio, can you handle this bug while Sean is on PTO? I am a bit worried that the bug is unassigned and we are in beta now.
![]() |
Reporter | |
Comment 26•4 years ago
|
||
(In reply to Pascal Chevrel:pascalc from comment #25)
Emilio, can you handle this bug while Sean is on PTO? I am a bit worried that the bug is unassigned and we are in beta now.
I can still reproduce the issue on Latest Nightly72.0a1(20191023094816),
However, I cannot reproduce on Firefox 71.0b3 (20191021164841).
(I tested with new profile both)
![]() |
Reporter | |
Updated•4 years ago
|
Comment hidden (offtopic) |
Assignee | ||
Comment 28•3 years ago
|
||
I'm taking a look.
Assignee | ||
Comment 29•3 years ago
•
|
||
I'm starting to understand a thing that goes wrong with interruptible reflow + flexbox here (which may be what's at fault).
- Twitter has tons of flex containers with abspos content
- When we get a pending interrupt during reflow, our flex container doesn't (itself) call
CheckForInterrupt()
- However, it does call FinishReflowWithAbsoluteFrames which (several layers down) does check for an interrupt in
nsAbsoluteContainingBlock::Reflow
. From this point on (until our next top-level reflow task starts)aPresContext->HasPendingInterrupt()
will return true. - When we finish reflowing a given nsFlexContainerFrame, it tests whether we've seen an interrupt and (if so) throws away all of its cached measurements:
void nsFlexContainerFrame::DidReflow(nsPresContext* aPresContext,
const ReflowInput* aReflowInput) {
// Remove the cached values if we got an interrupt because the values will be
// the wrong ones for following reflows.
[...]
if (aPresContext->HasPendingInterrupt()) {
for (nsIFrame* frame : mFrames) {
frame->DeleteProperty(CachedFlexMeasuringReflow());
}
}
- This would be an appropriate thing to do if we were bailing out all the way. However, we are not necessarily bailing out all the way -- we may get reflowed multiple times due to being e.g. a flex item that needs a measuring reflow + a final reflow, or e.g. due to being in a scrollframe (which reflows its content with + without scrollbar spacing subtracted out). If we had a cached measurement for our child content measurements, then the subsequent reflows would be cheaper; however, we've just thrown all of those cached measurements away, which ensures that we'll do a more-expensive "measuring" reflow of all our children who need it. And if those children are flex containers, then this recursively applies to them, too (they'll re-measure their own children on each repeated reflow as well, because they won't have any cached measurements either).
So this all blows up quite painfully, because we're throwing away (potentially-invalid) cached measurements which makes the rest of our interrupted reflow much more expensive.
Perhaps we should use a reflow callback to do the throwing-away....
Assignee | ||
Comment 30•3 years ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #29)
[...]
So this all blows up quite painfully, because we're throwing away (potentially-invalid) cached measurements which makes the rest of our interrupted reflow much more expensive.Perhaps we should use a reflow callback to do the throwing-away....
Better plan:
- Create a flex-container-specific frame state bit, that indicates "my flex item cached measurements are potentially invalid due to reflow having been interrupted"
- The
nsFlexContainerFrame::DidReflow()
"if HasPendingInterrupt()" clause should simply set this bit -- it shouldn't delete the cached measurement anymore. - When we look up a cached measurement: if
HasPendingInterrupt()
is true, then proceed as normal to let us finish up this reflow quickly. If it's false and the bit is set, then we delete the cached measurements for all flex items and clear the bit.
This ensures that one reflow-interruption will trigger a single cache-purging (per flex container). And the cache purging is "lazy" (happening after we've resumed layout when we touch the cached values), so that the interrupted layout can be interrupted ~quickly.
Assignee | ||
Updated•3 years ago
|
![]() |
||
Updated•3 years ago
|
Assignee | ||
Comment 32•3 years ago
|
||
Updated•3 years ago
|
Assignee | ||
Comment 33•3 years ago
|
||
To demonstrate the test's validity, here's a try run with the test but without the functional changes in this bug's patch:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee2529151d906768400f68f0327398486d3aeeab
...and here's a try run with the functional changes:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee2529151d906768400f68f0327398486d3aeeab
Comment 34•3 years ago
|
||
Pushed by dholbert@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/34f0f8db631a When a reflow is interrupted, don't purge flex item measurements until the next time they're needed in a later non-interrupted reflow. r=emilio
Assignee | ||
Comment 35•3 years ago
|
||
Assignee | ||
Comment 36•3 years ago
|
||
I forgot to adjust this bug's new logging line to use the new nsFlexContainerFrame logging macro that was just introduced in bug 1596973.
I added a trivial followup to make that change.
Comment 37•3 years ago
|
||
Pushed by dholbert@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/75035491e69f followup: change this bug's new logging to use new FLEX_LOG macro. (no review)
Comment 38•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/34f0f8db631a
https://hg.mozilla.org/mozilla-central/rev/75035491e69f
Assignee | ||
Comment 39•3 years ago
|
||
Alice0775 White, would you mind testing the latest Nightly to confirm that the issue has gone away? (I'm hoping it's better, with the changes here & the simplifying followup bug 1597348.)
Updated•3 years ago
|
![]() |
Reporter | |
Comment 40•3 years ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #39)
Alice0775 White, would you mind testing the latest Nightly to confirm that the issue has gone away? (I'm hoping it's better, with the changes here & the simplifying followup bug 1597348.)
I can reproduce the issue on Nightly 20191118093852.
And I verified fix on Nightly 20191119043902. No longer reproduce the problem. Thanks.
![]() |
Reporter | |
Updated•3 years ago
|
Assignee | ||
Comment 41•3 years ago
|
||
That's wonderful news. Thanks for your patience here (and of course for reporting and helping to diagnose this)!
(I still have no idea how/why the regression range points to bug 1572216... I'm guessing bug 1572216 somehow influenced the processing order of interruption-servicing, somehow, such that it became easier to trigger an interruption earlier in reflow & cause more/worse blowup.)
Description
•