Open Bug 1601950 Opened 4 years ago Updated 21 days ago

7.07 - 8.42% about_preferences_basic (windows10-64-shippable, windows10-64-shippable-qr) regression on push 2d39de2be1ae9fa8063a19b459d8712fa90bc851 (Wed December 4 2019)

Categories

(Firefox :: Settings UI, defect, P2)

defect

Tracking

()

Performance Impact low
Tracking Status
firefox-esr68 --- unaffected
firefox71 --- unaffected
firefox72 --- unaffected
firefox73 + wontfix
firefox74 - wontfix
firefox75 --- wontfix

People

(Reporter: Bebe, Unassigned)

References

(Depends on 1 open bug, Regression)

Details

(5 keywords)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=2d39de2be1ae9fa8063a19b459d8712fa90bc851

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

8% about_preferences_basic windows10-64-shippable opt e10s stylo 134.24 -> 145.55
8% about_preferences_basic windows10-64-shippable-qr opt e10s stylo 134.55 -> 145.08
7% about_preferences_basic windows10-64-shippable-qr opt e10s stylo 135.50 -> 145.08

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=24293

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/TestEngineering/Performance/Talos

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/TestEngineering/Performance/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/TestEngineering/Performance/Talos/RegressionBugsHandling

Component: Performance → Preferences
Product: Testing → Firefox
Version: Version 3 → unspecified

Thanks for the heads-up.

Unfortunately I'm going to have to pass this off to mats/Gijs/ntim, as I'm going on parental leave imminently and don't have time to do much here.

Note that backing out the offending patch on its own is not a sustainable/complete option in and of itself, since that patch is necessary to address a UI regression that was caused by a rewrite of some UI code. We might need to back out the changesets in bug 1593060 comment 34 as well (the rewrite), or come up with an alternate fix for the regression.

Would it be possible to get performance profiles of this test from before & after the change?

(The main difference in bug 1593060 is for nsGridContainerFrame to call GetXULPrefSize instead of MeasuringReflow on its XUL children. I would expect the new function-call to be faster, but perhaps it's not? That's what I would suggest investigating in profiles, if someone gets a chance to do so while I'm away.)

Flags: needinfo?(fstrugariu)
See Also: → 1596416

(Also: I would bet that the grid involved in the perf regression here is not the grid that we were concerned with fixing in bug 1593060... It's probably a different grid that gets excercised more thoroughly in our talos tests. I think ntim may've converted the top-level about:preferences layout to use CSS grid -- perhaps that's the one that's experiencing the perf regression here?)

I'm a bit confused because

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #0)

8% about_preferences_basic windows10-64-shippable opt e10s stylo 134.24 -> 145.55

this first regression, in the graph link is https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&selected=1921202,604967,200.1247345861836,236.5533199818633,993357074&series=autoland,1921202,1,1&timerange=1209600&zoom=1575416176011,1575428918848,113.73193839866607,176.0067376248982 (zoomed in). There's a lot of noise in the graph, but actually, the beginning and end of that interval get me this pushlog:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=1a5918cfe128992841682258cdc35c71c785bd01&tochange=ea818e25e302a2412f5fcf740727dbb42c4b3aa4

which was 1-2 hours before the regression that happened on the QR builds ( https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&selected=1924035,605061,770.2694962623855,146.88073430694192,993473599&series=autoland,1924035,1,1&series=mozilla-inbound,1948550,1,1&timerange=1209600&zoom=1575412311738,1575444931875,126.68653590463158,157.27599559914404 and https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&selected=1924035,605061,569.128138773506,77.58729211133762,993473599&series=autoland,1924035,1,1&series=mozilla-inbound,1948550,1,1&timerange=1209600&zoom=1575408583723,1575462639949,127.41026724257418,149.31541249972992 ). The QR builds definitely point to bug 1593060 and don't have nearly as much noise. But I suspect the non-QR regression is a different regression on the same test.

Florin, do you want to set up a separate bug for that?


Anyway... I triggered profiles on the before and after for the builds that definitely regressed here. Unfortunately, that wasn't very enlightening. The test cycles through different about:preferences loads (no hash, #sync, #privacy, and #search). A bunch of these are multi-modal - e.g. on the "before" run, without profiling, the cycles for about:preferences#privacy look like this:

Values: 205.0  268.0  170.0  153.0  275.0  174.0  173.0  270.0  175.0  274.0  175.0  152.0  273.0  153.0  267.0  272.0  156.0  153.0  175.0  170.0  268.0  152.0  156.0  177.0  174.0

afterwards, they look like this:

Values: 165.0  248.0  247.0  252.0  249.0  252.0  250.0  256.0  256.0  252.0  255.0  247.0  248.0  137.0  250.0  253.0  252.0  252.0  252.0  258.0  257.0  248.0  160.0  140.0  253.0

So the result is now more consistent - only it's consistently worse. I don't know why that is. For the profiling jobs, because of the profiling overhead and the multi-modal-ness, the results were unhelpful.

Before: Values: 240.0 207.0

After: Values: 182.0 178.0

Yes, the "before" results are significantly worse than the "after" ones.

So... I'm a bit lost, tbh. :-\

The profiles do provide ample opportunity to fix dumb things that we're doing (like, say, this scrollTop = 0 call that triggers a sync reflow) but that's not really related to the regression.

(In reply to :Gijs (he/him) from comment #4)

The profiles do provide ample opportunity to fix dumb things that we're doing (like, say, this scrollTop = 0 call that triggers a sync reflow) but that's not really related to the regression.

That's bug 1425539.

Whiteboard: [fxperf]
Flags: needinfo?(mconley)

FWIW, I'm planning to back out bug 1593060 (or at least limit its scope) in bug 1602939 if the reporter says that the issue is fixed by the patch series.

If that bug lands, the perf regression here should be fixed as well.

(In reply to Tim Nguyen :ntim from comment #6)

FWIW, I'm planning to back out bug 1593060 (or at least limit its scope) in bug 1602939 if the reporter says that the issue is fixed by the patch series.

If that bug lands, the perf regression here should be fixed as well.

Might be worth checking that assumption (ie that it'd fix the perf regression) on try given that in bug 1602939 the suggestion is that that combination doesn't fix the issue in 1593060...

Yeah, I went with mats' suggested approach, which no longer backs out bug 1593060.

Flags: needinfo?(fstrugariu)

(In reply to :Gijs (he/him) from comment #4)

Florin, do you want to set up a separate bug for that?

I didn't see a response to this, Florin, did this get missed?

Flags: needinfo?(fstrugariu)

I missed that part here is the bug:
Bug 1604119 - 8.42% about_preferences_basic (windows10-64-shippable) regression on push ea818e25e302a2412f5fcf740727dbb42c4b3aa4 (Wed December 4 2019)

Flags: needinfo?(fstrugariu)

From looking at the profiles (page_4_pagecycle in the profiles) for privacy loads in the before and after case, I think the issue is that the layout is now slightly faster (as comment #2 suggests). which means there's an increased chance that we end up doing what (ironically) happens in the "before" profile rather than the "after" profile (check the JS stacks for the second event processing delay), which is initialize all the panes as a result of bug 1533771 before we get a chance to paint something other than a blank window. That ends up delaying the first time we paint a non-blank window, which is what trips the perf regression.

Mike, can you check that I've read the above right re: the difference between the N-modal values outlined in comment #4 ?

If I'm right, fixing bug 1533771 by making initialization wait until an idle task after the full "load" event has happened will probably improve things.

The other thing that's obvious from that profile is that we spend a lot of time (~20ms) in querySelectorAll calls for all the preference elements. It might be worth thinking about ways of fixing that, too. bug 1526274 is relevant - we call updateAllElements a lot at the moment. See also Dave's comment at the bottom of bug 1582740 comment 2 .

I took a look at these profiles, and I originally had the same thought as you Gijs - that the main thread gets distracted in the "before" profile with the idle callback from findInPage.js and ends up doing more pane initialization and that this is where the regression comes from.

However, looking at how this test works, it has fnbpaint = True which means that what's being measured is the content's notion of how long it took to paint something meaningful. We can see this range in the Marker Chart as "FirstNonBlankPaint" for both profiles, and this ends before / cuts out the idleCallback / pane init stuff.

So focusing in on the FirstNonBlankPaint for our profiles, we get:

Before: https://perfht.ml/34pU3BK
After: https://perfht.ml/2r6KaLI

In the timeline, that's 207ms vs 178ms which corresponds to the second runs in the logs:

Before:

[task 2019-12-07T13:31:47.684Z] 13:31:47     INFO -  PID 1376 | [#3] preferences#privacy  Cycles:2  Average:223.50  Median:240.00  stddev:23.33 (9.7%)
[task 2019-12-07T13:31:47.684Z] 13:31:47     INFO -  PID 1376 | Values: 240.0  207.0

After:

[task 2019-12-07T13:31:25.924Z] 13:31:25     INFO -  PID 7268 | [#3] preferences#privacy  Cycles:2  Average:180.00  Median:182.00  stddev:2.83 (1.6%)
[task 2019-12-07T13:31:25.924Z] 13:31:25     INFO -  PID 7268 | Values: 182.0  178.0

So, so far so good, now we're looking at the thing that we're measuring, and we've discarded a red-herring (bug 1533771 is indeed worth fixing, but not the cause of the regression here).

So we've got a difference of 29ms to account for.

What does jump out immediately in the "before" profile is this chunk: https://perfht.ml/2PtX13P

Looks like the paint suppression ends, and then a microtask runs, and then we do another style / layout flush because of this stack:

(root) []
RtlUserThreadStart [ntdll.pdb]
BaseThreadInitThunk [kernel32.pdb]
static int __scrt_common_main_seh() [firefox.pdb]
wmain [firefox.pdb]
static int NS_internal_main(int, char * *, char * *) [firefox.pdb]
XRE_main(int,char * * const,mozilla::BootstrapConfig const &) [xul.pdb]
XREMain::XRE_main []
int XREMain::XRE_main(int, char * *, const struct mozilla::BootstrapConfig & const) [xul.pdb]
nsresult XREMain::XRE_mainRun() [xul.pdb]
nsresult nsAppStartup::Run() [xul.pdb]
nsAppShell::Run() [xul.pdb]
nsBaseAppShell::Run() [xul.pdb]
MessageLoop::Run() [xul.pdb]
void MessageLoop::RunHandler() [xul.pdb]
void mozilla::ipc::MessagePump::Run(class base::MessagePump::Delegate *) [xul.pdb]
NS_ProcessNextEvent(nsIThread *,bool) [xul.pdb]
nsThread::ProcessNextEvent(bool,bool *) [xul.pdb]
nsresult mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::ParentProcessVsyncNotifier::Run() [xul.pdb]
void mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::TickRefreshDriver(struct mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, class mozilla::TimeStamp) [xul.pdb]
void mozilla::RefreshDriverTimer::Tick(struct mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, class mozilla::TimeStamp) [xul.pdb]
void mozilla::RefreshDriverTimer::TickRefreshDrivers(struct mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, class mozilla::TimeStamp, class nsTArray<RefPtr<nsRefreshDriver> > & const) [xul.pdb]
nsRefreshDriver::Tick []
void nsRefreshDriver::Tick(struct mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, class mozilla::TimeStamp) [xul.pdb]
mozilla::CycleCollectedJSContext::PerformMicroTaskCheckPoint(bool) [xul.pdb]
promise callback []
void mozilla::PromiseJobRunnable::Run(class mozilla::AutoSlowOperation & const) [xul.pdb]
JS::Call(JSContext *,JS::Handle<JS::Value>,JS::Handle<JS::Value>,JS::HandleValueArray const &,JS::MutableHandle<JS::Value>) [xul.pdb]
js::InternalCallOrConstruct(JSContext *,JS::CallArgs const &,js::MaybeConstruct,js::CallReason) [xul.pdb]
static bool PromiseReactionJob(struct JSContext *, unsigned int, class JS::Value *) [xul.pdb]
js::Call(JSContext *,JS::Handle<JS::Value>,JS::Handle<JS::Value>,js::AnyInvokeArgs const &,JS::MutableHandle<JS::Value>,js::CallReason) [xul.pdb]
js::InternalCallOrConstruct(JSContext *,JS::CallArgs const &,js::MaybeConstruct,js::CallReason) [xul.pdb]
static bool mozilla::dom::NativeHandlerCallback(struct JSContext *, unsigned int, class JS::Value *) [xul.pdb]
void mozilla::dom::`anonymous namespace'::PromiseNativeHandlerShim::ResolvedCallback(struct JSContext *, class JS::Handle<JS::Value>) [xul.pdb]
void ElementTranslationHandler::ResolvedCallback(struct JSContext *, class JS::Handle<JS::Value>) [xul.pdb]
void mozilla::dom::DOMLocalization::ApplyTranslations(class nsTArray<nsCOMPtr<mozilla::dom::Element> > & const, class nsTArray<mozilla::dom::L10nMessage> & const, class nsXULPrototypeDocument *, class mozilla::ErrorResult & const) [xul.pdb]
static void mozilla::dom::L10nOverlays::TranslateElement(class mozilla::dom::Element & const, const struct mozilla::dom::L10nMessage & const, class nsTArray<mozilla::dom::L10nOverlaysError> & const, class mozilla::ErrorResult & const) [xul.pdb]
mozilla::dom::FragmentOrElement::SetTextContentInternal(nsTSubstring<char16_t> const &,nsIPrincipal *,mozilla::ErrorResult &) [xul.pdb]
nsContentUtils::SetNodeTextContent(nsIContent *,nsTSubstring<char16_t> const &,bool) [xul.pdb]
nsINode::InsertChildBefore(nsIContent *,nsIContent *,bool) [xul.pdb]
mozilla::dom::MutationObservers::NotifyContentAppended(nsIContent *,nsIContent *) [xul.pdb]
nsCSSFrameConstructor::ContentAppended []
nsCSSFrameConstructor::ContentAppended(nsIContent *,nsCSSFrameConstructor::InsertionKind) [xul.pdb]
bool nsCSSFrameConstructor::MaybeConstructLazily(nsCSSFrameConstructor::Operation, class nsIContent *) [xul.pdb]
mozilla::dom::NoteDirtyElement(mozilla::dom::Element *,unsigned int) [xul.pdb]
profiler_get_backtrace() [xul.pdb]

So it looks like some kind of Promise is resolved in that microtask and Fluent modifies the DOM, which causes us to do flush style and layout again, and that's when it's deemed that FirstNonBlankPaint has occurred.

So I suspect we've got a cliff where in some cases, FirstNonBlankPaint happens before Fluent happens, and sometimes after.

Does that help?

Flags: needinfo?(mconley) → needinfo?(gijskruitbosch+bugs)

(In reply to Mike Conley (:mconley) (:⚙️) (Wayyyy behind on needinfos) from comment #12)

So I suspect we've got a cliff where in some cases, FirstNonBlankPaint happens before Fluent happens, and sometimes after.

Does that help?

A little... I end up with more questions. :-)

  1. am I interpreting your comment correctly if I rephrase the quote above to "in some cases, we manage to paint before more translation/fluent work 'dirties' style/layout, and sometimes not" ? And is that just a race? (ni :mconley)

  2. we shouldn't be finishing layout (and, thus, I assume, not painting) before localization has finished, cf. https://bugzilla.mozilla.org/show_bug.cgi?id=1518252 . Ni :gandalf for this. Maybe this is a result of gathering strings at runtime from JS code that runs when panes initialize, and that therefore bypasses the layout blocking? At this point, the privacy pane will be initializing, at least... though I also don't know why this would have been impacted by the regressing patch, which is a layout patch...

  3. the stack in comment #12 (elided above for brevity) ends in https://searchfox.org/mozilla-central/rev/c61720a7d0c094d772059f9d6a7844eb7619f107/dom/base/Element.cpp#4027 which is what would mark things as dirty so that we'll need to flush style/layout again to paint. But AIUI it doesn't actually sync-layout-flush, right? Ni :mconley for this, too. :-)

Flags: needinfo?(mconley)
Flags: needinfo?(gijskruitbosch+bugs)
Flags: needinfo?(gandalf)

(In reply to :Gijs (he/him) from comment #13)

A little... I end up with more questions. :-)

  1. am I interpreting your comment correctly if I rephrase the quote above to "in some cases, we manage to paint before more translation/fluent work 'dirties' style/layout, and sometimes not" ? And is that just a race? (ni :mconley)

Yeah, my hypothesis is that bug 1593060 caused the timings to change sufficiently such that during the test we more often wait for Fluent to run and dirty the DOM and use its paint as FirstNonBlankPaint. In the "good case", an earlier paint is counted as the FirstNonBlankPaint.

I agree that it's strange that we're computing styles / layout (multiple times!) before the translation occurs.

  1. the stack in comment #12 (elided above for brevity) ends in https://searchfox.org/mozilla-central/rev/c61720a7d0c094d772059f9d6a7844eb7619f107/dom/base/Element.cpp#4027 which is what would mark things as dirty so that we'll need to flush style/layout again to paint. But AIUI it doesn't actually sync-layout-flush, right? Ni :mconley for this, too. :-)

Yeah, that stack is the "Marker Cause" for the reflow - the actual flush is due to a refresh driver tick: https://perfht.ml/35CrmmC

Flags: needinfo?(mconley)

I agree that it's strange that we're computing styles / layout (multiple times!) before the translation occurs.

I don't understand why would we. That should not happen unless we inject localization lazily after layout was already triggered.

It does appear as if the initial translation completes, and only then do we start painting:

Before: https://perfht.ml/34ADnHM
After: https://perfht.ml/2tsaBvX

So that's good - that part seems to work as expected.

I'm starting to think that maybe there's a problem with our paint suppression stuff. Here's the before profile again:

Before: https://perfht.ml/2rZW0Yn

Here, we're unsuppressing the painting of the about:preferences#privacy document (and invalidating, which will cause a paint) because the 5ms paint suppression timer has fired.

Great. But how come we're reflowing / calculating display lists before that point? In this profile, we seem to do it twice, before the suppression has come off: https://perfht.ml/34AcdAP

Why? tnikkel, do you know? Does it make sense for the RefreshDriver to cause a style / layout flush while we're still paint suppressed?

Flags: needinfo?(gandalf) → needinfo?(tnikkel)

(In reply to Mike Conley (:mconley) (:⚙️) (Wayyyy behind on needinfos) from comment #16)

Great. But how come we're reflowing / calculating display lists before that point? In this profile, we seem to do it twice, before the suppression has come off: https://perfht.ml/34AcdAP

Looks like about:preferences#privacy is in the parent process. So we still have to paint and reflow in it's ancestor document, the window chrome. So we would expect painting and reflowing in the ancestor documents that aren't paint suppressed. When we go to build the display list for a paint suppressed iframe we should be just generating a single background color item and nothing else for that document. But we still need to build the display list for window chrome if we are asked to paint. Reflowing in the paint suppressed document should be avoided too. Can we tell if the reflowing in the profile is from the paint suppressed document or not?

Flags: needinfo?(tnikkel)

(In reply to Timothy Nikkel (:tnikkel) from comment #17)

Can we tell if the reflowing in the profile is from the paint suppressed document or not?

Yep - flip to the Marker Chart, and hover your mouse over the Reflow markers (or hover the Reflow markers in the timeline, either works), and it'll say that about:preferences#privacy is the thing reflowing.

Thanks.

I took a look around, I don't see anything that would prevent reflowing in a paint suppressed document (although sometimes we might want to do that if a js call makes us flush, doing it from the refresh driver seems like a waste), maybe I'm missing something.

So we should probably explore if we can do that.

This is being tracked for this release, so I guess it should be P1, but we don't currently have an assignee. :tnikkel, do you or someone else on the layout team have cycles to look at the layout / paint suppression side of this? And, out of curiosity, do you know if this is likely to also affect web pages, or is it specific to the case where we've got a parent-process document in the tab?

Flags: needinfo?(tnikkel)
Priority: -- → P1

(In reply to :Gijs (he/him) from comment #20)

This is being tracked for this release, so I guess it should be P1, but we don't currently have an assignee. :tnikkel, do you or someone else on the layout team have cycles to look at the layout / paint suppression side of this? And, out of curiosity, do you know if this is likely to also affect web pages, or is it specific to the case where we've got a parent-process document in the tab?

This would affect any page AFAICT. The fact that this seems like a pretty obvious optimization that we haven't done makes me feel like I'm missing something, that either it's much harder to do this then I think, or that we can't do it due to correctness reasons or something else, or maybe it doesn't affect pages that are at the root of a process. AFAIK nothing has changed recently with our paint suppression so it's likely been this way for quite a while. So hoping that/depending on avoiding reflow during paint suppression to fix this for the next release is perhaps not the best idea.

I can maybe take a quick pass at a hacky patch to throw at try server to see how much breakage it would cause.

Flags: needinfo?(tnikkel)

Hey Sean - this is more of a Layout task - can someone on your team take a look?

Flags: needinfo?(svoisen)

We are pretty resource-constrained until January. Not sure we will be able to make much progress until then from the layout side, but will try. I'll keep the ni? on myself as a reminder.

Whiteboard: [fxperf] → [fxperf:p3]

Reminding myself that hopefully kamidphish can look into the paint suppression part once he's back and re-opens need-infos :)

Looking at the graphs, this is a pretty messy test suite and it's pretty hard to say whether this regression is even really there or not. I also don't see any clear change in the Beta graphs after 73 merged there in the beginning of January. Based on that, it doesn't seem like this bug is looking very actionable for 73 at this point. I'll hold off on resolving the bug outright pending comment 24.

(In reply to Sean Voisen (:svoisen) from comment #24)

Reminding myself that hopefully kamidphish can look into the paint suppression part once he's back and re-opens need-infos :)
Hi Sean, did kamidphish come back and is able to look into this bug?

Jared, do you think the priority of this bug should remain P1?

Flags: needinfo?(jaws)

(In reply to Ethan Tseng [:ethan] from comment #26)

Hi Sean, did kamidphish come back and is able to look into this bug?

Just now revisiting this because I somehow missed this reply. Anyway, kamidphish is no longer working on layout.

It would be good to know if this is still a P1 bug for front-end.

Flags: needinfo?(svoisen)

(In reply to Sean Voisen (:svoisen) from comment #28)

(In reply to Ethan Tseng [:ethan] from comment #26)

Hi Sean, did kamidphish come back and is able to look into this bug?

Just now revisiting this because I somehow missed this reply. Anyway, kamidphish is no longer working on layout.

It would be good to know if this is still a P1 bug for front-end.

I think realistically it's not a P1 in that we stopped tracking it for a release, but between comment #16 through comment #19 I think this should probably still be P2 in terms of seeing if we could improve performance (for webpages as well as about:preferences) by avoiding the layout work we're doing off vsync notifications while the document is being paint-suppressed (which is basically wasted time...). Does that help?

Flags: needinfo?(jaws) → needinfo?(svoisen)

Keeping this on our radar. Looks like we have an old bug filed for this work: bug 1320435.

Depends on: 1320435
Flags: needinfo?(svoisen)
Severity: normal → --
Priority: P1 → P2
Severity: -- → S4
Has Regression Range: --- → yes
Performance Impact: --- → low
Keywords: perf:frontend
Whiteboard: [fxperf:p3]
You need to log in before you can comment on or make changes to this bug.