1552425 - 2.51 - 2.54% raptor-speedometer-firefox (linux64-shippable, linux64-shippable-qr) regression on push dcf3a023d7d08907c6033a8c61c601a6b06e620a (Fri May 10 2019)

Florin Strugariu [:Bebe]

Reporter

Description

•

7 years ago

Raptor has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=dcf3a023d7d08907c6033a8c61c601a6b06e620a

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

3% raptor-speedometer-firefox linux64-shippable opt 89.93 -> 87.65
3% raptor-speedometer-firefox linux64-shippable-qr opt 88.92 -> 86.68

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=20895

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

To learn more about the regressing test(s) or reproducing them, please see: https://wiki.mozilla.org/Performance_sheriffing/Raptor

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling

Florin Strugariu [:Bebe]

Reporter

Updated

•

7 years ago

Product: Testing → Firefox

Regressed by: 1548941

Version: Version 3 → unspecified

Florin Strugariu [:Bebe]

Reporter

Updated

•

7 years ago

Flags: needinfo?(gijskruitbosch+bugs)

:Gijs (he/him)

Assignee

Comment 1

•

7 years ago

(In reply to Florin Strugariu [:Bebe] from comment #0)

Raptor has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=dcf3a023d7d08907c6033a8c61c601a6b06e620a

As author of one of the patches included in that push, we need your help to address this regression.

But I'm not the author? Your push links to https://bugzilla.mozilla.org/show_bug.cgi?id=1548472 ...

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(fstrugariu)

Florin Strugariu [:Bebe]

Reporter

Comment 2

•

7 years ago

:Gijs I know that.
Because of a bug in PerfHerder we can't move alerts

I created a new one with your push :

== Change summary for alert #20981 (as of Fri, 17 May 2019 06:55:05 GMT) ==

Regressions:

3% raptor-speedometer-firefox linux64-shippable opt 89.93 -> 87.65
3% raptor-speedometer-firefox linux64-shippable opt 90.02 -> 87.75
3% raptor-speedometer-firefox linux64-shippable-qr opt 88.92 -> 86.68

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20981

Here is a graph link:
https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1924529,1,10&series=mozilla-inbound,1942689,1,10&zoom=1557440353882.692,1557441391884.7673,86.72305519817183,90.7136306702848&selected=autoland,1924529,472156,812043437,10

Flags: needinfo?(fstrugariu) → needinfo?(gijskruitbosch+bugs)

:Gijs (he/him)

Assignee

Comment 3

•

7 years ago

•

Edited

(In reply to Florin Strugariu [:Bebe] from comment #2)

:Gijs I know that.

OK, but that wasn't obvious... It just looked like somewhere, a mistake was made.

Because of a bug in PerfHerder we can't move alerts

Then please mark the automated (?) comment obsolete and write a different one in future. Like this the result is just confusing.

Anyway, as for this issue, I don't understand it. Is the speedometer test meant to run with e10s disabled or enabled? The "perf" profile used for talos enables e10s ( https://searchfox.org/mozilla-central/source/testing/profiles/perf/user.js#36 ) but I don't know where the raptor speedometer code lives and if it uses that or not. The logs don't give me any clue about this.

The changes in bug 1548941 literally do not affect anything except the decision to use / not use e10s. If the job is and was actually using e10s, then it makes no sense that there was any effect from my changes.

Are there before/after profiles or something like that, that would help investigate what happened?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(fstrugariu)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 4

•

7 years ago

The Raptor code which launches the desktop browser can be found here:
https://searchfox.org/mozilla-central/rev/94c6b5f06d2464f6780a52f32e917d25ddc30d6b/testing/raptor/raptor/raptor.py#581-601

There is nothing specific to e10s in that method. That e10s has to be used is getting set in the profile, which means e10s is always enabled, regardless the --disable-e10s option:

https://searchfox.org/mozilla-central/rev/94c6b5f06d2464f6780a52f32e917d25ddc30d6b/testing/profiles/perf/user.js#36

Given that there is no indication if e10s is enabled or not we could have some additional logging for a try build applied to the code base before and after bug 1548941 landed. Maybe it really turns off e10s somehow.

:Gijs (he/him)

Assignee

Comment 5

•

7 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #4)

Given that there is no indication if e10s is enabled or not we could have some additional logging for a try build applied to the code base before and after bug 1548941 landed. Maybe it really turns off e10s somehow.

It doesn't locally when run through mach, so I would be surprised if infra was different. Perhaps it turned off e10s before, but the quickest way to check that is to check profiles from one of these jobs - I thought we had profiles for all the raptor jobs?

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 6

•

7 years ago

•

Edited

We do, but SETA might have reduced the priority of those jobs, and as such we don't run those that often anymore. I triggered some profiling jobs for you:

https://treeherder.mozilla.org/#/jobs?repo=autoland&fromchange=7fbf58d80879cbf94f82bddddb747bdfb9e4caa4&tochange=dcf3a023d7d08907c6033a8c61c601a6b06e620a&searchStr=rap%2Cspeedometer%2Cprof

:Gijs (he/him)

Assignee

Comment 7

•

7 years ago

Thanks, whimboo. So the profiles show both before/after are using e10s.

I don't really know what else to do here. I tried looking at the profiles, but they extend over the entire run of the test, and a 2% difference across the board is hard to pin down. Looking at the non-qr alert, none of the subtests really stand out (in fact, there are no significant differences and no confident differences...). The qr alert flags up a few things, notably in the "Elm" subtest for Elm-TodoMVC/CompletingAllItems/Async (whatever that means). But narrowing in on the Elm test in the profiles, I don't see significant differences off-hand. I see ~150ms setTimeout handler hangs in both... Looking at the "replicates" for that subtest, it seems that it is bimodal - sometimes it takes less than 10ms, sometimes between 45-80ms. In the "after" state we have more of the latter. I have no idea if that is a red herring or not.

Who normally works on these tests, and is there anything else I should investigate given that my change shouldn't have affected our JS implementation in any way?

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 8

•

7 years ago

Just an idea... what about reverting your change and push that to try with some speedometer profiling tests enabled? If the value doesn't jump back your change might indeed not have caused that.

:Gijs (he/him)

Assignee

Comment 9

•

7 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #8)

Just an idea... what about reverting your change and push that to try with some speedometer profiling tests enabled? If the value doesn't jump back your change might indeed not have caused that.

I think the data we have (from the non-profiled and profiled runs) is already pretty conclusive about my commit being correlated with the change in results. What's not clear is why. I don't think a trypush helps with that.

Bobby Holley (:bholley)

Comment 10

•

7 years ago

Try doing try pushes with the e10s decision result in that code hard-coded one way or the other. That will tell you whether the difference is based on the e10s decision in any way, or somehow related to the code that calculates it (maybe we're like checking env vars on a hot path or something).

:Gijs (he/him)

Assignee

Comment 11

•

7 years ago

(In reply to Bobby Holley (:bholley) from comment #10)

Try doing try pushes with the e10s decision result in that code hard-coded one way or the other. That will tell you whether the difference is based on the e10s decision in any way, or somehow related to the code that calculates it (maybe we're like checking env vars on a hot path or something).

I mean, perhaps I'm being naive, but then searching for nsAppRunner in the profiler should turn up something, right? And it doesn't... Also, the code is meant to cache the result - ie https://searchfox.org/mozilla-central/rev/94c6b5f06d2464f6780a52f32e917d25ddc30d6b/toolkit/xre/nsAppRunner.cpp#4933-4936 - so the actual decision should only be made once...

I'm out for the next few hours but I guess I'll do some trypushes to be extra sure...

Bobby Holley (:bholley)

Comment 12

•

7 years ago

(In reply to :Gijs (he/him) from comment #11)

(In reply to Bobby Holley (:bholley) from comment #10)

Try doing try pushes with the e10s decision result in that code hard-coded one way or the other. That will tell you whether the difference is based on the e10s decision in any way, or somehow related to the code that calculates it (maybe we're like checking env vars on a hot path or something).

I mean, perhaps I'm being naive, but then searching for nsAppRunner in the profiler should turn up something, right? And it doesn't... Also, the code is meant to cache the result - ie https://searchfox.org/mozilla-central/rev/94c6b5f06d2464f6780a52f32e917d25ddc30d6b/toolkit/xre/nsAppRunner.cpp#4933-4936 - so the actual decision should only be made once...

Sure, but none of the explanations feel particularly plausible. When I have a situation where a seemingly-innocuous patch causes an unexpected result on CI, my first step is to bisect the patch until I find the minimum code delta that causes the difference.

:Gijs (he/him)

Assignee

Comment 13

•

7 years ago

Alright, fair enough. Some attempts:

https://treeherder.mozilla.org/#/jobs?repo=try&fromchange=6d4829e5ba4494b5fabac3896fcbf412ce80f669&tochange=a2c7a185c5e6dfc2cd1f54826c1d94006ffb177c

:Gijs (he/him)

Assignee

Comment 14

•

7 years ago

How about this for a hypothesis: pgo tries to turn off e10s:

https://searchfox.org/mozilla-central/rev/0078b9e7d42c366b102d7aec918caf64fed1d574/build/pgo/profileserver.py#55-65

        profile_data_dir = os.path.join(build.topsrcdir, 'testing', 'profiles')
        with open(os.path.join(profile_data_dir, 'profiles.json'), 'r') as fh:
            base_profiles = json.load(fh)['profileserver']

        prefpaths = [os.path.join(profile_data_dir, profile, 'user.js')
                     for profile in base_profiles]

        prefs = {}
        for path in prefpaths:
            prefs.update(Preferences.read_

https://searchfox.org/mozilla-central/rev/0078b9e7d42c366b102d7aec918caf64fed1d574/testing/profiles/profiles.json#3

    "profileserver": ["base", "common", "unittest-required", "unittest-features", "profileserver"],

https://searchfox.org/mozilla-central/rev/0078b9e7d42c366b102d7aec918caf64fed1d574/testing/profiles/unittest-required/user.js#56

user_pref("browser.tabs.remote.autostart", false);

and the patch might have broken that (ie if the pgo build doesn't set the "no non-local connections" env var, we might now not let it turn off e10s). PGO changing in that way (can we even collect results from the child process, if we've never bothered to make pgo use e10s?) would explain seemingly unrelated perf tests changing on pgo builds...

This is further backed up by the fact that, from visual inspection, I think the change might have regressed things on win64 and macosx64 as well -- but only on pgo, cf. https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-central,1706597,1,10&series=mozilla-central,1936616,1,10&zoom=1557235517603.7737,1557687177452.83,76.50093592954485,89.4784640194325 .

:Gijs (he/him)

Assignee

Updated

•

7 years ago

Comment 15

•

7 years ago

Confirmed on try: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a2c7a185c5e6dfc2cd1f54826c1d94006ffb177c&newProject=try&newRevision=c47d668c43823a99a8b1375d9724a196cacedf3a&framework=10 - https://hg.mozilla.org/try/rev/68f728db48d2f6c2a648371bb4761df83588cc9d fixes things.

Assignee: nobody → gijskruitbosch+bugs

Status: NEW → ASSIGNED

Flags: needinfo?(fstrugariu)

:Gijs (he/him)

Assignee

Comment 16

•

7 years ago

Attached file Bug 1552425 - re-disable e10s when generating PGO profile data to fix speedometer perf regression — Details

Bug 1548941 restricted under which circumstances we allow the
browser.tabs.remote.autostart pref to turn off e10s. The PGO profileserver.py
script relied on the unittest-required user.js prefs collection to turn off
e10s (see also bug 1196094) via this pref. For PGO builds, we do not set the
MOZ_DISABLE_NONLOCAL_CONNECTIONS env var, which meant that we stopped
honouring the pref to turn off e10s. Unfortunately, this meant that
e10s was inadvertently now switched on for the pgo profiling, which
negatively impacted speedometer on PGO builds (and possibly other tests).

All this change does is re-disable e10s for PGO profiling. We should
investigate how to turn e10s on "properly" for PGO, but we can do that in
bug 1196094, without taking this temporary regression, especially as 68
branches.

Pulsebot

Comment 17

•

7 years ago

Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/autoland/rev/3a6ea5b272d3 re-disable e10s when generating PGO profile data to fix speedometer perf regression r=froydnj

Cristian Brindusan [:cbrindusan]

Comment 18

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/3a6ea5b272d3

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox68: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → Firefox 68

Florin Strugariu [:Bebe]

Reporter

Comment 19

•

7 years ago

== Change summary for alert #21026 (as of Tue, 21 May 2019 11:30:38 GMT) ==

Improvements:

3% raptor-speedometer-firefox linux64-shippable-qr opt 86.52 -> 88.77
3% raptor-speedometer-firefox linux64-shippable-qr opt 86.59 -> 88.76
2% raptor-speedometer-firefox linux64-shippable opt 88.01 -> 89.94

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=21026

Florin Strugariu [:Bebe]

Reporter

Comment 20

•

7 years ago

The regression was fixed

Status: RESOLVED → VERIFIED

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox67.0.1: --- → unaffected

status-firefox-esr60: --- → unaffected

Florin Strugariu [:Bebe]

Reporter

Updated

•

6 years ago

Keywords: perf-alert

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

Has Regression Range: --- → yes