Closed Bug 1590783 Opened 5 years ago Closed 4 years ago

26.63% startup_about_home_paint_realworld_webextensions (macosx1014-64-shippable) regression on push 6faac02d6d1ffe5cf2023f70c38f90b50948eb65 (Mon October 21 2019)

Categories

(Infrastructure & Operations :: RelOps: Posix OS, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: marauder, Unassigned)

References

Details

(4 keywords)

Talos has detected a Firefox performance regression around October 21st.
I did several retriggers and backfills and the regression was going backwards.
When i saw that pattern i decided to do 2 retriggers on two datapoints from the past:
2233060e1f08a - October 15th
09f5cd302da54 - October 11th
and the regression was there too.

Graph url:
https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=autoland,2131797,1,1&series=mozilla-inbound,2131854,1,1&timerange=1209600&zoom=1570689228000,1571889862000,1008.5555555555555,3324.4444444444443

Regressions:
27% startup_about_home_paint_realworld_webextensions macosx1014-64-shippable opt e10s stylo 1,833.46 -> 2,321.67

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=23519

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/TestEngineering/Performance/Talos

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/TestEngineering/Performance/Talos/Running

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/TestEngineering/Performance/Talos/RegressionBugsHandling

Assignee: nobody → infra
Blocks: 1578356
Component: Performance → Infrastructure: Other
Product: Testing → Infrastructure & Operations
QA Contact: cshields
Version: Version 3 → unspecified

I think this is an infra change, can someone confirm this ?
Thank you!

I've moved this over to the correct component for macOS infra, though from looking at the patch you cite it doesn't look like an infra issue. NI'ing the patch author to comment and potentially move this bug.

Assignee: infra → nobody
Component: Infrastructure: Other → RelOps: Posix OS
Flags: needinfo?(choller)
QA Contact: cshields

The patch in the push of comment 0 does not touch any release code. The code patched there is fuzzing-only and cannot be source of your regression.

Flags: needinfo?(choller)

(In reply to Christian Holler (:decoder) from comment #3)

The patch in the push of comment 0 does not touch any release code. The code patched there is fuzzing-only and cannot be source of your regression.

The patch from comment 0 doesn't have any kind of relevance. As perf sheriffs, we're interested to know if there where any infra changes done on OSX platforms, around October 20-21.

Marian, please update comment 0 by deleting that commit URL (it's indeed confusing) & replace it with the approximate date when the infra changes were 1st noticed in our graphs.

Flags: needinfo?(marian.raiciof)
Flags: needinfo?(choller)

I updated the first comment to better explain what is happening.

Flags: needinfo?(marian.raiciof)
Flags: needinfo?(dhouse)
See Also: → 1591010

This is likely caused by my changes to the MacOS systems on Oct 21st.

I changed the log forwarding configuration (forwarding less log entries: filtering out below error level for non-generic-worker/kernel/sudo processes). I also added monitoring, which I disabled that afternoon (I checked across the machines this morning and confirmed the monitoring(telegraf) service is not running).

I'll try running these tests on a staging worker with the logging changed back to how it was previously.

Dave, could you link the bug in question here? Or provide a link to the PR, so we can properly conclude this bug?

(In reply to Ionuț Goldan [:igoldan], Performance Sheriff from comment #7)

Dave, could you link the bug in question here? Or provide a link to the PR, so we can properly conclude this bug?

Ionut, thanks! Here is the PR: https://github.com/mozilla-platform-ops/ronin_puppet/pull/126
and the bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1585750

Flags: needinfo?(dhouse)

(In reply to Dave House [:dhouse] from comment #8)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriff from comment #7)

Dave, could you link the bug in question here? Or provide a link to the PR, so we can properly conclude this bug?

Ionut, thanks! Here is the PR: https://github.com/mozilla-platform-ops/ronin_puppet/pull/126
and the bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1585750

That PR was for turning off the monitoring to fix the test failures (over-running time) on the 21st.

These are the changes applied that likely caused the problem:
logging change: https://github.com/mozilla-platform-ops/ronin_puppet/pull/125 (still active in prod)
monitoring service: https://github.com/mozilla-platform-ops/ronin_puppet/pull/118 (disabled by above pr)

Flags: needinfo?(choller)
Blocks: 1592626
No longer blocks: 1592626
You need to log in before you can comment on or make changes to this bug.