Open Bug 1616236 Opened 2 months ago Updated 11 days ago

"tp5n time_to_session_store_window_restored_ms opt e10s stylo" is too variable to be useful in a single 5-run

Categories

(Testing :: Talos, defect, P3)

Version 3
defect

Tracking

(Not tracked)

People

(Reporter: standard8, Unassigned)

References

(Blocks 1 open bug)

Details

I've been doing various try runs with the aims of hunting down issues with a patch I've been trying to land.

I did test builds with 5 rebuilds a couple of weeks ago:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=49b9c64323badacb3e5aa4a12ff51c61f02321c3&newProject=try&newRevision=05bb331815307dbf0221bc22ecb5e1005744c0b6&framework=1#table-header-561038570

This shows a 14.41% regression (medium confidence) on windows7-32-shippable.

Today, I did another set of builds, using exactly the same m-c base and applied patches:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=87fd18a4071212bc00840c13378d82c816df974c&newProject=try&newRevision=30e6909acffed393b647baebc728890a447c5b8f&framework=1#table-header-561038570

This shows a 7.39% improvement (medium confidence) on windows7-32-shippable.

windows10-64 shows 8.22% low confidence which went down to 0.06% low confidence in the second run.

The variability here really implies that either 5 runs isn't enough, or there's something else affecting the test machines that makes this inconsistent.

time_to_session_store_window_restored_ms is useful when run locally, but not so useful in automation because it's not on dedicated hardware. We should probably not fire alerts based on that metric.

:standard8, 5 runs is definitely not enough here. In this case, you should probably run at least 30 trials for the test on windows7-32. The improvement or regression in your change is probably very small if you are running into this issue. The variability of this metric is at least 10%.

Looking at the one with a 14% regression, there's an outlier in the data that is throwing it off.

That said, even without the outlier it's still a regression so I'm adding this issue to the fxperftest triage discussion topics because those results are contradictory and I wonder if we have anything in the works to help with this.

I want to mention that I did notice a regression in the perfherder data starting around Feb. 6th. The metric's value increased and the variability also increased (it became more bi-modal) with a change that occurred around that point: https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&highlightedRevisions=49b9c64323ba&highlightedRevisions=05bb33181530&selected=1922259,1042059648&series=try,1915518,1,1&series=mozilla-central,1941169,1,1&series=autoland,1922259,1,1&timerange=31536000&zoom=1580262151795,1582217411881,537.8701858441731,1530.8610954716917

Priority: -- → P3
Whiteboard: [perftest:triage]

Thanks for the information. Unfortunately this combined with bug 1614805 makes this hard to analyse, but currently I'm thinking my patches have no overall major issues. I've added these two bugs to the xperf section on the wiki so that other people hitting issues here can hopefully find them more easily.

We will look into this more in our sheriffable/non-sheriffable efforts being done in bug 1573129.

Depends on: 1573129
Whiteboard: [perftest:triage]
Blocks: 1573129
No longer depends on: 1573129
You need to log in before you can comment on or make changes to this bug.