Closed Bug 1465806 Opened 2 years ago Closed 2 years ago

Reduce DAMP variance by waiting for pending browser code to complete

Categories

(DevTools :: Framework, defect)

defect
Not set

Tracking

(firefox62 fixed)

RESOLVED FIXED
Firefox 62
Tracking Status
firefox62 --- fixed

People

(Reporter: ochameau, Assigned: ochameau)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

While looking at the profiler stacks against fake empty tests, I released various code from Firefox is overlapping these fake tests.

I took two tests, removed everything in them and just made them wait for 10s.
By looking at profiler, I saw that the addon manager and session restore are often still running when we start executing the first tests (cold.inspector.open).
But also, testSetup is still executing some code related to AsyncTabSwitcher and tab animation (this comes from addTab call). There is also the preloaded process being created at some quite random timing.
And finally, testTeardown also spawns pending code we don't wait for completion. This time is is related to tab closure animation as well as the destruction of the content process.
Assignee: nobody → poirot.alex
Here is a DAMP comparison of two try pushes against mozilla-central tip:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=de1f320656896a1f56934ff351f7e7e2e08cc64a&newProject=try&newRevision=2c9dfc9ae092786723d1f9bf5ae712ecb63fbd76&originalSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&newSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&framework=1&showOnlyNoise=1

And comparison between two try pushes with the same tip and this patch:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=1ee053ba0bb87a9407510ef5c97fc5b2de1b3a93&newProject=try&newRevision=a74ea772b3145067743a1a92de3a0cf4135a94f1&originalSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&newSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&framework=1&showOnlyNoise=1

The "noise" metric of PerfHerder speaks by itself.
We go from:
  48.64 ± 2.06% 	< 	66.58 ± 1.50% 	36.87%
to:
  51.12 ± 1.96% 	< 	51.14 ± 1.96% 	0.03%

I don't know exactly how to read this data, but it looks like the absolute noise value is reduced,
we go from an average of 48+66/2 = 57 down to 51.
But what is more stricking is that the noise between two runs looks very stable.

Joel, do I interpret this data correctly?

Otherwise, a couple more stats about the standard deviation percentages.
We go from:
 an average of 7.874364640883979, a min:0.13 and a max:234.74
to:
 an average of 6.818152173913042, a min:0.23 and a max:148.7
It confirms that the typical difference between two test run is slightly reduced and that the maximal is also possibly lowered.

Otherwise, here is a comparison between m-c and this patch, it highlights many changes to tests baselines....
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=de1f320656896a1f56934ff351f7e7e2e08cc64a&newProject=try&newRevision=1ee053ba0bb8&originalSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&newSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&showOnlyConfident=1&framework=1

Finally, here is even more numbers about this with a custom view of talos data.
It isn't very clear what are all these boxplots and numbers but it helps seeing the trend:
* Diff between two m-c tip:
https://screenshots.firefox.com/IReUSqs0TrzftNER/localhost
average of all test averages (-0.14) average stddev(-1.66)

* Diff between two push with this patch:
https://screenshots.firefox.com/dCoHWnnb0f58LUu1/localhost
average of all test averages (0.09) average stddev(0.82)

* Diff between m-c and this patch:
https://screenshots.firefox.com/xpOgyGaYJrwoZFJI/localhost
average of all test averages (1) average stddev(-4.52)
Flags: needinfo?(jmaher)
thanks for working on this, it is hard to understand fully what you have here.  I typically mouse over in the compare view and see a distribution of the numbers and their raw values.  That gives me a good idea if there are outliers and the range the data is posted in.

I think this reduces it a good bit.  Keep in mind this is only linux, we run this test on osx and windows as well.
Flags: needinfo?(jmaher)
Comment on attachment 8982283 [details]
Bug 1465806 - Wait for addon manager and session restore full initialization before starting DAMP tests.

https://reviewboard.mozilla.org/r/248224/#review254736

Personally I am not yet convinced about the effect of this change on the noise and deviation of DAMP.
If we take the last link you posted, and look at the noise:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=de1f320656896a1f56934ff351f7e7e2e08cc64a&newProject=try&newRevision=1ee053ba0bb8&originalSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&newSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&framework=1&showOnlyNoise=1

Without patch: 47.66
With patch: 50.17

Which is the opposite of what your other runs seemed to highlight. Particularly it seems that the "close" tests have a higher std deviation than before.
I think we need more runs to know whether this improves the situation or not. So two solutions:
- do more runs and make sure the noise and deviations are reduced with the patch
- land and see the impact on the dashboard over time

Given that this impacts the baseline of the tests, it may be better to have a few more runs before landing.

I am setting r+ because I don't have particular comments about the implementation, but I would really like to see more data before landing this.

::: testing/talos/talos/tests/devtools/addon/content/damp.js:235
(Diff revision 1)
>    async testSetup(url) {
>      let tab = await this.addTab(url);
> -    await new Promise(resolve => {
> -      setTimeout(resolve, this._config.rest);
> -    });
>      return tab;

Even before there was no reason to have a dedicated testSetup method on top of addTab. The new version just makes it more obvious but we should either inline addTab in testSetup or file a follow up to get rid of testSetup.

::: testing/talos/talos/tests/devtools/addon/content/damp.js:263
(Diff revision 1)
>    // Everything below here are common pieces needed for the test runner to function,
>    // just copy and pasted from Tart with /s/TART/DAMP

Not directly related to your patch, but this is probably not anywhere close to the truth now :) we should remove this comment.
Attachment #8982283 - Flags: review?(jdescottes) → review+
Blocks: 1466479
(In reply to Julian Descottes [:jdescottes][:julian] from comment #4)
> Comment on attachment 8982283 [details]
> Bug 1465806 - Wait for browser code to complete before starting DAMP tests
> and wait for tab creation/destruction thoroughly.
> 
> https://reviewboard.mozilla.org/r/248224/#review254736
> 
> Personally I am not yet convinced about the effect of this change on the
> noise and deviation of DAMP.
> If we take the last link you posted, and look at the noise:
> https://treeherder.mozilla.org/perf.html#/
> comparesubtest?originalProject=try&originalRevision=de1f320656896a1f56934ff35
> 1f7e7e2e08cc64a&newProject=try&newRevision=1ee053ba0bb8&originalSignature=f79
> b6a4f1f8f53326d3056f6c8008c0ff4de0a94&newSignature=f79b6a4f1f8f53326d3056f6c8
> 008c0ff4de0a94&framework=1&showOnlyNoise=1
> 
> Without patch: 47.66
> With patch: 50.17

Yes, sometimes, the overall variance can still be as high as before.
Note that in my process, I only focused on cleaning up the noise of:
* cold.inspector.open, the very first test being ran, by waiting for addon manager and sessionion restore,
* *.open tests, by cleanup the noise after testTeardown/closeTab,
* *.open tests, by cleanup the noise after addTab.

So, I haven't really looked into any noise being in close.
But I may have increased the baseline of close as we are waiting more.

But I think here, you get confused by dashboard data.
According to this blog post:
https://elvis314.wordpress.com/2017/11/16/measuring-the-noise-in-performance-tests/
47 and 50 numbers are a "variance" computation. May be the average of all subtest variances?
(If your statistics are rusted, this page helps understanding variance and stddev with dogs :)
 https://www.mathsisfun.com/data/standard-deviation.html)
The issue with this computation is that all tests don't have the same baseline,
so making an average of all subtests variance is subject to the same issues than the "summary" value of DAMP.
Each subtest may have a slightly different impact on the final agreggated value.

A better read of the noise between two try pushes is to look at the stddev in percentage.
And to look at more than just one metric. Boxplot data is a better way to fully understand what is going on.

This stddev is displayed next to each subtest result, for example for cold.inspector.open:
"703.94 ± 3.50%" ==> the average duration for this test is 703, and the stddev percentage is 3.5% (if you hover this, the black popup will show you the variance: 24.63)

Now, for this comparison page, run these scripts in the console:
var t = 0, min=Infinity, max=-Infinity; var list=[...document.querySelectorAll('[replicates="compareResult.originalRuns"]')].map(e=>parseFloat(e.getAttribute("stddevpct"))).filter(e=>!!e); list.forEach(e=>{t+=e;min=Math.min(min,e);max=Math.max(max,e);}); list.sort((a,b) => a - b); "avg:"+t/list.length+" q1:"+list[Math.ceil(list.length/4)]+" median:"+list[Math.ceil(list.length/2)]+" q3:"+list[Math.ceil(3*list.length/4)]+" min:"+min+" max:"+max
And:
var t = 0, min=Infinity, max=-Infinity; var list=[...document.querySelectorAll('[replicates="compareResult.newRuns"]')].map(e=>parseFloat(e.getAttribute("stddevpct"))).filter(e=>!!e); list.forEach(e=>{t+=e;min=Math.min(min,e);max=Math.max(max,e);}); list.sort((a,b) => a - b); "avg:"+t/list.length+" q1:"+list[Math.ceil(list.length/4)]+" median:"+list[Math.ceil(list.length/2)]+" q3:"+list[Math.ceil(3*list.length/4)]+" min:"+min+" max:"+max

You will get this data for the base run:
avg:7.288333333333335 q1:0.71 median:1.76 q3:4.08 min:0.13 max:234.74
And this for the new run:
avg:6.918695652173913 q1:0.63 median:1.47 q3:4.24 min:0.23 max:148.7

This is the boxplot data for the standard deviation in percentage for base and then for new runs.

Now it becomes clearer, that overall, the noise is reduced.
The standard deviation, in percentage, for each subtests is globally reduced.

Here is another try run comparing m-c with this patch:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=a18826420dbd&newProject=try&newRevision=5e61e9463e1f&originalSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&newSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&framework=1

We go from:
  avg:8.282808988764046 q1:0.66 median:1.68 q3:4.19 min:0.24 max:225.71
to:
  avg:6.0690217391304335 q1:0.59 median:1.55 q3:5.3 min:0.16 max:151.18

And, if I compare 4 new try pushes. Two with m-c, two with this patch, we go from:
  avg:9.052166666666668 q1:1.41 median:4.62 q3:7.06 min:0.24 max:225.71
down to:
  avg:7.013478260869564 q1:0.65 median:1.48 q3:4.96 min:0.16 max:229.45

Finally, another try push with 30 rebuilds:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=83b293a300eef9c8ad2b6b56067a4c4017b6755c&newProject=try&newRevision=64e1fa2c133bd590de3b0d084b203bcda282262c&originalSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&newSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&framework=1

We go from:
  avg:7.437173913043478 q1:0.83 median:1.85 q3:4.49 min:0.29 max:188.63
to:
  avg:11.565 q1:0.77 median:1.81 q3:4.82 min:0.3 max:468.14
Which tells the opposite of all the previous try pushes :/
But looking at the actual results, it translates what I observed locally:
* cold.inspector.open is significantly faster as there is a lot of firefox startup no longer executing in middle of it,
* testTeardown no longer overlap next tests. It seems to be small enough to only impact tests with a small runtime:
  console.openwithcache.open, simple.inspector.open, simple.netmonitor.open and simple.styleeditor.open.

I would easily ignore the impact on close and settle for the wins on cold.inspector.open, console.openwithcache.open and simple.inspector.open. We are getting close to 1% or under noise, which is very significant win to track the more subtle regressions.
Also, this work isn't only about the dashboard results. I verified any of these changes locally with the profiler.
Every change helped getting rid of code that is external to the test to appear on perf-html when the tests were executing.

I imagine another followup would be to do similar optimization but with profile being collected from try pushes...

> ::: testing/talos/talos/tests/devtools/addon/content/damp.js:235
> (Diff revision 1)
> >    async testSetup(url) {
> >      let tab = await this.addTab(url);
> > -    await new Promise(resolve => {
> > -      setTimeout(resolve, this._config.rest);
> > -    });
> >      return tab;
> 
> Even before there was no reason to have a dedicated testSetup method on top
> of addTab. The new version just makes it more obvious but we should either
> inline addTab in testSetup or file a follow up to get rid of testSetup.

I'll inline for now, and open a followup to cleanup setup/teardown.
testTeardown calls the next tests, whereas it should be up to the damp.js to automatically call the next one once the subtest async function is done...

> ::: testing/talos/talos/tests/devtools/addon/content/damp.js:263
> (Diff revision 1)
> >    // Everything below here are common pieces needed for the test runner to function,
> >    // just copy and pasted from Tart with /s/TART/DAMP
> 
> Not directly related to your patch, but this is probably not anywhere close
> to the truth now :) we should remove this comment.

Oh, yes more cleanup is coming. I already have a followup patch to remove the useless frontend.
Thanks for the detailed feedback, I used the variance average because it was easier to point to but I did compare the variance test by test. I prefer to be cautious with this because the code will be complicated to maintain. I agree the goal is not the dashboard, the goal is to be able to detect smaller regressions. I don't doubt your patch reduced/shifted the noise, but I was trying to see if the noise was reduced enough so that it could give us a better confidence to detect regressions.

Landing the patch and seeing the evolution of the results overtime will be the best way to check the added value.
These results disturb me as well.

I'm splitting this patch in 4:
1) wait on startup
2) disable addons via the prefs
3) better wait on addTab
4) better wait on closeTab

And see if at least one reports a clear and expected result.
If none of them does, I would take it as: "everything is still very far from being trustable and a lot more work is necessary on the runtime and/or the dashboard to trust any of this!"
(In reply to Alexandre Poirot [:ochameau] from comment #8)
> I'm splitting this patch in 4:
> 1) wait on startup

\o/

Finally, something makes sense.
The first patch, that only waits for AddonManager/SessionRestore full startup highlights noise reduction on every single test, except a couple of settle (which are very noisy by definition), simple.inspector.close +4,45 and server.protocoljs +0,11:
  https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=2c5ff1e5f4e14ec10ef62e68c0d4e4e35113236c&newProject=try&newRevision=6f88e4aa501c6989649ef2cf1414b22c823571af&originalSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&newSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&framework=1
This patchs very clearly indicates a noise reduction.
We go from:
  avg:9.101847826086958 q1:2.67 median:3.63 q3:5.92 min:0.53 max:175.46
  to:
  avg:8.140978260869566 q1:0.84 median:1.82 q3:3.7 min:0.26 max:224.43

> 2) disable addons via the prefs

The result of this second patch is really different.
It only impacts cold.inspector.open, the very first test to run. This comparison is between this patch and the previous one. It is surprising as it doesn't really reduce the noise of this test, but reduces its baseline. It is as if something was always running during this test and no longer runs.
So may be the first patch do not wait correctly for everything. Or, for some reason, these extra addons are computing things only during the first test...
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=6f88e4aa501c6989649ef2cf1414b22c823571af&newProject=try&newRevision=f048b568fad96a720a1d9e255a243c7f472d5d9d&originalSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&newSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&framework=1
Overall the stddev doesn't change significantly:
  avg:8.140978260869566 q1:0.84 median:1.82 q3:3.7 min:0.26 max:224.43
  to:
  avg:8.34467391304348 q1:0.94 median:1.88 q3:4.25 min:0.24 max:253.39

> 3) better wait on addTab
> 4) better wait on closeTab

Unfortunately these two patches can't be uncoupled. Applying only the first will lead to timeouts.
So I have to compare both at the same time. Here I'm comparing against the previous patch, disabling the prefs.

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=f048b568fad96a720a1d9e255a243c7f472d5d9d&newProject=try&newRevision=fc836b178f5a6583e2d143369274d6beb25bcf87&originalSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&newSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&framework=1
We go from:
  avg:8.34804347826087 q1:0.94 median:1.93 q3:4.25 min:0.24 max:253.39
  to:
  avg:6.604782608695655 q1:0.61 median:1.48 q3:4.49 min:0.29 max:95.16

Here is in detail the regression in stddev percentages:
complicated.inspector.close +0,4
complicated.jsdebugger.close +0,94  <<<
complicated.jsdebugger.reload +0,06
complicated.styleeditor.close +1,64 <<<
complicated.styleeditor.reload +0,15
complicated.webconsole.close +0,36 <<<
complicated.webconsole.reload +0,03
custom.inspector.close +0,38 <<<
custom.inspector.collapseall.balanced +0,01
custom.inspector.expandall.manychildren +0,59  <<<
custom.inspector.manyrules.deselectnode +0,15
custom.webconsole.close +4,56 <<<
custom.webconsole.open +0,05
custom.webconsole.reload +0,28 <<<
panelsInBackground.reload +0,86 <<<
simple.inspector.close +3,04 <<<
simple.inspector.reload +0,20 <<<
simple.netmonitor.reload +0,17 <<<
simple.webconsole.reload +0,07
I ignored settle tests, but the stddev also increase for lots of them.

These patches are the one having a negative impact on close stddev, as well as reload and expandall.
I would like to better understand these patches and get why the first timeouts when applied individually.

So I'll move forward with the first one waiting for addon manager and session store.
I don't know what to think about the second one disabling prefs...
And I'll followup for the add/close tab patch.
Pushed by apoirot@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f2ae0027dfa4
Wait for addon manager and session restore full initialization before starting DAMP tests. r=jdescottes
https://hg.mozilla.org/mozilla-central/rev/f2ae0027dfa4
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 62
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.