Open Bug 1407826 Opened 2 years ago Updated 1 year ago

Try to reduce DAMP subtests variance

Categories

(DevTools :: General, defect, P2)

defect

Tracking

(firefox57 fix-optional)

Tracking Status
firefox57 --- fix-optional

People

(Reporter: ochameau, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Today, DAMP summary has a good variance, under 2% and often around 1%.
But subtests have a higher variance which makes them less reliable.
The higher the variance is, the lowest confidence will be on smaller regressions
It would be great if most important subtests were all under 2%.
It will improve PerfHerder readability by having better confidence and have the red/green highlights more frequently, and it would catch more regressions.

Solutions might be waiting between tests, doing GC, verify that tests do not overlap between each others, ensure that unmaintained tests are at the end to avoid them polluting the others,...

If you look at this run:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-central&newProject=try&newRevision=65c9b87aea96865e0614a793bee620418fc9f763&originalSignature=edaec66500db21d37602c99daa61ac983f21a6ac&newSignature=edaec66500db21d37602c99daa61ac983f21a6ac&framework=1&selectedTimeRange=172800
You can see that:
* cold.open is at 3.7%
* most close tests are over 2% and sometimes are very high (it most likely mean the reload test overlaps close)
* most reload tests against complicated are pretty good and under 2%
* this is different against simple, this is bad where debugger and netmonitor are above 4%
Priority: -- → P2
Here is a couple of experiments.

The reference run is this one, made of today's tip:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=6cc023198068bdbd247ce0a43233776c5ce0900b
Its average variance is 3.753%, max is 23% for styleeditor.close.
Variance sum is 202.
DAMP summary is 254.

1) Flush any async operation and stabilize GC before running the tests.
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=6cc023198068bdbd247ce0a43233776c5ce0900b&newProject=try&newRevision=381a714eec7852325d32481395b73d371074bfa1&originalSignature=edaec66500db21d37602c99daa61ac983f21a6ac&newSignature=edaec66500db21d37602c99daa61ac983f21a6ac&framework=1
Average variance is 3.626%, max is 13% for readHeapSnapshot.
Variance sum is 199.
DAMP summary is 265 (+4%).

2) Wait two seconds between each "command" (that's difference from subtest, it pauses after toolbox close for ex.).
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=6cc023198068bdbd247ce0a43233776c5ce0900b&newProject=try&newRevision=e23cf77c09a05f065d49f84c4919b83d3a380be5&originalSignature=edaec66500db21d37602c99daa61ac983f21a6ac&newSignature=edaec66500db21d37602c99daa61ac983f21a6ac&framework=1
Average variance is 3.226%, max is 21% for styleditor.close.
Variance sum is 174.
DAMP summary is 265 (+4%)

3) Wait two seconds between each "subtest" (after toolbox open, after reload, after toolbox close, ...)
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=6cc023198068&newProject=try&newRevision=2b23b72bcc42796baf021a5045213cbab5f516fa&originalSignature=edaec66500db21d37602c99daa61ac983f21a6ac&newSignature=edaec66500db21d37602c99daa61ac983f21a6ac&framework=1
Average variance is 2.153%, max is 7% for memory.close.
Variance sum is 116.
DAMP summary is 240 (-5.4%)


Conclusions:
* I don't really explain why the two first do not improve the variance and I can explain even less why the overall DAMP score is 4% slower in both cases.
* Third patch, waiting between each subtest, makes a lot of sense and its overall results also make a ton of sense. Most of the subtests most likely still do not wait correctly for full completion. We may not wait correctly for all promises to be resolved. We may not wait for a couple of reflows or some GC happen after significant actions, ...
* Third patch individual subtest changes still look weird. I would expect most of the test to be executed faster, as they would no longer overall between each others. But if you look at PerfHerder, you will see that a lot of subtests seem to be slower....
* Third patch is also hiding weak tests, we are missing coverage, ideally they would listen correctly for everything so that DAMP cover for this async code.
I mentioned this briefly in bug 1399779 comment 1, but what if we made each tool responsible for signaling when things like "reload" are done via custom events that DAMP could wait for?

Then each tool is responsible for signaling when all the work for "reload" etc. is done, since you might need a different approach given the different design of each tool.

Is that worth considering, or would it fall apart for some reason?
Flags: needinfo?(poirot.alex)
(In reply to J. Ryan Stinnett [:jryans] (use ni?) from comment #2)
> I mentioned this briefly in bug 1399779 comment 1, but what if we made each
> tool responsible for signaling when things like "reload" are done via custom
> events that DAMP could wait for?

Yes, we have to do something like that for bug 1405585.
And that event would have to be done in productiong codepath,
so it should also avoid introducing any overhead...

> Then each tool is responsible for signaling when all the work for "reload"
> etc. is done, since you might need a different approach given the different
> design of each tool.

Yes, I'm also expecting this to be very specific for each tool!


In the context of DAMP, it would help replacing these waiting codes:
http://searchfox.org/mozilla-central/source/testing/talos/talos/tests/devtools/addon/content/damp.js#387-393,402-414,422-436,438-450

But in the context of the current bug, it will be hard to have generic waiting code that will be
as good as these listeners with hardcoded numbers of console messages/scripts to be waited for.
So I'm not expecting them to really help on reducing the variance of DAMP.
Flags: needinfo?(poirot.alex)
Here is some results with patch #3 from comment 1 but with different durations:

# 2s timeout:
Average: 2.210%
Max: 10%
Sum: 119
DAMP summary: 241 (-5.5%)

# 1s timeout:
Average: 3.254%
Max: 25%
Sum: 175
DAMP summary: 238 (-6.5%)

# 0.5s timeout:
Average: 3.403%
Max: 21%
Sum: 183
DAMP summary: 253 (-0.7%)

# 0.250s timeout:
Average: 2.844%
Max: 9%
Sum: 153
DAMP summary: 258 (-0.9%)

# last 2 days m-c:
Average: 4.1%
Max: 22%
Sum: 225
DAMP summary: 255

# last 7 days m-c:
Average: 4.77%
Max: 23%
Sum: 262
DAMP summary: 256

# last 14 days m-c:
Average: 5.19%
Max: 23%
Sum: 285
DAMP summary: 259

It looks like we already get substancial win with 250ms wait delay, but still, 2s is better.
I'll look into smaller timeout but also try to flush the event loop, to avoid delaying when it is not necessary.
# setTimeout(0)
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=ed2fd2d1e7ff98998638e9d6543e9e795c714191
Average: 3.641
Max: 15
Sum: 196
DAMP summary: 243.82 ± 0.76% (-4.4%)
Duration: 30 minutes

# flush the event loop before moving on
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=ea7c43051dbe7b8c3f281e3e86942d01e764ab2c
Average: 2.310
Max: 9%
Sum: 124
DAMP summary: 238.51 ± 0.34% (-6.5%)
Duration: 67 minutes

# GC *and* flush the event loop
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=076b7d4e6cd0919121b67700b7c87170393213b0
Average: 1.678
Max: 8%
Sum: 90
DAMP summary: 259.31 ± 0.15% (1.6%)
Duration: 72 minutes

2s duration: 68 minutes
1s duration: 49 minutes
0.5s duration: 40 minutes
0.25s duration: 35 minutes

So it looks like we have to double the time spent on DAMP to make it more accurate!
Waiting for event loop flush seems to be equivalent to 2s timeout.
Here is variations of: "# GC *and* flush the event loop"
It was waiting for 5 consecutive empty event loop with 250ms interval.

# 10 iterations, 50ms interval
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=3eac4addd8d72ea28627966c08694c276c3defd1&selectedJob=137876481
Average: 1.790%
Sum: 96
Max: 7%
DAMP summary: 256.54 ± 0.61% (+0.52%)
Duration: 59 minutes

# 20iterations, 0ms interval
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=49679949feae2a30d4247446d744f0475b4dff06&selectedJob=137876419
Average: 1.842%
Sum: 99
Max: 9%
DAMP summary: 262.31 ± 0.34% (+2.78%)
Duration: 52 minutes.
I may be able to make DAMP two times less noisy (variance /2).
But DAMP will also be slower to run overall, about two times slower...

So, in term of ressources, is that an issue?
I imagine that given that the results are more stable we should either be able to say --rebuild-talos 6 is too much and we can now do 3? Or may be this is no longer relevant to run DAMP 25 times (cycles * tppagecycles)?

You told me you had some process to come up to "6" is the best choice for talos rebuild.
Could we re-run this process with this additional fix?
Flags: needinfo?(jmaher)
I am excited to hear that damp could be run faster- a slower runtime is not ideal, but something we can work with.  --rebuild-talos 6 is a generic form to allow all tests and all platforms get a reasonable sample size- some data points need a lot less.  I would make sure you can get this stability across osx and windows as well.

To move forward with this, I would recommend splitting damp into a few different tests- this allows us to get more "metrics" at a summary level and allows us to parallelize it across various different jobs so we can return results in a reasonable time frame (keep in mind this will realistically be december before we can do this due to the need for buildbot changes and an infrastructure freeze going into effect after next week)

if we are just testing damp, then we could document that there are fewer retriggers needed- I would love to point at DAMP as a flagship perf test :)  

As a note, my process for defining '6' was to take a few hundred data points and look for stability over chunks of various sizes- so if we were to download 100 summaries of damp over time this should only show alerts on real changes.  

An easier way to do this is to retrigger damp 100 times on a single try push and then analyze the data.  

A less scientific but useful way to do this is to use the compare feature of perfherder with two pushes from the same revision- in this case just push to try twice- then retrigger the 'g2' job 2 times and see if 3 data points show any stddev or if it detects a regression.  keep adding a job and see how it tracks over different data sets.
Flags: needinfo?(jmaher)
Here is some tests to see if all call to settle are worth:
# only settle after test teardown
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=742bf862a11c086a2dde14b8b15c14cb65575afa&selectedJob=138160580
Average: 2.182%
Max: 8%
Sum: 117
DAMP summary: 263.35 ± 0.75% (+3%)
Duration: 44m

# only settle after page reload
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=6fc48939964a6e6d94e1a372a41730dac25af37e&selectedJob=138160539
Average: 1.940%
Max: 8%
Sum: 104
DAMP summary: 238.75 ± 0.48% (-6%)
Duration: 42m

# only settle after toolbox open
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=c3380a3cfc4fe7063a437d04a57c2754f2ec3c26&selectedJob=138162991
Average: 2.417%
Max: 8%
Sum: 130
DAMP summary: 273.01 ± 0.26% (+6%)
Duration: 40m

So, settling after page reload is the most useful but there is still an improvement when settle is called after all steps.


# settle after all steps, but only do Cycle Collect
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=ceb71253524a24d56dc574f34367b10a5544f194&selectedJob=138159930
Average: 2.250%
Max: 9%
Sum: 121
DAMP summary: 234.69 ± 0.53% (-8%)
Duration: 54m

This run was to see if window.cycleCollect() was enough. It is not really clear if all these GC method are worth it. 
But it looks like only calling cycleCollect isn't equivalent to calling it and all the others.


All these runs should be compared to comment 5:
> # GC *and* flush the event loop
>   https://treeherder.mozilla.org/#/jobs?repo=try&revision=076b7d4e6cd0919121b67700b7c87170393213b0
> Average: 1.678
> Max: 8%
> Sum: 90
> DAMP summary: 259.31 ± 0.15% (1.6%)
> Duration: 72 minutes

This one is settling after each step and calling all GC methods.


At the end, may be "only settle after toolbox reload and call all GC methods" is a good compromise.
It brings DAMP variance under 2% and runs in 40minutes.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #9)
> I am excited to hear that damp could be run faster- a slower runtime is not
> ideal, but something we can work with.  --rebuild-talos 6 is a generic form
> to allow all tests and all platforms get a reasonable sample size- some data
> points need a lot less.  I would make sure you can get this stability across
> osx and windows as well.

Just to be clear. If you take one run, it will be slower with any of these patches.

But given that the variance may be divided by two, we should be able to do either less runs.
Either by:
* spawing less jobs (lower --rebuild-talos)
* doing less individual test run (lower cycles/tppagecycles)

And so, if we do that, we should be able to mitigate this slowness and may be use the same ressources, but with better/more stable results.

> To move forward with this, I would recommend splitting damp into a few
> different tests- this allows us to get more "metrics" at a summary level and
> allows us to parallelize it across various different jobs so we can return
> results in a reasonable time frame (keep in mind this will realistically be
> december before we can do this due to the need for buildbot changes and an
> infrastructure freeze going into effect after next week)

Please ping me whenever this is ready.
I think it would be helpful to be able to split it independently of this patch/issue.
Today it runs in 30min, it may be great to have the result in 15min with just two jobs!

But... this won't be obvious because of caches and cold versus warm runs.
We could easily break or pollute the results depending on how the tests are split.
We would have to be cautious about that.

> As a note, my process for defining '6' was to take a few hundred data points
> and look for stability over chunks of various sizes- so if we were to
> download 100 summaries of damp over time this should only show alerts on
> real changes.  
> 
> An easier way to do this is to retrigger damp 100 times on a single try push
> and then analyze the data.  

But I imagine you are looking for a hard limit. A factual criteria?
Like having stddev (is that what I keep calling variance?) under X %.

Today, DAMP has an average variance between 4% and 5% on subtests,
but its summary is often under 1% with 6 runs.

So I imagine I should push 100 times and see how many "times"
I can keep while staying under 1% stddev on summary?
And ideally also look at subtest stddev and keep it minimal.

What would be ideal is to run DAMP only once and tweak cycles/tppagecycles accordingly.
I think reducing the variance would allow us to reduce the cycles we run which will reduce the overall time.  This sounds exciting.

As for overall variance, unless we fix all the tests to have less variance the recommendation of 6 before/after still holds- if someone is only investigating damp, then yeah, we could recommend fewer (like 2).  These also run on a pool of 100+ machines, and given that there is a chance that one machine could post erratic results, sometimes it is easy to spot a machine, or sometimes this is a small window of time until the machine reboots or cleans itself up properly.  If we ran on the same machine, the variance for all tests would be much lower- I think we should see how low we can get the variance here and see if it sustains over time on all platforms- if so, I would love to repeat a similar process for other tests, maybe reducing a lot of noise and confusion.
About the various kind of GC commands, here is some runs:

# Base run - call all GC commands and flush event loop after reload step
average = 1.986
max = 10%
sum = 107
summary = 238.97 ± 0.28% 	-8.20%
duration = 41m

# Only call window.cycleCollect
average=2.885
max = 8%
sum = 155
summary = 227.83 ± 0.75% 	-12.48% 
duration = 43m

# Only call gMgr.minimizeMemoryUsage
average=2.162
max = 11%
sum = 116
summary = 238.35 ± 0.65% 	-8.44%
duration = 42m

# Only call Cu.forceGC
average=3.158
max = 13%
sum = 170
summary =  233.09 ± 0.59% 	-10.46%
duration = 42m

Conclusion:
gMgr.minimizeMemoryUsage is the most impactful by far, followed by cycleCollect and Cu.forceGC has only a limited impact.
A quick look on other OSes

# OSX - minimizeMemory on reload and teardown
average = 2.715
max = 7%
sum = 152
summary = 246.55 ± 1.27% 	-20.68%
duration = 58m

# OSX - base
average = 5.992
max = 14%
sum = 335
summary = 310.82 ± 2.08% 
duration = 38m

# Win7 - minimizeMemory on reload and teardown
average = 1.529
max = 7%
sum = 85
summary = 263.03 ± 0.59% 	-7.55% 
duration = 76m

# Win7 - base
average = 3.009
max = 9%
sum = 171
summary = 284.52 ± 0.44%
duration = 40m

Conclusion:
stddev is twice better on all OSes! The times it takes to run DAMP is also almost twice slower.
could we run less pagecycles?  If the stddev is low, then I think we could run 10 total data points instead of 25
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #16)
> could we run less pagecycles?  If the stddev is low, then I think we could
> run 10 total data points instead of 25

Yes, I hope so. I'm waiting for the results, I tried various combinations.
If we always consider running more than one job. We should tweak cycles/pagecycles accordingly.
Today I suggest everyone on devtools to use --rebuild-talos 6.
well there are other tests and we continue to need --rebuild-talos 6, but possibly we can analyze which tests need <=3 retriggers and which need >5 and split them accordingly; ideally the work you are doing here could be done on other tests :)
today I landed a metric for measuring noise- I suspect it will require some tweaking as we make it more comprehensive and accurate- I would love to see it on a pre/post change here, can you link the final try push with retriggers so I can compare?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #19)
> today I landed a metric for measuring noise- I suspect it will require some
> tweaking as we make it more comprehensive and accurate- I would love to see
> it on a pre/post change here, can you link the final try push with
> retriggers so I can compare?

Here is a try build with one of my best patch to improve DAMP variance:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=c4b11af964e668518db368d289d0762538048434
And a try without it, with whatever HEAD was on m-c:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=f675a717a8ca02918b6d0541d49667a67b9cc9ec

(I just pushed them)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #21)
> The new noise metric takes the subtests from 43 -> 18.  That is for linux64
> specifically- I wanted to see if the noise metric would work for your case :)

Cool! Is there any unit for this metric? How do you compute it?
I created this as a start for testing new hardware, here are some notes on the noise metric:
https://elvis314.wordpress.com/2017/11/16/measuring-the-noise-in-performance-tests/
Depends on: 1420303
Opened bug 1420303 to start improving noise by doing a couple of GC.
Keeping this bug open to try other ways to reduce the noise.
Depends on: 1465806
Product: Firefox → DevTools
Severity: normal → enhancement
You need to log in before you can comment on or make changes to this bug.