Open Bug 1407826 Opened 8 years ago Updated 1 year ago

Try to reduce DAMP subtests variance

Categories

(DevTools :: General, task, P2)

task

Tracking

(firefox57 fix-optional)

Tracking Status
firefox57 --- fix-optional

People

(Reporter: ochameau, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Today, DAMP summary has a good variance, under 2% and often around 1%. But subtests have a higher variance which makes them less reliable. The higher the variance is, the lowest confidence will be on smaller regressions It would be great if most important subtests were all under 2%. It will improve PerfHerder readability by having better confidence and have the red/green highlights more frequently, and it would catch more regressions. Solutions might be waiting between tests, doing GC, verify that tests do not overlap between each others, ensure that unmaintained tests are at the end to avoid them polluting the others,... If you look at this run: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-central&newProject=try&newRevision=65c9b87aea96865e0614a793bee620418fc9f763&originalSignature=edaec66500db21d37602c99daa61ac983f21a6ac&newSignature=edaec66500db21d37602c99daa61ac983f21a6ac&framework=1&selectedTimeRange=172800 You can see that: * cold.open is at 3.7% * most close tests are over 2% and sometimes are very high (it most likely mean the reload test overlaps close) * most reload tests against complicated are pretty good and under 2% * this is different against simple, this is bad where debugger and netmonitor are above 4%
Priority: -- → P2
Here is a couple of experiments. The reference run is this one, made of today's tip: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6cc023198068bdbd247ce0a43233776c5ce0900b Its average variance is 3.753%, max is 23% for styleeditor.close. Variance sum is 202. DAMP summary is 254. 1) Flush any async operation and stabilize GC before running the tests. https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=6cc023198068bdbd247ce0a43233776c5ce0900b&newProject=try&newRevision=381a714eec7852325d32481395b73d371074bfa1&originalSignature=edaec66500db21d37602c99daa61ac983f21a6ac&newSignature=edaec66500db21d37602c99daa61ac983f21a6ac&framework=1 Average variance is 3.626%, max is 13% for readHeapSnapshot. Variance sum is 199. DAMP summary is 265 (+4%). 2) Wait two seconds between each "command" (that's difference from subtest, it pauses after toolbox close for ex.). https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=6cc023198068bdbd247ce0a43233776c5ce0900b&newProject=try&newRevision=e23cf77c09a05f065d49f84c4919b83d3a380be5&originalSignature=edaec66500db21d37602c99daa61ac983f21a6ac&newSignature=edaec66500db21d37602c99daa61ac983f21a6ac&framework=1 Average variance is 3.226%, max is 21% for styleditor.close. Variance sum is 174. DAMP summary is 265 (+4%) 3) Wait two seconds between each "subtest" (after toolbox open, after reload, after toolbox close, ...) https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=6cc023198068&newProject=try&newRevision=2b23b72bcc42796baf021a5045213cbab5f516fa&originalSignature=edaec66500db21d37602c99daa61ac983f21a6ac&newSignature=edaec66500db21d37602c99daa61ac983f21a6ac&framework=1 Average variance is 2.153%, max is 7% for memory.close. Variance sum is 116. DAMP summary is 240 (-5.4%) Conclusions: * I don't really explain why the two first do not improve the variance and I can explain even less why the overall DAMP score is 4% slower in both cases. * Third patch, waiting between each subtest, makes a lot of sense and its overall results also make a ton of sense. Most of the subtests most likely still do not wait correctly for full completion. We may not wait correctly for all promises to be resolved. We may not wait for a couple of reflows or some GC happen after significant actions, ... * Third patch individual subtest changes still look weird. I would expect most of the test to be executed faster, as they would no longer overall between each others. But if you look at PerfHerder, you will see that a lot of subtests seem to be slower.... * Third patch is also hiding weak tests, we are missing coverage, ideally they would listen correctly for everything so that DAMP cover for this async code.
I mentioned this briefly in bug 1399779 comment 1, but what if we made each tool responsible for signaling when things like "reload" are done via custom events that DAMP could wait for? Then each tool is responsible for signaling when all the work for "reload" etc. is done, since you might need a different approach given the different design of each tool. Is that worth considering, or would it fall apart for some reason?
Flags: needinfo?(poirot.alex)
(In reply to J. Ryan Stinnett [:jryans] (use ni?) from comment #2) > I mentioned this briefly in bug 1399779 comment 1, but what if we made each > tool responsible for signaling when things like "reload" are done via custom > events that DAMP could wait for? Yes, we have to do something like that for bug 1405585. And that event would have to be done in productiong codepath, so it should also avoid introducing any overhead... > Then each tool is responsible for signaling when all the work for "reload" > etc. is done, since you might need a different approach given the different > design of each tool. Yes, I'm also expecting this to be very specific for each tool! In the context of DAMP, it would help replacing these waiting codes: http://searchfox.org/mozilla-central/source/testing/talos/talos/tests/devtools/addon/content/damp.js#387-393,402-414,422-436,438-450 But in the context of the current bug, it will be hard to have generic waiting code that will be as good as these listeners with hardcoded numbers of console messages/scripts to be waited for. So I'm not expecting them to really help on reducing the variance of DAMP.
Flags: needinfo?(poirot.alex)
Here is some results with patch #3 from comment 1 but with different durations: # 2s timeout: Average: 2.210% Max: 10% Sum: 119 DAMP summary: 241 (-5.5%) # 1s timeout: Average: 3.254% Max: 25% Sum: 175 DAMP summary: 238 (-6.5%) # 0.5s timeout: Average: 3.403% Max: 21% Sum: 183 DAMP summary: 253 (-0.7%) # 0.250s timeout: Average: 2.844% Max: 9% Sum: 153 DAMP summary: 258 (-0.9%) # last 2 days m-c: Average: 4.1% Max: 22% Sum: 225 DAMP summary: 255 # last 7 days m-c: Average: 4.77% Max: 23% Sum: 262 DAMP summary: 256 # last 14 days m-c: Average: 5.19% Max: 23% Sum: 285 DAMP summary: 259 It looks like we already get substancial win with 250ms wait delay, but still, 2s is better. I'll look into smaller timeout but also try to flush the event loop, to avoid delaying when it is not necessary.
# setTimeout(0) https://treeherder.mozilla.org/#/jobs?repo=try&revision=ed2fd2d1e7ff98998638e9d6543e9e795c714191 Average: 3.641 Max: 15 Sum: 196 DAMP summary: 243.82 ± 0.76% (-4.4%) Duration: 30 minutes # flush the event loop before moving on https://treeherder.mozilla.org/#/jobs?repo=try&revision=ea7c43051dbe7b8c3f281e3e86942d01e764ab2c Average: 2.310 Max: 9% Sum: 124 DAMP summary: 238.51 ± 0.34% (-6.5%) Duration: 67 minutes # GC *and* flush the event loop https://treeherder.mozilla.org/#/jobs?repo=try&revision=076b7d4e6cd0919121b67700b7c87170393213b0 Average: 1.678 Max: 8% Sum: 90 DAMP summary: 259.31 ± 0.15% (1.6%) Duration: 72 minutes 2s duration: 68 minutes 1s duration: 49 minutes 0.5s duration: 40 minutes 0.25s duration: 35 minutes So it looks like we have to double the time spent on DAMP to make it more accurate! Waiting for event loop flush seems to be equivalent to 2s timeout.
Here is variations of: "# GC *and* flush the event loop" It was waiting for 5 consecutive empty event loop with 250ms interval. # 10 iterations, 50ms interval https://treeherder.mozilla.org/#/jobs?repo=try&revision=3eac4addd8d72ea28627966c08694c276c3defd1&selectedJob=137876481 Average: 1.790% Sum: 96 Max: 7% DAMP summary: 256.54 ± 0.61% (+0.52%) Duration: 59 minutes # 20iterations, 0ms interval https://treeherder.mozilla.org/#/jobs?repo=try&revision=49679949feae2a30d4247446d744f0475b4dff06&selectedJob=137876419 Average: 1.842% Sum: 99 Max: 9% DAMP summary: 262.31 ± 0.34% (+2.78%) Duration: 52 minutes.
I may be able to make DAMP two times less noisy (variance /2). But DAMP will also be slower to run overall, about two times slower... So, in term of ressources, is that an issue? I imagine that given that the results are more stable we should either be able to say --rebuild-talos 6 is too much and we can now do 3? Or may be this is no longer relevant to run DAMP 25 times (cycles * tppagecycles)? You told me you had some process to come up to "6" is the best choice for talos rebuild. Could we re-run this process with this additional fix?
Flags: needinfo?(jmaher)
I am excited to hear that damp could be run faster- a slower runtime is not ideal, but something we can work with. --rebuild-talos 6 is a generic form to allow all tests and all platforms get a reasonable sample size- some data points need a lot less. I would make sure you can get this stability across osx and windows as well. To move forward with this, I would recommend splitting damp into a few different tests- this allows us to get more "metrics" at a summary level and allows us to parallelize it across various different jobs so we can return results in a reasonable time frame (keep in mind this will realistically be december before we can do this due to the need for buildbot changes and an infrastructure freeze going into effect after next week) if we are just testing damp, then we could document that there are fewer retriggers needed- I would love to point at DAMP as a flagship perf test :) As a note, my process for defining '6' was to take a few hundred data points and look for stability over chunks of various sizes- so if we were to download 100 summaries of damp over time this should only show alerts on real changes. An easier way to do this is to retrigger damp 100 times on a single try push and then analyze the data. A less scientific but useful way to do this is to use the compare feature of perfherder with two pushes from the same revision- in this case just push to try twice- then retrigger the 'g2' job 2 times and see if 3 data points show any stddev or if it detects a regression. keep adding a job and see how it tracks over different data sets.
Flags: needinfo?(jmaher)
Here is some tests to see if all call to settle are worth: # only settle after test teardown https://treeherder.mozilla.org/#/jobs?repo=try&revision=742bf862a11c086a2dde14b8b15c14cb65575afa&selectedJob=138160580 Average: 2.182% Max: 8% Sum: 117 DAMP summary: 263.35 ± 0.75% (+3%) Duration: 44m # only settle after page reload https://treeherder.mozilla.org/#/jobs?repo=try&revision=6fc48939964a6e6d94e1a372a41730dac25af37e&selectedJob=138160539 Average: 1.940% Max: 8% Sum: 104 DAMP summary: 238.75 ± 0.48% (-6%) Duration: 42m # only settle after toolbox open https://treeherder.mozilla.org/#/jobs?repo=try&revision=c3380a3cfc4fe7063a437d04a57c2754f2ec3c26&selectedJob=138162991 Average: 2.417% Max: 8% Sum: 130 DAMP summary: 273.01 ± 0.26% (+6%) Duration: 40m So, settling after page reload is the most useful but there is still an improvement when settle is called after all steps. # settle after all steps, but only do Cycle Collect https://treeherder.mozilla.org/#/jobs?repo=try&revision=ceb71253524a24d56dc574f34367b10a5544f194&selectedJob=138159930 Average: 2.250% Max: 9% Sum: 121 DAMP summary: 234.69 ± 0.53% (-8%) Duration: 54m This run was to see if window.cycleCollect() was enough. It is not really clear if all these GC method are worth it. But it looks like only calling cycleCollect isn't equivalent to calling it and all the others. All these runs should be compared to comment 5: > # GC *and* flush the event loop > https://treeherder.mozilla.org/#/jobs?repo=try&revision=076b7d4e6cd0919121b67700b7c87170393213b0 > Average: 1.678 > Max: 8% > Sum: 90 > DAMP summary: 259.31 ± 0.15% (1.6%) > Duration: 72 minutes This one is settling after each step and calling all GC methods. At the end, may be "only settle after toolbox reload and call all GC methods" is a good compromise. It brings DAMP variance under 2% and runs in 40minutes.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #9) > I am excited to hear that damp could be run faster- a slower runtime is not > ideal, but something we can work with. --rebuild-talos 6 is a generic form > to allow all tests and all platforms get a reasonable sample size- some data > points need a lot less. I would make sure you can get this stability across > osx and windows as well. Just to be clear. If you take one run, it will be slower with any of these patches. But given that the variance may be divided by two, we should be able to do either less runs. Either by: * spawing less jobs (lower --rebuild-talos) * doing less individual test run (lower cycles/tppagecycles) And so, if we do that, we should be able to mitigate this slowness and may be use the same ressources, but with better/more stable results. > To move forward with this, I would recommend splitting damp into a few > different tests- this allows us to get more "metrics" at a summary level and > allows us to parallelize it across various different jobs so we can return > results in a reasonable time frame (keep in mind this will realistically be > december before we can do this due to the need for buildbot changes and an > infrastructure freeze going into effect after next week) Please ping me whenever this is ready. I think it would be helpful to be able to split it independently of this patch/issue. Today it runs in 30min, it may be great to have the result in 15min with just two jobs! But... this won't be obvious because of caches and cold versus warm runs. We could easily break or pollute the results depending on how the tests are split. We would have to be cautious about that. > As a note, my process for defining '6' was to take a few hundred data points > and look for stability over chunks of various sizes- so if we were to > download 100 summaries of damp over time this should only show alerts on > real changes. > > An easier way to do this is to retrigger damp 100 times on a single try push > and then analyze the data. But I imagine you are looking for a hard limit. A factual criteria? Like having stddev (is that what I keep calling variance?) under X %. Today, DAMP has an average variance between 4% and 5% on subtests, but its summary is often under 1% with 6 runs. So I imagine I should push 100 times and see how many "times" I can keep while staying under 1% stddev on summary? And ideally also look at subtest stddev and keep it minimal. What would be ideal is to run DAMP only once and tweak cycles/tppagecycles accordingly.
I think reducing the variance would allow us to reduce the cycles we run which will reduce the overall time. This sounds exciting. As for overall variance, unless we fix all the tests to have less variance the recommendation of 6 before/after still holds- if someone is only investigating damp, then yeah, we could recommend fewer (like 2). These also run on a pool of 100+ machines, and given that there is a chance that one machine could post erratic results, sometimes it is easy to spot a machine, or sometimes this is a small window of time until the machine reboots or cleans itself up properly. If we ran on the same machine, the variance for all tests would be much lower- I think we should see how low we can get the variance here and see if it sustains over time on all platforms- if so, I would love to repeat a similar process for other tests, maybe reducing a lot of noise and confusion.
About the various kind of GC commands, here is some runs: # Base run - call all GC commands and flush event loop after reload step average = 1.986 max = 10% sum = 107 summary = 238.97 ± 0.28% -8.20% duration = 41m # Only call window.cycleCollect average=2.885 max = 8% sum = 155 summary = 227.83 ± 0.75% -12.48% duration = 43m # Only call gMgr.minimizeMemoryUsage average=2.162 max = 11% sum = 116 summary = 238.35 ± 0.65% -8.44% duration = 42m # Only call Cu.forceGC average=3.158 max = 13% sum = 170 summary = 233.09 ± 0.59% -10.46% duration = 42m Conclusion: gMgr.minimizeMemoryUsage is the most impactful by far, followed by cycleCollect and Cu.forceGC has only a limited impact.
A quick look on other OSes # OSX - minimizeMemory on reload and teardown average = 2.715 max = 7% sum = 152 summary = 246.55 ± 1.27% -20.68% duration = 58m # OSX - base average = 5.992 max = 14% sum = 335 summary = 310.82 ± 2.08% duration = 38m # Win7 - minimizeMemory on reload and teardown average = 1.529 max = 7% sum = 85 summary = 263.03 ± 0.59% -7.55% duration = 76m # Win7 - base average = 3.009 max = 9% sum = 171 summary = 284.52 ± 0.44% duration = 40m Conclusion: stddev is twice better on all OSes! The times it takes to run DAMP is also almost twice slower.
could we run less pagecycles? If the stddev is low, then I think we could run 10 total data points instead of 25
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #16) > could we run less pagecycles? If the stddev is low, then I think we could > run 10 total data points instead of 25 Yes, I hope so. I'm waiting for the results, I tried various combinations. If we always consider running more than one job. We should tweak cycles/pagecycles accordingly. Today I suggest everyone on devtools to use --rebuild-talos 6.
well there are other tests and we continue to need --rebuild-talos 6, but possibly we can analyze which tests need <=3 retriggers and which need >5 and split them accordingly; ideally the work you are doing here could be done on other tests :)
today I landed a metric for measuring noise- I suspect it will require some tweaking as we make it more comprehensive and accurate- I would love to see it on a pre/post change here, can you link the final try push with retriggers so I can compare?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #19) > today I landed a metric for measuring noise- I suspect it will require some > tweaking as we make it more comprehensive and accurate- I would love to see > it on a pre/post change here, can you link the final try push with > retriggers so I can compare? Here is a try build with one of my best patch to improve DAMP variance: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c4b11af964e668518db368d289d0762538048434 And a try without it, with whatever HEAD was on m-c: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f675a717a8ca02918b6d0541d49667a67b9cc9ec (I just pushed them)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #21) > The new noise metric takes the subtests from 43 -> 18. That is for linux64 > specifically- I wanted to see if the noise metric would work for your case :) Cool! Is there any unit for this metric? How do you compute it?
I created this as a start for testing new hardware, here are some notes on the noise metric: https://elvis314.wordpress.com/2017/11/16/measuring-the-noise-in-performance-tests/
Opened bug 1420303 to start improving noise by doing a couple of GC. Keeping this bug open to try other ways to reduce the noise.
Product: Firefox → DevTools
Severity: normal → enhancement
Severity: normal → S3
Type: defect → task
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: