1191019 - talos osx 10.10 shows a lot of false positives in compare view

Reporter

Description

•

10 years ago

As we have been switching to a mode where we point all talos regressions at a perfherder compare view, I keep seeing osx 10.10 regressions in every single one. After chatting with :avih on irc this morning he suggested I push to try the same revision as existing push on the tree. Given a mozilla-central revision: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=5cf4d2f7f2f2&filter-searchStr=talos and a corresponding try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b82a44ba8928 (did a hg update 909e4b1913a9, to get to the same code revision) We end up with a compare view: https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-central&originalRevision=5cf4d2f7f2f2&newProject=try&newRevision=b82a44ba8928&hideMinorChanges=1 Lots of osx 10.10 improvements/regressions. Some thoughts: 1) collect more data 2) investigate why osx 10.10 is noisy 3) maybe see how all the tests compare, this would help identify problematic tests/platforms 4) adjust our calculations to account for this I am sure there are other good ideas here- Please chime in and I would be happy to experiment a bit. This should be understood more before we switch our policies- this noise is causing confusion for developers.

William Lachance (:wlach)

Comment 1

•

10 years ago

We're hiding MacOS X results in compareperf view now. I think we should just turn this off (at least on branches besides try) until/unless someone is willing to spend some time looking at why these results are so unreliable.

Comment 2

•

10 years ago

(In reply to Joel Maher (:jmaher) from comment #0) > We end up with a compare view: > https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla- > central&originalRevision=5cf4d2f7f2f2&newProject=try&newRevision=b82a44ba8928 > &hideMinorChanges=1 > > Lots of osx 10.10 improvements/regressions. This perfherder view ends up empty. How can we get a most recent compare view which includes OSX results? I'd like to go over test by test and see if all of them are bad or if it's only bad for some tests, etc. Just, evaluate the results.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 3

•

10 years ago

looking at the alerts generated we have 90 alerts for osx 10.10, half are improvements. Excluding duplicates, we have 32 regression alerts. Keeping this useful, the last 4 months, the majority of the alerts were due to landing the python webserver with no tp5n data, this means we had -100%+ regressions!! This was backed out, in fact- in the last 4 months all regressions have been talos related- there have been no regressions caught by code changes, not even jemalloc4. Also all alerts we have seen are 10%+, no small ones- probably because we have so much noise.

Avi Halachmi (:avih)

Comment 4

•

10 years ago

@Joel, was that a reply to my request? If yes, I don't understand how it answers it...

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 5

•

10 years ago

no, I was typing it at the same time and just submitted my comment- I will look over all the tests next.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 6

•

10 years ago

a11y [+e10s]: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=[mozilla-inbound,24979818b7ede5165a73d4111a954eaa89615577,1]&series=[mozilla-inbound,6578241abb694e67e275a73dd57fea57a7b38794,1] * a range between 500 and 700 * a regression around sept 4th with no alert cart [+e10s]: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=[mozilla-inbound,096c9660734e6ef31d9d328a9a535fa2baac9861,1]&series=[mozilla-inbound,d96137533e482698c2337d88324b3d2d2963c36e,1] * a range between 36 and 52 damp [+e10s]: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=[mozilla-inbound,9ec3643233a5a6829720b9d4852b242273d197ca,1]&series=[mozilla-inbound,38b719af764eaad10d2d2486a10faa8974634f50,1] * range between 255 and 368 a few data points

Avi Halachmi (:avih)

Comment 7

•

10 years ago

I'd think that a good starting point would be to compare the noise level (stddev between runs on the same changeset) between OS X 10.10 and other platforms - per test and possibly also per subtest. This should give us a good initial overview on the severity of the situation IMO.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 8

•

10 years ago

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 9

•

10 years ago

15 noisy tests and 5 not so noisy ones. This could be a scale issue, but I suspect the 5 more stable tests have a chance to be actionable. Sadly I cannot come up with any common thread between these tests. 3/4 are pure noise, I am leaning towards turning stuff off except for try until someone has time to investigate.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 10

•

10 years ago

:avih, do you have further thoughts here?

Flags: needinfo?(avihpit)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

10 years ago

Depends on: 1201230

Avi Halachmi (:avih)

Comment 11

•

10 years ago

w00t! thanks for the links! So, I just looked at all of them. Some comments: 1. I think kraken and canvasmark were always noisy, but in order to notice it you have to zoom in - the same as with v8_7. 2. At the tests which you described as "not noisy", it still has similar noise to the other tests IMO. 3. Except for webgl (which is almost suspiciously stable for my taste), all the tests exhibit similar (but not identical) "noise patterns" on OS X 10.10. 4. At all(/most?) of the places where I visually noticed a change at the linux graphs, I could also clearly notice a similar change at the osx 10.10 graph - despite the noise. So my assessment is that despite the clearly higher than desirable noise level in 10.10, performance changes are still clearly visible with enough retriggers/data-points. The graphs are definitely not useless IMO, but also clearly pose a real challenge for automatic regression detection systems or for comparisons in general where not many data points are available. Not sure how we should handle it further (other than the obvious investigation of the noise source - I'd guess that it's the same reason for the noise on all tests), but I'd feel bad to throw away graphs which IMO clearly have value, only because their noise level poses a challenge to automated detection systems... Ideas? thoughts?

Flags: needinfo?(avihpit)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 12

•

10 years ago

William Lachance (:wlach)

Comment 13

•

10 years ago

(In reply to Joel Maher (:jmaher) from comment #12) > This lines up perfectly with the switch to the new r7 hardware and the os > upgrade from 10.10.2 to 10.10.5. Yay for upgrades. We should understand > why this happened if possible. I asked about this in m.r.engineering: https://groups.google.com/d/msg/mozilla.release.engineering/e78tDkl-PCE/brHSPeriBgAJ

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 14

•

10 years ago

I would like to push to try with two identical changesets and see what noise comes out of this. Ideally with 6 data points each, we could analyze that and maybe another with 12 data points each.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 15

•

10 years ago

and I have done this: https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=54f694f6b37e&newProject=try&newRevision=5c89475c28d3&showUnreliablePlatforms=1 Still waiting on g1/g2 results to finish up and post. So far we have 8 'noisy data points' out of 40 and 2 of them might go away with we get full data.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 16

•

10 years ago

oh, the mozilla.com version: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=54f694f6b37e&newProject=try&newRevision=5c89475c28d3&showUnreliablePlatforms=1 shows 7 quirky data points.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

10 years ago

Blocks: 1255582

Robert Wood [:rwood]

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX

Bugzilla

talos osx 10.10 shows a lot of false positives in compare view

Categories

(Testing :: Talos, defect)

Tracking

(Not tracked)

People

(Reporter: jmaher, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Updated

Updated