Last Comment Bug 1191019 - talos osx 10.10 shows a lot of false positives in compare view
: talos osx 10.10 shows a lot of false positives in compare view
Status: NEW
Product: Testing
Classification: Components
Component: Talos (show other bugs)
: unspecified
: Unspecified Unspecified
-- normal (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
Depends on: 1201230
Blocks: 1255582
  Show dependency treegraph
Reported: 2015-08-04 12:35 PDT by Joel Maher ( :jmaher)
Modified: 2016-03-10 13:34 PST (History)
4 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---


Description User image Joel Maher ( :jmaher) 2015-08-04 12:35:33 PDT
As we have been switching to a mode where we point all talos regressions at a perfherder compare view, I keep seeing osx 10.10 regressions in every single one.  

After chatting with :avih on irc this morning he suggested I push to try the same revision as existing push on the tree.

Given a mozilla-central revision:

and a corresponding try push: (did a hg update 909e4b1913a9, to get to the same code revision)

We end up with a compare view:

Lots of osx 10.10 improvements/regressions.  

Some thoughts:
1) collect more data
2) investigate why osx 10.10 is noisy
3) maybe see how all the tests compare, this would help identify problematic tests/platforms
4) adjust our calculations to account for this

I am sure there are other good ideas here- Please chime in and I would be happy to experiment a bit.

This should be understood more before we switch our policies- this noise is causing confusion for developers.
Comment 1 User image William Lachance (:wlach) (use needinfo!) 2015-09-09 08:51:32 PDT
We're hiding MacOS X results in compareperf view now.

I think we should just turn this off (at least on branches besides try) until/unless someone is willing to spend some time looking at why these results are so unreliable.
Comment 2 User image Avi Halachmi (:avih) 2015-09-09 09:26:58 PDT
(In reply to Joel Maher (:jmaher) from comment #0)
> We end up with a compare view:
> central&originalRevision=5cf4d2f7f2f2&newProject=try&newRevision=b82a44ba8928
> &hideMinorChanges=1
> Lots of osx 10.10 improvements/regressions.  

This perfherder view ends up empty.

How can we get a most recent compare view which includes OSX results? I'd like to go over test by test and see if all of them are bad or if it's only bad for some tests, etc.

Just, evaluate the results.
Comment 3 User image Joel Maher ( :jmaher) 2015-09-09 09:28:45 PDT
looking at the alerts generated we have 90 alerts for osx 10.10, half are improvements.  Excluding duplicates, we have 32 regression alerts.  Keeping this useful, the last 4 months, the majority of the alerts were due to landing the python webserver with no tp5n data, this means we had -100%+ regressions!!  This was backed out, in fact- in the last 4 months all regressions have been talos related- there have been no regressions caught by code changes, not even jemalloc4.

Also all alerts we have seen are 10%+, no small ones- probably because we have so much noise.
Comment 4 User image Avi Halachmi (:avih) 2015-09-09 09:32:37 PDT
@Joel, was that a reply to my request? If yes, I don't understand how it answers it...
Comment 5 User image Joel Maher ( :jmaher) 2015-09-09 09:37:52 PDT
no, I was typing it at the same time and just submitted my comment- I will look over all the tests next.
Comment 7 User image Avi Halachmi (:avih) 2015-09-09 11:51:13 PDT
I'd think that a good starting point would be to compare the noise level (stddev between runs on the same changeset) between OS X 10.10 and other platforms - per test and possibly also per subtest.

This should give us a good initial overview on the severity of the situation IMO.
Comment 8 User image Joel Maher ( :jmaher) 2015-09-09 13:01:44 PDT
a11y osx vs linux64:[mozilla-inbound,24979818b7ede5165a73d4111a954eaa89615577,1]&series=[mozilla-inbound,6578241abb694e67e275a73dd57fea57a7b38794,1]&series=[mozilla-inbound,d94a1100216dba2bacfb063e5315fd25d875ea2c,1]&series=[mozilla-inbound,b332843380d02009805b33015f923af37c63a267,1]

cart osx vs linux64:[mozilla-inbound,096c9660734e6ef31d9d328a9a535fa2baac9861,1]&series=[mozilla-inbound,d96137533e482698c2337d88324b3d2d2963c36e,1]&series=[mozilla-inbound,d3e348b81f26ed4ab23597884139faa000e56a4b,1]&series=[mozilla-inbound,eb05ab320cf3b9d062e624276d42337a95bd24bf,1]

damp osx vs linux64:[mozilla-inbound,9ec3643233a5a6829720b9d4852b242273d197ca,1]&series=[mozilla-inbound,38b719af764eaad10d2d2486a10faa8974634f50,1]&series=[mozilla-inbound,25169736932930bc20f0437757e9399dadf68976,1]&series=[mozilla-inbound,8db4184f56e09eb0a34227f5270e345701798ba3,1]

dromaeo css linux64 vs osx 10.10:[mozilla-inbound,7433b9af4ad52cc5f2cbff94717a3d80fbe306d3,1]&series=[mozilla-inbound,73c31ca5f967b35041bd3fe80ea6a8f031aea600,1]

sessionrestore linux64 vs osx 10.10:[mozilla-inbound,31bd26f64638a428fd20cf5c6e4e41891e9f7847,1]&series=[mozilla-inbound,d302a19fe1f655de7f9db1928a97bda4b3274568,1]

session restore no auto restore linux64 vs osx 10.10:[mozilla-inbound,9a330c1c036fbf8f681b0a4f8b11747ab3101774,1]&series=[mozilla-inbound,d83aa21cf592c17bba37edab93a678b4b29acd53,1]

tart linux64 vs osx 10.10:[mozilla-inbound,4a03564179d82c4105af5facba8b3d6fc785415a,1]&series=[mozilla-inbound,65e173fb1a511a28979c769ad2c8a1b31dd358bc,1]&series=[mozilla-inbound,b1d144c18b37edd1852da21b96b47fa7fc35691f,1]&series=[mozilla-inbound,26f363f15d58cd426c6edeb593cbfc2fa76d17fe,1]

tp5o linux64 vs osx 10.10:[mozilla-inbound,bd72d04511c657c5c5040f1633fe73642fcdcb3b,1]&series=[mozilla-inbound,dc5dc84a309cfa3f89307a1d195307e333707a12,0]&series=[mozilla-inbound,06a2fc2ad84cc4c129b44d9b9f29b800c519e559,1]&series=[mozilla-inbound,3ec2958352e2cb98df247808db574209f4ab5eb8,1]

tp5o_scroll linux64 vs osx 10.10:[mozilla-inbound,ef368eea7a86c71180f5a6fc4af4672a595585d3,1]&series=[mozilla-inbound,4c4dfc86eefd577ec8f1a280c4b08d3f6f0f108a,1]

tresize linux64 vs osx10.10:[mozilla-inbound,374451d2ca27173d3525e341a86f7d999cc32b4c,1]&series=[mozilla-inbound,4df28bfcc957ad1f93f698984fde97b4401a7bfb,1]&series=[mozilla-inbound,72f4651f24362c87efb15d5f4113b9ca194d8e3f,1]&series=[mozilla-inbound,55776a2a6808c7c69af642f42e05d0589f4d10d9,1]



v8_7 linux64 vs osx 10.10:[mozilla-inbound,3480ad1b2f24ce38b509f3ae60e1478419c65d87,1]&series=[mozilla-inbound,e81533b82bb54362bdb9a24b08da2d09b79c8956,1]

changed to noisy in august

kraken linux64 vs osx 10.10 (went wild around august 8th):[mozilla-inbound,432906c168f6a9dc4ac192d1615189e64151bea5,1]&series=[mozilla-inbound,17ed6349edd0e71d0df2748dc78b1b3fcea9acc9,1]

tcanvasmark linux64 vs osx 10.10 (went wild around august 8th):[mozilla-inbound,df8939dc6e77c3a3c208294042ab6b50013d3966,1]&series=[mozilla-inbound,d534dff4bab8bdb94f4e065c3451ac5ccc45af42,1]&series=[mozilla-inbound,96f77827ad799f45f0ebee409f2036c5232e6244,1]&series=[mozilla-inbound,6d7ba064eedb822927c37ce2cda8d3164ee69604,1]

one or more tests are not so bad here

tpaint linux64 vs osx 10.10 (non e10s isn't so bad, but still much worse than linux64):[mozilla-inbound,d5654ef4f31d6127e4fd9b71eb6428292489a066,1]&series=[mozilla-inbound,1830d8da0f86ac3126c8e0c131ad67cd3d634dcb,1]&series=[mozilla-inbound,e06918cf794c1b73f831684896b6da10bea4af0b,1]&series=[mozilla-inbound,9cf6695cdf39d34ae8f89862dc60913468f9e426,1]

tps linux64 vs osx 10.10 (e10s isn't so bad):[mozilla-inbound,f6fb7dcb89c26e9c7e18722fa9846f60583d2ef2,1]&series=[mozilla-inbound,ba8ccda021618c02de072c68b0b56ba251f42abd,1]&series=[mozilla-inbound,21552af6220f8727499e86b152ff30c82c79612f,1]&series=[mozilla-inbound,637a7f061cf5e18c4a14cf10f342b19a345f8e3c,1]

a fairly stable test...glterrain linux 64 vs osx 10.10:[mozilla-inbound,72e07984983c51f486a3cbb36481ae53c9240d5c,1]&series=[mozilla-inbound,e3a940146efd0882135b78a56a9abc31f6bf6e97,1]

tscrollx linux64 vs osx 10.10:[mozilla-inbound,722ccaa15401d2d8169d9c743c139e96314e27e9,1]&series=[mozilla-inbound,a4041fd2ab76bc5d15000f741bfcd0db7a66651e,1]&series=[mozilla-inbound,f58e3f07b738cd3393275d04d37b3333622e0fb7,1]&series=[mozilla-inbound,d1d4c7dc4c8e34bf20e8b723aea64756823211ff,1]

tsvgr_opacity linux64 vs osx 10.10:[mozilla-inbound,fc1e3928bd298fb15c68aa5868030e3f66d77eb9,1]&series=[mozilla-inbound,81f7df5cebc967fe0bf62c7a33577fea5d970f0e,1]&series=[mozilla-inbound,6981e256ea8173cdb53dec0741ec05e8ace13f30,1]&series=[mozilla-inbound,505e97c4c50669524afd8d210471ceddf62cfe2a,1]
Comment 9 User image Joel Maher ( :jmaher) 2015-09-09 13:06:41 PDT
15 noisy tests and 5 not so noisy ones.  This could be a scale issue, but I suspect the 5 more stable tests have a chance to be actionable.

Sadly I cannot come up with any common thread between these tests.  3/4 are pure noise, I am leaning towards turning stuff off except for try until someone has time to investigate.
Comment 10 User image Joel Maher ( :jmaher) 2015-09-09 13:08:01 PDT
:avih, do you have further thoughts here?
Comment 11 User image Avi Halachmi (:avih) 2015-09-09 13:51:26 PDT
w00t! thanks for the links!

So, I just looked at all of them. Some comments:

1. I think kraken and canvasmark were always noisy, but in order to notice it you have to zoom in - the same as with v8_7.

2. At the tests which you described as "not noisy", it still has similar noise to the other tests IMO.

3. Except for webgl (which is almost suspiciously stable for my taste), all the tests exhibit similar (but not identical) "noise patterns" on OS X 10.10.

4. At all(/most?) of the places where I visually noticed a change at the linux graphs, I could also clearly notice a similar change at the osx 10.10 graph - despite the noise.

So my assessment is that despite the clearly higher than desirable noise level in 10.10, performance changes are still clearly visible with enough retriggers/data-points.

The graphs are definitely not useless IMO, but also clearly pose a real challenge for automatic regression detection systems or for comparisons in general where not many data points are available.

Not sure how we should handle it further (other than the obvious investigation of the noise source - I'd guess that it's the same reason for the noise on all tests), but I'd feel bad to throw away graphs which IMO clearly have value, only because their noise level poses a challenge to automated detection systems...

Ideas? thoughts?
Comment 12 User image Joel Maher ( :jmaher) 2015-11-27 03:45:57 PST
a huge improvement in osx numbers after upgrading to 10.10.5:

This lines up perfectly with the switch to the new r7 hardware and the os upgrade from 10.10.2 to 10.10.5.  Yay for upgrades.  We should understand why this happened if possible.
Comment 13 User image William Lachance (:wlach) (use needinfo!) 2015-11-30 12:01:40 PST
(In reply to Joel Maher (:jmaher) from comment #12)
> This lines up perfectly with the switch to the new r7 hardware and the os
> upgrade from 10.10.2 to 10.10.5.  Yay for upgrades.  We should understand
> why this happened if possible.

I asked about this in
Comment 14 User image Joel Maher ( :jmaher) 2015-11-30 13:12:06 PST
I would like to push to try with two identical changesets and see what noise comes out of this.  Ideally with 6 data points each, we could analyze that and maybe another with 12 data points each.
Comment 15 User image Joel Maher ( :jmaher) 2015-12-01 13:30:58 PST
and I have done this:

Still waiting on g1/g2 results to finish up and post.  So far we have 8 'noisy data points' out of 40 and 2 of them might go away with we get full data.

Note You need to log in before you can comment on or make changes to this bug.