Closed Bug 1253341 Opened 8 years ago Closed 7 years ago

Run duplicate Talos jobs in AWS for Linux

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: selenamarie, Unassigned)

References

Details

Attachments

(7 files, 4 obsolete files)

47 bytes, text/x-github-pull-request
Details | Review
58 bytes, text/x-review-board-request
wlach
: review+
Details
1.94 KB, patch
kmoir
: review+
Details | Diff | Splinter Review
4.23 KB, patch
kmoir
: review+
Details | Diff | Splinter Review
58 bytes, text/x-review-board-request
rail
: review+
Details
58 bytes, text/x-review-board-request
rail
: review+
Details
1.62 KB, patch
Callek
: review+
Details | Diff | Splinter Review
We'd like to run a 2 week test to see if AWS results are "stable enough" for us to catch regressions. 

Suggesting Linux and Windows 7 if possible.

Please create additional bugs as needed for perfherder and other related changes.
:wlach, as we would be running jobs in parallel on a system which could be temporary, should we create a new platform, i.e. 'Linux64_aws', 'Windows7_aws' ?  Or would you rather we annotate the tests themselves i.e. 'tp5o summary opt [aws]', etc.
Flags: needinfo?(wlachance)
(In reply to Joel Maher (:jmaher) from comment #1)
> :wlach, as we would be running jobs in parallel on a system which could be
> temporary, should we create a new platform, i.e. 'Linux64_aws',
> 'Windows7_aws' ?  Or would you rather we annotate the tests themselves i.e.
> 'tp5o summary opt [aws]', etc.

A new platform would be ideal.

This probably goes without saying, but:

1. Please ask me for feedback (f?) on any patches related to this.
2. Please don't roll this out into production without consulting me first, we could easily corrupt our performance data if we're not careful.

Excited to see this go forward!
Flags: needinfo?(wlachance)
:catlee, as we do this in releng buildbot scheduling, I would like to make sure we can hack the platform.  In the past we did this for e10s and added a .e to the machine running the tests.  This was done at the talos level, so it would show up properly in graph server as e10s.

If we need to, I can land a talos change and we can inspect the machine name and change anything as needed.  For example we run a command like this:
/builds/slave/test/build/venv/bin/python /builds/slave/test/build/tests/talos/talos/run_tests.py --branchName Mozilla-Inbound-Non-PGO --suite chromez --executablePath /builds/slave/test/build/application/firefox/firefox --symbolsPath https://queue.taskcluster.net/v1/task/Rx4PufsSRWOMGDIXdronnw/artifacts/public/build/firefox-47.0a1.en-US.linux-x86_64.crashreporter-symbols.zip --title talos-linux64-ix-010 --webServer localhost --log-tbpl-level=debug --log-errorsummary=/builds/slave/test/build/blobber_upload_dir/chromez_errorsummary.log --log-raw=/builds/slave/test/build/blobber_upload_dir/chromez_raw.log

you can see '--title talos-linux64-ix-010', I could parse that for:
talos-linux64-ix.* == buildbot (leave alone)
talos-linux64-spot.* == aws (change what I need to)

What is interesting is that for talos data we parse the info from the job information- in that case we might need to adjust the buildbot properties file to change the buildername from:
"Ubuntu HW 12.04 x64 mozilla-inbound talos chromez"

to:
"Ubuntu HW 12.04 x64 AWS mozilla-inbound talos chromez"


that might be all that is needed for treeherder to associate this as a different platform.
What if it were "Ubuntu VM 12.04 x64 mozilla-inbound talos chromez" ?

Do we need to munge the machine name now that graph server is dead?
I like Ubuntu VM, we don't need graph server bits to change- this is going to be a question for how will treeherder/perfherder handle it.

I don't think that would translate into a different platform inside of perfherder unfortunately- we magically transform "ubuntu hw 12.04 x64" -> linux64;

:wlach, what could we do to facilitate this experiment inside of treeherder/perfherder?
Flags: needinfo?(wlachance)
So if we're doing this through buildbot, I'm pretty sure we would just need to add a new translator from buildername -> platform name in treeherder's etl layer:

https://github.com/mozilla/treeherder/blob/master/treeherder/etl/buildbot.py#L66

Although I thought we were going to do this through taskcluster? In that case, I would guess we would configure the platform name on the taskcluster end of things.

If it does make sense to go this route, let's just **make sure** to test this out on try before deploying it to production (and also that both treeherder and stage are updated with the changes).
Flags: needinfo?(wlachance)
:catlee, which route are we going to take here: buildbot or taskcluster?  If buildbot, lets add a case to the treeherder ETL sooner than later.
Flags: needinfo?(catlee)
We decided earlier we were going to attempt buildbot.
Flags: needinfo?(catlee)
for the etl we have:
    {
        'regex': re.compile(r'^(?:Linux|Ubuntu).*64 Mulet', re.IGNORECASE),
        'attributes': {
            'os': 'linux',
            'os_platform': 'mulet-linux64',
            'arch': 'x86_64',
        }
    },
    {
        'regex': re.compile(r'(?:linux|ubuntu).*64.+|dxr', re.IGNORECASE),
        'attributes': {
            'os': 'linux',
            'os_platform': 'linux64',
            'arch': 'x86_64',
        }
    },


the problem is if we match "ubuntu vm 12.04..." it will change the platform in treeherder for all the unittests.  I would like to add something like:
    {
        'regex': re.compile(r'(?:linux|ubuntu) VM.*64.+|dxr', re.IGNORECASE),
        'attributes': {
            'os': 'linux',
            'os_platform': 'linux64-vm',
            'arch': 'x86_64',
        }
    },


maybe instead we special case the hardware:
    {
        'regex': re.compile(r'(?:linux|ubuntu) HW.*64.+|dxr', re.IGNORECASE),
        'attributes': {
            'os': 'linux',
            'os_platform': 'linux64-hw',
            'arch': 'x86_64',
        }
    },


:wlach, do you have any further thoughts here?
Flags: needinfo?(wlachance)
keeping this as linux as win7 isn't ready
Summary: Run duplicate Talos jobs in AWS for Linux (and maybe Windows 7) → Run duplicate Talos jobs in AWS for Linux
So my initial thought is that we might *actually* want to distinguish linux64-vm from linux64-hw for the unit tests as well, in which case your change may well be desirable.

CC'ing Ed Morley, who is basically the domain expert here, in case he wants to comment.
Flags: needinfo?(wlachance)
Attachment #8728093 - Flags: review?(emorley)
Comment on attachment 8728093 [details] [review]
[treeherder] jmaher:master > mozilla:master

The tests are failing, have commented on the PR
Attachment #8728093 - Flags: review?(emorley) → review-
Comment on attachment 8728093 [details] [review]
[treeherder] jmaher:master > mozilla:master

updated pull request after verifying tests run locally.
Attachment #8728093 - Flags: review- → review?(emorley)
Comment on attachment 8728093 [details] [review]
[treeherder] jmaher:master > mozilla:master

appears that ./runtests.sh doesn't run the slow tests, failed in travis- looking into it
Attachment #8728093 - Flags: review?(emorley)
Comment on attachment 8728093 [details] [review]
[treeherder] jmaher:master > mozilla:master

ok, I have all checks passed, the tests account for the different types!
Attachment #8728093 - Flags: review?(emorley)
Comment on attachment 8728093 [details] [review]
[treeherder] jmaher:master > mozilla:master

This changes the behaviour of the existing jobs - I don't think we want them to change platform/row (this will break lots of things downstream I believe).

I'd thought this new platform was just a temporary thing? If so, the above is particularly true.

I'm wondering if a new group would actually be preferable here?
Attachment #8728093 - Flags: review?(emorley) → review-
Comment on attachment 8728093 [details] [review]
[treeherder] jmaher:master > mozilla:master

Happy to defer to Will here.

Please can you check this won't break the visibility profiles (or work with them to get them fixed), or the signature tables (given the bugs we have in them), and also update OrangeFactor for the new platform names.
Attachment #8728093 - Flags: review- → review?(wlachance)
Comment on attachment 8728093 [details] [review]
[treeherder] jmaher:master > mozilla:master

we are going to not create a new treeherder platform, just a different framework for talos.
Attachment #8728093 - Flags: review?(wlachance)
Comment on attachment 8729113 [details]
MozReview Request: Bug 1253341 - support --framework for talos. r?wlach

https://reviewboard.mozilla.org/r/39247/#review35941

::: testing/mozharness/mozharness/mozilla/testing/talos.py:174
(Diff revision 1)
> -            junk, junk, opts = self.buildbot_config['sourcestamp']['changes'][-1]['comments'].partition('mozharness:')
> +                junk, junk, opts = self.buildbot_config['sourcestamp']['changes'][-1]['comments'].partition('mozharness:')

This looks unintentional.

BTW, you probably didn't write this but it's preferable to use "_" instead of a bogus variable like "junk"

::: testing/mozharness/mozharness/mozilla/testing/talos.py:245
(Diff revision 1)
> +            if kw_options['title'].startswith('tst-linux64-spot'):

This seems a little brittle (since it's specific to spot instances), is there no more definitive way of identifying that we're running on aws?
Attachment #8729113 - Flags: review?(wlachance)
the junk stuff is intentional, there is an error there and it was in the same general code I was editing, so I decided to fix it.

as for the tst-linux64-spot stuff, I am not sure of the best method to do this.  This method will work, but it is hacky and not scalable.  :catlee, can you come up with a better option here?  Maybe a buildbot_property from buildprops.json ?
Flags: needinfo?(catlee)
maybe a property, or checking to see if we're in a VM, or seeing if the AWS metadata service answers (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html#instancedata-data-retrieval).

How much effort do you want to put in for this experiment?
Flags: needinfo?(catlee)
:catlee, if we do this experiment on win7/win10 in aws in the future, we would have to hack this up again- if this is by far much simpler, I would vote for the existing solution; if making it more scalable with a property or config option is not much more work, then I would vote to go that route.

wlach, any thoughts?
can we get more info, opinions here?
Flags: needinfo?(wlachance)
Flags: needinfo?(catlee)
My main concern is the aws results getting mixed up in the non-aws results. This could happen with the current patch if the machine name changes to something else (but is still running on aws). So yeah, I'd definitely prefer a property or config option here.
Flags: needinfo?(wlachance)
Attached patch talos-aws-buildbot-configs.diff (obsolete) — Splinter Review
Flags: needinfo?(catlee)
Attachment #8730751 - Flags: review?(kmoir)
These two patches add mozilla-inbound talos jobs to the ubuntu64_vm (AWS) machines.
catlee, is there a wait to get a buildbot property set to indicate HW vs VM?
Attachment #8730752 - Flags: review?(kmoir) → review+
Attachment #8730751 - Flags: review?(kmoir) → review+
Comment on attachment 8729113 [details]
MozReview Request: Bug 1253341 - support --framework for talos. r?wlach

as per conversation in IRC, this temporary hack will work for this experiment, but any future versions of this experiment will involve taskcluster which pass config variables much differently- lets do this with the hack on the machine name.  I can hack elsewhere as needed, just let me know any concerns.
Attachment #8729113 - Flags: review?(wlachance)
sorry for the churn. I think it's probably better to enable on try first to make sure we're getting data submitted properly before enabling on inbound. When the time comes to enable on inbound, we can add 'mozilla-inbound' to the set of branches.
Attachment #8730751 - Attachment is obsolete: true
Attachment #8730803 - Flags: review?(kmoir)
Attachment #8730803 - Flags: review?(kmoir) → review+
if this is try only that is sort of ok, we need to rest of the stuff hooked up, pending wlach review.
Attachment #8729113 - Flags: review?(wlachance) → review+
Comment on attachment 8729113 [details]
MozReview Request: Bug 1253341 - support --framework for talos. r?wlach

https://reviewboard.mozilla.org/r/39247/#review36723

I guess this can land! I still find the method of determining whether we're on aws or not somewhat terrifying: I can just see someone changing a setting and unknowingly break the "aws" detection. So ok, with the proviso that this needs to be cleaned up in the near future.
Attachment #8730887 - Attachment is obsolete: true
(In reply to Release Engineering SlaveAPI Service from comment #41)
> In production: https://hg.mozilla.org/build/buildbot-configs/rev/2d6158d681e0

This got reverted by catlee.
http://hg.mozilla.org/build/buildbot-configs/rev/7c463958098d

That said, I think the real cause was found and fixed in the mean time, so I think it's OK to re-land now?
Attachment #8732334 - Attachment is obsolete: true
Attachment #8732334 - Flags: review?(wlachance)
Attachment #8732342 - Flags: review?(wlachance)
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/a90b7605df536f5383d9c15b8b18c3ca35f57fe1
Bug 1253341 - Run duplicate Talos jobs in AWS for Linux; add performance framework id to perfherder

https://github.com/mozilla/treeherder/commit/8b6bae1467ba9a93630853ff431e0e090e13cf0a
Merge pull request #1358 from jmaher/talos

Bug 1253341 - Run duplicate Talos jobs in AWS for Linux. r=wlach
Attachment #8732342 - Flags: review?(wlachance) → review+
this is live and we have results:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c68f24e3df3d

here is a simple breakdown of each test:
a11y: 100% regression (e10s: 200% regression)
canvasmark: 50% regression (opt|e10s)
cart: 100% regression (100% regression)
damp: 150% regression (e10s times out most of the time)
dromaeo_css: 50% regression (opt|e10s)
dromaeo_dom: 40% regression (opt|e10s)
glterrain: 200% regression (opt|e10s)
kraken: 100% regression (e10s: 125% regression)
sessionrestore: 150% regression (e10s: 200% regression)
sessionrestore_no_auto_restore: 150% regression (e10s: 200% regression)
tabpaint: 150% regression (e10s: 200% regression)
tart: 125% regression (e10s: 200% regression)
tpaint: 150% regression (e10s: 200% regression)
tp5o: 150% regression (e10s: didn't run at all)
tp5o_scroll: 50% regression (e10s: 200% regression)
tps: 200% regression (e10s: 250% regression)
tresize: 50% regression (e10s: 100% regression)
tscrollx: 50% regression (e10s: 200% regression)
tsvg_opacity: 40% regression (e10s: 60% regression)
tsvgx: 90% regression (e10s: 125% regression)
ts_paint: 150% regression (e10s: 300% regression)
xperf: win7 only


the main timeouts are:
cart (opt|e10s) - too much for production
damp e10s - almost perma
tp5 (opt|e10s) - e10s is perma


in addition we are not getting the framework set for these jobs, so we need to look at that again.
are the numbers stable? or is it too early to tell?
I think my push didn't include the latest tip and missed out on my framework change- I can assess the stability.  Although if we are looking for stability, we should pick the AMI type that we want and focus on that.  These are too slow right now, tp5 fails in e10s, that isn't specific to graphics.  The runtime of the jobs is almost twice as long as the hardware ones.
ok, my latest run is posting the framework correctly, but perfherder is not ingesting it properly:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=00dc870ba158

we are getting closer, just not there yet.
:wlach, can you help us figure out why this isn't being parsed properly in perfherder?

here is a link to an example log:
http://archive.mozilla.org/pub/firefox/try-builds/jmaher@mozilla.com-00dc870ba158f067ab80d89610eab5b74e1f08cd/try-linux64/try_ubuntu64_vm_test-chromez-bm52-tests1-linux64-build3.txt.gz
Flags: needinfo?(wlachance)
(In reply to Joel Maher (:jmaher) from comment #55)
> :wlach, can you help us figure out why this isn't being parsed properly in
> perfherder?

The changes to add the talos-aws fixture have not been deployed to production yet, I can see perf numbers fine on stage though:

https://treeherder.allizom.org/#/jobs?repo=try&revision=00dc870ba158&selectedJob=18885641
Flags: needinfo?(wlachance)
wrt slowness, I wonder whether a compute optimized instance wouldn't be a better choice for talos jobs: https://aws.amazon.com/ec2/instance-types/
yeah, we need to come up with a plan for figuring out a faster ami, lets see if stability and noise is reasonable.  Right now I am not clear on how to query just the spot instance data without a lot of manual work.

now that we are parsing this into a different framework, how do we see it on the graphs or query it with the perfherder api?
Flags: needinfo?(wlachance)
Depends on: 1258403
(In reply to Joel Maher (:jmaher) from comment #58)
> yeah, we need to come up with a plan for figuring out a faster ami, lets see
> if stability and noise is reasonable.  Right now I am not clear on how to
> query just the spot instance data without a lot of manual work.
> 
> now that we are parsing this into a different framework, how do we see it on
> the graphs or query it with the perfherder api?

You can pinpoint the AMI ones on try by looking for the slow datapoints LOL.

Everything is currently mashed together in the graph data chooser (see bug 1230652) but once we have this running regularly it should be possible to query the talos-aws data separately using its unique signature... actually, wait, it just occured to me that this case isn't supported (multiple signatures that are the same in the repository except with differing frameworks). Filed bug 1258403 blocking this one (please don't deploy anything outside of try until that is fixed).
Flags: needinfo?(wlachance)
Chris, can you turn this off for now? I think we have enough data from try to consider going forward, and the double results are confusing developers, see bug 1260926 (part of that is probably Perfherder's fault, but still).
Flags: needinfo?(catlee)
Attachment #8737183 - Flags: review?(rail) → review+
Comment on attachment 8737183 [details]
MozReview Request: Bug 1253341: Disable duplicate talos jobs in AWS r=rail

https://reviewboard.mozilla.org/r/41081/#review40379
No longer depends on: 1258403
I forgot to mention this earlier, so it's my fault, but please don't turn this on again before bug 1260926 is fixed.
Depends on: 1260926
Depends on: 1230652
(In reply to Chris AtLee [:catlee] from comment #64)
> https://hg.mozilla.org/build/buildbot-configs/rev/
> c06fa02837e998eae6fff8c00e5edd503c7f7b59
> Bug 1253341: Disable duplicate talos jobs in AWS r=rail

Looks like we're still scheduling jobs despite this, and launching spot instances to handle them.
I am fine with us continuing to do so, we should have it sorted out on the treeherder side this coming week.
(In reply to Nick Thomas [:nthomas] from comment #65)
> (In reply to Chris AtLee [:catlee] from comment #64)
> > https://hg.mozilla.org/build/buildbot-configs/rev/
> > c06fa02837e998eae6fff8c00e5edd503c7f7b59
> > Bug 1253341: Disable duplicate talos jobs in AWS r=rail
> 
> Looks like we're still scheduling jobs despite this, and launching spot
> instances to handle them.

Where do you see these?
Flags: needinfo?(catlee)
Over the weekend there was a nagios backlog alert with jobs older than 24 hours. They were all on try, on at least 10 revisions so it seemed likely they where scheduled by buildbot rather than from mozci or some other source. I don't see any now though so that was probably incorrect. Or restarting the schedulers ~24 hours ago may have fixed something up.
the m1.medium instance types are too slow and some talos tests are timing out.  We want to redo this experiment (ideally in the next week) with a large instance type.  I will need some guidance on instance types to choose from, ideally one that we have already used for other test related jobs.
FTR, next week is the RC build week, but this shouldn't be a big problem.
thanks for the heads up rail.  maybe we should re-enable this this week instead.

:rail, any chance you have a list of different ami types we currently use in automation for test jobs?
Flags: needinfo?(rail)
Comment on attachment 8740550 [details]
MozReview Request: Bug 1253341 - Run duplicate Talos jobs in AWS for Linux (ubuntu64_vm_lnx_large). r?rail

https://reviewboard.mozilla.org/r/45833/#review42433
Attachment #8740550 - Flags: review?(rail) → review+
this doesn't seem to be working, I wonder if ubuntu64_vm_large is not the right target to use?

:rail, do you have any ideas how I might be able to figure this out?
Flags: needinfo?(rail)
I HATE config.py! :) 

Not sure how the current approach worked before, I have vague memories that you shouldn't try to add platforms in loops. Instead you need to define it globally and then remove from all branches except desired ones.

This approach may be a bit different for tests config.py because we have slave_platforms and talos_slave_platforms...

Maybe something like https://gist.github.com/rail/cfebf537c4cee1ee0d0c281fd75ee867 helps? To verify you may need to dump all builders before and after. I haven't done this for ages and not sure if I can reproduce it now. If you don't know how to dump builder, can you ask buildduty folks - I know they have this setup ready! :)
Flags: needinfo?(rail)
Comment on attachment 8740550 [details]
MozReview Request: Bug 1253341 - Run duplicate Talos jobs in AWS for Linux (ubuntu64_vm_lnx_large). r?rail

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/45833/diff/1-2/
Attachment #8740550 - Attachment description: MozReview Request: Bug 1253341 - Run duplicate Talos jobs in AWS for Linux (large instances). r?rail → MozReview Request: Bug 1253341 - Run duplicate Talos jobs in AWS for Linux (ubuntu64_vm_lnx_large). r?rail
builders added with ubuntu64_vm_lnx_large:
Builders added:
+ Ubuntu VM large 12.04 x64 try talos chromez
+ Ubuntu VM large 12.04 x64 try talos chromez-e10s
+ Ubuntu VM large 12.04 x64 try talos dromaeojs
+ Ubuntu VM large 12.04 x64 try talos dromaeojs-e10s
+ Ubuntu VM large 12.04 x64 try talos g1
+ Ubuntu VM large 12.04 x64 try talos g1-e10s
+ Ubuntu VM large 12.04 x64 try talos g2
+ Ubuntu VM large 12.04 x64 try talos g2-e10s
+ Ubuntu VM large 12.04 x64 try talos g3
+ Ubuntu VM large 12.04 x64 try talos g3-e10s
+ Ubuntu VM large 12.04 x64 try talos other
+ Ubuntu VM large 12.04 x64 try talos other-e10s
+ Ubuntu VM large 12.04 x64 try talos svgr
+ Ubuntu VM large 12.04 x64 try talos svgr-e10s
+ Ubuntu VM large 12.04 x64 try talos tp5o
+ Ubuntu VM large 12.04 x64 try talos tp5o-e10s
results are in-

ix: https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=e93ebef962dd&newProject=try&newRevision=0ad2d3cc82086352841c758010ff08f0e73714fa&framework=1&showOnlyImportant=0
** 4 tests >=1% variance with 12 data points before/after
** 50 total data points


aws: https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=e93ebef962dd&newProject=try&newRevision=0ad2d3cc82086352841c758010ff08f0e73714fa&framework=7&showOnlyImportant=0
** 25 tests with >=1% variance with 12 data points before/after
** 3 tests >= 5% variance
** 41 total datapoints, no data for tp5o_scroll*, tscrollx*, and tests in the same job (cart/tart/tsvgx/svg_opacity e10s flavor) and glterrain/tp5o_scroll both opt and opt+e10s.


The overall values and runtimes are closer to the hardware, but the data is quite noisy.  As it stands, I wouldn't want to use these specific machines for tracking performance.

:catlee, these are running on the linux_emulator64 instances which are linux large.  I am not sure if we can easily try another instance?
Flags: needinfo?(catlee)
It looks like tst-emulator64 are a mix of c3.xlarge and m3.xlarge, which may be contributing to the noisy results. We should try again with only a single instance type....I'm not sure how to do that easily though.
Flags: needinfo?(catlee)
good point, who should I coordinate with sometime this week to figure that out?
we could restrict tst-emulator64 to run on only c3.xlarge as an experiment.
I couldn't find code in the normal repos to adjust this- with a pointer, I can write a patch.
Depends on: 1266439
oh no, I pushed to try and got emulators only, not duplicate jobs.

can we ensure that https://bugzilla.mozilla.org/show_bug.cgi?id=1253341 is landed?  I don't see it here:
https://dxr.mozilla.org/build-central/source/buildbot-configs/mozilla-tests/config.py#3160

but it is in the latest source (default head).

here is a try push where I expected duplicate jobs and ended up with only emulator vm jobs:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&selectedJob=19846466

:rail, any thoughts on this?
Flags: needinfo?(rail)
DXR may be outdated, http://hg.mozilla.org/build/buildbot-configs/file/production/mozilla-tests/config.py#l3150 is a better link to verify.
Flags: needinfo?(rail)
thanks :rail!  that does show what I would expect.  In addition all the VM jobs ran to completion and the HW jobs are pending running- once we start running the first job, then pulse_actions does the retrigger magic- I just need to be patient.
I have similar results for c3.xlarge only, in this case we have a bit more stability:
aws: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0ea5439590ad&newProject=try&newRevision=0d372519e22dceaf9e7eb0612eed9c3aaaa222d7&framework=7&showOnlyImportant=0

** 15 jobs >= 1% variance with 13 data points before/after
** 2 tests >= 5% variance (tp5o_responsiveness, tresize, both 10%+)
** 41 total datapoints, no data for tp5o_scroll*, tscrollx*, and tests in the same job (cart/tart/tsvgx/svg_opacity e10s flavor) and glterrain/tp5o_scroll both opt and opt+e10s.

While this is better than the first experiment, it is still lacking noticeably worse than the HW machines.  Looking in detail at the individual tests and distributions, I see some better, some worse, but overall the pattern of noise is roughly the same from c3.xlarge+m3.xlarge vs c3.xlarge.

I need to sync up with wlach on validating the perfherder data before getting too much further

We should also outline what is required to test on another instance type.  I am not sure of the work required or how to stand up a new instance type- so far we have been reusing existing instances/configs.
chatting with catlee on irc, it seems as though our experiment with buildbot has reached its limits.  Possibilities are running on a larger host or something to ensure we are the only VM on the box, that would be considered a dedicated machine.

What I see as next steps are to:
* look at a dedicated host machine (single vm, or cloud hardware), this would need to be done outside of buildbot though- possibly via taskcluster
* look at running taskcluster on linux hardware in our colo (use the existing docker setup, etc.)
(In reply to Joel Maher (:jmaher) from comment #91)
> I have similar results for c3.xlarge only, in this case we have a bit more
> stability:

Just as a cross-check, did catlee or someone make sure all the m3.xlarge instances had gone when your tests ran ? Sometimes they live quite a while.
thanks for checking on that Nick!  I chatted with rail the day before and he forced a cycle on the instances- then I waited about 12 hours and ran the experiment.  I have pretty high confidence it was just the one instance type.
Glad to hear you've already handled that. In other news, there's ~880 pending test jobs with 'Windows 7 VM-GFX 32-bit' at the start of the name, they look to all be unit test across several revisions on try. There are just 19 slaves enabled in slavealloc so they're not keeping up. Wondering if we intended to schedule unit tests as well as talos.
that is an experiment I am doing with Q, hopefully to result in many jobs being able to run on win7.  99% of the jobs on that platform are my try pushes- we don't have current plans for talos there, but it would be worth experimenting with in the near future.  Once we really validate win7-vm as a new platform, I imagine we will have a few hundred instances up and running.
Attachment #8744973 - Flags: review?(bugspam.Callek) → review+
given more data now from running talos jobs in taskcluster in both c4large and c3large instances, we have more data.

	        Delta %	avg delta %	stddev	avg stddev
ix.bb	        20.02	0.4	        71.59	1.43
c3xlarge.bb	80.78	1.97	        205.87	5.02
c3large.tc	35.77	0.89	        127.01	3.17

* delta % - sum(abs(median(new) - median(old)))
* avg delta % - average(delta %)
* stddev - max(stddev(olddata), stddev(newdata)
* avg stddev - average(stddev)


this indicates we have roughly twice the noise on c3large running in taskcluster as we do on the IX hardware machines in buildbot.  This is much better than we saw with the buildbot experiment.

A few things to note:
* this data was collected Monday May 2, 2016- maybe this is a stable point in time for the AWS machines
* looking at the noisiest tests, they seem CPU bound- we don't have a lot of choices in cpu, is docker or aws influencing this?  maybe they are helping keep it stable.
* we are not running 2 tests on c3large.tc (tp5o_scroll, tscrollx - due to hanging), and all tests in the svg job on non e10s are not collecting data as well due to tart hanging.  With this tests involved, I would imagine the noise being similar or worse, not better.

Backing up slightly, why do we care about noise:
* more noise == more false alerts, more missed alerts for smaller issues
* more noise == more data points to collect for confidence- right now we require 24 data points for an alert, that is a lot of data and we don't need to run more jobs
* more noise == harder for developers to trust the data and verify a fix- right now we do quite well on about 80% of our tests/platform conbinations.

While it isn't ideal to have "noise zero", doubling our existing noise is not a path forward for success.  Adding 20-25% more noise is reasonable, that is a fuzzy line, although the more noise we introduce the more scrutiny we get.

What is next:
* consider any tweaks to the AWS c3large environment.  I am not a fan of running in AWS in general as we don't have as much control over the machines/environment, and we are already at twice the noise- still worth ruling out
* look at an on the hardware solution
** bare metal in the cloud providers
** in-house IX solution with TC worker+docker
* figure out why *scroll tests are failing, and tart is perma fail vs intermittent on bb.

When we have an experiment that yields an acceptable noise level, I would like to run the experiment daily for 2 weeks to measure the noise over time, as well as running the tests in true parallel and seeing if we match the same issues we catch with our existing automation.

:garndt, given the few options above, could you weigh in on how realistic it is for me to dive forward on one of them?  possibly you have ideas for tweaks to aws machines?  do we have a way to run on some in house ix machines or other bare metal cloud providers?

* this shouldn't be a drop everything priority, but getting a plan for next steps would be nice this week.
Flags: needinfo?(garndt)
As far as bare metal providers, I spoke with one of the employees at packet.net recently which gives an API to bare metal instances and they offer per hour pricing.  To start with we could look into what it would take to just spin up these machines and keep them running with docker-worker.

We could also use existing bare metal that we have, it would just require a similar environment to what we're running with now (ubuntu w/ aufs support, video/sound kernel modules, docker 1.10, node 0.12).
Flags: needinfo?(garndt)
(In reply to Greg Arndt [:garndt] from comment #100)
> As far as bare metal providers, I spoke with one of the employees at
> packet.net recently which gives an API to bare metal instances and they
> offer per hour pricing.  To start with we could look into what it would take
> to just spin up these machines and keep them running with docker-worker.
> 
> We could also use existing bare metal that we have, it would just require a
> similar environment to what we're running with now (ubuntu w/ aufs support,
> video/sound kernel modules, docker 1.10, node 0.12).

Packet sounds great! I'll bet we could get away with using their lowest cost instance, since predictability rather than speed is our primary goal.

https://www.packet.net/bare-metal/
I think next steps would be packet.net.

The type 0 might just work for us:
https://www.packet.net/bare-metal/servers/type-0/

and it is the cheapest per hour.  I am not sure if the CPU will be adequate, but the rest of the specs seem useful.
Re: https://github.com/mozilla/treeherder/pull/1338

Is the list of buildernames in comment 80 still accurate? (And are there any more?)

I can add a testcase for these to confirm everything is working and prevent regressions.
Flags: needinfo?(jmaher)
I grafted production-only c7e341b34124 backout to default to prevent accidental merges in the future: https://hg.mozilla.org/build/buildbot-configs/rev/2302dc47cb21
hey :emorley, those buildernames are not valid anymore, we have cancelled our experiment on here and are moving to hardware.  I am not sure there is much we can do for adding extra test cases.
Flags: needinfo?(jmaher)
Attachment #8732342 - Attachment is obsolete: true
Attachment #8732342 - Flags: review+
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.