1253341 - Run duplicate Talos jobs in AWS for Linux

Reporter

Description

•

9 years ago

We'd like to run a 2 week test to see if AWS results are "stable enough" for us to catch regressions. Suggesting Linux and Windows 7 if possible. Please create additional bugs as needed for perfherder and other related changes.

Joel Maher ( :jmaher ) (UTC -8)

Comment 1

•

9 years ago

:wlach, as we would be running jobs in parallel on a system which could be temporary, should we create a new platform, i.e. 'Linux64_aws', 'Windows7_aws' ? Or would you rather we annotate the tests themselves i.e. 'tp5o summary opt [aws]', etc.

Flags: needinfo?(wlachance)

William Lachance (:wlach)

Comment 2

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #1) > :wlach, as we would be running jobs in parallel on a system which could be > temporary, should we create a new platform, i.e. 'Linux64_aws', > 'Windows7_aws' ? Or would you rather we annotate the tests themselves i.e. > 'tp5o summary opt [aws]', etc. A new platform would be ideal. This probably goes without saying, but: 1. Please ask me for feedback (f?) on any patches related to this. 2. Please don't roll this out into production without consulting me first, we could easily corrupt our performance data if we're not careful. Excited to see this go forward!

Flags: needinfo?(wlachance)

Joel Maher ( :jmaher ) (UTC -8)

Comment 3

•

9 years ago

:catlee, as we do this in releng buildbot scheduling, I would like to make sure we can hack the platform. In the past we did this for e10s and added a .e to the machine running the tests. This was done at the talos level, so it would show up properly in graph server as e10s. If we need to, I can land a talos change and we can inspect the machine name and change anything as needed. For example we run a command like this: /builds/slave/test/build/venv/bin/python /builds/slave/test/build/tests/talos/talos/run_tests.py --branchName Mozilla-Inbound-Non-PGO --suite chromez --executablePath /builds/slave/test/build/application/firefox/firefox --symbolsPath https://queue.taskcluster.net/v1/task/Rx4PufsSRWOMGDIXdronnw/artifacts/public/build/firefox-47.0a1.en-US.linux-x86_64.crashreporter-symbols.zip --title talos-linux64-ix-010 --webServer localhost --log-tbpl-level=debug --log-errorsummary=/builds/slave/test/build/blobber_upload_dir/chromez_errorsummary.log --log-raw=/builds/slave/test/build/blobber_upload_dir/chromez_raw.log you can see '--title talos-linux64-ix-010', I could parse that for: talos-linux64-ix.* == buildbot (leave alone) talos-linux64-spot.* == aws (change what I need to) What is interesting is that for talos data we parse the info from the job information- in that case we might need to adjust the buildbot properties file to change the buildername from: "Ubuntu HW 12.04 x64 mozilla-inbound talos chromez" to: "Ubuntu HW 12.04 x64 AWS mozilla-inbound talos chromez" that might be all that is needed for treeherder to associate this as a different platform.

Chris AtLee [:catlee]

Comment 4

•

9 years ago

What if it were "Ubuntu VM 12.04 x64 mozilla-inbound talos chromez" ? Do we need to munge the machine name now that graph server is dead?

Joel Maher ( :jmaher ) (UTC -8)

Comment 5

•

9 years ago

I like Ubuntu VM, we don't need graph server bits to change- this is going to be a question for how will treeherder/perfherder handle it. I don't think that would translate into a different platform inside of perfherder unfortunately- we magically transform "ubuntu hw 12.04 x64" -> linux64; :wlach, what could we do to facilitate this experiment inside of treeherder/perfherder?

Flags: needinfo?(wlachance)

William Lachance (:wlach)

Comment 6

•

9 years ago

So if we're doing this through buildbot, I'm pretty sure we would just need to add a new translator from buildername -> platform name in treeherder's etl layer: https://github.com/mozilla/treeherder/blob/master/treeherder/etl/buildbot.py#L66 Although I thought we were going to do this through taskcluster? In that case, I would guess we would configure the platform name on the taskcluster end of things. If it does make sense to go this route, let's just **make sure** to test this out on try before deploying it to production (and also that both treeherder and stage are updated with the changes).

Flags: needinfo?(wlachance)

Joel Maher ( :jmaher ) (UTC -8)

Comment 7

•

9 years ago

:catlee, which route are we going to take here: buildbot or taskcluster? If buildbot, lets add a case to the treeherder ETL sooner than later.

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 8

•

9 years ago

We decided earlier we were going to attempt buildbot.

Flags: needinfo?(catlee)

Joel Maher ( :jmaher ) (UTC -8)

Comment 9

•

9 years ago

for the etl we have: { 'regex': re.compile(r'^(?:Linux|Ubuntu).*64 Mulet', re.IGNORECASE), 'attributes': { 'os': 'linux', 'os_platform': 'mulet-linux64', 'arch': 'x86_64', } }, { 'regex': re.compile(r'(?:linux|ubuntu).*64.+|dxr', re.IGNORECASE), 'attributes': { 'os': 'linux', 'os_platform': 'linux64', 'arch': 'x86_64', } }, the problem is if we match "ubuntu vm 12.04..." it will change the platform in treeherder for all the unittests. I would like to add something like: { 'regex': re.compile(r'(?:linux|ubuntu) VM.*64.+|dxr', re.IGNORECASE), 'attributes': { 'os': 'linux', 'os_platform': 'linux64-vm', 'arch': 'x86_64', } }, maybe instead we special case the hardware: { 'regex': re.compile(r'(?:linux|ubuntu) HW.*64.+|dxr', re.IGNORECASE), 'attributes': { 'os': 'linux', 'os_platform': 'linux64-hw', 'arch': 'x86_64', } }, :wlach, do you have any further thoughts here?

Flags: needinfo?(wlachance)

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

9 years ago

keeping this as linux as win7 isn't ready

Summary: Run duplicate Talos jobs in AWS for Linux (and maybe Windows 7) → Run duplicate Talos jobs in AWS for Linux

William Lachance (:wlach)

Comment 11

•

9 years ago

So my initial thought is that we might *actually* want to distinguish linux64-vm from linux64-hw for the unit tests as well, in which case your change may well be desirable. CC'ing Ed Morley, who is basically the domain expert here, in case he wants to comment.

Flags: needinfo?(wlachance)

GitHub Autolander Bot

Comment 12

•

9 years ago

Attached file [treeherder] jmaher:master > mozilla:master — Details

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Attachment #8728093 - Flags: review?(emorley)

Ed Morley [:emorley]

Comment 13

•

9 years ago

Comment on attachment 8728093 [details] [review] [treeherder] jmaher:master > mozilla:master The tests are failing, have commented on the PR

Attachment #8728093 - Flags: review?(emorley) → review-

Joel Maher ( :jmaher ) (UTC -8)

Comment 14

•

9 years ago

Comment on attachment 8728093 [details] [review] [treeherder] jmaher:master > mozilla:master updated pull request after verifying tests run locally.

Attachment #8728093 - Flags: review- → review?(emorley)

Joel Maher ( :jmaher ) (UTC -8)

Comment 15

•

9 years ago

Comment on attachment 8728093 [details] [review] [treeherder] jmaher:master > mozilla:master appears that ./runtests.sh doesn't run the slow tests, failed in travis- looking into it

Attachment #8728093 - Flags: review?(emorley)

Joel Maher ( :jmaher ) (UTC -8)

Comment 16

•

9 years ago

Comment on attachment 8728093 [details] [review] [treeherder] jmaher:master > mozilla:master ok, I have all checks passed, the tests account for the different types!

Attachment #8728093 - Flags: review?(emorley)

Ed Morley [:emorley]

Comment 17

•

9 years ago

Comment on attachment 8728093 [details] [review] [treeherder] jmaher:master > mozilla:master This changes the behaviour of the existing jobs - I don't think we want them to change platform/row (this will break lots of things downstream I believe). I'd thought this new platform was just a temporary thing? If so, the above is particularly true. I'm wondering if a new group would actually be preferable here?

Attachment #8728093 - Flags: review?(emorley) → review-

Ed Morley [:emorley]

Comment 18

•

9 years ago

Comment on attachment 8728093 [details] [review] [treeherder] jmaher:master > mozilla:master Happy to defer to Will here. Please can you check this won't break the visibility profiles (or work with them to get them fixed), or the signature tables (given the bugs we have in them), and also update OrangeFactor for the new platform names.

Attachment #8728093 - Flags: review- → review?(wlachance)

Joel Maher ( :jmaher ) (UTC -8)

Comment 19

•

9 years ago

Comment on attachment 8728093 [details] [review] [treeherder] jmaher:master > mozilla:master we are going to not create a new treeherder platform, just a different framework for talos.

Attachment #8728093 - Flags: review?(wlachance)

Joel Maher ( :jmaher ) (UTC -8)

Comment 20

•

9 years ago

Attached file MozReview Request: Bug 1253341 - support --framework for talos. r?wlach — Details

Review commit: https://reviewboard.mozilla.org/r/39247/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/39247/

Attachment #8729113 - Flags: review?(wlachance)

William Lachance (:wlach)

Comment 22

•

9 years ago

Comment on attachment 8729113 [details] MozReview Request: Bug 1253341 - support --framework for talos. r?wlach https://reviewboard.mozilla.org/r/39247/#review35941 ::: testing/mozharness/mozharness/mozilla/testing/talos.py:174 (Diff revision 1) > - junk, junk, opts = self.buildbot_config['sourcestamp']['changes'][-1]['comments'].partition('mozharness:') > + junk, junk, opts = self.buildbot_config['sourcestamp']['changes'][-1]['comments'].partition('mozharness:') This looks unintentional. BTW, you probably didn't write this but it's preferable to use "_" instead of a bogus variable like "junk" ::: testing/mozharness/mozharness/mozilla/testing/talos.py:245 (Diff revision 1) > + if kw_options['title'].startswith('tst-linux64-spot'): This seems a little brittle (since it's specific to spot instances), is there no more definitive way of identifying that we're running on aws?

Attachment #8729113 - Flags: review?(wlachance)

Joel Maher ( :jmaher ) (UTC -8)

Comment 23

•

9 years ago

the junk stuff is intentional, there is an error there and it was in the same general code I was editing, so I decided to fix it. as for the tst-linux64-spot stuff, I am not sure of the best method to do this. This method will work, but it is hacky and not scalable. :catlee, can you come up with a better option here? Maybe a buildbot_property from buildprops.json ?

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 24

•

9 years ago

maybe a property, or checking to see if we're in a VM, or seeing if the AWS metadata service answers (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html#instancedata-data-retrieval). How much effort do you want to put in for this experiment?

Flags: needinfo?(catlee)

Joel Maher ( :jmaher ) (UTC -8)

Comment 25

•

9 years ago

:catlee, if we do this experiment on win7/win10 in aws in the future, we would have to hack this up again- if this is by far much simpler, I would vote for the existing solution; if making it more scalable with a property or config option is not much more work, then I would vote to go that route. wlach, any thoughts?

Joel Maher ( :jmaher ) (UTC -8)

Comment 26

•

9 years ago

can we get more info, opinions here?

Flags: needinfo?(wlachance)

Flags: needinfo?(catlee)

William Lachance (:wlach)

Comment 27

•

9 years ago

My main concern is the aws results getting mixed up in the non-aws results. This could happen with the current patch if the machine name changes to something else (but is still running on aws). So yeah, I'd definitely prefer a property or config option here.

Flags: needinfo?(wlachance)

Chris AtLee [:catlee]

Comment 28

•

9 years ago

Attached patch talos-aws-buildbot-configs.diff (obsolete) — Details — Splinter Review

Flags: needinfo?(catlee)

Attachment #8730751 - Flags: review?(kmoir)

Chris AtLee [:catlee]

Comment 29

•

9 years ago

Attached patch talos-aws-buildbotcustom.diff — Details — Splinter Review

Attachment #8730752 - Flags: review?(kmoir)

Chris AtLee [:catlee]

Comment 30

•

9 years ago

These two patches add mozilla-inbound talos jobs to the ubuntu64_vm (AWS) machines.

Joel Maher ( :jmaher ) (UTC -8)

Comment 31

•

9 years ago

catlee, is there a wait to get a buildbot property set to indicate HW vs VM?

Kim Moir [:kmoir] ET

Updated

•

9 years ago

Attachment #8730752 - Flags: review?(kmoir) → review+

Kim Moir [:kmoir] ET

Updated

•

9 years ago

Attachment #8730751 - Flags: review?(kmoir) → review+

Joel Maher ( :jmaher ) (UTC -8)

Comment 32

•

9 years ago

Comment on attachment 8729113 [details] MozReview Request: Bug 1253341 - support --framework for talos. r?wlach as per conversation in IRC, this temporary hack will work for this experiment, but any future versions of this experiment will involve taskcluster which pass config variables much differently- lets do this with the hack on the machine name. I can hack elsewhere as needed, just let me know any concerns.

Attachment #8729113 - Flags: review?(wlachance)

Chris AtLee [:catlee]

Comment 33

•

9 years ago

Attached patch talos-aws-buildbot-configs-1.diff — Details — Splinter Review

sorry for the churn. I think it's probably better to enable on try first to make sure we're getting data submitted properly before enabling on inbound. When the time comes to enable on inbound, we can add 'mozilla-inbound' to the set of branches.

Attachment #8730751 - Attachment is obsolete: true

Attachment #8730803 - Flags: review?(kmoir)

Kim Moir [:kmoir] ET

Updated

•

9 years ago

Attachment #8730803 - Flags: review?(kmoir) → review+

Chris AtLee [:catlee]

Comment 34

•

9 years ago

https://hg.mozilla.org/build/buildbotcustom/rev/42184dbdf8f3b9d4f70ae83a9e975738899f5074 Bug 1253341 - Allow running talos on AWS machines r=kmoir

Chris AtLee [:catlee]

Comment 35

•

9 years ago

https://hg.mozilla.org/build/buildbot-configs/rev/cc4afce0be820264b6d560cf74e1e98d2aa23845 Bug 1253341 - Enable talos on linux64 AWS machines on try. r=kmoir

Joel Maher ( :jmaher ) (UTC -8)

Comment 36

•

9 years ago

if this is try only that is sort of ok, we need to rest of the stuff hooked up, pending wlach review.

William Lachance (:wlach)

Updated

•

9 years ago

Attachment #8729113 - Flags: review?(wlachance) → review+

William Lachance (:wlach)

Comment 37

•

9 years ago

Comment on attachment 8729113 [details] MozReview Request: Bug 1253341 - support --framework for talos. r?wlach https://reviewboard.mozilla.org/r/39247/#review36723 I guess this can land! I still find the method of determining whether we're on aws or not somewhat terrifying: I can just see someone changing a setting and unknowingly break the "aws" detection. So ok, with the proviso that this needs to be cleaned up in the near future.

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Keywords: leave-open

Chris AtLee [:catlee]

Comment 38

•

9 years ago

Attached patch Enable talos tests on ubuntu64_vm r=kmoir (obsolete) — Details — Splinter Review

Chris AtLee [:catlee]

Updated

•

9 years ago

Attachment #8730887 - Attachment is obsolete: true

Chris AtLee [:catlee]

Comment 39

•

9 years ago

https://hg.mozilla.org/build/buildbot-configs/rev/2d6158d681e00713a91430dd8241101ba9699485 Bug 1253341 - Enable talos on linux64 AWS machines on try. r=kmoir

Pulsebot

Comment 40

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/101e18ca4a12

Release Engineering SlaveAPI Service

Comment 41

•

9 years ago

In production: https://hg.mozilla.org/build/buildbot-configs/rev/2d6158d681e0

Release Engineering SlaveAPI Service

Comment 42

•

9 years ago

In production: https://hg.mozilla.org/build/buildbotcustom/rev/42184dbdf8f3

Carsten Book [:Tomcat]

Comment 43

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/101e18ca4a12

Chris AtLee [:catlee]

Comment 44

•

9 years ago

https://hg.mozilla.org/build/buildbotcustom/rev/ddb55938d8425c190114e9a7b8d878bf2ef9899d Bug 1253341 - Allow running talos on AWS machines r=kmoir

Ryan VanderMeulen [:RyanVM]

Comment 45

•

9 years ago

(In reply to Release Engineering SlaveAPI Service from comment #41) > In production: https://hg.mozilla.org/build/buildbot-configs/rev/2d6158d681e0 This got reverted by catlee. http://hg.mozilla.org/build/buildbot-configs/rev/7c463958098d That said, I think the real cause was found and fixed in the mean time, so I think it's OK to re-land now?

Chris AtLee [:catlee]

Comment 46

•

9 years ago

I just did! https://hg.mozilla.org/build/buildbot-configs/rev/40a7aac2702c

Release Engineering SlaveAPI Service

Comment 47

•

9 years ago

In production: https://hg.mozilla.org/build/buildbotcustom/rev/ddb55938d842

Joel Maher ( :jmaher ) (UTC -8)

Comment 48

•

9 years ago

Attached file https://github.com/mozilla/treeherder/pull/1357 (obsolete) — Details

Attachment #8732334 - Flags: review?(wlachance)

GitHub Autolander Bot

Comment 49

•

9 years ago

Attached file [treeherder] jmaher:talos > mozilla:master (obsolete) — Details

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Attachment #8732334 - Attachment is obsolete: true

Attachment #8732334 - Flags: review?(wlachance)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Attachment #8732342 - Flags: review?(wlachance)

Treeherder GitHub Bugbot

Comment 50

•

9 years ago

Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/a90b7605df536f5383d9c15b8b18c3ca35f57fe1 Bug 1253341 - Run duplicate Talos jobs in AWS for Linux; add performance framework id to perfherder https://github.com/mozilla/treeherder/commit/8b6bae1467ba9a93630853ff431e0e090e13cf0a Merge pull request #1358 from jmaher/talos Bug 1253341 - Run duplicate Talos jobs in AWS for Linux. r=wlach

William Lachance (:wlach)

Updated

•

9 years ago

Attachment #8732342 - Flags: review?(wlachance) → review+

Joel Maher ( :jmaher ) (UTC -8)

Comment 51

•

9 years ago

this is live and we have results: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c68f24e3df3d here is a simple breakdown of each test: a11y: 100% regression (e10s: 200% regression) canvasmark: 50% regression (opt|e10s) cart: 100% regression (100% regression) damp: 150% regression (e10s times out most of the time) dromaeo_css: 50% regression (opt|e10s) dromaeo_dom: 40% regression (opt|e10s) glterrain: 200% regression (opt|e10s) kraken: 100% regression (e10s: 125% regression) sessionrestore: 150% regression (e10s: 200% regression) sessionrestore_no_auto_restore: 150% regression (e10s: 200% regression) tabpaint: 150% regression (e10s: 200% regression) tart: 125% regression (e10s: 200% regression) tpaint: 150% regression (e10s: 200% regression) tp5o: 150% regression (e10s: didn't run at all) tp5o_scroll: 50% regression (e10s: 200% regression) tps: 200% regression (e10s: 250% regression) tresize: 50% regression (e10s: 100% regression) tscrollx: 50% regression (e10s: 200% regression) tsvg_opacity: 40% regression (e10s: 60% regression) tsvgx: 90% regression (e10s: 125% regression) ts_paint: 150% regression (e10s: 300% regression) xperf: win7 only the main timeouts are: cart (opt|e10s) - too much for production damp e10s - almost perma tp5 (opt|e10s) - e10s is perma in addition we are not getting the framework set for these jobs, so we need to look at that again.

Chris AtLee [:catlee]

Comment 52

•

9 years ago

are the numbers stable? or is it too early to tell?

Joel Maher ( :jmaher ) (UTC -8)

Comment 53

•

9 years ago

I think my push didn't include the latest tip and missed out on my framework change- I can assess the stability. Although if we are looking for stability, we should pick the AMI type that we want and focus on that. These are too slow right now, tp5 fails in e10s, that isn't specific to graphics. The runtime of the jobs is almost twice as long as the hardware ones.

Joel Maher ( :jmaher ) (UTC -8)

Comment 54

•

9 years ago

ok, my latest run is posting the framework correctly, but perfherder is not ingesting it properly: https://treeherder.mozilla.org/#/jobs?repo=try&revision=00dc870ba158 we are getting closer, just not there yet.

Joel Maher ( :jmaher ) (UTC -8)

Comment 55

•

9 years ago

:wlach, can you help us figure out why this isn't being parsed properly in perfherder? here is a link to an example log: http://archive.mozilla.org/pub/firefox/try-builds/jmaher@mozilla.com-00dc870ba158f067ab80d89610eab5b74e1f08cd/try-linux64/try_ubuntu64_vm_test-chromez-bm52-tests1-linux64-build3.txt.gz

Flags: needinfo?(wlachance)

William Lachance (:wlach)

Comment 56

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #55) > :wlach, can you help us figure out why this isn't being parsed properly in > perfherder? The changes to add the talos-aws fixture have not been deployed to production yet, I can see perf numbers fine on stage though: https://treeherder.allizom.org/#/jobs?repo=try&revision=00dc870ba158&selectedJob=18885641

Flags: needinfo?(wlachance)

William Lachance (:wlach)

Comment 57

•

9 years ago

wrt slowness, I wonder whether a compute optimized instance wouldn't be a better choice for talos jobs: https://aws.amazon.com/ec2/instance-types/

Joel Maher ( :jmaher ) (UTC -8)

Comment 58

•

9 years ago

yeah, we need to come up with a plan for figuring out a faster ami, lets see if stability and noise is reasonable. Right now I am not clear on how to query just the spot instance data without a lot of manual work. now that we are parsing this into a different framework, how do we see it on the graphs or query it with the perfherder api?

Flags: needinfo?(wlachance)

William Lachance (:wlach)

Updated

•

9 years ago

Depends on: 1258403

William Lachance (:wlach)

Comment 59

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #58) > yeah, we need to come up with a plan for figuring out a faster ami, lets see > if stability and noise is reasonable. Right now I am not clear on how to > query just the spot instance data without a lot of manual work. > > now that we are parsing this into a different framework, how do we see it on > the graphs or query it with the perfherder api? You can pinpoint the AMI ones on try by looking for the slow datapoints LOL. Everything is currently mashed together in the graph data chooser (see bug 1230652) but once we have this running regularly it should be possible to query the talos-aws data separately using its unique signature... actually, wait, it just occured to me that this case isn't supported (multiple signatures that are the same in the repository except with differing frameworks). Filed bug 1258403 blocking this one (please don't deploy anything outside of try until that is fixed).

William Lachance (:wlach)

Updated

•

9 years ago

Flags: needinfo?(wlachance)

William Lachance (:wlach)

Comment 60

•

9 years ago

Chris, can you turn this off for now? I think we have enough data from try to consider going forward, and the double results are confusing developers, see bug 1260926 (part of that is probably Perfherder's fault, but still).

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 61

•

9 years ago

Attached file MozReview Request: Bug 1253341: Disable duplicate talos jobs in AWS r=rail — Details

Review commit: https://reviewboard.mozilla.org/r/41081/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/41081/

Attachment #8737183 - Flags: review?(rail)

Rail Aliiev [:rail]

Updated

•

9 years ago

Attachment #8737183 - Flags: review?(rail) → review+

Rail Aliiev [:rail]

Comment 62

•

9 years ago

Comment on attachment 8737183 [details] MozReview Request: Bug 1253341: Disable duplicate talos jobs in AWS r=rail https://reviewboard.mozilla.org/r/41081/#review40379

William Lachance (:wlach)

Updated

•

9 years ago

No longer depends on: 1258403

William Lachance (:wlach)

Comment 63

•

9 years ago

I forgot to mention this earlier, so it's my fault, but please don't turn this on again before bug 1260926 is fixed.

William Lachance (:wlach)

Updated

•

9 years ago

Depends on: 1260926

Chris AtLee [:catlee]

Comment 64

•

9 years ago

https://hg.mozilla.org/build/buildbot-configs/rev/c06fa02837e998eae6fff8c00e5edd503c7f7b59 Bug 1253341: Disable duplicate talos jobs in AWS r=rail

Selena Deckelmann :selenamarie :selena

Reporter

Updated

•

9 years ago

Depends on: 1230652

Nick Thomas [:nthomas] (UTC+12)

Comment 65

•

9 years ago

(In reply to Chris AtLee [:catlee] from comment #64) > https://hg.mozilla.org/build/buildbot-configs/rev/ > c06fa02837e998eae6fff8c00e5edd503c7f7b59 > Bug 1253341: Disable duplicate talos jobs in AWS r=rail Looks like we're still scheduling jobs despite this, and launching spot instances to handle them.

Joel Maher ( :jmaher ) (UTC -8)

Comment 66

•

9 years ago

I am fine with us continuing to do so, we should have it sorted out on the treeherder side this coming week.

Chris AtLee [:catlee]

Comment 67

•

9 years ago

(In reply to Nick Thomas [:nthomas] from comment #65) > (In reply to Chris AtLee [:catlee] from comment #64) > > https://hg.mozilla.org/build/buildbot-configs/rev/ > > c06fa02837e998eae6fff8c00e5edd503c7f7b59 > > Bug 1253341: Disable duplicate talos jobs in AWS r=rail > > Looks like we're still scheduling jobs despite this, and launching spot > instances to handle them. Where do you see these?

Flags: needinfo?(catlee)

Nick Thomas [:nthomas] (UTC+12)

Comment 68

•

9 years ago

Over the weekend there was a nagios backlog alert with jobs older than 24 hours. They were all on try, on at least 10 revisions so it seemed likely they where scheduled by buildbot rather than from mozci or some other source. I don't see any now though so that was probably incorrect. Or restarting the schedulers ~24 hours ago may have fixed something up.

Joel Maher ( :jmaher ) (UTC -8)

Comment 69

•

9 years ago

the m1.medium instance types are too slow and some talos tests are timing out. We want to redo this experiment (ideally in the next week) with a large instance type. I will need some guidance on instance types to choose from, ideally one that we have already used for other test related jobs.

Rail Aliiev [:rail]

Comment 70

•

9 years ago

FTR, next week is the RC build week, but this shouldn't be a big problem.

Joel Maher ( :jmaher ) (UTC -8)

Comment 71

•

9 years ago

thanks for the heads up rail. maybe we should re-enable this this week instead. :rail, any chance you have a list of different ami types we currently use in automation for test jobs?

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 72

•

9 years ago

According to https://dxr.mozilla.org/build-central/search?q=regexp%3Aubuntu.*_large+path%3Abuildbot-configs%2Fmozilla-tests&redirect=false&case=true we use them for CC and TSAN.

Flags: needinfo?(rail)

Joel Maher ( :jmaher ) (UTC -8)

Comment 73

•

9 years ago

Attached file MozReview Request: Bug 1253341 - Run duplicate Talos jobs in AWS for Linux (ubuntu64_vm_lnx_large). r?rail — Details

Review commit: https://reviewboard.mozilla.org/r/45833/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/45833/

Attachment #8740550 - Flags: review?(rail)

Rail Aliiev [:rail]

Comment 74

•

9 years ago

Comment on attachment 8740550 [details] MozReview Request: Bug 1253341 - Run duplicate Talos jobs in AWS for Linux (ubuntu64_vm_lnx_large). r?rail https://reviewboard.mozilla.org/r/45833/#review42433

Attachment #8740550 - Flags: review?(rail) → review+

Joel Maher ( :jmaher ) (UTC -8)

Comment 75

•

9 years ago

landed: https://hg.mozilla.org/build/buildbot-configs/rev/609fef319188

Release Engineering SlaveAPI Service

Comment 76

•

9 years ago

In production: https://hg.mozilla.org/build/buildbot-configs/rev/609fef319188

Joel Maher ( :jmaher ) (UTC -8)

Comment 77

•

9 years ago

this doesn't seem to be working, I wonder if ubuntu64_vm_large is not the right target to use? :rail, do you have any ideas how I might be able to figure this out?

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 78

•

9 years ago

I HATE config.py! :) Not sure how the current approach worked before, I have vague memories that you shouldn't try to add platforms in loops. Instead you need to define it globally and then remove from all branches except desired ones. This approach may be a bit different for tests config.py because we have slave_platforms and talos_slave_platforms... Maybe something like https://gist.github.com/rail/cfebf537c4cee1ee0d0c281fd75ee867 helps? To verify you may need to dump all builders before and after. I haven't done this for ages and not sure if I can reproduce it now. If you don't know how to dump builder, can you ask buildduty folks - I know they have this setup ready! :)

Flags: needinfo?(rail)

Joel Maher ( :jmaher ) (UTC -8)

Comment 79

•

9 years ago

Comment on attachment 8740550 [details] MozReview Request: Bug 1253341 - Run duplicate Talos jobs in AWS for Linux (ubuntu64_vm_lnx_large). r?rail Review request updated; see interdiff: https://reviewboard.mozilla.org/r/45833/diff/1-2/

Attachment #8740550 - Attachment description: MozReview Request: Bug 1253341 - Run duplicate Talos jobs in AWS for Linux (large instances). r?rail → MozReview Request: Bug 1253341 - Run duplicate Talos jobs in AWS for Linux (ubuntu64_vm_lnx_large). r?rail

Joel Maher ( :jmaher ) (UTC -8)

Comment 80

•

9 years ago

builders added with ubuntu64_vm_lnx_large: Builders added: + Ubuntu VM large 12.04 x64 try talos chromez + Ubuntu VM large 12.04 x64 try talos chromez-e10s + Ubuntu VM large 12.04 x64 try talos dromaeojs + Ubuntu VM large 12.04 x64 try talos dromaeojs-e10s + Ubuntu VM large 12.04 x64 try talos g1 + Ubuntu VM large 12.04 x64 try talos g1-e10s + Ubuntu VM large 12.04 x64 try talos g2 + Ubuntu VM large 12.04 x64 try talos g2-e10s + Ubuntu VM large 12.04 x64 try talos g3 + Ubuntu VM large 12.04 x64 try talos g3-e10s + Ubuntu VM large 12.04 x64 try talos other + Ubuntu VM large 12.04 x64 try talos other-e10s + Ubuntu VM large 12.04 x64 try talos svgr + Ubuntu VM large 12.04 x64 try talos svgr-e10s + Ubuntu VM large 12.04 x64 try talos tp5o + Ubuntu VM large 12.04 x64 try talos tp5o-e10s

Rail Aliiev [:rail]

Comment 81

•

9 years ago

https://reviewboard.mozilla.org/r/45833/#review43039

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Depends on: 1265655

Joel Maher ( :jmaher ) (UTC -8)

Comment 82

•

9 years ago

results are in- ix: https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=e93ebef962dd&newProject=try&newRevision=0ad2d3cc82086352841c758010ff08f0e73714fa&framework=1&showOnlyImportant=0 ** 4 tests >=1% variance with 12 data points before/after ** 50 total data points aws: https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=e93ebef962dd&newProject=try&newRevision=0ad2d3cc82086352841c758010ff08f0e73714fa&framework=7&showOnlyImportant=0 ** 25 tests with >=1% variance with 12 data points before/after ** 3 tests >= 5% variance ** 41 total datapoints, no data for tp5o_scroll*, tscrollx*, and tests in the same job (cart/tart/tsvgx/svg_opacity e10s flavor) and glterrain/tp5o_scroll both opt and opt+e10s. The overall values and runtimes are closer to the hardware, but the data is quite noisy. As it stands, I wouldn't want to use these specific machines for tracking performance. :catlee, these are running on the linux_emulator64 instances which are linux large. I am not sure if we can easily try another instance?

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 83

•

9 years ago

It looks like tst-emulator64 are a mix of c3.xlarge and m3.xlarge, which may be contributing to the noisy results. We should try again with only a single instance type....I'm not sure how to do that easily though.

Flags: needinfo?(catlee)

Joel Maher ( :jmaher ) (UTC -8)

Comment 84

•

9 years ago

good point, who should I coordinate with sometime this week to figure that out?

Chris AtLee [:catlee]

Comment 85

•

9 years ago

we could restrict tst-emulator64 to run on only c3.xlarge as an experiment.

Joel Maher ( :jmaher ) (UTC -8)

Comment 86

•

9 years ago

I couldn't find code in the normal repos to adjust this- with a pointer, I can write a patch.

Nick Thomas [:nthomas] (UTC+12)

Comment 87

•

9 years ago

See https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg#L91

Chris AtLee [:catlee]

Updated

•

9 years ago

Depends on: 1266439

Joel Maher ( :jmaher ) (UTC -8)

Comment 88

•

9 years ago

oh no, I pushed to try and got emulators only, not duplicate jobs. can we ensure that https://bugzilla.mozilla.org/show_bug.cgi?id=1253341 is landed? I don't see it here: https://dxr.mozilla.org/build-central/source/buildbot-configs/mozilla-tests/config.py#3160 but it is in the latest source (default head). here is a try push where I expected duplicate jobs and ended up with only emulator vm jobs: https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&selectedJob=19846466 :rail, any thoughts on this?

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 89

•

9 years ago

DXR may be outdated, http://hg.mozilla.org/build/buildbot-configs/file/production/mozilla-tests/config.py#l3150 is a better link to verify.

Flags: needinfo?(rail)

Joel Maher ( :jmaher ) (UTC -8)

Comment 90

•

9 years ago

thanks :rail! that does show what I would expect. In addition all the VM jobs ran to completion and the HW jobs are pending running- once we start running the first job, then pulse_actions does the retrigger magic- I just need to be patient.

Joel Maher ( :jmaher ) (UTC -8)

Comment 91

•

9 years ago

I have similar results for c3.xlarge only, in this case we have a bit more stability: aws: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0ea5439590ad&newProject=try&newRevision=0d372519e22dceaf9e7eb0612eed9c3aaaa222d7&framework=7&showOnlyImportant=0 ** 15 jobs >= 1% variance with 13 data points before/after ** 2 tests >= 5% variance (tp5o_responsiveness, tresize, both 10%+) ** 41 total datapoints, no data for tp5o_scroll*, tscrollx*, and tests in the same job (cart/tart/tsvgx/svg_opacity e10s flavor) and glterrain/tp5o_scroll both opt and opt+e10s. While this is better than the first experiment, it is still lacking noticeably worse than the HW machines. Looking in detail at the individual tests and distributions, I see some better, some worse, but overall the pattern of noise is roughly the same from c3.xlarge+m3.xlarge vs c3.xlarge. I need to sync up with wlach on validating the perfherder data before getting too much further We should also outline what is required to test on another instance type. I am not sure of the work required or how to stand up a new instance type- so far we have been reusing existing instances/configs.

Joel Maher ( :jmaher ) (UTC -8)

Comment 92

•

9 years ago

chatting with catlee on irc, it seems as though our experiment with buildbot has reached its limits. Possibilities are running on a larger host or something to ensure we are the only VM on the box, that would be considered a dedicated machine. What I see as next steps are to: * look at a dedicated host machine (single vm, or cloud hardware), this would need to be done outside of buildbot though- possibly via taskcluster * look at running taskcluster on linux hardware in our colo (use the existing docker setup, etc.)

Nick Thomas [:nthomas] (UTC+12)

Comment 93

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #91) > I have similar results for c3.xlarge only, in this case we have a bit more > stability: Just as a cross-check, did catlee or someone make sure all the m3.xlarge instances had gone when your tests ran ? Sometimes they live quite a while.

Joel Maher ( :jmaher ) (UTC -8)

Comment 94

•

9 years ago

thanks for checking on that Nick! I chatted with rail the day before and he forced a cycle on the instances- then I waited about 12 hours and ran the experiment. I have pretty high confidence it was just the one instance type.

Nick Thomas [:nthomas] (UTC+12)

Comment 95

•

9 years ago

Glad to hear you've already handled that. In other news, there's ~880 pending test jobs with 'Windows 7 VM-GFX 32-bit' at the start of the name, they look to all be unit test across several revisions on try. There are just 19 slaves enabled in slavealloc so they're not keeping up. Wondering if we intended to schedule unit tests as well as talos.

Joel Maher ( :jmaher ) (UTC -8)

Comment 96

•

9 years ago

that is an experiment I am doing with Q, hopefully to result in many jobs being able to run on win7. 99% of the jobs on that platform are my try pushes- we don't have current plans for talos there, but it would be worth experimenting with in the near future. Once we really validate win7-vm as a new platform, I imagine we will have a few hundred instances up and running.

Joel Maher ( :jmaher ) (UTC -8)

Comment 97

•

9 years ago

Attached patch undo all the talos duplicate jobs (1.0) — Details — Splinter Review

Attachment #8744973 - Flags: review?(bugspam.Callek)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Depends on: 1268650

Justin Wood (:Callek)

Updated

•

9 years ago

Attachment #8744973 - Flags: review?(bugspam.Callek) → review+

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Depends on: 1269040

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Depends on: 1269090

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Depends on: 1269340

Joel Maher ( :jmaher ) (UTC -8)

Comment 98

•

9 years ago

given more data now from running talos jobs in taskcluster in both c4large and c3large instances, we have more data. Delta % avg delta % stddev avg stddev ix.bb 20.02 0.4 71.59 1.43 c3xlarge.bb 80.78 1.97 205.87 5.02 c3large.tc 35.77 0.89 127.01 3.17 * delta % - sum(abs(median(new) - median(old))) * avg delta % - average(delta %) * stddev - max(stddev(olddata), stddev(newdata) * avg stddev - average(stddev) this indicates we have roughly twice the noise on c3large running in taskcluster as we do on the IX hardware machines in buildbot. This is much better than we saw with the buildbot experiment. A few things to note: * this data was collected Monday May 2, 2016- maybe this is a stable point in time for the AWS machines * looking at the noisiest tests, they seem CPU bound- we don't have a lot of choices in cpu, is docker or aws influencing this? maybe they are helping keep it stable. * we are not running 2 tests on c3large.tc (tp5o_scroll, tscrollx - due to hanging), and all tests in the svg job on non e10s are not collecting data as well due to tart hanging. With this tests involved, I would imagine the noise being similar or worse, not better. Backing up slightly, why do we care about noise: * more noise == more false alerts, more missed alerts for smaller issues * more noise == more data points to collect for confidence- right now we require 24 data points for an alert, that is a lot of data and we don't need to run more jobs * more noise == harder for developers to trust the data and verify a fix- right now we do quite well on about 80% of our tests/platform conbinations. While it isn't ideal to have "noise zero", doubling our existing noise is not a path forward for success. Adding 20-25% more noise is reasonable, that is a fuzzy line, although the more noise we introduce the more scrutiny we get. What is next: * consider any tweaks to the AWS c3large environment. I am not a fan of running in AWS in general as we don't have as much control over the machines/environment, and we are already at twice the noise- still worth ruling out * look at an on the hardware solution ** bare metal in the cloud providers ** in-house IX solution with TC worker+docker * figure out why *scroll tests are failing, and tart is perma fail vs intermittent on bb. When we have an experiment that yields an acceptable noise level, I would like to run the experiment daily for 2 weeks to measure the noise over time, as well as running the tests in true parallel and seeing if we match the same issues we catch with our existing automation. :garndt, given the few options above, could you weigh in on how realistic it is for me to dive forward on one of them? possibly you have ideas for tweaks to aws machines? do we have a way to run on some in house ix machines or other bare metal cloud providers? * this shouldn't be a drop everything priority, but getting a plan for next steps would be nice this week.

Flags: needinfo?(garndt)

Joel Maher ( :jmaher ) (UTC -8)

Comment 99

•

9 years ago

for reference here are links to compare view: ix.bb: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=e93ebef962dd&newProject=try&newRevision=0ad2d3cc82086352841c758010ff08f0e73714fa&framework=1&showOnlyImportant=0 c3xlarge.bb: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=e93ebef962dd&newProject=try&newRevision=0ad2d3cc82086352841c758010ff08f0e73714fa&framework=7&showOnlyImportant=0 c3large.tc: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0a2f933ce936&newProject=try&newRevision=cb8972deb3b3&framework=7&showOnlyImportant=0

Greg Arndt [:garndt]

Comment 100

•

9 years ago

As far as bare metal providers, I spoke with one of the employees at packet.net recently which gives an API to bare metal instances and they offer per hour pricing. To start with we could look into what it would take to just spin up these machines and keep them running with docker-worker. We could also use existing bare metal that we have, it would just require a similar environment to what we're running with now (ubuntu w/ aufs support, video/sound kernel modules, docker 1.10, node 0.12).

Flags: needinfo?(garndt)

William Lachance (:wlach)

Comment 101

•

9 years ago

(In reply to Greg Arndt [:garndt] from comment #100) > As far as bare metal providers, I spoke with one of the employees at > packet.net recently which gives an API to bare metal instances and they > offer per hour pricing. To start with we could look into what it would take > to just spin up these machines and keep them running with docker-worker. > > We could also use existing bare metal that we have, it would just require a > similar environment to what we're running with now (ubuntu w/ aufs support, > video/sound kernel modules, docker 1.10, node 0.12). Packet sounds great! I'll bet we could get away with using their lowest cost instance, since predictability rather than speed is our primary goal. https://www.packet.net/bare-metal/

Joel Maher ( :jmaher ) (UTC -8)

Comment 102

•

9 years ago

I think next steps would be packet.net. The type 0 might just work for us: https://www.packet.net/bare-metal/servers/type-0/ and it is the cheapest per hour. I am not sure if the CPU will be adequate, but the rest of the specs seem useful.

Ed Morley [:emorley]

Comment 103

•

9 years ago

Re: https://github.com/mozilla/treeherder/pull/1338 Is the list of buildernames in comment 80 still accurate? (And are there any more?) I can add a testcase for these to confirm everything is working and prevent regressions.

Flags: needinfo?(jmaher)

Rail Aliiev [:rail]

Comment 104

•

9 years ago

I grafted production-only c7e341b34124 backout to default to prevent accidental merges in the future: https://hg.mozilla.org/build/buildbot-configs/rev/2302dc47cb21

Joel Maher ( :jmaher ) (UTC -8)

Comment 105

•

9 years ago

hey :emorley, those buildernames are not valid anymore, we have cancelled our experiment on here and are moving to hardware. I am not sure there is much we can do for adding extra test cases.

Flags: needinfo?(jmaher)

Ed Morley [:emorley]

Updated

•

9 years ago

Attachment #8732342 - Attachment is obsolete: true

Attachment #8732342 - Flags: review+

Release Engineering SlaveAPI Service

Comment 106

•

9 years ago

In production: https://hg.mozilla.org/build/buildbot-configs/rev/2302dc47cb21

Chris AtLee [:catlee]

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX

Firefox Bug Husbandry Bot

Comment 107

•

7 years ago

Removing leave-open keyword from resolved bugs, per :sylvestre.

Keywords: leave-open

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: General Automation → General

[treeherder] jmaher:master > mozilla:master 9 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request		Details \| Review
MozReview Request: Bug 1253341 - support --framework for talos. r?wlach 9 years ago Joel Maher ( :jmaher ) (UTC -8) 58 bytes, text/x-review-board-request	wlach : review+	Details
talos-aws-buildbot-configs.diff 9 years ago Chris AtLee [:catlee] 4.33 KB, patch	kmoir : review+	Details \| Diff \| Splinter Review
talos-aws-buildbotcustom.diff 9 years ago Chris AtLee [:catlee] 1.94 KB, patch	kmoir : review+	Details \| Diff \| Splinter Review
talos-aws-buildbot-configs-1.diff 9 years ago Chris AtLee [:catlee] 4.23 KB, patch	kmoir : review+	Details \| Diff \| Splinter Review
Enable talos tests on ubuntu64_vm r=kmoir 9 years ago Chris AtLee [:catlee] 4.35 KB, patch		Details \| Diff \| Splinter Review
https://github.com/mozilla/treeherder/pull/1357 9 years ago Joel Maher ( :jmaher ) (UTC -8) 47 bytes, text/x-github-pull-request		Details \| Review
[treeherder] jmaher:talos > mozilla:master 9 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request		Details \| Review
MozReview Request: Bug 1253341: Disable duplicate talos jobs in AWS r=rail 9 years ago Chris AtLee [:catlee] 58 bytes, text/x-review-board-request	rail : review+	Details
MozReview Request: Bug 1253341 - Run duplicate Talos jobs in AWS for Linux (ubuntu64_vm_lnx_large). r?rail 9 years ago Joel Maher ( :jmaher ) (UTC -8) 58 bytes, text/x-review-board-request	rail : review+	Details
undo all the talos duplicate jobs (1.0) 9 years ago Joel Maher ( :jmaher ) (UTC -8) 1.62 KB, patch	Callek : review+	Details \| Diff \| Splinter Review