1572089 - gecko-t-win10-64-gpu could run on a cheaper instance

Reporter

Description

•

6 years ago

It looks like gecko-t-win10-64-gpu is currently running on a g3.4xlarge. Instead it could be using a g3s.xlarge which can be be half the price.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 1

•

6 years ago

:coop, I don't see this as an option in the in-tree code. Can you set up a pool of g3s.xlarge machines (10-20 of them) as t-win10-64-gpu-s so I can do a try push.

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Comment 2

•

6 years ago

I'll get a pool setup for testing.

Assignee: nobody → coop

Status: NEW → ASSIGNED

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Comment 3

•

6 years ago

I'm having trouble getting a pool setup for testing due to InsufficientInstanceCapacity, or at least that's the error I'm hitting in US regions.. It seems that g3s.xlarge is quite popular being the cheapest GPU-enabled option. The question becomes: how much do we need to increase our spot bid to get capacity, and is it still cost-effective at that point?

Note that we quite often get spot request failures for the existing g3.4xlarge too, but we do bid pretty high for them already and are able to power through.

I'll keep iterating here.

Chris Cooper [:coop] (he/him)

Comment 4

•

6 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #3)

I'll keep iterating here.

I did manage to get some instances by bumping our minPrice up to $2. There's a small pool configured now, with a min of 2 instances running and a max of 20:

https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/

Chris Cooper [:coop] (he/him)

Comment 5

•

6 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #4)

https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/

Joel: I think the ball's in your court here now.

Assignee: coop → jmaher

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 6

•

6 years ago

hmm, my try push isn't scheduling anything:
https://hg.mozilla.org/try/rev/45b0d1b0e12f9d7ce49edf7e920f9c713e4d80e1

I wonder if there is something else needed to use the new worker type?

Chris Cooper [:coop] (he/him)

Comment 7

•

6 years ago

•

Edited

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #6)

I wonder if there is something else needed to use the new worker type?

Oh, I thought you might simply subsume the existing gecko-t-win10-64-gpu worker type rather than creating a brand new type, e.g.:

 diff --git a/taskcluster/ci/config.yml b/taskcluster/ci/config.yml
 --- a/taskcluster/ci/config.yml
 +++ b/taskcluster/ci/config.yml
 @@ -375,17 +375,17 @@ workers:
              worker-type:
                  by-level:
                      '3': 'gecko-{level}-t-osx-1014'
                      default: 'gecko-t-osx-1014'
          t-win10-64(|-gpu):
              provisioner: aws-provisioner-v1
              implementation: generic-worker
              os: windows
 -            worker-type: 'gecko-{alias}'
 +            worker-type: 'gecko-{alias}-s'
          t-win10-64(-hw|-ref-hw):
              provisioner: releng-hardware
              implementation: generic-worker
              os: windows
              worker-type: 'gecko-{alias}'
          t-win7-32(|-gpu):
              provisioner: aws-provisioner-v1
              implementation: generic-worker

The alternative is updating https://hg.mozilla.org/ci/ci-configuration/file/1a202649a61e58dc32cffcef33cc941d84f0e9f6/grants.yml which we can also do, but should be cleaned up afterwards too.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 8

•

6 years ago

I tried changing that with the same results:
https://hg.mozilla.org/try/rev/41fd6bd0d257242254ca19e6d11725a83cc31d7b

:coop, maybe we need to fix up the grants.yml?

Chris Cooper [:coop] (he/him)

Comment 9

•

6 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #8)

I tried changing that with the same results:
https://hg.mozilla.org/try/rev/41fd6bd0d257242254ca19e6d11725a83cc31d7b

:coop, maybe we need to fix up the grants.yml?

Needing to officially commit something to the ci-configuration repo seems like a hurdle we'd like to avoid for in-tree experimentation with configs, but maybe I'm overly-optimistic about the state-of-the-art here, at least for Windows.

I'm going to loop in :grenade who might have a better idea where I went wrong here and how best to go about setting up variants of existing Windows test workers.

Here's what I tried in this case:

To setup the new worker type (https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/view), I clicked on CreateWorkerType in the aws-provisioner and copied in the definition for gecko-t-win10-64-gpu (https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu/view), and changed the instanceType, minCapacity, maxCapacity, minPrice, maxPrice, and scopes (scopes are apparently ignored).

New workers are being created -- you can see instances spinning up here: https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/resources -- but they aren't claiming work. Both jmaher and I have tried to submit tasks to these workers via Try, but the provisioner claims there are no workers available: https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win10-64-gpu-s

Speaking to Dustin in slack, the instances are starting but likely never calling queue.claimWork, which is why they aren't getting added to the worker-types listing.

:grenade - did this approach (re-using existing worker definition with some tweaks) ever have any hope of working for Windows testers? How do you normally go about setting up variants for testing?

Flags: needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 10

•

6 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #9)

:grenade - did this approach (re-using existing worker definition with some tweaks) ever have any hope of working for Windows testers?

yes, it might have worked. i see the instances start, run and log. then they eventually shut down with a message about gw not starting. which could be for any number of reasons. windows instances check the occ repo for a manifest relating to their worker type and of course we didn't have one so maybe that's why their missing something that would encourage productivity. i've added it to see if that helps.

How do you normally go about setting up variants for testing?

i probably would have just changed instance type in the aws provisioner config for the beta worker type, since that's already up and running and exists for this sort of thing. but there's no harm in the approach used and it almost worked. i'll monitor now that the manifest exists and see if gecko-t-win10-64-gpu-s starts taking work...

Flags: needinfo?(rthijssen)

Chris Cooper [:coop] (he/him)

Comment 11

•

6 years ago

(In reply to Rob Thijssen [:grenade (EET/UTC+0300)] from comment #10)

i probably would have just changed instance type in the aws provisioner config for the beta worker type, since that's already up and running and exists for this sort of thing. but there's no harm in the approach used and it almost worked. i'll monitor now that the manifest exists and see if gecko-t-win10-64-gpu-s starts taking work...

Thanks, Rob. Good to know for the future.

It looks like some of Joel's jobs are completing now (https://treeherder.mozilla.org/#/jobs?repo=try&revision=a13612275f0bdd7721f8ec5b918f8bdbaa700ffe) although most are hitting exceptions from not being claimed in time.

Mine was triggered more recently, so it succeeded: https://treeherder.mozilla.org/#/jobs?repo=try&author=coop%40mozilla.com&selectedJob=260598293

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 12

•

6 years ago

I did a before/after and retrigger x5 for each job (some jobs might have a few extra). Overall the g3s.xlarge machines are 5% slower:
https://docs.google.com/spreadsheets/d/1uqeDOpaz_E49Ta7f7KiUZainmQ_irx4ux_xzTmrdgKs/edit#gid=0

^ links to try pushes at the top of the document

If these are 1/2 price and have as much availability as the current g3.4xlarge machines, we could really see some good savings (not quite 50%)

As for failures:
current: 20
g3s.xlarge: 31 [1]

Putting some work into more debugging the timing of tests/browser in a dozen or so cases, it would be realistic to expect the failure rates to be the same. That might be a few weeks.

[1] fixing or disabling a few tests, the failures would be at 25

:jrmuizel, do you have any other concerns? I think if you are ok, I will pop over to bholley for next steps.

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 13

•

6 years ago

Looks good to me.

Flags: needinfo?(jmuizelaar)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 14

•

6 years ago

:grenade, do you have any knowledge of the availability of the g3s.xlarge machines? I want to know if there would be enough of them to replace the existing g3.4xlarge pool

Flags: needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 15

•

6 years ago

•

Edited

i have no knowledge and as far as i'm aware, amazon don't publish capacity figures.

my suggestion would be to set multiple instance types for this worker, with a bias in favour of the cheaper instance. that way the provisioner can always fall back to the g3.4xlarge instances if there isn't enough g3s.xlarge capacity for our workload. if we run that way for a few weeks, we should be able to gauge if there is enough g3s.xlarge to remove the g3.4xlarge instance type from configuration.

eg, modify the config for now, to contain something like:

    "instanceTypes": [
      {
        "instanceType": "g3.4xlarge",
        "utility": 0.5,
        ...
      },
      {
        "instanceType": "g3s.xlarge",
        "utility": 1,
        ...
      }
    ],

Flags: needinfo?(rthijssen)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 16

•

6 years ago

that is a good tip- we just need to fix up the few tests that are high frequency/perma fail:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a13612275f0bdd7721f8ec5b918f8bdbaa700ffe

:ahal, could you pick this up to green up the few tests? I assume we will need to look at win7 gpu instances as well. Not a p1, but something we could realistically do this month.

:grenade, how could we track how many devices of each are we using?

Flags: needinfo?(ahal)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 17

•

6 years ago

•

Edited

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #16)

:grenade, how could we track how many devices of each are we using?

perfherder records the instance type, so you can see a couple weeks of history there, but it might be enough just to look at the resources page on a busy friday afternoon or two and see how many of each type is running.
if you need more accurate data, dhouse may have ideas for a grafana report.

Andrew Halberstadt [:ahal]

Comment 18

•

6 years ago

Sure, I'll try and take a look sometime this week or next.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

6 years ago

Blocks: 1573872

Rob Thijssen [:grenade (EET/UTC+0300)]

Updated

•

6 years ago

Blocks: 1574164

Geoff Brown [:gbrown]

Updated

•

6 years ago

Priority: -- → P2

Summary: gecko-t-win10-64-gpu could run on a cheaper intstance → gecko-t-win10-64-gpu could run on a cheaper instance

Andrew Halberstadt [:ahal]

Comment 19

•

6 years ago

Jobs have been pending for 24hrs on my push:
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=d7b39265f4fd3107842ea623000964d51ea4589e

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 20

•

6 years ago

:dustin, coop is on PTO- he had setup these workers (aws-provisioner-v1/gecko-t-win10-64-gpu-s) and they seem to not be scheduling anymore, cna you help get this online?

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 21

•

6 years ago

Did they work at some point? I see two instances running right now, and zero pending.

The jobs on Andrew's try push (comment 19) are for a different workerType. Were those supposed to be the same?

Flags: needinfo?(dustin)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 22

•

6 years ago

it looks like Andrews worker types should be aws-provisioner-v1/gecko-t-win10-64-gpu-s, how do you determine the ones that are not run are not that? I could be overlooking something.

Dustin J. Mitchell [:dustin] (he/him)

Comment 23

•

6 years ago

I clicked on the grayed-out pending jobs and clicked on the task, and saw gecko-t-win10-64-s.

Andrew Halberstadt [:ahal]

Comment 24

•

6 years ago

They were working on Joel's push here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a13612275f0bdd7721f8ec5b918f8bdbaa700ffe

For example:
https://tools.taskcluster.net/groups/BEnyAuP5S9a5ScNm_6KjiQ/tasks/XmKeh_7VTCCmXdAYqmrzAw/details

Fwiw my push uses the exact same patch that Joel's did.

Flags: needinfo?(ahal)

Dustin J. Mitchell [:dustin] (he/him)

Comment 25

•

6 years ago

Just to be clear, https://tools.taskcluster.net/groups/COMmQSW2QaKc1nH3sUC_OA/tasks/Ihs4ZEnnSTaKr2elwOYK8A/details has WorkerType: gecko-t-win10-64-s and there is no such workerType in https://tools.taskcluster.net/aws-provisioner. https://tools.taskcluster.net/groups/BEnyAuP5S9a5ScNm_6KjiQ/tasks/XmKeh_7VTCCmXdAYqmrzAw/details, linked in comment 24, has WorkerType: gecko-t-win10-64-gpu-s which does exist.

Andrew Halberstadt [:ahal]

Comment 26

•

6 years ago

Gotcha.. Joel does that mean that only the tasks that ran on my push use the new worker type? Aka there are no problems and nothing needs to be triaged (other than the patch adding -s to non-gpu things)?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 27

•

6 years ago

I honestly don't know- did I do things wrong in the past? It would take me a while to figure this out, maybe tomorrow morning I will have time if this doesn't have a solution today.

Flags: needinfo?(jmaher)

Andrew Halberstadt [:ahal]

Comment 28

•

6 years ago

I see what happened.. your previous try push used !ship in the fuzzy syntax, so that's why the unclaimed tasks didn't run on your push. So yeah, while your patch does correctly change the -gpu tasks to -gpu-s, it also accidentally adds -s to non-gpu tasks. I'll see if I can come up with the correct patch.

Andrew Halberstadt [:ahal]

Comment 29

•

6 years ago

I believe this is the patch we need to make the switch:
https://hg.mozilla.org/try/rev/d73e583974ee6e2fd1ce1d3ca0dca48b93686b18

One obvious question.. do we want to switch all Windows 10-64 virtual-with-gpu tasks over at the same time? E.g including things like nightly, ccov, shippable, etc.. or should we just start with a subset of the configurations in that patch?

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 30

•

6 years ago

everything assuming it is green. These are presumably half the cost and ~5% slower, that is a win we should do as complete as possible

Andrew Halberstadt [:ahal]

Comment 31

•

5 years ago

Attached file Bug 1572089 - [taskgraph] Remove unused WINDOWS_WORKER_TYPES from tests transform, r?jmaher — Details

Andrew Halberstadt [:ahal]

Comment 32

•

5 years ago

Attached file Bug 1572089 - [ci] Migrate Win10-64 virtual-with-gpu tasks to a cheaper workertype, r?jmaher — Details

Depends on D43819

Andrew Halberstadt [:ahal]

Comment 33

•

5 years ago

Just to note, the above patches can't land yet as we only have 20 instances allocated to the pool:
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types?search=win10-64-gpu

Given this pool is slower, we'll likely want slightly higher capacity than the old worker type (266). But I'm still verifying the tests, so no need to rush out and make the change quite yet.

Andrew Halberstadt [:ahal]

Comment 34

•

5 years ago

I think the tests are more or less good to go:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7bd1abc2d2c7631649adeaa2a91955d307119d99

Dustin, how would you prefer we land this? Wait to increase the pool size before landing? Or land, see how big the backlog gets and set the pool size accordingly? If you aren't the one who will make the change feel free to redirect.

Thanks!

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 35

•

5 years ago

Deferring to coop..

Flags: needinfo?(dustin) → needinfo?(coop)

Chris Cooper [:coop] (he/him)

Comment 36

•

5 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #34)

I think the tests are more or less good to go:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7bd1abc2d2c7631649adeaa2a91955d307119d99

Dustin, how would you prefer we land this? Wait to increase the pool size before landing? Or land, see how big the backlog gets and set the pool size accordingly? If you aren't the one who will make the change feel free to redirect.

I would suggest bumping the pool size from 266->300 on initial landing. We can bump it further (or lower it) afterwards if required, but a small increase here seems prudent.

Flags: needinfo?(coop)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 37

•

5 years ago

who can increase the pool size to 300?
how can we monitor the backlog/queue size?

It would be nice to have that information so anyone reading this bug can help out.

Andrew Halberstadt [:ahal]

Comment 38

•

5 years ago

For clarification (so whoever ends up making the switch doesn't need to parse intent out of the comments above)...

We are switching tasks from:
gecko-t-win10-64-gpu

to:
gecko-t-win10-64-gpu-s

so the pool size of the former can be decreased proportionally to the increase in pool size of the latter. Note the latter will run ~5% slower so the proportion isn't quite 1 to 1, but it's close enough that we can just do that and adjust later as needed.

Andrew Halberstadt [:ahal]

Comment 39

•

5 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #37)

how can we monitor the backlog/queue size?

I think this link is the right one:
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types?search=win10-64-gpu&layout=grid

(Current backlog is from another try push I did to double check for more intermittents)

Andrew Halberstadt [:ahal]

Comment 40

•

5 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #36)

I would suggest bumping the pool size from 266->300 on initial landing. We can bump it further (or lower it) afterwards if required, but a small increase here seems prudent.

I think you might be looking at the wrong workertype. We need to increase the gpu-s pool which appears to have a size of 20 atm.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 41

•

5 years ago

Backlog / queue size is available for monitoring at https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&var-workerType=gecko-t-win10-64-gpu-s&refresh=1m

Chris Cooper [:coop] (he/him)

Comment 42

•

5 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #40)

I think you might be looking at the wrong workertype. We need to increase the gpu-s pool which appears to have a size of 20 atm.

Sorry, I was just using the number from comment #33.

Chris Cooper [:coop] (he/him)

Comment 43

•

5 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #37)

who can increase the pool size to 300?

AFAIK, anyone with scopes to modify to worker type definition:

https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu/view
https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/view

Chris Cooper [:coop] (he/him)

Comment 44

•

5 years ago

I've updated the maxCapacity on the gecko-t-win10-64-gpu-s worker type to 512:

https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/view

Pulsebot

Comment 45

•

5 years ago

Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e5a0b04ebf3a [taskgraph] Remove unused WINDOWS_WORKER_TYPES from tests transform, r=jmaher https://hg.mozilla.org/integration/autoland/rev/cb1dbe32e155 [ci] Migrate Win10-64 virtual-with-gpu tasks to a cheaper workertype, r=jmaher

Bogdan Tara[:bogdan_tara | bogdant]

Comment 46

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e5a0b04ebf3a
https://hg.mozilla.org/mozilla-central/rev/cb1dbe32e155

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox71: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla71

Alexandru Ionescu (needinfo me) [:alexandrui]

Updated

•

5 years ago

Regressions: 1579844

Bobby Holley (:bholley)

Comment 47

•

5 years ago

This coincides with a 25% drop in the cost of mozilla-central pushes ($205->$154): https://sql.telemetry.mozilla.org/dashboard/ci-costs

Great job everyone.

Tom Prince [:tomprince]

Comment 48

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-esr68/rev/17e6c22c16f8
https://hg.mozilla.org/releases/mozilla-esr68/rev/ab5181b89dab

status-firefox-esr68: --- → fixed

Tom Prince [:tomprince]

Comment 49

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-esr68/rev/5bda3864bfaa
https://hg.mozilla.org/releases/mozilla-esr68/rev/28b6891c53f3

Tom Prince [:tomprince]

Comment 50

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-release/rev/6cbc37c0edf2
https://hg.mozilla.org/releases/mozilla-release/rev/0100c63fd822

status-firefox70: --- → fixed

Bug 1572089 - [taskgraph] Remove unused WINDOWS_WORKER_TYPES from tests transform, r?jmaher 5 years ago Andrew Halberstadt [:ahal] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1572089 - [ci] Migrate Win10-64 virtual-with-gpu tasks to a cheaper workertype, r?jmaher 5 years ago Andrew Halberstadt [:ahal] 47 bytes, text/x-phabricator-request		Details \| Review