Closed Bug 1572089 Opened 1 year ago Closed 1 year ago

gecko-t-win10-64-gpu could run on a cheaper instance

Categories

(Testing :: General, enhancement, P2)

Version 3
enhancement

Tracking

(firefox-esr68 fixed, firefox70 fixed, firefox71 fixed)

RESOLVED FIXED
mozilla71
Tracking Status
firefox-esr68 --- fixed
firefox70 --- fixed
firefox71 --- fixed

People

(Reporter: jrmuizel, Assigned: jmaher)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

It looks like gecko-t-win10-64-gpu is currently running on a g3.4xlarge. Instead it could be using a g3s.xlarge which can be be half the price.

:coop, I don't see this as an option in the in-tree code. Can you set up a pool of g3s.xlarge machines (10-20 of them) as t-win10-64-gpu-s so I can do a try push.

Flags: needinfo?(coop)

I'll get a pool setup for testing.

Assignee: nobody → coop
Status: NEW → ASSIGNED
Flags: needinfo?(coop)

I'm having trouble getting a pool setup for testing due to InsufficientInstanceCapacity, or at least that's the error I'm hitting in US regions.. It seems that g3s.xlarge is quite popular being the cheapest GPU-enabled option. The question becomes: how much do we need to increase our spot bid to get capacity, and is it still cost-effective at that point?

Note that we quite often get spot request failures for the existing g3.4xlarge too, but we do bid pretty high for them already and are able to power through.

I'll keep iterating here.

(In reply to Chris Cooper [:coop] pronoun: he from comment #3)

I'll keep iterating here.

I did manage to get some instances by bumping our minPrice up to $2. There's a small pool configured now, with a min of 2 instances running and a max of 20:

https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/

(In reply to Chris Cooper [:coop] pronoun: he from comment #4)

https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/

Joel: I think the ball's in your court here now.

Assignee: coop → jmaher

hmm, my try push isn't scheduling anything:
https://hg.mozilla.org/try/rev/45b0d1b0e12f9d7ce49edf7e920f9c713e4d80e1

I wonder if there is something else needed to use the new worker type?

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #6)

I wonder if there is something else needed to use the new worker type?

Oh, I thought you might simply subsume the existing gecko-t-win10-64-gpu worker type rather than creating a brand new type, e.g.:

 diff --git a/taskcluster/ci/config.yml b/taskcluster/ci/config.yml
 --- a/taskcluster/ci/config.yml
 +++ b/taskcluster/ci/config.yml
 @@ -375,17 +375,17 @@ workers:
              worker-type:
                  by-level:
                      '3': 'gecko-{level}-t-osx-1014'
                      default: 'gecko-t-osx-1014'
          t-win10-64(|-gpu):
              provisioner: aws-provisioner-v1
              implementation: generic-worker
              os: windows
 -            worker-type: 'gecko-{alias}'
 +            worker-type: 'gecko-{alias}-s'
          t-win10-64(-hw|-ref-hw):
              provisioner: releng-hardware
              implementation: generic-worker
              os: windows
              worker-type: 'gecko-{alias}'
          t-win7-32(|-gpu):
              provisioner: aws-provisioner-v1
              implementation: generic-worker

The alternative is updating https://hg.mozilla.org/ci/ci-configuration/file/1a202649a61e58dc32cffcef33cc941d84f0e9f6/grants.yml which we can also do, but should be cleaned up afterwards too.

I tried changing that with the same results:
https://hg.mozilla.org/try/rev/41fd6bd0d257242254ca19e6d11725a83cc31d7b

:coop, maybe we need to fix up the grants.yml?

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #8)

I tried changing that with the same results:
https://hg.mozilla.org/try/rev/41fd6bd0d257242254ca19e6d11725a83cc31d7b

:coop, maybe we need to fix up the grants.yml?

Needing to officially commit something to the ci-configuration repo seems like a hurdle we'd like to avoid for in-tree experimentation with configs, but maybe I'm overly-optimistic about the state-of-the-art here, at least for Windows.

I'm going to loop in :grenade who might have a better idea where I went wrong here and how best to go about setting up variants of existing Windows test workers.

Here's what I tried in this case:

To setup the new worker type (https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/view), I clicked on CreateWorkerType in the aws-provisioner and copied in the definition for gecko-t-win10-64-gpu (https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu/view), and changed the instanceType, minCapacity, maxCapacity, minPrice, maxPrice, and scopes (scopes are apparently ignored).

New workers are being created -- you can see instances spinning up here: https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/resources -- but they aren't claiming work. Both jmaher and I have tried to submit tasks to these workers via Try, but the provisioner claims there are no workers available: https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win10-64-gpu-s

Speaking to Dustin in slack, the instances are starting but likely never calling queue.claimWork, which is why they aren't getting added to the worker-types listing.

:grenade - did this approach (re-using existing worker definition with some tweaks) ever have any hope of working for Windows testers? How do you normally go about setting up variants for testing?

Flags: needinfo?(rthijssen)

(In reply to Chris Cooper [:coop] pronoun: he from comment #9)

:grenade - did this approach (re-using existing worker definition with some tweaks) ever have any hope of working for Windows testers?

yes, it might have worked. i see the instances start, run and log. then they eventually shut down with a message about gw not starting. which could be for any number of reasons. windows instances check the occ repo for a manifest relating to their worker type and of course we didn't have one so maybe that's why their missing something that would encourage productivity. i've added it to see if that helps.

How do you normally go about setting up variants for testing?

i probably would have just changed instance type in the aws provisioner config for the beta worker type, since that's already up and running and exists for this sort of thing. but there's no harm in the approach used and it almost worked. i'll monitor now that the manifest exists and see if gecko-t-win10-64-gpu-s starts taking work...

Flags: needinfo?(rthijssen)

(In reply to Rob Thijssen [:grenade (EET/UTC+0300)] from comment #10)

i probably would have just changed instance type in the aws provisioner config for the beta worker type, since that's already up and running and exists for this sort of thing. but there's no harm in the approach used and it almost worked. i'll monitor now that the manifest exists and see if gecko-t-win10-64-gpu-s starts taking work...

Thanks, Rob. Good to know for the future.

It looks like some of Joel's jobs are completing now (https://treeherder.mozilla.org/#/jobs?repo=try&revision=a13612275f0bdd7721f8ec5b918f8bdbaa700ffe) although most are hitting exceptions from not being claimed in time.

Mine was triggered more recently, so it succeeded: https://treeherder.mozilla.org/#/jobs?repo=try&author=coop%40mozilla.com&selectedJob=260598293

I did a before/after and retrigger x5 for each job (some jobs might have a few extra). Overall the g3s.xlarge machines are 5% slower:
https://docs.google.com/spreadsheets/d/1uqeDOpaz_E49Ta7f7KiUZainmQ_irx4ux_xzTmrdgKs/edit#gid=0

^ links to try pushes at the top of the document

If these are 1/2 price and have as much availability as the current g3.4xlarge machines, we could really see some good savings (not quite 50%)

As for failures:
current: 20
g3s.xlarge: 31 [1]

Putting some work into more debugging the timing of tests/browser in a dozen or so cases, it would be realistic to expect the failure rates to be the same. That might be a few weeks.

[1] fixing or disabling a few tests, the failures would be at 25

:jrmuizel, do you have any other concerns? I think if you are ok, I will pop over to bholley for next steps.

Flags: needinfo?(jmuizelaar)

Looks good to me.

Flags: needinfo?(jmuizelaar)

:grenade, do you have any knowledge of the availability of the g3s.xlarge machines? I want to know if there would be enough of them to replace the existing g3.4xlarge pool

Flags: needinfo?(rthijssen)

i have no knowledge and as far as i'm aware, amazon don't publish capacity figures.

my suggestion would be to set multiple instance types for this worker, with a bias in favour of the cheaper instance. that way the provisioner can always fall back to the g3.4xlarge instances if there isn't enough g3s.xlarge capacity for our workload. if we run that way for a few weeks, we should be able to gauge if there is enough g3s.xlarge to remove the g3.4xlarge instance type from configuration.

eg, modify the config for now, to contain something like:

    "instanceTypes": [
      {
        "instanceType": "g3.4xlarge",
        "utility": 0.5,
        ...
      },
      {
        "instanceType": "g3s.xlarge",
        "utility": 1,
        ...
      }
    ],
Flags: needinfo?(rthijssen)

that is a good tip- we just need to fix up the few tests that are high frequency/perma fail:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a13612275f0bdd7721f8ec5b918f8bdbaa700ffe

:ahal, could you pick this up to green up the few tests? I assume we will need to look at win7 gpu instances as well. Not a p1, but something we could realistically do this month.

:grenade, how could we track how many devices of each are we using?

Flags: needinfo?(ahal)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #16)

:grenade, how could we track how many devices of each are we using?

perfherder records the instance type, so you can see a couple weeks of history there, but it might be enough just to look at the resources page on a busy friday afternoon or two and see how many of each type is running.
if you need more accurate data, dhouse may have ideas for a grafana report.

Sure, I'll try and take a look sometime this week or next.

Blocks: 1573872
Priority: -- → P2
Summary: gecko-t-win10-64-gpu could run on a cheaper intstance → gecko-t-win10-64-gpu could run on a cheaper instance

:dustin, coop is on PTO- he had setup these workers (aws-provisioner-v1/gecko-t-win10-64-gpu-s) and they seem to not be scheduling anymore, cna you help get this online?

Flags: needinfo?(dustin)

Did they work at some point? I see two instances running right now, and zero pending.

The jobs on Andrew's try push (comment 19) are for a different workerType. Were those supposed to be the same?

Flags: needinfo?(dustin)

it looks like Andrews worker types should be aws-provisioner-v1/gecko-t-win10-64-gpu-s, how do you determine the ones that are not run are not that? I could be overlooking something.

I clicked on the grayed-out pending jobs and clicked on the task, and saw gecko-t-win10-64-s.

Gotcha.. Joel does that mean that only the tasks that ran on my push use the new worker type? Aka there are no problems and nothing needs to be triaged (other than the patch adding -s to non-gpu things)?

Flags: needinfo?(jmaher)

I honestly don't know- did I do things wrong in the past? It would take me a while to figure this out, maybe tomorrow morning I will have time if this doesn't have a solution today.

Flags: needinfo?(jmaher)

I see what happened.. your previous try push used !ship in the fuzzy syntax, so that's why the unclaimed tasks didn't run on your push. So yeah, while your patch does correctly change the -gpu tasks to -gpu-s, it also accidentally adds -s to non-gpu tasks. I'll see if I can come up with the correct patch.

I believe this is the patch we need to make the switch:
https://hg.mozilla.org/try/rev/d73e583974ee6e2fd1ce1d3ca0dca48b93686b18

One obvious question.. do we want to switch all Windows 10-64 virtual-with-gpu tasks over at the same time? E.g including things like nightly, ccov, shippable, etc.. or should we just start with a subset of the configurations in that patch?

everything assuming it is green. These are presumably half the cost and ~5% slower, that is a win we should do as complete as possible

Just to note, the above patches can't land yet as we only have 20 instances allocated to the pool:
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types?search=win10-64-gpu

Given this pool is slower, we'll likely want slightly higher capacity than the old worker type (266). But I'm still verifying the tests, so no need to rush out and make the change quite yet.

I think the tests are more or less good to go:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7bd1abc2d2c7631649adeaa2a91955d307119d99

Dustin, how would you prefer we land this? Wait to increase the pool size before landing? Or land, see how big the backlog gets and set the pool size accordingly? If you aren't the one who will make the change feel free to redirect.

Thanks!

Flags: needinfo?(dustin)

Deferring to coop..

Flags: needinfo?(dustin) → needinfo?(coop)

(In reply to Andrew Halberstadt [:ahal] from comment #34)

I think the tests are more or less good to go:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7bd1abc2d2c7631649adeaa2a91955d307119d99

Dustin, how would you prefer we land this? Wait to increase the pool size before landing? Or land, see how big the backlog gets and set the pool size accordingly? If you aren't the one who will make the change feel free to redirect.

I would suggest bumping the pool size from 266->300 on initial landing. We can bump it further (or lower it) afterwards if required, but a small increase here seems prudent.

Flags: needinfo?(coop)

who can increase the pool size to 300?
how can we monitor the backlog/queue size?

It would be nice to have that information so anyone reading this bug can help out.

For clarification (so whoever ends up making the switch doesn't need to parse intent out of the comments above)...

We are switching tasks from:
gecko-t-win10-64-gpu

to:
gecko-t-win10-64-gpu-s

so the pool size of the former can be decreased proportionally to the increase in pool size of the latter. Note the latter will run ~5% slower so the proportion isn't quite 1 to 1, but it's close enough that we can just do that and adjust later as needed.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #37)

how can we monitor the backlog/queue size?

I think this link is the right one:
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types?search=win10-64-gpu&layout=grid

(Current backlog is from another try push I did to double check for more intermittents)

(In reply to Chris Cooper [:coop] pronoun: he from comment #36)

I would suggest bumping the pool size from 266->300 on initial landing. We can bump it further (or lower it) afterwards if required, but a small increase here seems prudent.

I think you might be looking at the wrong workertype. We need to increase the gpu-s pool which appears to have a size of 20 atm.

(In reply to Andrew Halberstadt [:ahal] from comment #40)

I think you might be looking at the wrong workertype. We need to increase the gpu-s pool which appears to have a size of 20 atm.

Sorry, I was just using the number from comment #33.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #37)

who can increase the pool size to 300?

AFAIK, anyone with scopes to modify to worker type definition:

https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu/view
https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/view

I've updated the maxCapacity on the gecko-t-win10-64-gpu-s worker type to 512:

https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/view

Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e5a0b04ebf3a
[taskgraph] Remove unused WINDOWS_WORKER_TYPES from tests transform, r=jmaher
https://hg.mozilla.org/integration/autoland/rev/cb1dbe32e155
[ci] Migrate Win10-64 virtual-with-gpu tasks to a cheaper workertype, r=jmaher
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71

This coincides with a 25% drop in the cost of mozilla-central pushes ($205->$154): https://sql.telemetry.mozilla.org/dashboard/ci-costs

Great job everyone.

You need to log in before you can comment on or make changes to this bug.