gecko-t-win10-64-gpu could run on a cheaper instance
Categories
(Testing :: General, enhancement, P2)
Tracking
(firefox-esr68 fixed, firefox70 fixed, firefox71 fixed)
People
(Reporter: jrmuizel, Assigned: jmaher)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
It looks like gecko-t-win10-64-gpu is currently running on a g3.4xlarge. Instead it could be using a g3s.xlarge which can be be half the price.
Assignee | ||
Comment 1•6 years ago
|
||
:coop, I don't see this as an option in the in-tree code. Can you set up a pool of g3s.xlarge machines (10-20 of them) as t-win10-64-gpu-s so I can do a try push.
Comment 2•6 years ago
|
||
I'll get a pool setup for testing.
Comment 3•6 years ago
|
||
I'm having trouble getting a pool setup for testing due to InsufficientInstanceCapacity, or at least that's the error I'm hitting in US regions.. It seems that g3s.xlarge is quite popular being the cheapest GPU-enabled option. The question becomes: how much do we need to increase our spot bid to get capacity, and is it still cost-effective at that point?
Note that we quite often get spot request failures for the existing g3.4xlarge too, but we do bid pretty high for them already and are able to power through.
I'll keep iterating here.
Comment 4•6 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #3)
I'll keep iterating here.
I did manage to get some instances by bumping our minPrice up to $2. There's a small pool configured now, with a min of 2 instances running and a max of 20:
https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/
Comment 5•6 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #4)
https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/
Joel: I think the ball's in your court here now.
Assignee | ||
Comment 6•6 years ago
|
||
hmm, my try push isn't scheduling anything:
https://hg.mozilla.org/try/rev/45b0d1b0e12f9d7ce49edf7e920f9c713e4d80e1
I wonder if there is something else needed to use the new worker type?
Comment 7•6 years ago
•
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #6)
I wonder if there is something else needed to use the new worker type?
Oh, I thought you might simply subsume the existing gecko-t-win10-64-gpu worker type rather than creating a brand new type, e.g.:
diff --git a/taskcluster/ci/config.yml b/taskcluster/ci/config.yml
--- a/taskcluster/ci/config.yml
+++ b/taskcluster/ci/config.yml
@@ -375,17 +375,17 @@ workers:
worker-type:
by-level:
'3': 'gecko-{level}-t-osx-1014'
default: 'gecko-t-osx-1014'
t-win10-64(|-gpu):
provisioner: aws-provisioner-v1
implementation: generic-worker
os: windows
- worker-type: 'gecko-{alias}'
+ worker-type: 'gecko-{alias}-s'
t-win10-64(-hw|-ref-hw):
provisioner: releng-hardware
implementation: generic-worker
os: windows
worker-type: 'gecko-{alias}'
t-win7-32(|-gpu):
provisioner: aws-provisioner-v1
implementation: generic-worker
The alternative is updating https://hg.mozilla.org/ci/ci-configuration/file/1a202649a61e58dc32cffcef33cc941d84f0e9f6/grants.yml which we can also do, but should be cleaned up afterwards too.
Assignee | ||
Comment 8•6 years ago
|
||
I tried changing that with the same results:
https://hg.mozilla.org/try/rev/41fd6bd0d257242254ca19e6d11725a83cc31d7b
:coop, maybe we need to fix up the grants.yml?
Comment 9•6 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #8)
I tried changing that with the same results:
https://hg.mozilla.org/try/rev/41fd6bd0d257242254ca19e6d11725a83cc31d7b:coop, maybe we need to fix up the grants.yml?
Needing to officially commit something to the ci-configuration repo seems like a hurdle we'd like to avoid for in-tree experimentation with configs, but maybe I'm overly-optimistic about the state-of-the-art here, at least for Windows.
I'm going to loop in :grenade who might have a better idea where I went wrong here and how best to go about setting up variants of existing Windows test workers.
Here's what I tried in this case:
To setup the new worker type (https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/view), I clicked on CreateWorkerType in the aws-provisioner and copied in the definition for gecko-t-win10-64-gpu (https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu/view), and changed the instanceType, minCapacity, maxCapacity, minPrice, maxPrice, and scopes (scopes are apparently ignored).
New workers are being created -- you can see instances spinning up here: https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/resources -- but they aren't claiming work. Both jmaher and I have tried to submit tasks to these workers via Try, but the provisioner claims there are no workers available: https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win10-64-gpu-s
Speaking to Dustin in slack, the instances are starting but likely never calling queue.claimWork, which is why they aren't getting added to the worker-types listing.
:grenade - did this approach (re-using existing worker definition with some tweaks) ever have any hope of working for Windows testers? How do you normally go about setting up variants for testing?
Comment 10•6 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #9)
:grenade - did this approach (re-using existing worker definition with some tweaks) ever have any hope of working for Windows testers?
yes, it might have worked. i see the instances start, run and log. then they eventually shut down with a message about gw not starting. which could be for any number of reasons. windows instances check the occ repo for a manifest relating to their worker type and of course we didn't have one so maybe that's why their missing something that would encourage productivity. i've added it to see if that helps.
How do you normally go about setting up variants for testing?
i probably would have just changed instance type in the aws provisioner config for the beta worker type, since that's already up and running and exists for this sort of thing. but there's no harm in the approach used and it almost worked. i'll monitor now that the manifest exists and see if gecko-t-win10-64-gpu-s starts taking work...
Comment 11•6 years ago
|
||
(In reply to Rob Thijssen [:grenade (EET/UTC+0300)] from comment #10)
i probably would have just changed instance type in the aws provisioner config for the beta worker type, since that's already up and running and exists for this sort of thing. but there's no harm in the approach used and it almost worked. i'll monitor now that the manifest exists and see if gecko-t-win10-64-gpu-s starts taking work...
Thanks, Rob. Good to know for the future.
It looks like some of Joel's jobs are completing now (https://treeherder.mozilla.org/#/jobs?repo=try&revision=a13612275f0bdd7721f8ec5b918f8bdbaa700ffe) although most are hitting exceptions from not being claimed in time.
Mine was triggered more recently, so it succeeded: https://treeherder.mozilla.org/#/jobs?repo=try&author=coop%40mozilla.com&selectedJob=260598293
Assignee | ||
Comment 12•6 years ago
|
||
I did a before/after and retrigger x5 for each job (some jobs might have a few extra). Overall the g3s.xlarge machines are 5% slower:
https://docs.google.com/spreadsheets/d/1uqeDOpaz_E49Ta7f7KiUZainmQ_irx4ux_xzTmrdgKs/edit#gid=0
^ links to try pushes at the top of the document
If these are 1/2 price and have as much availability as the current g3.4xlarge machines, we could really see some good savings (not quite 50%)
As for failures:
current: 20
g3s.xlarge: 31 [1]
Putting some work into more debugging the timing of tests/browser in a dozen or so cases, it would be realistic to expect the failure rates to be the same. That might be a few weeks.
[1] fixing or disabling a few tests, the failures would be at 25
:jrmuizel, do you have any other concerns? I think if you are ok, I will pop over to bholley for next steps.
Assignee | ||
Comment 14•6 years ago
|
||
:grenade, do you have any knowledge of the availability of the g3s.xlarge machines? I want to know if there would be enough of them to replace the existing g3.4xlarge pool
Comment 15•6 years ago
•
|
||
i have no knowledge and as far as i'm aware, amazon don't publish capacity figures.
my suggestion would be to set multiple instance types for this worker, with a bias in favour of the cheaper instance. that way the provisioner can always fall back to the g3.4xlarge instances if there isn't enough g3s.xlarge capacity for our workload. if we run that way for a few weeks, we should be able to gauge if there is enough g3s.xlarge to remove the g3.4xlarge instance type from configuration.
eg, modify the config for now, to contain something like:
"instanceTypes": [
{
"instanceType": "g3.4xlarge",
"utility": 0.5,
...
},
{
"instanceType": "g3s.xlarge",
"utility": 1,
...
}
],
Assignee | ||
Comment 16•6 years ago
|
||
that is a good tip- we just need to fix up the few tests that are high frequency/perma fail:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a13612275f0bdd7721f8ec5b918f8bdbaa700ffe
:ahal, could you pick this up to green up the few tests? I assume we will need to look at win7 gpu instances as well. Not a p1, but something we could realistically do this month.
:grenade, how could we track how many devices of each are we using?
Comment 17•6 years ago
•
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #16)
:grenade, how could we track how many devices of each are we using?
perfherder records the instance type, so you can see a couple weeks of history there, but it might be enough just to look at the resources page on a busy friday afternoon or two and see how many of each type is running.
if you need more accurate data, dhouse may have ideas for a grafana report.
Comment 18•6 years ago
|
||
Sure, I'll try and take a look sometime this week or next.
![]() |
||
Updated•6 years ago
|
Comment 19•6 years ago
|
||
Jobs have been pending for 24hrs on my push:
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=d7b39265f4fd3107842ea623000964d51ea4589e
Assignee | ||
Comment 20•6 years ago
|
||
:dustin, coop is on PTO- he had setup these workers (aws-provisioner-v1/gecko-t-win10-64-gpu-s) and they seem to not be scheduling anymore, cna you help get this online?
Comment 21•6 years ago
|
||
Did they work at some point? I see two instances running right now, and zero pending.
The jobs on Andrew's try push (comment 19) are for a different workerType. Were those supposed to be the same?
Assignee | ||
Comment 22•6 years ago
|
||
it looks like Andrews worker types should be aws-provisioner-v1/gecko-t-win10-64-gpu-s, how do you determine the ones that are not run are not that? I could be overlooking something.
Comment 23•6 years ago
|
||
I clicked on the grayed-out pending jobs and clicked on the task, and saw gecko-t-win10-64-s
.
Comment 24•6 years ago
|
||
They were working on Joel's push here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a13612275f0bdd7721f8ec5b918f8bdbaa700ffe
For example:
https://tools.taskcluster.net/groups/BEnyAuP5S9a5ScNm_6KjiQ/tasks/XmKeh_7VTCCmXdAYqmrzAw/details
Fwiw my push uses the exact same patch that Joel's did.
Comment 25•6 years ago
|
||
Just to be clear, https://tools.taskcluster.net/groups/COMmQSW2QaKc1nH3sUC_OA/tasks/Ihs4ZEnnSTaKr2elwOYK8A/details has WorkerType: gecko-t-win10-64-s
and there is no such workerType in https://tools.taskcluster.net/aws-provisioner. https://tools.taskcluster.net/groups/BEnyAuP5S9a5ScNm_6KjiQ/tasks/XmKeh_7VTCCmXdAYqmrzAw/details, linked in comment 24, has WorkerType: gecko-t-win10-64-gpu-s
which does exist.
Comment 26•6 years ago
|
||
Gotcha.. Joel does that mean that only the tasks that ran on my push use the new worker type? Aka there are no problems and nothing needs to be triaged (other than the patch adding -s
to non-gpu things)?
Assignee | ||
Comment 27•6 years ago
|
||
I honestly don't know- did I do things wrong in the past? It would take me a while to figure this out, maybe tomorrow morning I will have time if this doesn't have a solution today.
Comment 28•6 years ago
|
||
I see what happened.. your previous try push used !ship
in the fuzzy syntax, so that's why the unclaimed tasks didn't run on your push. So yeah, while your patch does correctly change the -gpu
tasks to -gpu-s
, it also accidentally adds -s
to non-gpu tasks. I'll see if I can come up with the correct patch.
Comment 29•6 years ago
|
||
I believe this is the patch we need to make the switch:
https://hg.mozilla.org/try/rev/d73e583974ee6e2fd1ce1d3ca0dca48b93686b18
One obvious question.. do we want to switch all Windows 10-64 virtual-with-gpu
tasks over at the same time? E.g including things like nightly, ccov, shippable, etc.. or should we just start with a subset of the configurations in that patch?
Assignee | ||
Comment 30•6 years ago
|
||
everything assuming it is green. These are presumably half the cost and ~5% slower, that is a win we should do as complete as possible
Comment 31•5 years ago
|
||
Comment 32•5 years ago
|
||
Depends on D43819
Comment 33•5 years ago
|
||
Just to note, the above patches can't land yet as we only have 20 instances allocated to the pool:
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types?search=win10-64-gpu
Given this pool is slower, we'll likely want slightly higher capacity than the old worker type (266). But I'm still verifying the tests, so no need to rush out and make the change quite yet.
Comment 34•5 years ago
|
||
I think the tests are more or less good to go:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7bd1abc2d2c7631649adeaa2a91955d307119d99
Dustin, how would you prefer we land this? Wait to increase the pool size before landing? Or land, see how big the backlog gets and set the pool size accordingly? If you aren't the one who will make the change feel free to redirect.
Thanks!
Comment 36•5 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #34)
I think the tests are more or less good to go:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7bd1abc2d2c7631649adeaa2a91955d307119d99Dustin, how would you prefer we land this? Wait to increase the pool size before landing? Or land, see how big the backlog gets and set the pool size accordingly? If you aren't the one who will make the change feel free to redirect.
I would suggest bumping the pool size from 266->300 on initial landing. We can bump it further (or lower it) afterwards if required, but a small increase here seems prudent.
Assignee | ||
Comment 37•5 years ago
|
||
who can increase the pool size to 300?
how can we monitor the backlog/queue size?
It would be nice to have that information so anyone reading this bug can help out.
Comment 38•5 years ago
|
||
For clarification (so whoever ends up making the switch doesn't need to parse intent out of the comments above)...
We are switching tasks from:
gecko-t-win10-64-gpu
to:
gecko-t-win10-64-gpu-s
so the pool size of the former can be decreased proportionally to the increase in pool size of the latter. Note the latter will run ~5% slower so the proportion isn't quite 1 to 1, but it's close enough that we can just do that and adjust later as needed.
Comment 39•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #37)
how can we monitor the backlog/queue size?
I think this link is the right one:
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types?search=win10-64-gpu&layout=grid
(Current backlog is from another try push I did to double check for more intermittents)
Comment 40•5 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #36)
I would suggest bumping the pool size from 266->300 on initial landing. We can bump it further (or lower it) afterwards if required, but a small increase here seems prudent.
I think you might be looking at the wrong workertype. We need to increase the gpu-s
pool which appears to have a size of 20 atm.
![]() |
||
Comment 41•5 years ago
|
||
Backlog / queue size is available for monitoring at https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&var-workerType=gecko-t-win10-64-gpu-s&refresh=1m
Comment 42•5 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #40)
I think you might be looking at the wrong workertype. We need to increase the
gpu-s
pool which appears to have a size of 20 atm.
Sorry, I was just using the number from comment #33.
Comment 43•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #37)
who can increase the pool size to 300?
AFAIK, anyone with scopes to modify to worker type definition:
https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu/view
https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/view
Comment 44•5 years ago
|
||
I've updated the maxCapacity on the gecko-t-win10-64-gpu-s worker type to 512:
https://tools.taskcluster.net/aws-provisioner/gecko-t-win10-64-gpu-s/view
Comment 45•5 years ago
|
||
Comment 46•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/e5a0b04ebf3a
https://hg.mozilla.org/mozilla-central/rev/cb1dbe32e155
Comment 47•5 years ago
|
||
This coincides with a 25% drop in the cost of mozilla-central pushes ($205->$154): https://sql.telemetry.mozilla.org/dashboard/ci-costs
Great job everyone.
Comment 48•5 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-esr68/rev/17e6c22c16f8
https://hg.mozilla.org/releases/mozilla-esr68/rev/ab5181b89dab
Comment 49•5 years ago
|
||
bugherder uplift |
Comment 50•5 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-release/rev/6cbc37c0edf2
https://hg.mozilla.org/releases/mozilla-release/rev/0100c63fd822
Description
•