Closed Bug 1302530 Opened 8 years ago Closed 8 years ago

Not enough g-w732 instances running

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Unassigned)

References

Details

Attachments

(2 files)

      No description provided.
There are several people working on this behind the scenes. Prices for the g2.2xlarge instance type are currently very high in us-east-1 at $7.67, which is nearly 10 times bigger than our bid.
Prices have dropped somewhat, and we've boosted our bid. Waiting to see if instances are launched.
Pending is starting to drop as instances come online.
papertrail:

* we tried bumping price from 0.79 to 5.00 in use1:
** https://github.com/mozilla-releng/build-cloud-tools/commit/169ca3d70f2e3414847a1ca35c6192ae123270d8

* wasn't enough, so we 1) bumped it to $8.00 to be ~20% higher than today's use1 market peak of $6.5. 2) tried to add some ondemand instances so we the cost of spot doesn't hurt so much
** https://github.com/mozilla-releng/build-cloud-tools/commit/4899277412298caf023d592157a05be7e7062607
** https://github.com/mozilla-releng/build-cloud-tools/commit/4899277412298caf023d592157a05be7e7062607

* ondemand never started. likely something to do with the way slavealloc + buildbot are configured. we can probably backout the ondemand patch lines from above

* we currently don't have any g-w732-spot instances in usw2. To help with load and failover, we filed a bug to add more g-w732-spot machines in bbot+slavealloc
** https://bugzilla.mozilla.org/show_bug.cgi?id=1302549



current status: we currently have 23 pending for g-w732-spot[1]. That's down from 469 pending when this bug was filed. we currently have 84 g2.2xlarge instances running[2]. That's up from 0 when this bug was filed.

based on that status, we can probably close this bug and reopen trees when sheriffs are ready and can confirm similar state.

[1] https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=g-w732-spot
[2] https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:instanceType=%27g2.2xlarge%27;instanceState=running;sort=tag:Name
Depends on: 1302549
(In reply to Jordan Lund (:jlund) from comment #4)
> papertrail:
> 
> * wasn't enough, so we 1) bumped it to $8.00 to be ~20% higher than today's
> use1 market peak of $6.5. 2) tried to add some ondemand instances so we the
> cost of spot doesn't hurt so much
> **
> https://github.com/mozilla-releng/build-cloud-tools/commit/
> 4899277412298caf023d592157a05be7e7062607
> **
> https://github.com/mozilla-releng/build-cloud-tools/commit/
> 4899277412298caf023d592157a05be7e7062607

apologies, second link should have been:
** https://github.com/mozilla-releng/build-cloud-tools/commit/d893ab899712973e9da4a5d3a2b81ad2c96ac0d8
resolving for now. please reopen if we regress again. will file clean up bugs to revert temp fixes
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Depends on: 1302578
11:46 AM <philor> KWierso: 10 Win7 GFX jobs running, 694 pending
11:47 AM <philor> !squirrel time to blow the budget again
11:48 AM <philor> or find out who it is that's outbidding us, and run them out of business
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 8791361 [details]
Bug 1302530 - Add ondemand g-w732 instances

https://reviewboard.mozilla.org/r/78784/#review77408
Attachment #8791361 - Flags: review?(rail) → review+
We're trying to get ondemand instances working for g-w732. Rail has added entries in slavealloc, and the above patches to buildbot have been deployed.
(In reply to Chris AtLee [:catlee] from comment #14)
> We're trying to get ondemand instances working for g-w732. Rail has added
> entries in slavealloc, and the above patches to buildbot have been deployed.

braindump of steps taken trying to figure out why ondemand g-w732 instances aren't being created...

16:14:54 <jlund> sanity check: does setting 'distro' make any difference in slavealloc? they are set to win32 while the spot are win7
16:15:10 <•catlee> hm, not sure
16:15:16 <•catlee> they should probably be the same
16:31:41 — jlund sees a lot of errors reporting we have run out of subnet addresses
16:35:08 <jlund> also hitting this a lot: https://github.com/mozilla-releng/build-cloud-tools/blob/32babcc491c84c792ef5e806a40b9c1de7714ab2/cloudtools/scripts/aws_watch_pending.py#L132
16:35:52 <jlund> which actually might be normal
16:52:32 <jlund> hm
16:52:36 <jlund> https://irccloud.mozilla.com/pastebin/ZIOPjo2I 
16:55:39 <jlund> aws_watch_pending should see them. they are in the slaves.json cache
16:55:41 <jlund> {"custom_tplid": 4, "poolid": 92, "speed": "g2.2xlarge", "bitlength": "32", "basedir": "c:\\slave", "dcid": 19, "environment": "prod", "trustid": 4, "distro": "win32", "trustlevel": "t    ry", "envid": 2, "speedid": 30, "datacenter": "us-east-1", "locked_masterid": null, "locked_master": null, "purpose": "tests", "slaveid": 32008, "pool":
16:55:41 <jlund> "tests-use1-windows", "bitsid": 1, "current_masterid": null, "name": "g-w732-ec2-001", "distroid": 46, "enabled": true, "purposeid": 4, "current_master": null, "    notes": null}
17:07:07 <jlund> oh, maybe it does need to be win7: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/g-w732#L33
17:07:32 <•catlee> The distro still needs changing? 
17:08:00 <jlund> I changed 1 of the 2 a while back
17:09:56 <jlund> ah, yeah might as well switch them both given: https://github.com/mozilla-releng/build-cloud-tools/blob/e26be59bf63258786f65875da506dbf1f72eecbe/cloudtools/slavealloc.py#L129
17:16:56 <jlund> so I'm not sure why this isn't working: https://github.com/mozilla-releng/build-cloud-tools/blob/32babcc491c84c792ef5e806a40b9c1de7714ab2/cloudtools/scripts/aws_watch_pending.py#L489
17:18:23 <jlund> maybe it does and we don't have logic for 'create' ondemand. only start/resume? https://github.com/mozilla-releng/build-cloud-tools/blob/32babcc491c84c792ef5e806a40b9c1de7714ab2/cloudtools/scripts/aws_watch_pending.py#L511
17:21:07 <•catlee> That's kind of what I remembered 
17:29:59 <jlund> ah
17:30:01 <jlund> here we go:
17:30:05 <jlund> https://irccloud.mozilla.com/pastebin/XJu8hELv
17:30:52 <jlund> maybe this is blocking us from doing a ondemand request: https://github.com/mozilla-releng/build-cloud-tools/blob/32babcc491c84c792ef5e806a40b9c1de7714ab2/cloudtools/scripts/aws_watch_pending.py#L295
17:32:54 <jlund> why is availability_zone None
17:33:53 <jlund> oh, cause we tell i tto be: https://github.com/mozilla-releng/build-cloud-tools/blob/32babcc491c84c792ef5e806a40b9c1de7714ab2/cloudtools/scripts/aws_watch_pending.py#L119
17:38:32 <jlund> maybe spot is competing and not letting ondemand have any ip's https://github.com/mozilla-releng/build-cloud-tools/blob/3a2158f84252629df05d2191dda574f9ba2dd4e0/cloudtools/aws/vpc.py#L48-L50
17:42:01 <jlund> nthomas: ^ any of this ringing a bell?
17:43:02 <•nthomas> how big are the subnets ? Some of them are only 125 IPs or so, but there’s 20 nets in the list there
17:46:37 <jlund> subnet-5bafe92c for example has 41 available ips
17:47:23 <jlund> e40e0786 has 120 available


at this point I would guess 1) there are issues with subnetting or 2) watch_pending only trying to 'resume' not 'create'
Quick status update: we're now fighting with new test failures on WinXP and Win8. All the XP machines have been re-imaged in bug 1302863.
Just to clarify some things said in IRC in comment 15:

* Even though the function name contains "resume" it starts instances from scratch, see https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L116, similarly to how we start spot instances, except one parameter (is_spot=False). If this function is called, it should start ondemand instances (but it's not called, see below)

* Ondemand instances are launched with instance_initiated_shutdown_behavior="terminate" https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L411, so they should be terminated when runner shuts them down. Similar to spots.

* the main reason why we don't call aws_resume_instances is this block:
https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L478-L489. "started" is always equals to "count" if we try to start spot instances.

It would be called if:
** we had no rules for this instance type: https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L147-L150
** we have no "spot_choices" in https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L155-L158, which is calculated in https://github.com/mozilla-releng/build-cloud-tools/blob/7f7e5c28e3fe086828f602f580afc6c08e8dbf85/cloudtools/aws/spot.py#L302-L332. Looks like if we decrease the bid prices enough, we should get the ondemand instances started.
Severity: blocker → major
Depends on: 1304158
(In reply to Rail Aliiev [:rail] from comment #17)
> Just to clarify some things said in IRC in comment 15:
> 

thanks for clarifying!

> * the main reason why we don't call aws_resume_instances is this block:
> https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/
> scripts/aws_watch_pending.py#L478-L489. "started" is always equals to
> "count" if we try to start spot instances.

I think we were hitting this block though: https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L492

e.g. https://irccloud.mozilla.com/pastebin/ZIOPjo2I

I thought subnets were an issue for a while: https://irccloud.mozilla.com/pastebin/XJu8hELv
(In reply to Jordan Lund (:jlund) from comment #18)
> (In reply to Rail Aliiev [:rail] from comment #17)
> > Just to clarify some things said in IRC in comment 15:
> > 
> 
> thanks for clarifying!
> 
> > * the main reason why we don't call aws_resume_instances is this block:
> > https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/
> > scripts/aws_watch_pending.py#L478-L489. "started" is always equals to
> > "count" if we try to start spot instances.
> 
> I think we were hitting this block though:
> https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/
> scripts/aws_watch_pending.py#L492
>
(rail explains over irc)
rail-mtg> jlund|mtg: yup, which is set in https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L478-L489
11:41:00 it changes to_create_ondemand
11:41:25 
<jlund|mtg> Jordan Lund ooo
11:41:28 
<rail-mtg> we never get any leftover there
11:41:58 outbids are out of band, not counted there for reals
Depends on: 1304831
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: