Closed
Bug 1302530
Opened 8 years ago
Closed 8 years ago
Not enough g-w732 instances running
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Unassigned)
References
Details
Attachments
(2 files)
No description provided.
Comment 1•8 years ago
|
||
There are several people working on this behind the scenes. Prices for the g2.2xlarge instance type are currently very high in us-east-1 at $7.67, which is nearly 10 times bigger than our bid.
Comment 2•8 years ago
|
||
Prices have dropped somewhat, and we've boosted our bid. Waiting to see if instances are launched.
Comment 3•8 years ago
|
||
Pending is starting to drop as instances come online.
Comment 4•8 years ago
|
||
papertrail:
* we tried bumping price from 0.79 to 5.00 in use1:
** https://github.com/mozilla-releng/build-cloud-tools/commit/169ca3d70f2e3414847a1ca35c6192ae123270d8
* wasn't enough, so we 1) bumped it to $8.00 to be ~20% higher than today's use1 market peak of $6.5. 2) tried to add some ondemand instances so we the cost of spot doesn't hurt so much
** https://github.com/mozilla-releng/build-cloud-tools/commit/4899277412298caf023d592157a05be7e7062607
** https://github.com/mozilla-releng/build-cloud-tools/commit/4899277412298caf023d592157a05be7e7062607
* ondemand never started. likely something to do with the way slavealloc + buildbot are configured. we can probably backout the ondemand patch lines from above
* we currently don't have any g-w732-spot instances in usw2. To help with load and failover, we filed a bug to add more g-w732-spot machines in bbot+slavealloc
** https://bugzilla.mozilla.org/show_bug.cgi?id=1302549
current status: we currently have 23 pending for g-w732-spot[1]. That's down from 469 pending when this bug was filed. we currently have 84 g2.2xlarge instances running[2]. That's up from 0 when this bug was filed.
based on that status, we can probably close this bug and reopen trees when sheriffs are ready and can confirm similar state.
[1] https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=g-w732-spot
[2] https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:instanceType=%27g2.2xlarge%27;instanceState=running;sort=tag:Name
Depends on: 1302549
Comment 5•8 years ago
|
||
(In reply to Jordan Lund (:jlund) from comment #4)
> papertrail:
>
> * wasn't enough, so we 1) bumped it to $8.00 to be ~20% higher than today's
> use1 market peak of $6.5. 2) tried to add some ondemand instances so we the
> cost of spot doesn't hurt so much
> **
> https://github.com/mozilla-releng/build-cloud-tools/commit/
> 4899277412298caf023d592157a05be7e7062607
> **
> https://github.com/mozilla-releng/build-cloud-tools/commit/
> 4899277412298caf023d592157a05be7e7062607
apologies, second link should have been:
** https://github.com/mozilla-releng/build-cloud-tools/commit/d893ab899712973e9da4a5d3a2b81ad2c96ac0d8
Comment 6•8 years ago
|
||
resolving for now. please reopen if we regress again. will file clean up bugs to revert temp fixes
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment 7•8 years ago
|
||
11:46 AM <philor> KWierso: 10 Win7 GFX jobs running, 694 pending
11:47 AM <philor> !squirrel time to blow the budget again
11:48 AM <philor> or find out who it is that's outbidding us, and run them out of business
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment hidden (mozreview-request) |
Comment 10•8 years ago
|
||
mozreview-review |
Comment on attachment 8791361 [details]
Bug 1302530 - Add ondemand g-w732 instances
https://reviewboard.mozilla.org/r/78784/#review77408
Attachment #8791361 -
Flags: review?(rail) → review+
Reporter | ||
Comment 11•8 years ago
|
||
https://hg.mozilla.org/build/buildbot-configs/rev/ba50299bcd9fe86df5aa694a0a027ee2168041b0
Bug 1302530 - Add ondemand g-w732 instances r=rail
Comment 12•8 years ago
|
||
Comment 13•8 years ago
|
||
Reporter | ||
Comment 14•8 years ago
|
||
We're trying to get ondemand instances working for g-w732. Rail has added entries in slavealloc, and the above patches to buildbot have been deployed.
Comment 15•8 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #14)
> We're trying to get ondemand instances working for g-w732. Rail has added
> entries in slavealloc, and the above patches to buildbot have been deployed.
braindump of steps taken trying to figure out why ondemand g-w732 instances aren't being created...
16:14:54 <jlund> sanity check: does setting 'distro' make any difference in slavealloc? they are set to win32 while the spot are win7
16:15:10 <•catlee> hm, not sure
16:15:16 <•catlee> they should probably be the same
16:31:41 — jlund sees a lot of errors reporting we have run out of subnet addresses
16:35:08 <jlund> also hitting this a lot: https://github.com/mozilla-releng/build-cloud-tools/blob/32babcc491c84c792ef5e806a40b9c1de7714ab2/cloudtools/scripts/aws_watch_pending.py#L132
16:35:52 <jlund> which actually might be normal
16:52:32 <jlund> hm
16:52:36 <jlund> https://irccloud.mozilla.com/pastebin/ZIOPjo2I
16:55:39 <jlund> aws_watch_pending should see them. they are in the slaves.json cache
16:55:41 <jlund> {"custom_tplid": 4, "poolid": 92, "speed": "g2.2xlarge", "bitlength": "32", "basedir": "c:\\slave", "dcid": 19, "environment": "prod", "trustid": 4, "distro": "win32", "trustlevel": "t ry", "envid": 2, "speedid": 30, "datacenter": "us-east-1", "locked_masterid": null, "locked_master": null, "purpose": "tests", "slaveid": 32008, "pool":
16:55:41 <jlund> "tests-use1-windows", "bitsid": 1, "current_masterid": null, "name": "g-w732-ec2-001", "distroid": 46, "enabled": true, "purposeid": 4, "current_master": null, " notes": null}
17:07:07 <jlund> oh, maybe it does need to be win7: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/g-w732#L33
17:07:32 <•catlee> The distro still needs changing?
17:08:00 <jlund> I changed 1 of the 2 a while back
17:09:56 <jlund> ah, yeah might as well switch them both given: https://github.com/mozilla-releng/build-cloud-tools/blob/e26be59bf63258786f65875da506dbf1f72eecbe/cloudtools/slavealloc.py#L129
17:16:56 <jlund> so I'm not sure why this isn't working: https://github.com/mozilla-releng/build-cloud-tools/blob/32babcc491c84c792ef5e806a40b9c1de7714ab2/cloudtools/scripts/aws_watch_pending.py#L489
17:18:23 <jlund> maybe it does and we don't have logic for 'create' ondemand. only start/resume? https://github.com/mozilla-releng/build-cloud-tools/blob/32babcc491c84c792ef5e806a40b9c1de7714ab2/cloudtools/scripts/aws_watch_pending.py#L511
17:21:07 <•catlee> That's kind of what I remembered
17:29:59 <jlund> ah
17:30:01 <jlund> here we go:
17:30:05 <jlund> https://irccloud.mozilla.com/pastebin/XJu8hELv
17:30:52 <jlund> maybe this is blocking us from doing a ondemand request: https://github.com/mozilla-releng/build-cloud-tools/blob/32babcc491c84c792ef5e806a40b9c1de7714ab2/cloudtools/scripts/aws_watch_pending.py#L295
17:32:54 <jlund> why is availability_zone None
17:33:53 <jlund> oh, cause we tell i tto be: https://github.com/mozilla-releng/build-cloud-tools/blob/32babcc491c84c792ef5e806a40b9c1de7714ab2/cloudtools/scripts/aws_watch_pending.py#L119
17:38:32 <jlund> maybe spot is competing and not letting ondemand have any ip's https://github.com/mozilla-releng/build-cloud-tools/blob/3a2158f84252629df05d2191dda574f9ba2dd4e0/cloudtools/aws/vpc.py#L48-L50
17:42:01 <jlund> nthomas: ^ any of this ringing a bell?
17:43:02 <•nthomas> how big are the subnets ? Some of them are only 125 IPs or so, but there’s 20 nets in the list there
17:46:37 <jlund> subnet-5bafe92c for example has 41 available ips
17:47:23 <jlund> e40e0786 has 120 available
at this point I would guess 1) there are issues with subnetting or 2) watch_pending only trying to 'resume' not 'create'
Reporter | ||
Comment 16•8 years ago
|
||
Quick status update: we're now fighting with new test failures on WinXP and Win8. All the XP machines have been re-imaged in bug 1302863.
Comment 17•8 years ago
|
||
Just to clarify some things said in IRC in comment 15:
* Even though the function name contains "resume" it starts instances from scratch, see https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L116, similarly to how we start spot instances, except one parameter (is_spot=False). If this function is called, it should start ondemand instances (but it's not called, see below)
* Ondemand instances are launched with instance_initiated_shutdown_behavior="terminate" https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L411, so they should be terminated when runner shuts them down. Similar to spots.
* the main reason why we don't call aws_resume_instances is this block:
https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L478-L489. "started" is always equals to "count" if we try to start spot instances.
It would be called if:
** we had no rules for this instance type: https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L147-L150
** we have no "spot_choices" in https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L155-L158, which is calculated in https://github.com/mozilla-releng/build-cloud-tools/blob/7f7e5c28e3fe086828f602f580afc6c08e8dbf85/cloudtools/aws/spot.py#L302-L332. Looks like if we decrease the bid prices enough, we should get the ondemand instances started.
Updated•8 years ago
|
Severity: blocker → major
Comment 18•8 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #17)
> Just to clarify some things said in IRC in comment 15:
>
thanks for clarifying!
> * the main reason why we don't call aws_resume_instances is this block:
> https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/
> scripts/aws_watch_pending.py#L478-L489. "started" is always equals to
> "count" if we try to start spot instances.
I think we were hitting this block though: https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L492
e.g. https://irccloud.mozilla.com/pastebin/ZIOPjo2I
I thought subnets were an issue for a while: https://irccloud.mozilla.com/pastebin/XJu8hELv
Comment 19•8 years ago
|
||
(In reply to Jordan Lund (:jlund) from comment #18)
> (In reply to Rail Aliiev [:rail] from comment #17)
> > Just to clarify some things said in IRC in comment 15:
> >
>
> thanks for clarifying!
>
> > * the main reason why we don't call aws_resume_instances is this block:
> > https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/
> > scripts/aws_watch_pending.py#L478-L489. "started" is always equals to
> > "count" if we try to start spot instances.
>
> I think we were hitting this block though:
> https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/
> scripts/aws_watch_pending.py#L492
>
(rail explains over irc)
rail-mtg> jlund|mtg: yup, which is set in https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py#L478-L489
11:41:00 it changes to_create_ondemand
11:41:25
<jlund|mtg> Jordan Lund ooo
11:41:28
<rail-mtg> we never get any leftover there
11:41:58 outbids are out of band, not counted there for reals
Updated•8 years ago
|
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•