Closed Bug 952448 Opened 9 years ago Closed 9 years ago

Integration Trees closed, high number of pending linux compile jobs

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Linux
task
Not set
blocker

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: cbook, Unassigned)

Details

Seems we have currently a high number of pending linux compile jobs since no linux build started since about 0:30am Pacific 

M-c seems fine but the integration trees are hit hard and so closed.

Could someone look into this?
btw is this the same as bug 950702 again ?
paged hwine at 4:46am via ircbot
Looks like aws_watch_pending can't acquire it's local lock file...
killed this process:
2325      7990  7955  0 Dec19 ?        00:00:11 python aws_watch_pending.py -k secrets/aws-secrets.json -c configs/watch_pending.cfg -r us-west-2 -r us-east-1 -r us-west-1 --cached-cert-dir /home/buildduty/aws/cloud-tools/aws/secrets/certs
The latest watch_pending log has tons of this:
2013-12-19 23:50:04,179 - bld-linux64 - started 0 spot instances; need 1
2013-12-19 23:50:28,017 - try-linux64 - started 13 spot instances; need 0
2013-12-19 23:51:49,513 - Cannot start bld-linux64-ec2-173
2013-12-19 23:53:24,402 - Cannot start bld-linux64-ec2-079
2013-12-19 23:54:11,005 - Cannot start bld-linux64-ec2-187
2013-12-19 23:55:32,304 - Cannot start bld-linux64-ec2-168
2013-12-19 23:56:28,579 - Cannot start bld-linux64-ec2-037
2013-12-19 23:57:21,426 - Cannot start bld-linux64-ec2-122
2013-12-19 23:58:16,726 - Cannot start bld-linux64-ec2-111
2013-12-19 23:59:32,618 - Cannot start bld-linux64-ec2-188
2013-12-20 00:00:07,584 - Cannot start bld-linux64-ec2-042
2013-12-20 00:00:32,574 - Cannot start bld-linux64-ec2-035
2013-12-20 00:02:04,013 - Cannot start bld-linux64-ec2-169
2013-12-20 00:03:33,982 - Cannot start bld-linux64-ec2-150
2013-12-20 00:04:06,664 - Cannot start bld-linux64-ec2-177
2013-12-20 00:05:51,769 - Cannot start bld-linux64-ec2-043
2013-12-20 00:06:54,840 - Cannot start bld-linux64-ec2-010
...
I tried to start instances by hand and got "insufficient capacity" for use1 a, b, c, and d and usw2 a, b, and c.

Some ideas from IRC:
- Purchase reserved instances now -- if doing so will make machines immediately available
- Spin up usw1 capacity -- if there's availability there
Maybe could also create new instances of a different type. i2.* were announced today!

Rail, your input here would be invaluable.
Flags: needinfo?(rail)
Per http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/i2-instances.html i2 instances are HVM based and all ours instances are pv-grub based (see https://bugzilla.mozilla.org/show_bug.cgi?id=944113#c19 for the details).

As a result we can't easily switch to this type right now.
Flags: needinfo?(rail)
Depends on: 952479
No longer depends on: 952479
In case it helps, try seems to be picking these jobs up with some delay, but still.
So far we see no issues on our side. Looks like AWS is very busy due to the holidays. There was a blog post about c3.* shortage a week ago (http://aws.typepad.com/aws/2013/12/c3-instance-update.html). Probably now they are hitting m3.* as well...

We'll continue to monitor and update by EoD.
I've converted bld-linux64-ec2-475 to m3.2xlarge (a faster but more expensive instance type) and was successful in starting it.  We may be able to start more of this type to reduce load.
Looks like the things look better now, we have all ec2 builders running now.
Trees reopened at 0819 PT

Services has also seen some connectivity issues with use1, so suspect some of this is AWS holiday load issues.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
just verified :) thanks to all involved!
Status: RESOLVED → VERIFIED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.