Trees closed, Windows builds not starting

RESOLVED FIXED

Status

--
blocker
RESOLVED FIXED
3 years ago
6 months ago

People

(Reporter: philor, Unassigned)

Tracking

Details

(Reporter)

Description

3 years ago
Try seems to be starting its builds, but https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=d4bcfa95a869 (and every m-i push above it) has had Windows builds pending for almost four hours now. Nightlies and weekly updates (where they managed to work) got builds, so sometime after 3am things went south.

All trees other than try are closed.
(Reporter)

Updated

3 years ago
Depends on: 1246397
(Reporter)

Updated

3 years ago
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
9:12 AM <relengbot> [sns alert] Sat 06:02:43 PST aws-manager2.srv.releng.scl3.mozilla.com aws_sanity_checker.py: 0 b-2008-ec2-golden (i-bd114d38, us-east-1) Loaned to: /builds/aws_manager, in b-2008-ec2-golden, up for 0h:26m
9:12 AM <grenade> ^^ that's me. safe to ignore

...

and then:

10:11 AM <nagios-releng> Sat 07:11:28 PST [4022] aws-manager2.srv.releng.scl3.mozilla.com:File Age - /builds/aws_manager/aws_watch_pending.log is CRITICAL: FILE_AGE CRITICAL: /builds/aws_manager/aws_watch_pending.log is 640 seconds old and 4917689 bytes (http://m.mozilla.org/File+Age+-+/builds/aws_manager/aws_watch_pending.log)
10:13 AM <nagios-releng> Sat 07:13:28 PST [4023] aws-manager2.srv.releng.scl3.mozilla.com:File Age - /builds/aws_manager/aws_watch_pending.log is OK: FILE_AGE OK: /builds/aws_manager/aws_watch_pending.log is 16 seconds old and 5007243 bytes (http://m.mozilla.org/File+Age+-+/builds/aws_manager/aws_watch_pending.log)

[[[so n-i to amy and grenade for this (I have a prior *need* to be away for a few hours starting literally this minute)]]]

Turns out that I didn't actually end up n-i'ing anyone because I got a "are you sure you meant amy" page when I ended up having to leave, so still CC'ing but not n-i'ing.
For posterity, from #releng  (timestamps ET)

12:59 PM <philor> grenade: did that puppet failure leave us unable to create new b-2008-spot instances? we've got builds that have been pending for almost four hours
1:00 PM <grenade> philor, it shouldn't have. i'll take a look
1:03 PM <grenade> somethings not right. we've got 200 b-2008 spot instances running
1:03 PM <grenade> we normally only have a few dozen
1:04 PM <grenade> i suspect they're not taking jobs
1:07 PM <philor> !squirrel https://bugzilla.mozilla.org/show_bug.cgi?id=1246412 everything but Try is closed
1:21 PM <grenade> yes, the puppet failure caused a dud ami. i've deregistered it and am killing off spot instances that were spawned from it. new instances will take their places shortly, spawned from yesterdays known good ami.
1:37 PM <•catlee> philor: just windows?
1:38 PM <philor> catlee: yeah, https://bugzilla.mozilla.org/show_bug.cgi?id=1246397 gave us a bad b-2008 ami
1:39 PM <philor> everything else including y-2008 is happy
1:40 PM <•catlee> grenade: need any help?
1:40 PM <grenade> catlee: b-2008 should be back online shortly. the dud spots are all terminated now, just waiting for cloud tools to respawn new ones
1:41 PM <grenade> on the bright side, we know that check_ami is working ;)
1:42 PM <•catlee> :)
2:13 PM <grenade> catlee: if you're about, the aws_watch_pending log keeps saying "No slave name available for us-east-1, b-2008".do you know if it gets that info from a cache?
2:14 PM <grenade> because all the slave names should be available
2:15 PM <•catlee> it's cached per run
2:15 PM <•catlee> not between runs
2:15 PM <philor> there's also two masters which ought to be the ones for b-2008 alerting in #buildduty about not having buildbot running
2:16 PM <•catlee> grenade: do we have instances running?
2:16 PM — •catlee goes to start those masters
2:17 PM <grenade> we have only 4 spot instances running (b-2008)
2:17 PM <philor> which since the pending count just dropped by 4, must have just taken jobs
2:18 PM <grenade> i terminated everything spawned from the new ami (they were not running buildbot), normally cloud-tools would take that as a cue to respawn but it keeps repeating that theres no slave names available
2:19 PM <grenade> i'm wondering if the availability of names is cached somewhere that we can refresh
2:20 PM <•catlee> pretty sure it's not
2:21 PM <•catlee> could be because it thinks hte instances are still running?
2:22 PM <grenade> perhaps. sorta expect something to realise the names are available and start allocating
2:22 PM <grenade> which i have just now seen happen
2:23 PM <grenade> \o/
2:23 PM <grenade> musta been related to the bb masters
2:23 PM <•catlee> don't think so...
2:24 PM <•catlee> it takes like 10 minutes to process
2:25 PM <•catlee> grenade: which 'No slave name available' lines are you looking at?
2:25 PM <•catlee> around 11:15?
2:26 PM <grenade> yes, i was tailing the log
2:26 PM <grenade> it went on for quite a while
2:26 PM <•catlee> because it was putting it some spot requests
2:26 PM <grenade> but its just full of requests now
2:27 PM <•catlee> it just wasn't able to get enough?
2:27 PM <•catlee> also seemed to have problems finding an IP address
2:27 PM <grenade> yeah
2:27 PM <•catlee> 2016-02-06 11:15:52,407 - No free IP available in None for subnets [u'subnet-d748dabe', u'subnet-a848dac1', u'subnet-ad48dac4', u'subnet-c74f48b3']
2:27 PM <•catlee> wonder if the old instances weren't fully gone yet
2:27 PM <grenade> maybe it took a while for the ips to be released after the instances were terminated?
2:28 PM <•catlee> could be
2:28 PM <grenade> the console only took seconds to show the instances as terminated
2:28 PM <•catlee> I suspect that's the root behind not finding a free name
2:28 PM <•catlee> names are tied to regions, and if there's no free IP in a region, it gives up
2:37 PM <philor> and we're good again
2:38 PM <grenade> :)
2:42 PM <•catlee> great!
2:42 PM <•catlee> thanks grenade!
2:42 PM <grenade> np, thank you!
2:42 PM — •catlee wanders off again

Updated

6 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.