Closed Bug 1305564 Opened 8 years ago Closed 7 years ago

Alert for stale AMI or failing AMI generation

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P5)

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: nthomas, Assigned: aselagea)

Details

try-linux64 hasn't been refreshed since ami-3922452e (use1) on August 18. In papertrail we have a lot of

Sep 21 01:13:01 aws-manager2.srv.releng.scl3.mozilla.com try-linux64-ec2-golden: lockfile: Sorry, giving up on "/builds/aws_manager/try-linux64-ec2-golden.lock"

Lots of things we could do here
* nagios alert age of lock files on aws-manager[12]
* SNS alert based on papertrail
* combine watch_pending.cfg and amis.json to find pools with stale AMI
Assignee: nobody → aselagea
papertrail looks like it's reporting 2 lockfile errors per night on each of the golden AMI generation crons. Not sure why it would be doing that, since there shouldn't be a lockfile kicking around at all when the jobs kick off.

https://papertrailapp.com/groups/1390904/events?q=%22lockfile%3A+Sorry%2C+giving+up+on%22+golden

I notice that we don't specify any locktimeout value to lockfile in these AMI generation cron scripts, though. Maybe we should add -l 172800 for a couple days timeout? That would at least ensure that we'd wipe out old, stale lockfiles. If it tried to create an instance while one was still up, we should see an SNS alert about the golden instance IP already being in use.
Looks like we are running the script three times, and two fail on the lockfile of the winner:

Oct 17 01:45:01 aws-manager2.srv.releng.scl3.mozilla.com CROND: (buildduty) CMD (/builds/aws_manager/bin/aws_manager-tst-emulator64-ec2-golden.sh 2>&1 | logger -t 'tst-emulator64-ec2-golden')
Oct 17 01:45:01 aws-manager2.srv.releng.scl3.mozilla.com CROND: (buildduty) CMD (/builds/aws_manager/bin/aws_manager-tst-emulator64-ec2-golden.sh 2>&1 | logger -t 'tst-emulator64-ec2-golden')
Oct 17 01:45:01 aws-manager2.srv.releng.scl3.mozilla.com CROND: (buildduty) CMD (/builds/aws_manager/bin/aws_manager-tst-emulator64-ec2-golden.sh 2>&1 | logger -t 'tst-emulator64-ec2-golden')
Oct 17 01:45:02 aws-manager2.srv.releng.scl3.mozilla.com tst-emulator64-ec2-golden: 2016-10-17 01:45:02,675 - INFO - Sanity checking DNS entries...
Oct 17 01:45:02 aws-manager2.srv.releng.scl3.mozilla.com tst-emulator64-ec2-golden: 2016-10-17 01:45:02,676 - INFO - Checking name conflicts for tst-emulator64-ec2-golden
...
Oct 17 01:46:03 aws-manager2.srv.releng.scl3.mozilla.com tst-emulator64-ec2-golden: 2016-10-17 01:46:04,056 - WARNING - cannot connect; instance may still be starting  tst-emulator64-ec2-golden.test.releng.use1.mozilla.com (i-0a16e2d5bc88cb10e, 10.134.48.124) - Timed out trying to connect to 10.134.48.124 (tried 1 time),retrying in 1200 sec ...
Oct 17 01:48:02 aws-manager2.srv.releng.scl3.mozilla.com tst-emulator64-ec2-golden: lockfile: Sorry, giving up on "/builds/aws_manager/tst-emulator64-ec2-golden.lock"
Oct 17 01:48:02 aws-manager2.srv.releng.scl3.mozilla.com tst-emulator64-ec2-golden: lockfile: Sorry, giving up on "/builds/aws_manager/tst-emulator64-ec2-golden.lock"

There's nothing relevant in aws-manager2:/var/spool/cron, and only /etc/cron.d/aws_manager-tst-emulator64-ec2-golden.cron, so crond is all confuzzled ?
... I have no idea how the machine got in this state, but there were 3 copies of crond running. I did note that running /etc/init.d/crdond restart didn't actually kill off the old instances, so maybe someone had done that in the past. I've killed off all the old instances and restarted cron, so hopefully that will clear this up (and maybe some other flakiness that this machine might have been exhibiting because of the multiple cron daemons).

root      1572     1  0 Jul15 ?        00:01:39 crond
root     28588     1  0 Aug16 ?        00:01:07 crond
root     29939     1  0 Oct07 ?        00:00:11 crond

Assuming that does fix things, we can add in a papertrail alert looking for golden lockfiles.
Looks like killing the multiple cronds cleared up the lockfile reporting issue
nthomas: do you want to add more checks to this, or is looking for a stale lockfle sufficient? I suspect we should add more based on what watch_pending is actually deploying.
The SNS alert is a great start, I'd like to do more though. We could modify watch_pending to check the moz-created tag of the ami it's launching, emit a log line if it's over some threshold, then SNS alert with papertrail.  Alternatively, we could check independently by finding the set of the pools used in the buildermap of
  https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg
and then check in 
  https://s3.amazonaws.com/mozilla-releng-amis/amis.json
for the most recent for each type, eg by using moz-type and moz-created in the tags block.

Either way, probably worth setting the age threshold long enough that we don't get spammed if we need to delete a bad AMI and fallback to the previous one.
Note explaining the priority level: P5 doesn't mean we've lowered the priority, but the contrary. However, we're aligning these levels to the buildduty quarterly deliverables, where P1-P3 are taken by our daily waterline KTLO operational tasks.
Priority: -- → P5
The need for this goes away as we transition to taskcluster, which has different AMI generation mechanisms.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.