1305564 - Alert for stale AMI or failing AMI generation

Reporter

Description

•

8 years ago

try-linux64 hasn't been refreshed since ami-3922452e (use1) on August 18. In papertrail we have a lot of

Sep 21 01:13:01 aws-manager2.srv.releng.scl3.mozilla.com try-linux64-ec2-golden: lockfile: Sorry, giving up on "/builds/aws_manager/try-linux64-ec2-golden.lock"

Lots of things we could do here
* nagios alert age of lock files on aws-manager[12]
* SNS alert based on papertrail
* combine watch_pending.cfg and amis.json to find pools with stale AMI

Alin Selagea [:aselagea]

Assignee

Updated

•

8 years ago

Assignee: nobody → aselagea

Amy Rich [:arr] [:arich]

Comment 1

•

8 years ago

papertrail looks like it's reporting 2 lockfile errors per night on each of the golden AMI generation crons. Not sure why it would be doing that, since there shouldn't be a lockfile kicking around at all when the jobs kick off.

https://papertrailapp.com/groups/1390904/events?q=%22lockfile%3A+Sorry%2C+giving+up+on%22+golden

I notice that we don't specify any locktimeout value to lockfile in these AMI generation cron scripts, though. Maybe we should add -l 172800 for a couple days timeout? That would at least ensure that we'd wipe out old, stale lockfiles. If it tried to create an instance while one was still up, we should see an SNS alert about the golden instance IP already being in use.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

8 years ago

Looks like we are running the script three times, and two fail on the lockfile of the winner:

Oct 17 01:45:01 aws-manager2.srv.releng.scl3.mozilla.com CROND: (buildduty) CMD (/builds/aws_manager/bin/aws_manager-tst-emulator64-ec2-golden.sh 2>&1 | logger -t 'tst-emulator64-ec2-golden')
Oct 17 01:45:01 aws-manager2.srv.releng.scl3.mozilla.com CROND: (buildduty) CMD (/builds/aws_manager/bin/aws_manager-tst-emulator64-ec2-golden.sh 2>&1 | logger -t 'tst-emulator64-ec2-golden')
Oct 17 01:45:01 aws-manager2.srv.releng.scl3.mozilla.com CROND: (buildduty) CMD (/builds/aws_manager/bin/aws_manager-tst-emulator64-ec2-golden.sh 2>&1 | logger -t 'tst-emulator64-ec2-golden')
Oct 17 01:45:02 aws-manager2.srv.releng.scl3.mozilla.com tst-emulator64-ec2-golden: 2016-10-17 01:45:02,675 - INFO - Sanity checking DNS entries...
Oct 17 01:45:02 aws-manager2.srv.releng.scl3.mozilla.com tst-emulator64-ec2-golden: 2016-10-17 01:45:02,676 - INFO - Checking name conflicts for tst-emulator64-ec2-golden
...
Oct 17 01:46:03 aws-manager2.srv.releng.scl3.mozilla.com tst-emulator64-ec2-golden: 2016-10-17 01:46:04,056 - WARNING - cannot connect; instance may still be starting  tst-emulator64-ec2-golden.test.releng.use1.mozilla.com (i-0a16e2d5bc88cb10e, 10.134.48.124) - Timed out trying to connect to 10.134.48.124 (tried 1 time),retrying in 1200 sec ...
Oct 17 01:48:02 aws-manager2.srv.releng.scl3.mozilla.com tst-emulator64-ec2-golden: lockfile: Sorry, giving up on "/builds/aws_manager/tst-emulator64-ec2-golden.lock"
Oct 17 01:48:02 aws-manager2.srv.releng.scl3.mozilla.com tst-emulator64-ec2-golden: lockfile: Sorry, giving up on "/builds/aws_manager/tst-emulator64-ec2-golden.lock"

There's nothing relevant in aws-manager2:/var/spool/cron, and only /etc/cron.d/aws_manager-tst-emulator64-ec2-golden.cron, so crond is all confuzzled ?

Amy Rich [:arr] [:arich]

Comment 3

•

8 years ago

... I have no idea how the machine got in this state, but there were 3 copies of crond running. I did note that running /etc/init.d/crdond restart didn't actually kill off the old instances, so maybe someone had done that in the past. I've killed off all the old instances and restarted cron, so hopefully that will clear this up (and maybe some other flakiness that this machine might have been exhibiting because of the multiple cron daemons).

root      1572     1  0 Jul15 ?        00:01:39 crond
root     28588     1  0 Aug16 ?        00:01:07 crond
root     29939     1  0 Oct07 ?        00:00:11 crond

Assuming that does fix things, we can add in a papertrail alert looking for golden lockfiles.

Amy Rich [:arr] [:arich]

Comment 4

•

8 years ago

https://papertrailapp.com/searches/14273403/edit is the lockfile SNS check.

Amy Rich [:arr] [:arich]

Comment 5

•

8 years ago

Looks like killing the multiple cronds cleared up the lockfile reporting issue

Amy Rich [:arr] [:arich]

Comment 6

•

8 years ago

nthomas: do you want to add more checks to this, or is looking for a stale lockfle sufficient? I suspect we should add more based on what watch_pending is actually deploying.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

8 years ago

The SNS alert is a great start, I'd like to do more though. We could modify watch_pending to check the moz-created tag of the ami it's launching, emit a log line if it's over some threshold, then SNS alert with papertrail.  Alternatively, we could check independently by finding the set of the pools used in the buildermap of
  https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg
and then check in 
  https://s3.amazonaws.com/mozilla-releng-amis/amis.json
for the most recent for each type, eg by using moz-type and moz-created in the tags block.

Either way, probably worth setting the age threshold long enough that we don't get spammed if we need to delete a bad AMI and fallback to the previous one.

Mihai Tabara [:mtabara]⌚️GMT

Comment 8

•

7 years ago

Note explaining the priority level: P5 doesn't mean we've lowered the priority, but the contrary. However, we're aligning these levels to the buildduty quarterly deliverables, where P1-P3 are taken by our daily waterline KTLO operational tasks.

Priority: -- → P5

Amy Rich [:arr] [:arich]

Comment 9

•

7 years ago

The need for this goes away as we transition to taskcluster, which has different AMI generation mechanisms.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

Alert for stale AMI or failing AMI generation

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P5)

Tracking

(Not tracked)

People

(Reporter: nthomas, Assigned: aselagea)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated