Closed Bug 1428344 Opened 6 years ago Closed 6 years ago

ec2-golden Critical Warnings

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: u604971, Unassigned)

Details

While monitoring the #buildduty irc channel this alert came up. To be noted that this alert was reported by Nagios a couple of times. Here is the last time it occurred.

Fri 12:32:33 UTC [7344] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 5 processes with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI)
Could you please take a look at this?
Thank you.
Flags: needinfo?(jlund)
Thanks Danut!

grenade terminated the bad golden instance.

I then did some cleanup. Let's see if next ami gen works.

[root@aws-manager2.srv.releng.scl3.mozilla.com ~]# rm /builds/aws_manager/b-2008-ec2-golden.lock
rm: remove regular file `/builds/aws_manager/b-2008-ec2-golden.lock'? y
[root@aws-manager2.srv.releng.scl3.mozilla.com ~]# ps aux | grep ami
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
501      14438  0.0  0.6 249680 27380 ?        S    Jan02   0:00 /builds/aws_manager/bin/python aws_create_instance.py -c /builds/aws_manager/cloud-tools/configs/b-2008 -r us-east-1 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key /home/buildduty/.ssh/aws-ssh-key -i /builds/aws_manager/cloud-tools/instance_data/us-east-1.instance_data_prod.json --create-ami --ignore-subnet-check --copy-to-region us-west-2 b-2008-ec2-golden
501      14487  0.2  0.6 249952 24928 ?        S    Jan02  17:51 /builds/aws_manager/bin/python aws_create_instance.py -c /builds/aws_manager/cloud-tools/configs/b-2008 -r us-east-1 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key /home/buildduty/.ssh/aws-ssh-key -i /builds/aws_manager/cloud-tools/instance_data/us-east-1.instance_data_prod.json --create-ami --ignore-subnet-check --copy-to-region us-west-2 b-2008-ec2-golden
root     24082  0.0  0.0 103252   840 pts/1    S+   09:38   0:00 grep ami
[root@aws-manager2.srv.releng.scl3.mozilla.com ~]# kill 14438
[root@aws-manager2.srv.releng.scl3.mozilla.com ~]# kill 14487

We can close the bug if the nagios alert in comment 0 goes green.
Flags: needinfo?(jlund)
nagios-releng> Sat 18:32:32 UTC [7480] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is OK: ELAPSED OK: 0 processes with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI)

\o/
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
 This alert came again in #buildduty channel today.
 
Fri 18:19:24 UTC [7980] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 3 crit, 0 warn out of 3 processes with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI)

[root@aws-manager2.srv.releng.scl3.mozilla.com ~]# ps -aux | grep ami 
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
501       9363  0.0  0.7 254568 28092 pts/2    S+   04:40   0:00 /builds/aws_manager/bin/python aws_create_instance.py -c /builds/aws_manager/cloud-tools/configs/y-2008 -r us-east-1 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key /home/buildduty/.ssh/aws-ssh-key -i /builds/aws_manager/cloud-tools/instance_data/us-east-1.instance_data_try.json --create-ami --ignore-subnet-check --copy-to-region us-west-2 y-2008-ec2-golden
501       9371  0.3  0.6 254312 25668 pts/2    S+   04:40   1:24 /builds/aws_manager/bin/python aws_create_instance.py -c /builds/aws_manager/cloud-tools/configs/y-2008 -r us-east-1 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key /home/buildduty/.ssh/aws-ssh-key -i /builds/aws_manager/cloud-tools/instance_data/us-east-1.instance_data_try.json --create-ami --ignore-subnet-check --copy-to-region us-west-2 y-2008-ec2-golden
root     20170  0.0  0.0 103272   840 pts/0    S+   11:27   0:00 grep ami
[root@aws-manager2.srv.releng.scl3.mozilla.com ~]# kill 9363 
[root@aws-manager2.srv.releng.scl3.mozilla.com ~]# kill 9371

The .lock file was already removed.

[root@aws-manager2.srv.releng.scl3.mozilla.com ~]# cd /builds/aws_manager/
[root@aws-manager2.srv.releng.scl3.mozilla.com aws_manager]# ls | grep lock

I have re-scheduled the next check of this host from the Nagios UI ->

Fri 18:32:53 UTC [7990] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is OK: ELAPSED OK: 0 processes with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI)
Problem re-appeared : Sun 17:36:52 UTC [7028] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 9 processes with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI)

I have done the same steps and the problem seems to be fixed for now.

Sun 18:10:39 UTC [7030] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is OK: ELAPSED OK: 1 process with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI)
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.