Closed
Bug 1295995
Opened 8 years ago
Closed 8 years ago
Golden AMI generation gets stuck often lately
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: aselagea, Assigned: dividehex)
Details
Attachments
(2 files)
It's been the case for 'try-linux64-ec2-golden' during the past two weeks, but other AMIs could encounter similar issues during the generation process. In #buildduty we received the following alert: <nagios-releng> Wed 07:02:22 PDT [4083] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 5 processes with args ec2-golden (http://m.mozilla.org/procs+age+-+golden+AMI) The AMI generation process seems to get stuck at the puppetization step. Certificate request for try-linux64-ec2-golden.try.releng.use1.mozilla.com Got incorrect certificates (!?)
Comment 1•8 years ago
|
||
hitting this now. I am making the assumption that this is not Tree Closure Window related 13:15:25 <nagios-releng> Sat 12:15:29 PDT [4747] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 5 processes with args ec2-golden (http://m.mozilla.org/procs+age+-+golden+AMI)
Comment 2•8 years ago
|
||
could be related: from aws_sanity_checker.log... Aug 20 06:01:40 aws-manager2.srv.releng.scl3.mozilla.com aws_sanity_checker.py: 70 t-w732-ec2-golden (i-0ade5712e498ea036, us-east-1) Unknown type: 't-w732' Aug 20 06:01:40 aws-manager2.srv.releng.scl3.mozilla.com aws_sanity_checker.py: 1 try-linux64-ec2-golden (i-04c7936f2be4329b5, us-east-1) Unknown state: 'pending'
Reporter | ||
Comment 3•8 years ago
|
||
The latest occurrence of the issue for 'try-linux64-ec2-golden' is from this weekend (Oct 9). As mentioned in description, the AMI generation fails at the puppetization step: "Certificate request for try-linux64-ec2-golden.try.releng.use1.mozilla.com Got incorrect certificates (!? " Attached you can find the e-mail contents generated for: - try-linux64-ec2-golden: failed - bld-linux64-ec2-golden: successful
Reporter | ||
Comment 4•8 years ago
|
||
Reporter | ||
Comment 5•8 years ago
|
||
Reporter | ||
Comment 6•8 years ago
|
||
Taking a look at the end of the log for try-linux64-ec2-golden: + openssl genrsa -out /var/lib/puppetmaster/ssl/tmp/try-linux64-ec2-golden.try.releng.use1.mozilla.com-C1d0bg.key 2048 + openssl req -subj /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com -new -key /var/lib/puppetmaster/ssl/tmp/try-linux64-ec2-golden.try.releng.use1.mozilla.com-C1d0bg.key + openssl ca -batch -config /var/lib/puppetmaster/ssl/ca/openssl.conf -extensions master_ca_exts -in /dev/stdin -notext -out /var/lib/puppetmaster/ssl/tmp/try-linux64-ec2-golden.try.releng.use1.mozilla.com.crt + rm -f /var/lib/puppetmaster/ssl/ca/lock + exit While for bld-linux64-ec2-golden we have the following: + openssl genrsa -out /var/lib/puppetmaster/ssl/tmp/bld-linux64-ec2-golden.build.releng.use1.mozilla.com-S40dSn.key 2048 + openssl req -subj /CN=bld-linux64-ec2-golden.build.releng.use1.mozilla.com -new -key /var/lib/puppetmaster/ssl/tmp/bld-linux64-ec2-golden.build.releng.use1.mozilla.com-S40dSn.key + openssl ca -batch -config /var/lib/puppetmaster/ssl/ca/openssl.conf -extensions master_ca_exts -in /dev/stdin -notext -out /var/lib/puppetmaster/ssl/tmp/bld-linux64-ec2-golden.build.releng.use1.mozilla.com.crt + add_file_to_git /var/lib/puppetmaster/ssl/tmp/bld-linux64-ec2-golden.build.releng.use1.mozilla.com.crt agent-certs/releng-puppet2.srv.releng.scl3.mozilla.com/bld-linux64-ec2-golden.build.releng.use1.mozilla.com.crt 'add agent cert for bld-linux64-ec2-golden.build.releng.use1.mozilla.com' ... + rm -f /var/lib/puppetmaster/ssl/ca/lock + exit The openssl commands correspond to https://dxr.mozilla.org/build-central/source/puppet/modules/puppetmaster/templates/ssl_common.sh.erb#56 As it can be noticed, those commands are preceded by a "trap" statement, which will remove the lockfile at exit. That being said, I think one of the last two openssl command fails and the script will move to the trap condition before exiting. @Dustin: I recall :catlee mentioning you touching this code before, so I was wondering if you have any thoughts on this.
Flags: needinfo?(dustin)
Comment 7•8 years ago
|
||
Yes, that sounds about right. Typically these kind of errors have occurred because there's some other issue in with the certificate store, which is why the `openssl ca` command refused to sign the certificate. The lack of output is frustrating, but that's OpenSSL. Jake or possibly Rob should be able to help out.
Flags: needinfo?(rthijssen)
Flags: needinfo?(jwatkins)
Flags: needinfo?(dustin)
Assignee | ||
Comment 9•8 years ago
|
||
It looks like the failure only takes place when the certificate is requested from releng-puppet2.srv.releng.scl3.mozilla.com. All other puppetmasters generate and deliver successfully. I'll look further into this and see if I can dig up some errors in the logs.
Flags: needinfo?(jwatkins)
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → jwatkins
Assignee | ||
Comment 10•8 years ago
|
||
It looks like this was caused by a certificate that went missing. I'm not really sure why that is but maybe there is a race condition revocation or generation process. [root@releng-puppet2.srv.releng.scl3.mozilla.com ca]# grep "try-linux64-ec2-golden" inventory.txt <trimmed> R 210723083356Z 160724083629Z 3455 unknown /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com V 210723083755Z 3456 unknown /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com Since the cert is missing from /var/lib/puppetmaster/ssl/git/agent-certs/releng-puppet2.srv.releng.scl3.mozilla.com, I was able to simply revoke it by its .pem file in /var/lib/puppetmaster/ssl/ca/certs [root@releng-puppet2.srv.releng.scl3.mozilla.com ca]# openssl ca -config /var/lib/puppetmaster/ssl/ca/openssl.conf -revoke certs/3456.pem Using configuration from /var/lib/puppetmaster/ssl/ca/openssl.conf Revoking Certificate 3456. Data Base Updated [root@releng-puppet2.srv.releng.scl3.mozilla.com ca]# grep "try-linux64-ec2-golden" inventory.txt <trimmed> R 210723083356Z 160724083629Z 3455 unknown /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com R 210723083755Z 161017200147Z 3456 unknown /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com
Assignee | ||
Comment 11•8 years ago
|
||
Looking at last nights cron mail, a cert was successfully generated on releng-puppet2 for try-linux64-ec2-golden.try.releng.use1.mozilla.com I still don't know what the root cause is but at least I know what to look for in future cases. I do think at the very least, changing the cert generation email (and underlying script) to indicate failure could be an easy and quick fix.
Assignee | ||
Comment 12•8 years ago
|
||
Filed bug 1312851 to get better visibility for the next time this happens.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•