It's been the case for 'try-linux64-ec2-golden' during the past two weeks, but other AMIs could encounter similar issues during the generation process. In #buildduty we received the following alert: <nagios-releng> Wed 07:02:22 PDT  aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 5 processes with args ec2-golden (http://m.mozilla.org/procs+age+-+golden+AMI) The AMI generation process seems to get stuck at the puppetization step. Certificate request for try-linux64-ec2-golden.try.releng.use1.mozilla.com Got incorrect certificates (!?)
hitting this now. I am making the assumption that this is not Tree Closure Window related 13:15:25 <nagios-releng> Sat 12:15:29 PDT  aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 5 processes with args ec2-golden (http://m.mozilla.org/procs+age+-+golden+AMI)
could be related: from aws_sanity_checker.log... Aug 20 06:01:40 aws-manager2.srv.releng.scl3.mozilla.com aws_sanity_checker.py: 70 t-w732-ec2-golden (i-0ade5712e498ea036, us-east-1) Unknown type: 't-w732' Aug 20 06:01:40 aws-manager2.srv.releng.scl3.mozilla.com aws_sanity_checker.py: 1 try-linux64-ec2-golden (i-04c7936f2be4329b5, us-east-1) Unknown state: 'pending'
The latest occurrence of the issue for 'try-linux64-ec2-golden' is from this weekend (Oct 9). As mentioned in description, the AMI generation fails at the puppetization step: "Certificate request for try-linux64-ec2-golden.try.releng.use1.mozilla.com Got incorrect certificates (!? " Attached you can find the e-mail contents generated for: - try-linux64-ec2-golden: failed - bld-linux64-ec2-golden: successful
Taking a look at the end of the log for try-linux64-ec2-golden: + openssl genrsa -out /var/lib/puppetmaster/ssl/tmp/try-linux64-ec2-golden.try.releng.use1.mozilla.com-C1d0bg.key 2048 + openssl req -subj /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com -new -key /var/lib/puppetmaster/ssl/tmp/try-linux64-ec2-golden.try.releng.use1.mozilla.com-C1d0bg.key + openssl ca -batch -config /var/lib/puppetmaster/ssl/ca/openssl.conf -extensions master_ca_exts -in /dev/stdin -notext -out /var/lib/puppetmaster/ssl/tmp/try-linux64-ec2-golden.try.releng.use1.mozilla.com.crt + rm -f /var/lib/puppetmaster/ssl/ca/lock + exit While for bld-linux64-ec2-golden we have the following: + openssl genrsa -out /var/lib/puppetmaster/ssl/tmp/bld-linux64-ec2-golden.build.releng.use1.mozilla.com-S40dSn.key 2048 + openssl req -subj /CN=bld-linux64-ec2-golden.build.releng.use1.mozilla.com -new -key /var/lib/puppetmaster/ssl/tmp/bld-linux64-ec2-golden.build.releng.use1.mozilla.com-S40dSn.key + openssl ca -batch -config /var/lib/puppetmaster/ssl/ca/openssl.conf -extensions master_ca_exts -in /dev/stdin -notext -out /var/lib/puppetmaster/ssl/tmp/bld-linux64-ec2-golden.build.releng.use1.mozilla.com.crt + add_file_to_git /var/lib/puppetmaster/ssl/tmp/bld-linux64-ec2-golden.build.releng.use1.mozilla.com.crt agent-certs/releng-puppet2.srv.releng.scl3.mozilla.com/bld-linux64-ec2-golden.build.releng.use1.mozilla.com.crt 'add agent cert for bld-linux64-ec2-golden.build.releng.use1.mozilla.com' ... + rm -f /var/lib/puppetmaster/ssl/ca/lock + exit The openssl commands correspond to https://dxr.mozilla.org/build-central/source/puppet/modules/puppetmaster/templates/ssl_common.sh.erb#56 As it can be noticed, those commands are preceded by a "trap" statement, which will remove the lockfile at exit. That being said, I think one of the last two openssl command fails and the script will move to the trap condition before exiting. @Dustin: I recall :catlee mentioning you touching this code before, so I was wondering if you have any thoughts on this.
Yes, that sounds about right. Typically these kind of errors have occurred because there's some other issue in with the certificate store, which is why the `openssl ca` command refused to sign the certificate. The lack of output is frustrating, but that's OpenSSL. Jake or possibly Rob should be able to help out.
i apologise that i have no insights to add here.
It looks like the failure only takes place when the certificate is requested from releng-puppet2.srv.releng.scl3.mozilla.com. All other puppetmasters generate and deliver successfully. I'll look further into this and see if I can dig up some errors in the logs.
It looks like this was caused by a certificate that went missing. I'm not really sure why that is but maybe there is a race condition revocation or generation process. [email@example.com ca]# grep "try-linux64-ec2-golden" inventory.txt <trimmed> R 210723083356Z 160724083629Z 3455 unknown /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com V 210723083755Z 3456 unknown /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com Since the cert is missing from /var/lib/puppetmaster/ssl/git/agent-certs/releng-puppet2.srv.releng.scl3.mozilla.com, I was able to simply revoke it by its .pem file in /var/lib/puppetmaster/ssl/ca/certs [firstname.lastname@example.org ca]# openssl ca -config /var/lib/puppetmaster/ssl/ca/openssl.conf -revoke certs/3456.pem Using configuration from /var/lib/puppetmaster/ssl/ca/openssl.conf Revoking Certificate 3456. Data Base Updated [email@example.com ca]# grep "try-linux64-ec2-golden" inventory.txt <trimmed> R 210723083356Z 160724083629Z 3455 unknown /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com R 210723083755Z 161017200147Z 3456 unknown /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com
Looking at last nights cron mail, a cert was successfully generated on releng-puppet2 for try-linux64-ec2-golden.try.releng.use1.mozilla.com I still don't know what the root cause is but at least I know what to look for in future cases. I do think at the very least, changing the cert generation email (and underlying script) to indicate failure could be an easy and quick fix.
Filed bug 1312851 to get better visibility for the next time this happens.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.