Closed Bug 1295995 Opened 8 years ago Closed 8 years ago

Golden AMI generation gets stuck often lately

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aselagea, Assigned: dividehex)

Details

Attachments

(2 files)

It's been the case for 'try-linux64-ec2-golden' during the past two weeks, but other AMIs could encounter similar issues during the generation process. 

In #buildduty we received the following alert:
<nagios-releng> Wed 07:02:22 PDT [4083] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 5 processes with args ec2-golden (http://m.mozilla.org/procs+age+-+golden+AMI)

The AMI generation process seems to get stuck at the puppetization step.
Certificate request for try-linux64-ec2-golden.try.releng.use1.mozilla.com
Got incorrect certificates (!?)
hitting this now. I am making the assumption that this is not Tree Closure Window related

13:15:25 <nagios-releng> Sat 12:15:29 PDT [4747] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 5 crit, 0 warn out of 5 processes with args ec2-golden (http://m.mozilla.org/procs+age+-+golden+AMI)
could be related:


from aws_sanity_checker.log...


Aug 20 06:01:40 aws-manager2.srv.releng.scl3.mozilla.com aws_sanity_checker.py: 70 t-w732-ec2-golden (i-0ade5712e498ea036, us-east-1) Unknown type: 't-w732'
Aug 20 06:01:40 aws-manager2.srv.releng.scl3.mozilla.com aws_sanity_checker.py: 1 try-linux64-ec2-golden (i-04c7936f2be4329b5, us-east-1) Unknown state: 'pending'
The latest occurrence of the issue for 'try-linux64-ec2-golden' is from this weekend (Oct 9). As mentioned in description, the AMI generation fails at the puppetization step: 

"Certificate request for try-linux64-ec2-golden.try.releng.use1.mozilla.com
Got incorrect certificates (!? "

Attached you can find the e-mail contents generated for:
    - try-linux64-ec2-golden: failed
    - bld-linux64-ec2-golden: successful
Taking a look at the end of the log for try-linux64-ec2-golden:

+ openssl genrsa -out /var/lib/puppetmaster/ssl/tmp/try-linux64-ec2-golden.try.releng.use1.mozilla.com-C1d0bg.key 2048
+ openssl req -subj /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com -new -key /var/lib/puppetmaster/ssl/tmp/try-linux64-ec2-golden.try.releng.use1.mozilla.com-C1d0bg.key
+ openssl ca -batch -config /var/lib/puppetmaster/ssl/ca/openssl.conf -extensions master_ca_exts -in /dev/stdin -notext -out /var/lib/puppetmaster/ssl/tmp/try-linux64-ec2-golden.try.releng.use1.mozilla.com.crt
+ rm -f /var/lib/puppetmaster/ssl/ca/lock
+ exit

While for bld-linux64-ec2-golden we have the following:

+ openssl genrsa -out /var/lib/puppetmaster/ssl/tmp/bld-linux64-ec2-golden.build.releng.use1.mozilla.com-S40dSn.key 2048
+ openssl req -subj /CN=bld-linux64-ec2-golden.build.releng.use1.mozilla.com -new -key /var/lib/puppetmaster/ssl/tmp/bld-linux64-ec2-golden.build.releng.use1.mozilla.com-S40dSn.key
+ openssl ca -batch -config /var/lib/puppetmaster/ssl/ca/openssl.conf -extensions master_ca_exts -in /dev/stdin -notext -out /var/lib/puppetmaster/ssl/tmp/bld-linux64-ec2-golden.build.releng.use1.mozilla.com.crt
+ add_file_to_git /var/lib/puppetmaster/ssl/tmp/bld-linux64-ec2-golden.build.releng.use1.mozilla.com.crt agent-certs/releng-puppet2.srv.releng.scl3.mozilla.com/bld-linux64-ec2-golden.build.releng.use1.mozilla.com.crt 'add agent cert for bld-linux64-ec2-golden.build.releng.use1.mozilla.com'
...
+ rm -f /var/lib/puppetmaster/ssl/ca/lock
+ exit

The openssl commands correspond to https://dxr.mozilla.org/build-central/source/puppet/modules/puppetmaster/templates/ssl_common.sh.erb#56
As it can be noticed, those commands are preceded by a "trap" statement, which will remove the lockfile at exit. That being said, I think one of the last two openssl command fails and the script will move to the trap condition before exiting. 

@Dustin: I recall :catlee mentioning you touching this code before, so I was wondering if you have any thoughts on this.
Flags: needinfo?(dustin)
Yes, that sounds about right.

Typically these kind of errors have occurred because there's some other issue in with the certificate store, which is why the `openssl ca` command refused to sign the certificate.  The lack of output is frustrating, but that's OpenSSL.

Jake or possibly Rob should be able to help out.
Flags: needinfo?(rthijssen)
Flags: needinfo?(jwatkins)
Flags: needinfo?(dustin)
i apologise that i have no insights to add here.
Flags: needinfo?(rthijssen)
It looks like the failure only takes place when the certificate is requested from releng-puppet2.srv.releng.scl3.mozilla.com.  All other puppetmasters generate and deliver successfully.

I'll look further into this and see if I can dig up some errors in the logs.
Flags: needinfo?(jwatkins)
Assignee: nobody → jwatkins
It looks like this was caused by a certificate that went missing.  I'm not really sure why that is but maybe there is a race condition revocation or generation process.

[root@releng-puppet2.srv.releng.scl3.mozilla.com ca]# grep "try-linux64-ec2-golden" inventory.txt
<trimmed>
R    210723083356Z    160724083629Z    3455    unknown    /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com
V    210723083755Z        3456    unknown    /CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com

Since the cert is missing from /var/lib/puppetmaster/ssl/git/agent-certs/releng-puppet2.srv.releng.scl3.mozilla.com, I was able to simply revoke it by its .pem file in /var/lib/puppetmaster/ssl/ca/certs

[root@releng-puppet2.srv.releng.scl3.mozilla.com ca]# openssl ca -config /var/lib/puppetmaster/ssl/ca/openssl.conf -revoke certs/3456.pem
Using configuration from /var/lib/puppetmaster/ssl/ca/openssl.conf
Revoking Certificate 3456.
Data Base Updated

[root@releng-puppet2.srv.releng.scl3.mozilla.com ca]# grep "try-linux64-ec2-golden" inventory.txt
<trimmed>
R	210723083356Z	160724083629Z	3455	unknown	/CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com
R	210723083755Z	161017200147Z	3456	unknown	/CN=try-linux64-ec2-golden.try.releng.use1.mozilla.com
Looking at last nights cron mail, a cert was successfully generated on releng-puppet2 for try-linux64-ec2-golden.try.releng.use1.mozilla.com

I still don't know what the root cause is but at least I know what to look for in future cases.  I do think at the very least, changing the cert generation email (and underlying script) to indicate failure could be an easy and quick fix.
Filed bug 1312851 to get better visibility for the next time this happens.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: