Closed Bug 1167182 Opened 9 years ago Closed 9 years ago

ec2 AMI generation causes git conflicts on puppetmasters

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(3 files, 1 obsolete file)

==== revocation of bld-linux64-ec2-golden.build.releng.use1.mozilla.com-for-releng-puppet1.srv.releng.use1.mozilla.com.crt failed:
Using configuration from /var/lib/puppetmaster/ssl/ca/openssl.conf
ERROR:Already revoked, serial number 03C0
==== revocation of try-linux64-ec2-golden.try.releng.use1.mozilla.com-for-releng-puppet1.srv.releng.use1.mozilla.com.crt failed:
Using configuration from /var/lib/puppetmaster/ssl/ca/openssl.conf
ERROR:Already revoked, serial number 03BF

and whatnot.

I suspect that certs are being issued and revoked fairly rapidly across a variety of puppetmasters..
It looks like this might be due to re-issuing certs every time "assimilation fails"

So the puppet-side fix may be to just refuse to issue a cert when there's a revocation outstanding for it.
I think there are probably two issues wrapped up in this. 

The first is that having new certs generated and old certs revoked quickly in succession isn't handled well by the revocation scripts since there's a delay in syncing those across puppet masters. At the moment, this requires some manual cleanup.

Related to that, if the script to revoke the cert fails with an error that a cert has already been revoked, it should probably check the exit code/message and handle that gracefully and just remove the cert from git.


The second issue (and likely root cause) is that the golden AMI generation scripts seem to quickly loop on generating new puppet certs when they continually encounter the error: "WARNING - problem assimilating <name>." According to the log output, it retries in 10 seconds.

 May 21 01:15:01 aws-manager2.srv.releng.scl3.mozilla.com CROND: (buildduty) CMD (/builds/aws_manager/bin/aws_manager-bld-linux64-ec2-golden.sh 2>&1 | logger -t 'bld-linux64-ec2-golden')

...

May 21 01:17:20 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:17:19,857 - INFO - Puppetizing bld-linux64-ec2-golden.build.releng.use1.mozilla.com, it may take a while...
May 21 01:17:20 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:17:20,032 - INFO - Secsh channel 12 opened.
May 21 01:26:21 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:26:21,471 - INFO - Unpacking tarballs
May 21 01:26:21 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:26:21,682 - INFO - Secsh channel 13 opened.
May 21 01:26:21 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:26:21,862 - INFO - [chan 13] Opened sftp connection (server version 3)
May 21 01:26:22 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:26:22,475 - INFO - [chan 13] sftp session closed.
May 21 01:26:22 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:26:22,649 - INFO - Secsh channel 14 opened.
May 21 01:26:23 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:26:23,118 - INFO - Secsh channel 15 opened.
May 21 01:27:13 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:27:13,103 - INFO - Secsh channel 16 opened.
May 21 01:27:13 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:27:13,513 - INFO - Secsh channel 17 opened.
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:48:19,428 - WARNING - problem assimilating bld-linux64-ec2-golden.build.releng.use1.mozilla.com (i-0bb08add, 10.134.49.64), retrying in 10 sec ...
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: Traceback (most recent call last):
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:   File "aws_create_instance.py", line 166, in create_instance
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:     deploypass=deploypass, reboot=reboot)
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:   File "/builds/aws_manager/cloud-tools/cloudtools/aws/instance.py", line 156, in assimilate_instance
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:     unpack_tarballs(instance_data["s3_tarballs"])
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:   File "/builds/aws_manager/cloud-tools/cloudtools/aws/instance.py", line 208, in unpack_tarballs
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:     user="cltbld")
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:   File "/builds/aws_manager/lib/python2.7/site-packages/fabric/network.py", line 578, in host_prompting_wrapper
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:     return func(*args, **kwargs)
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:   File "/builds/aws_manager/lib/python2.7/site-packages/fabric/operations.py", line 1095, in sudo
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:     stderr=stderr, timeout=timeout, shell_escape=shell_escape,
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:   File "/builds/aws_manager/lib/python2.7/site-packages/fabric/operations.py", line 932, in _run_command
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:     error(message=msg, stdout=out, stderr=err)
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:   File "/builds/aws_manager/lib/python2.7/site-packages/fabric/utils.py", line 321, in error
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:     return func(message)
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:   File "/builds/aws_manager/lib/python2.7/site-packages/fabric/utils.py", line 34, in abort
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden:     sys.exit(1)
May 21 01:48:19 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: SystemExit: 1

Then we get retries which seem to get closer and closer together:

May 21 01:17:20 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:17:19,857 - INFO - Puppetizing bld-linux64-ec2-golden.build.releng.use1.mozilla.com, it may take a while...
May 21 01:48:51 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:48:51,770 - INFO - Puppetizing bld-linux64-ec2-golden.build.releng.use1.mozilla.com, it may take a while...
May 21 01:55:57 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 01:55:57,773 - INFO - Puppetizing bld-linux64-ec2-golden.build.releng.use1.mozilla.com, it may take a while...
May 21 02:02:26 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 02:02:26,437 - INFO - Puppetizing bld-linux64-ec2-golden.build.releng.use1.mozilla.com, it may take a while...
May 21 02:07:33 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 02:07:33,903 - INFO - Puppetizing bld-linux64-ec2-golden.build.releng.use1.mozilla.com, it may take a while...
May 21 02:13:22 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 02:13:22,781 - INFO - Puppetizing bld-linux64-ec2-golden.build.releng.use1.mozilla.com, it may take a while...
May 21 02:18:08 aws-manager2.srv.releng.scl3.mozilla.com bld-linux64-ec2-golden: 2015-05-21 02:18:08,241 - INFO - Puppetizing bld-linux64-ec2-golden.build.releng.use1.mozilla.com, it may take a while...
I'd rather have it fail and notify us - as it has done - when something's unusual with the certs.  Hiding errors here could lead to hiding indicators of compromise.
Attached file git.log
This is the git log for the last day
Yikes, that's worse than I thought!  I'm impressed that git is doing as well as it is, tbh.

There's not a whole lot we can do here from the puppet side -- there are lots of cases of the same cert being requested from different masters within just a few minutes of one another.  Without tying the masters together using some more active means (backend database, etcd, etc.) there's not much puppet can do to prevent this sort of abuse.
The tricky bit here is that cloud-tools is invoking /root/puppetize.sh on the destination host, which is what's requesting the cert.  We actually switched *to* this mode from the earlier mode of grabbing the cert on aws-manager1 and copying it to the target host.

The error is occurring, per the traceback above, in upload_tarballs -- the get from S3 is failing for some reason (helpfully obscured by fabric).  The result is to go back to the try/except in aws_create_instance.py, sleep for 10 seconds, and try again.

I can think of a few options here:

 1. Fix aws_create_instance to sleep for 20 minutes on failure, and adjust the puppet git synchronization to happen every 5 minutes (it can take two hops for a change to propagate, hence the factor of 4).  That will give puppet time to settle between retries.

 2. Stick with a single puppetmaster when assimilating instances.  We currently use random.choice, which is why each new puppetize.sh run hits a different master.  A single puppetmaster can issue and revoke as quickly as you'd like -- it's just synchronizing across multiple masters that causes issues.

 3. Retry the `unpack_tarballs` operation (and, why not, `unbundle_hg` and `share_repos` too) separately -- no need to re-puppetize the whole host just because S3 is horked.  We already have `redo` as a requirement.

Rail, what do you think?
Flags: needinfo?(rail)
I think all 3 options are worth to apply. I'm also fixing the root clause of this in bug 1167213.
Flags: needinfo?(rail)
OK, well, I'll try to fix the subordinate clauses ;)
Attached file MozReview Request: bz://1167182/dustin (obsolete) —
/r/9251 - Bug 1167182: sync puppetmasters' ssl stuff every 5 minutes

Pull down this commit:

hg pull -r f15a59bb0d7395874cd2a7f5d681a5541e164187 https://reviewboard-hg.mozilla.org/build-puppet
Attachment #8609384 - Flags: review?(rail)
Comment on attachment 8609384 [details]
MozReview Request: bz://1167182/dustin

https://reviewboard.mozilla.org/r/9249/#review7921

lgtm
Attachment #8609384 - Flags: review?(rail) → review+
Attachment #8609429 - Flags: review?(rail) → review+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Attachment #8609384 - Attachment is obsolete: true
Attachment #8620347 - Flags: review+
See Also: → 1312040
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: