Closed Bug 1246397 Opened 8 years ago Closed 8 years ago

puppettize.vbs partial failure (failed to extract 2 of 3 keys from certs file)

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

x86_64
Windows Server 2008
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: grenade, Assigned: grenade)

References

Details

(Whiteboard: [aws][windows])

Attachments

(2 files)

This mornings cron scheduled puppet run on b-2008-ec2-golden failed. Both y-2008 and t-w732 succeeded and ran as normal.

The error was captured in the agent run report as "2016-02-06 02:45:00 -0800 Puppet (err): Could not request certificate: Error 400 on SERVER: this master is not a CA".

This normally indicates a missing cert file on the agent. On the instance, I found that the output from the cscript run of puppettize.vbs (c:\log\puppettize-stderr.log) contained this:

C:\ProgramData\PuppetLabs\puppet\var\puppettize_TEMP.vbs(88, 1) Microsoft VBScript runtime error: Input past end of file

This is new, I've never seen any ouput in this file before. It's normally empty. It's created when we run puppettize.vbs from userdata powershell with a stderr redirect.

The certs.sh file (in programdata/puppetlabs/...) contained all of the certs we would expect.
The private_keys folder contained the newly created, agent specific, pem file that we expect (file timestamp checked).
The certs folder was missing both of the pem files that we expect.

So, for some reason puppettize.vbs which always succeeds and never gives us problems, failed today in a new and wonderful way, after partially succeeding by extracting 1 of the 3 keys in the certs file. I have no idea why, but if anyone wants a copy of the certs.sh file that it choked on, in order to debug what happened, I kept a copy and can forward on request.
Blocks: 1246412
The failed puppet run yesterday had unexpected but logical and rational consequences:

- spot instances spawned from the golden ami with the failed puppet run did not start runner/buildbot (I did not expect this, being based off an ami that could run builds without puppet, I was sure the failed puppet run would have no impact. I was wrong)
- check_ami worked beautifully and killed off (good) spot instances spawned from earlier (good) amis, replacing them with (bad) spot instances spawned from the new (bad) ami. yay for check_ami!
cloud tools kept seeing demand for for b-2008 spot instances because the pending queue was full no matter how many b-2008 instances were spawned. we ended up instantiating the full allocation of 200 b-2008 spot instances which all just sat around twiddling their thumbs and not talking to buildbot.
- when we noticed (because philor pointed out he had closed the trees), I manually de-registered the bad amis (use1, usw2) and terminated all of the idle b-2008 spot instances using the aws console. cloud tools immediately tried to replace all the instances, but took a full hour to do so, because it took that long for the IP address leases to expire and become available for re-use by new instances.

Today's cron run (Sunday), had no problem running puppettize.vbs. Nothing changed, it just worked the way it should (and always has, excluding yesterday), puppet ran correctly, the golden ami was created.
the new amis to watch are ami-47c1e92d (use1), ami-f5fd1c95 (usw2)
Attachment #8716669 - Flags: review?(mcornmesser) → review+
Assignee: relops → rthijssen
Status: NEW → ASSIGNED
Attachment #8720786 - Flags: review?(mcornmesser)
Attachment #8720786 - Flags: review?(mcornmesser) → review+
fixed by looping puppet runs until someone manually fixes problems.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: