puppettize.vbs partial failure (failed to extract 2 of 3 keys from certs file)

RESOLVED FIXED

Status

Infrastructure & Operations
RelOps: Puppet
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: grenade, Assigned: grenade)

Tracking

Details

(Whiteboard: [aws][windows])

Attachments

(2 attachments)

(Assignee)

Description

2 years ago
This mornings cron scheduled puppet run on b-2008-ec2-golden failed. Both y-2008 and t-w732 succeeded and ran as normal.

The error was captured in the agent run report as "2016-02-06 02:45:00 -0800 Puppet (err): Could not request certificate: Error 400 on SERVER: this master is not a CA".

This normally indicates a missing cert file on the agent. On the instance, I found that the output from the cscript run of puppettize.vbs (c:\log\puppettize-stderr.log) contained this:

C:\ProgramData\PuppetLabs\puppet\var\puppettize_TEMP.vbs(88, 1) Microsoft VBScript runtime error: Input past end of file

This is new, I've never seen any ouput in this file before. It's normally empty. It's created when we run puppettize.vbs from userdata powershell with a stderr redirect.

The certs.sh file (in programdata/puppetlabs/...) contained all of the certs we would expect.
The private_keys folder contained the newly created, agent specific, pem file that we expect (file timestamp checked).
The certs folder was missing both of the pem files that we expect.

So, for some reason puppettize.vbs which always succeeds and never gives us problems, failed today in a new and wonderful way, after partially succeeding by extracting 1 of the 3 keys in the certs file. I have no idea why, but if anyone wants a copy of the certs.sh file that it choked on, in order to debug what happened, I kept a copy and can forward on request.
Blocks: 1246412
(Assignee)

Comment 1

2 years ago
Created attachment 8716669 [details] [review]
https://github.com/mozilla/build-cloud-tools/pull/177
Attachment #8716669 - Flags: review?(mcornmesser)
(Assignee)

Comment 2

2 years ago
The failed puppet run yesterday had unexpected but logical and rational consequences:

- spot instances spawned from the golden ami with the failed puppet run did not start runner/buildbot (I did not expect this, being based off an ami that could run builds without puppet, I was sure the failed puppet run would have no impact. I was wrong)
- check_ami worked beautifully and killed off (good) spot instances spawned from earlier (good) amis, replacing them with (bad) spot instances spawned from the new (bad) ami. yay for check_ami!
cloud tools kept seeing demand for for b-2008 spot instances because the pending queue was full no matter how many b-2008 instances were spawned. we ended up instantiating the full allocation of 200 b-2008 spot instances which all just sat around twiddling their thumbs and not talking to buildbot.
- when we noticed (because philor pointed out he had closed the trees), I manually de-registered the bad amis (use1, usw2) and terminated all of the idle b-2008 spot instances using the aws console. cloud tools immediately tried to replace all the instances, but took a full hour to do so, because it took that long for the IP address leases to expire and become available for re-use by new instances.

Today's cron run (Sunday), had no problem running puppettize.vbs. Nothing changed, it just worked the way it should (and always has, excluding yesterday), puppet ran correctly, the golden ami was created.
(Assignee)

Comment 3

2 years ago
the new amis to watch are ami-47c1e92d (use1), ami-f5fd1c95 (usw2)
Attachment #8716669 - Flags: review?(mcornmesser) → review+
(Assignee)

Comment 4

2 years ago
Created attachment 8720786 [details] [review]
https://github.com/mozilla/build-cloud-tools/pull/178
Assignee: relops → rthijssen
Status: NEW → ASSIGNED
Attachment #8720786 - Flags: review?(mcornmesser)
Attachment #8720786 - Flags: review?(mcornmesser) → review+
(Assignee)

Comment 5

2 years ago
fixed by looping puppet runs until someone manually fixes problems.
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.