733110 - PuppetAgain puppet::atboot systems use a list of masters

Assignee

Description

•

12 years ago

When slaves start up, they use puppet::atboot (see https://wiki.mozilla.org/ReleaseEngineering/Puppet/Modules/puppet). This is currently configured to act like the old puppet did: block startup until puppet runs succesfully.

However, this makes puppet a hard requirement for operations to continue. Hard requirements are best avoided, and only where that's impossible are they worked around with (much more expensive and difficult-to-get-right) HA systems.

My rough proposal would be to have slaves try several times to run puppet, and failing that, continue startup, while complaining via email or briarpatch.

The expected failure modes for puppet are (a) incorrect manifests and (b) infrastructure failures. It would be best to differentiate the two, as (a) will happen now and then as people commit bogus manifests and correct them, while (b) is cause for a bug and contacting IT (we will hopefully get concurrent alerts from nagios).

During "normal operations", puppet only serves to verify that a slave looks the way it should - in the vast majority of cases, the configuration will already be correct. The times when this is not the case are when manifest changes land that have developer visibility (e.g., a perf-changing upgrade). When that happens, there is generally a scheduled tree closure, or the change is being rolled out "early" before a buildbot or mozconfig change is made to use it -- in either case, the change is monitored by whoever landed it. If the puppet masters are functioning correctly, no problems occur under my proposal.

We would only see problems when:
1. a developer-visible change lands
2. a puppet master fails
3. the releng person monitoring the change misses the alerts

I think that the confluence of 1 and 2 is unlikely, and that 3 is human error and can be avoided.

IMHO, this slight risk is not enough to justify adding a hard requirement to this new deployment.

John Ford [:jhford] CET/CEST Berlin Time

Comment 1

•

12 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #0)
> When slaves start up, they use puppet::atboot (see
> https://wiki.mozilla.org/ReleaseEngineering/Puppet/Modules/puppet).  This is
> currently configured to act like the old puppet did: block startup until
> puppet runs succesfully.
>
> However, this makes puppet a hard requirement for operations to continue.
> Hard requirements are best avoided, and only where that's impossible are
> they worked around with (much more expensive and difficult-to-get-right) HA
> systems.
>
> My rough proposal would be to have slaves try several times to run puppet,
> and failing that, continue startup, while complaining via email or
> briarpatch.
>
> The expected failure modes for puppet are (a) incorrect manifests and (b)
> infrastructure failures.  It would be best to differentiate the two, as (a)
> will happen now and then as people commit bogus manifests and correct them,
> while (b) is cause for a bug and contacting IT (we will hopefully get
> concurrent alerts from nagios).
>
> During "normal operations", puppet only serves to verify that a slave looks
> the way it should - in the vast majority of cases, the configuration will
> already be correct.  The times when this is not the case are when manifest
> changes land that have developer visibility (e.g., a perf-changing upgrade).
> When that happens, there is generally a scheduled tree closure, or the
> change is being rolled out "early" before a buildbot or mozconfig change is
> made to use it -- in either case, the change is monitored by whoever landed
> it.  If the puppet masters are functioning correctly, no problems occur
> under my proposal.
>
> We would only see problems when:
> 1. a developer-visible change lands
> 2. a puppet master fails
> 3. the releng person monitoring the change misses the alerts
>
> I think that the confluence of 1 and 2 is unlikely, and that 3 is human
> error and can be avoided.
>
> IMHO, this slight risk is not enough to justify adding a hard requirement to
> this new deployment.

As long as we use puppet as our guarantee that the machines are in the correct state needed for builds and tests, we must run puppet before the machine goes into buildbot.  As soon as we stop using puppet to establish the guarantee, we stop benefiting from puppet.  We could be booting computers into completely invalid states due to changes in the manifest or a dependency on commands run on the slave as part of the puppet job.

If we were to cache the catalogue on each client for up to X hours and attempt to apply that catalogue if the puppet master was unreachable, we might be able to survive puppet downtime safely.  I don't think that is going to be particularly easy to implement or keep working with each new version of puppet, further burdening an upgrade of our puppet system.  I think the priority should be placed on setting up our puppet infrastructure to be redundant and resilient while still maintaining the guarantee that a successful puppet run means a slave is in a good state to start buildbot.

The list presented is not exhaustive.  I can think of a case we recently had to deal with for 4.5 months with the dongles, where there was a failure to run puppet.  That failure was a legitimate intermittent machine configuration issue.  In the proposal above, there is a very major problem.  We would've been starting buildbot on slaves with the incorrect resolution leading to useless performance data and failing tests.  

The naive suggestion for avoiding this would be "move that check to a script we run before starting buildbot".  The problem here is that we need to figure out at what point a check is in puppet, a script or both.  We'd end up having to maintain a platform and slave type specific set of scripts that roughly approximate what puppet does.  I'd rather not reinvent the wheel.

Instead of increasing complexity, random behaviour and unneeded human interaction, why not build puppet infrastructure in a highly available, scalable, fault tolerant way?  Has using a single CA and having multiple puppet masters that can each sync any slave been investigated?  Has reductive labs been contacted about help setting up highly available puppet?  Has there been investigation of caching catalogues?

Consistency and correctness is critical in a CI system.  Instead of ensuring that our machines fail-dangerous, lets make sure we've done everything we can to have them never fail and having them fail-safe when they do.  There is serious risk of allowing a failing puppet run to pass into buildbot with a concrete example which had major implications and caused a lot of work.

It is also worth mentioning that with the current behaviour, we can survive momentary puppet master outages.  The failure mode is that machines start to drop out of the pool as they finish their jobs, reboot and fail to puppet.

bug733110.patch 12 years ago Dustin J. Mitchell [:dustin] (he/him) 889 bytes, patch	kmoir : review+	Details \| Diff \| Splinter Review
bug733110.patch 12 years ago Dustin J. Mitchell [:dustin] (he/him) 5.83 KB, patch	kmoir : review+	Details \| Diff \| Splinter Review