Improve puppet::atboot error handling and reporting

NEW
Assigned to

Status

Infrastructure & Operations
RelOps: Puppet
9 months ago
9 months ago

People

(Reporter: dhouse, Assigned: dhouse)

Tracking

Details

(Assignee)

Description

9 months ago
1. improve error reporting. show in stderr/stdout what the failure was? return an error exit code?
a. Does the script log failure clearly when manually executed or logs are reviewed? Resolution was delayed for the problem in bug 1393524 because the script did not report failure (the failure is redirected inside the script). :nthomas noted this problem in http://logs.glob.uno/?c=mozilla%23releng&s=7+Sep+2017&e=7+Sep+2017#c312298

2. Propose a patch for removal of the infinite loop+reboots. fail-over to run tasks after N retries?
Modify the puppet::atboot script error checking to prevent infinite looping in a failure case and instead report the failure while allowing the slave to perform work.
a. If the same problem recurred, how will the script prevent infinite reboots?
b. Is there a puppet failure case that would demand workers not run tasks?
(Assignee)

Comment 1

9 months ago
pro/con so far for #2 (TLDR: I've talked myself out of it at this point. If puppet is borked, stopping work is the best alert to prompt us to fix it quickly.)

Current (grep over stdout; looping on puppet and reboots):
positive:
1. forces puppet runs to be clean of errors (or workers do not run)
2. escalates attention to problems quickly (no work is getting done)
negative:
1. prevents work getting done for all puppet errors (error may be minor and not affecting the system)
2. fragile to depend on a grep over stdout from puppet agent run (may be inaccurate)
3. infinite retry expects state to change on the host, or the puppet code to change.

Proposed (Fail-over to take work after N failures would):
positive:
1. allow work to be completed if puppet is failing.
2. try running tasks. don't expect puppet failure to mean tasks will not run
3. Expect puppet to not leave things in a bad state unless it explicitly needs a reboot and re-run.
4. Expect puppet to be stable and have failures only for minor things that will be fixed.
negative:
1. could have invalid test failures when the system is broken by puppet failure
2. will not apply the fixed puppet state as quickly (current state is infinitely re-running puppet until success)
You need to log in before you can comment on or make changes to this bug.