puppet capable machines should have an exponential backoff before rebooting to reduce puppet master load

RESOLVED FIXED

Status

Release Engineering
General
P2
normal
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: dustin, Assigned: jhford)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [puppet])

Attachments

(1 attachment)

(Reporter)

Description

6 years ago
scl-production-puppet is getting repeated connections from talos-r4-snow-* and talos-r4-lion-*.  talos-r4-lion-075 is a decent example:

Dec 11 09:20:36 scl-production-puppet-new puppetmasterd[14850]: Compiled catalog for talos-r4-lion-075.build.scl1.mozilla.com in 35.07 seconds
Dec 11 09:23:10 scl-production-puppet-new puppetmasterd[14701]: Compiled catalog for talos-r4-lion-075.build.scl1.mozilla.com in 14.70 seconds
Dec 11 09:25:31 scl-production-puppet-new puppetmasterd[14701]: Compiled catalog for talos-r4-lion-075.build.scl1.mozilla.com in 2.50 seconds
Dec 11 09:27:42 scl-production-puppet-new puppetmasterd[14701]: Compiled catalog for talos-r4-lion-075.build.scl1.mozilla.com in 9.45 seconds
Dec 11 09:29:25 scl-production-puppet-new puppetmasterd[14850]: Compiled catalog for talos-r4-lion-075.build.scl1.mozilla.com in 7.49 seconds
Dec 11 09:31:27 scl-production-puppet-new puppetmasterd[14701]: Compiled catalog for talos-r4-lion-075.build.scl1.mozilla.com in 5.78 seconds
Dec 11 09:33:43 scl-production-puppet-new puppetmasterd[14756]: Compiled catalog for talos-r4-lion-075.build.scl1.mozilla.com in 12.68 seconds
Dec 11 09:35:58 scl-production-puppet-new puppetmasterd[14850]: Compiled catalog for talos-r4-lion-075.build.scl1.mozilla.com in 10.69 seconds
Dec 11 09:37:57 scl-production-puppet-new puppetmasterd[14756]: Compiled catalog for talos-r4-lion-075.build.scl1.mozilla.com in 11.86 seconds

Looking on this host, I see

err: //Node[talos-r4-lion-075]/talos_osx_rev4/Exec[verify-resolution]/returns: change from notrun to 0 failed: /usr/local/bin/screenresolution set 1600x1200x32 returned 1 instead of 0 at /etc/puppet/manifests/os/talos_osx_rev4.pp:113

but it only loops and re-runs puppet -- the host has been up for four days, pinging the master every 2 minutes or so.  Either a reboot, or a bail-out with error email or something, would probably be a better choice here.
(Reporter)

Comment 1

6 years ago
VNC'ing into that machine shows iCal onscreen, too, fyi.  Its resolutions list in "Displays" only shows four options, and has 1280x1024 selected.
We discussed this at our meeting today.  We will do an exponential backoff.  We will do this for all puppet failures, instead of just screen resolution failures.
Summary: snow- and lion- minis do not reboot if resolution is incorrect → puppet capable machines should have an exponential backoff before rebooting to reduce puppet master load
Created attachment 581127 [details] [diff] [review]
exponential backoff idea 1
Attachment #581127 - Flags: review?(bear)
(Reporter)

Comment 4

6 years ago
Comment on attachment 581127 [details] [diff] [review]
exponential backoff idea 1

the iterator for an exponential backoff would be
 next_sleep=$((next_sleep * BACKOFF))
leading to 1, 2, 4, 8, .., rather than 1, 2, 6, 42, 1806, 3263442

I like this idea, though.  You should probably change the centos boot scripts to do the same, and merge it to http://hg.mozilla.org/puppet, too (or I can do that if you prefer)

Comment 5

6 years ago
Comment on attachment 581127 [details] [diff] [review]
exponential backoff idea 1

question (if i'm doing the math right)

2 ^ 12 = 4096 / 60 = 68 minutes

do we want the backoff to reach an hour?  maybe we can set the retry to 11 (a small change but it puts the max at like 17 minutes)
Attachment #581127 - Flags: review?(bear) → review+

Updated

6 years ago
Assignee: nobody → jhford
Status: NEW → ASSIGNED
Priority: -- → P2
Whiteboard: [puppet]
deployed
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.