Closed
Bug 748906
Opened 13 years ago
Closed 13 years ago
scl1 puppet unable to handle the load post 25Apr2012 downtime
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bear, Assigned: dustin)
Details
(Whiteboard: [buildduty])
during the downtime we noticed that linux* hosts were not loading buildbot and getting nagios alerts. checking on things it was noticed that the puppetd run was timing out making requests.
scl1 puppet master was rebooted to see if that would help, it did not.
iptable rules were added to reduce the load and that worked, somewhat.
(needs more info)
Assignee | ||
Comment 1•13 years ago
|
||
This is in the process of being fixed as follows (and we've done this once before, over a year ago): block off the entire subnet in small increments (I used /27):
for j in 48 49 50 51; do for i in 0 32 64 96 128 160 192 224; do iptables -I INPUT -j REJECT -p tcp --destination-port 8140 -s 10.12.$j.$i/27; done; done
kill any lingering processes
start opening up one /27 at a time (using iptables -L -v -n to see which have any traffic). Wait 5-10 minutes, at least between each, and watch the logfiles for failures.
The slaves all puppet on boot, and if puppet fails, they retry ten times at one-minute intervals, then reboot and do it all again. Puppet can fail because there are manifest errors, or because the puppetmaster is busy and things time out. Because of the retry logic, errors of the first sort lead quickly to errors of the second sort. The solution to the second sort is the iptables trick.
Now, we don't know what pushed this over the edge. There were manifest-related errors in the logs for the ix boxes. I strongly suspect that these errors were due to nondeterminism in the manifests, but it's hard to say. They could just be bugs in puppet.
At any rate, things seem functional now.
RESO/UPGRADETOPUPPETAGAIN
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → dustin
Reporter | ||
Updated•13 years ago
|
Whiteboard: [buildduty][downtime][outage] → [buildduty]
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•