Closed Bug 748906 Opened 13 years ago Closed 13 years ago

scl1 puppet unable to handle the load post 25Apr2012 downtime

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bear, Assigned: dustin)

Details

(Whiteboard: [buildduty])

during the downtime we noticed that linux* hosts were not loading buildbot and getting nagios alerts. checking on things it was noticed that the puppetd run was timing out making requests. scl1 puppet master was rebooted to see if that would help, it did not. iptable rules were added to reduce the load and that worked, somewhat. (needs more info)
This is in the process of being fixed as follows (and we've done this once before, over a year ago): block off the entire subnet in small increments (I used /27): for j in 48 49 50 51; do for i in 0 32 64 96 128 160 192 224; do iptables -I INPUT -j REJECT -p tcp --destination-port 8140 -s 10.12.$j.$i/27; done; done kill any lingering processes start opening up one /27 at a time (using iptables -L -v -n to see which have any traffic). Wait 5-10 minutes, at least between each, and watch the logfiles for failures. The slaves all puppet on boot, and if puppet fails, they retry ten times at one-minute intervals, then reboot and do it all again. Puppet can fail because there are manifest errors, or because the puppetmaster is busy and things time out. Because of the retry logic, errors of the first sort lead quickly to errors of the second sort. The solution to the second sort is the iptables trick. Now, we don't know what pushed this over the edge. There were manifest-related errors in the logs for the ix boxes. I strongly suspect that these errors were due to nondeterminism in the manifests, but it's hard to say. They could just be bugs in puppet. At any rate, things seem functional now. RESO/UPGRADETOPUPPETAGAIN
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Assignee: nobody → dustin
Whiteboard: [buildduty][downtime][outage] → [buildduty]
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.