Closed Bug 1125159 Opened 10 years ago Closed 10 years ago

zlb[38] complaining about nf_conntrack

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Usul, Assigned: gozer)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/337] )

Fri 06:22:15 PST [1410] zlb8.ops.phx1.mozilla.com:nf_conntrack_count is WARNING: WARN: nf_conntrack table size 948496/1048576, 90% (http://m.mozilla.org/nf_conntrack_count) <Usul> gozer, it could be a fallout of bug 1119899 <gozer> Oui, trop de malades ici depuis un bout.... <gozer> Usul: yeah, could very well be <Usul> anything we can do before we run out of conntracks <gozer> Usul: yeah, working on it <linda> :) <gozer> Usul: the issue is that we are counting wrong, so we are not really running out atm <Usul> gozer <3 <gozer> Usul: nothing much I can do here, unfortunately <gozer> not sure the hotfix is related, shouldn’t be touching aus4 <Usul> :( <gozer> I suspect way more AUS4, as this puppy went live relatively recently <gozer> Usul: nothing I can do, but doesn’t mean there is a problem ATM, like I said, that check is somewhat misleading <gozer> it counts used slots in the conntrack tables vs. maximum <gozer> but lots of these entries are effectively empty placeholders, waiting to be replaced, an optimization of sorts <gozer> Usul: 609484 such spare entries in there <Usul> ok <Usul> so I'm going to ack the alert <gozer> Usul: yes, for now, nothing more can be done, unfortunately <Usul> kk <gozer> Usul: but I suggest bringing someone from releng in on this, bhearsum, actually, since he’s the AUS4 dude <Usul> of course he's not around now :( <gozer> Usul: like I said, it’s not a serious problem ATM, remember, there are an extra 60,000 spares entries in that table <gozer> Usul: I’ll follow up on that later today too, cc me on the zlb bug please
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/337]
Thanks for the cc. I don't really understand this bug or what (if anything) you'd like from me. Please let me know if we should talk about this or I need to do anything!
Assignee: server-ops-webops → gozer
<nagios-phx1:#sysadmins> Sun 05:26:06 PST [1035] zlb8.ops.phx1.mozilla.com:nf_conntrack_count is WARNING: WARN: nf_conntrack table size 945525/1048576, 90% (http://m.mozilla.org/nf_conntrack_count) This alerted again, AUS4 is indeed taking a lot of connections at the moment. I've increased nf_conntrack_max and decreased nf_conntrack_tcp_timeout_established to stop it alerting until a better solution can be worked out. # echo 2097152 > /proc/sys/net/netfilter/nf_conntrack_max # sysctl net.netfilter.nf_conntrack_tcp_timeout_established=43200
zlb3 alerted today: nagios-scl3 Mon 05:52:01 PST [5164] zlb3.ops.scl3.mozilla.com:nf_conntrack_count is WARNING: WARN: nf_conntrack table size 983694/1048576, 93%
Do we need to bump it up to the levels that aus3 is at, maybe? It's serving pretty much all of the traffic that aus3 used to.
As far as I can see AUS4 is being served from the same ZLBs as AUS3. These are machine-level settings that haven't changed for either until now so I'm unclear what you mean by bumping it up to other levels?
(In reply to Peter Radcliffe [:pir] from comment #6) > As far as I can see AUS4 is being served from the same ZLBs as AUS3. These > are machine-level settings that haven't changed for either until now so I'm > unclear what you mean by bumping it up to other levels? Based on your previous comment I assumed that nf_conntrack_max was something that was set per domain (eg, aus3 and aus4 could have different settings for it). Looks like I'm wrong, sorry for the fly by!
I have also did the same workaround pir did for zlb8 to zlb3 : # echo 2097152 > /proc/sys/net/netfilter/nf_conntrack_max # sysctl net.netfilter.nf_conntrack_tcp_timeout_established=43200
Might be related to redirects aus3.mo --> aus4.mo ? Also, aus2.m.o --> aus3.m.o but that's lower traffic and been there for a long time.
This is also being reported in bug 1069798. Put back the workaround from above on zlb8. Situation doesn't seem to have changed.
and workaround back on zlb3
Summary: zlb8 complaining about nf_conntrack → zlb[38] complaining about nf_conntrack
<digi:#systems> if we dont have any stateful rules we should disable conntrak <digi:#systems> /etc/modprobe.d/blacklist.conf <digi:#systems> iptables -L will tell you
We've altered how AUS3/4 behave since this last occurred, which has kept the conntrack issue from being a problem since those fixes. :pir, can we downgrade the conntrack check to IRC only? It's very helpful *if* there's an incident, but otherwise it's not the sort of thing that should page-and-escalate on its own.
Flags: needinfo?(pradcliffe+bugzilla)
Should be done. pir@wedge> svn diff Index: puppet/trunk/modules/nagios/manifests/mozilla/services.pp =================================================================== --- puppet/trunk/modules/nagios/manifests/mozilla/services.pp (revision 103408) +++ puppet/trunk/modules/nagios/manifests/mozilla/services.pp (working copy) @@ -5068,6 +5068,7 @@ service_description => "nf_conntrack_count", check_command => 'check_iptables', normal_check_interval => 30, + contact_groups => 'sysalertsonly', hostgroups => $::fqdn ? { 'nagios1.private.scl3.mozilla.com' => [ 'external-zeus' ir@wedge> svn ci -m "make nf_conntrack_count an IRC only alert, bug 1125159" Sending puppet/trunk/modules/nagios/manifests/mozilla/services.pp Transmitting file data . Committed revision 103409.
Flags: needinfo?(pradcliffe+bugzilla)
Thanks!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Thanks for all the time and effort on this folks!
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.