1125159 - zlb[38] complaining about nf_conntrack

Reporter

Description

•

10 years ago

Fri 06:22:15 PST [1410] zlb8.ops.phx1.mozilla.com:nf_conntrack_count is WARNING: WARN: nf_conntrack table size 948496/1048576, 90% (http://m.mozilla.org/nf_conntrack_count) <Usul> gozer, it could be a fallout of bug 1119899 <gozer> Oui, trop de malades ici depuis un bout.... <gozer> Usul: yeah, could very well be <Usul> anything we can do before we run out of conntracks <gozer> Usul: yeah, working on it <linda> :) <gozer> Usul: the issue is that we are counting wrong, so we are not really running out atm <Usul> gozer <3 <gozer> Usul: nothing much I can do here, unfortunately <gozer> not sure the hotfix is related, shouldn’t be touching aus4 <Usul> :( <gozer> I suspect way more AUS4, as this puppy went live relatively recently <gozer> Usul: nothing I can do, but doesn’t mean there is a problem ATM, like I said, that check is somewhat misleading <gozer> it counts used slots in the conntrack tables vs. maximum <gozer> but lots of these entries are effectively empty placeholders, waiting to be replaced, an optimization of sorts <gozer> Usul: 609484 such spare entries in there <Usul> ok <Usul> so I'm going to ack the alert <gozer> Usul: yes, for now, nothing more can be done, unfortunately <Usul> kk <gozer> Usul: but I suggest bringing someone from releng in on this, bhearsum, actually, since he’s the AUS4 dude <Usul> of course he's not around now :( <gozer> Usul: like I said, it’s not a serious problem ATM, remember, there are an extra 60,000 spares entries in that table <gozer> Usul: I’ll follow up on that later today too, cc me on the zlb bug please

:kanban

Updated

•

10 years ago

Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/337]

bhearsum@mozilla.com (:bhearsum)

Comment 1

•

10 years ago

Thanks for the cc. I don't really understand this bug or what (if anything) you'd like from me. Please let me know if we should talk about this or I need to do anything!

schan

Updated

•

10 years ago

Assignee: server-ops-webops → gozer

Peter Radcliffe [:pir]

Comment 2

•

10 years ago

<nagios-phx1:#sysadmins> Sun 05:26:06 PST [1035] zlb8.ops.phx1.mozilla.com:nf_conntrack_count is WARNING: WARN: nf_conntrack table size 945525/1048576, 90% (http://m.mozilla.org/nf_conntrack_count) This alerted again, AUS4 is indeed taking a lot of connections at the moment. I've increased nf_conntrack_max and decreased nf_conntrack_tcp_timeout_established to stop it alerting until a better solution can be worked out. # echo 2097152 > /proc/sys/net/netfilter/nf_conntrack_max # sysctl net.netfilter.nf_conntrack_tcp_timeout_established=43200

Ryan Watson [:w0ts0n]

Comment 4

•

10 years ago

zlb3 alerted today: nagios-scl3 Mon 05:52:01 PST [5164] zlb3.ops.scl3.mozilla.com:nf_conntrack_count is WARNING: WARN: nf_conntrack table size 983694/1048576, 93%

bhearsum@mozilla.com (:bhearsum)

Comment 5

•

10 years ago

Do we need to bump it up to the levels that aus3 is at, maybe? It's serving pretty much all of the traffic that aus3 used to.

Peter Radcliffe [:pir]

Comment 6

•

10 years ago

As far as I can see AUS4 is being served from the same ZLBs as AUS3. These are machine-level settings that haven't changed for either until now so I'm unclear what you mean by bumping it up to other levels?

bhearsum@mozilla.com (:bhearsum)

Comment 7

•

10 years ago

(In reply to Peter Radcliffe [:pir] from comment #6) > As far as I can see AUS4 is being served from the same ZLBs as AUS3. These > are machine-level settings that haven't changed for either until now so I'm > unclear what you mean by bumping it up to other levels? Based on your previous comment I assumed that nf_conntrack_max was something that was set per domain (eg, aus3 and aus4 could have different settings for it). Looks like I'm wrong, sorry for the fly by!

Ryan Watson [:w0ts0n]

Comment 8

•

10 years ago

I have also did the same workaround pir did for zlb8 to zlb3 : # echo 2097152 > /proc/sys/net/netfilter/nf_conntrack_max # sysctl net.netfilter.nf_conntrack_tcp_timeout_established=43200

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

10 years ago

Might be related to redirects aus3.mo --> aus4.mo ? Also, aus2.m.o --> aus3.m.o but that's lower traffic and been there for a long time.

Peter Radcliffe [:pir]

Comment 10

•

10 years ago

This is also being reported in bug 1069798. Put back the workaround from above on zlb8. Situation doesn't seem to have changed.

Peter Radcliffe [:pir]

Comment 11

•

10 years ago

and workaround back on zlb3

Peter Radcliffe [:pir]

Updated

•

10 years ago

Summary: zlb8 complaining about nf_conntrack → zlb[38] complaining about nf_conntrack

Peter Radcliffe [:pir]

Comment 12

•

10 years ago

<digi:#systems> if we dont have any stateful rules we should disable conntrak <digi:#systems> /etc/modprobe.d/blacklist.conf <digi:#systems> iptables -L will tell you

:Atoll

Comment 13

•

10 years ago

We've altered how AUS3/4 behave since this last occurred, which has kept the conntrack issue from being a problem since those fixes. :pir, can we downgrade the conntrack check to IRC only? It's very helpful *if* there's an incident, but otherwise it's not the sort of thing that should page-and-escalate on its own.

Flags: needinfo?(pradcliffe+bugzilla)

Peter Radcliffe [:pir]

Comment 14

•

10 years ago

Should be done. pir@wedge> svn diff Index: puppet/trunk/modules/nagios/manifests/mozilla/services.pp =================================================================== --- puppet/trunk/modules/nagios/manifests/mozilla/services.pp (revision 103408) +++ puppet/trunk/modules/nagios/manifests/mozilla/services.pp (working copy) @@ -5068,6 +5068,7 @@ service_description => "nf_conntrack_count", check_command => 'check_iptables', normal_check_interval => 30, + contact_groups => 'sysalertsonly', hostgroups => $::fqdn ? { 'nagios1.private.scl3.mozilla.com' => [ 'external-zeus' ir@wedge> svn ci -m "make nf_conntrack_count an IRC only alert, bug 1125159" Sending puppet/trunk/modules/nagios/manifests/mozilla/services.pp Transmitting file data . Committed revision 103409.

Flags: needinfo?(pradcliffe+bugzilla)

:Atoll

Comment 15

•

10 years ago

Thanks!

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

bhearsum@mozilla.com (:bhearsum)

Comment 16

•

10 years ago

Thanks for all the time and effort on this folks!

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

zlb[38] complaining about nf_conntrack

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

Tracking

(Not tracked)

People

(Reporter: Usul, Assigned: gozer)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/337] )

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Updated