zlb is causing connections to timeout on graphite-relay.private.scl3

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: dividehex, Assigned: ericz)

Tracking

Details

(Reporter)

Description

4 years ago
ericz and I noticed this last night when debugging the graphite6 load issue (noted in bug921789).   When collectd polling is dialed back to 60s the connections through graphite-relay.private.scl3 timesout.  This causes all sorts of havoc on the collectd client.  I don't know enough about zlb to fix this myself.

If zlb can be adjusted to hold the sockets open, we can bring collectd back to 60s poll times and see the load on graphite6 go way down.  Currently, infra collectd is set at 30s and releng is set at 25s poll intervals.
(Reporter)

Updated

4 years ago
Blocks: 921789, 921796
(Assignee)

Comment 1

4 years ago
In my testing, 30 seconds work fine but I'll do 25s as well if that worked better on hosts you were looking at.
(Assignee)

Updated

4 years ago
Assignee: server-ops → eziegenhorn
(Assignee)

Comment 2

4 years ago
So I had set the virtual server connection timeouts to 300 seconds from the default of 30 last night.  I've now found an additional timeout on the backend node side that was set to 30 seconds.  I've increased this to 300 seconds as well.  I've tested on a couple servers and don't see any collectd complaints or timeouts.  :dividehex, can you try 60 seconds again on a few of your collectd boxes and see if the errors are gone for you as well?
(Assignee)

Comment 3

4 years ago
For the record the back-end node timeout is set on the pool in Zeus, not the virtual server itself.
(Reporter)

Comment 4

4 years ago
(In reply to Eric Ziegenhorn :ericz from comment #2)
> So I had set the virtual server connection timeouts to 300 seconds from the
> default of 30 last night.  I've now found an additional timeout on the
> backend node side that was set to 30 seconds.  I've increased this to 300
> seconds as well.  I've tested on a couple servers and don't see any collectd
> complaints or timeouts.  :dividehex, can you try 60 seconds again on a few
> of your collectd boxes and see if the errors are gone for you as well?

I tested the 60s poll time and it looks good as I don't see any broken pipe errors.
Releng nodes have been shifted back to 60s.

see https://bugzilla.mozilla.org/show_bug.cgi?id=921796#c2
(Assignee)

Updated

4 years ago
Duplicate of this bug: 921806

Updated

4 years ago
Summary: zlb is cause connections to timeout on graphite-relay.private.scl3 → zlb is causing connections to timeout on graphite-relay.private.scl3
(Assignee)

Comment 6

4 years ago
Infra has been running with 60s collectd interval for a day now and it looks good.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.