Closed Bug 921862 Opened 11 years ago Closed 11 years ago

zlb is causing connections to timeout on graphite-relay.private.scl3

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: ericz)

References

Details

ericz and I noticed this last night when debugging the graphite6 load issue (noted in bug921789).   When collectd polling is dialed back to 60s the connections through graphite-relay.private.scl3 timesout.  This causes all sorts of havoc on the collectd client.  I don't know enough about zlb to fix this myself.

If zlb can be adjusted to hold the sockets open, we can bring collectd back to 60s poll times and see the load on graphite6 go way down.  Currently, infra collectd is set at 30s and releng is set at 25s poll intervals.
Blocks: 921789, 921796
In my testing, 30 seconds work fine but I'll do 25s as well if that worked better on hosts you were looking at.
Assignee: server-ops → eziegenhorn
So I had set the virtual server connection timeouts to 300 seconds from the default of 30 last night.  I've now found an additional timeout on the backend node side that was set to 30 seconds.  I've increased this to 300 seconds as well.  I've tested on a couple servers and don't see any collectd complaints or timeouts.  :dividehex, can you try 60 seconds again on a few of your collectd boxes and see if the errors are gone for you as well?
For the record the back-end node timeout is set on the pool in Zeus, not the virtual server itself.
(In reply to Eric Ziegenhorn :ericz from comment #2)
> So I had set the virtual server connection timeouts to 300 seconds from the
> default of 30 last night.  I've now found an additional timeout on the
> backend node side that was set to 30 seconds.  I've increased this to 300
> seconds as well.  I've tested on a couple servers and don't see any collectd
> complaints or timeouts.  :dividehex, can you try 60 seconds again on a few
> of your collectd boxes and see if the errors are gone for you as well?

I tested the 60s poll time and it looks good as I don't see any broken pipe errors.
Releng nodes have been shifted back to 60s.

see https://bugzilla.mozilla.org/show_bug.cgi?id=921796#c2
Summary: zlb is cause connections to timeout on graphite-relay.private.scl3 → zlb is causing connections to timeout on graphite-relay.private.scl3
Infra has been running with 60s collectd interval for a day now and it looks good.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.