Closed
Bug 921862
Opened 11 years ago
Closed 11 years ago
zlb is causing connections to timeout on graphite-relay.private.scl3
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dividehex, Assigned: ericz)
References
Details
ericz and I noticed this last night when debugging the graphite6 load issue (noted in bug921789). When collectd polling is dialed back to 60s the connections through graphite-relay.private.scl3 timesout. This causes all sorts of havoc on the collectd client. I don't know enough about zlb to fix this myself. If zlb can be adjusted to hold the sockets open, we can bring collectd back to 60s poll times and see the load on graphite6 go way down. Currently, infra collectd is set at 30s and releng is set at 25s poll intervals.
Reporter | ||
Updated•11 years ago
|
Assignee | ||
Comment 1•11 years ago
|
||
In my testing, 30 seconds work fine but I'll do 25s as well if that worked better on hosts you were looking at.
Assignee | ||
Updated•11 years ago
|
Assignee: server-ops → eziegenhorn
Assignee | ||
Comment 2•11 years ago
|
||
So I had set the virtual server connection timeouts to 300 seconds from the default of 30 last night. I've now found an additional timeout on the backend node side that was set to 30 seconds. I've increased this to 300 seconds as well. I've tested on a couple servers and don't see any collectd complaints or timeouts. :dividehex, can you try 60 seconds again on a few of your collectd boxes and see if the errors are gone for you as well?
Assignee | ||
Comment 3•11 years ago
|
||
For the record the back-end node timeout is set on the pool in Zeus, not the virtual server itself.
Reporter | ||
Comment 4•11 years ago
|
||
(In reply to Eric Ziegenhorn :ericz from comment #2) > So I had set the virtual server connection timeouts to 300 seconds from the > default of 30 last night. I've now found an additional timeout on the > backend node side that was set to 30 seconds. I've increased this to 300 > seconds as well. I've tested on a couple servers and don't see any collectd > complaints or timeouts. :dividehex, can you try 60 seconds again on a few > of your collectd boxes and see if the errors are gone for you as well? I tested the 60s poll time and it looks good as I don't see any broken pipe errors. Releng nodes have been shifted back to 60s. see https://bugzilla.mozilla.org/show_bug.cgi?id=921796#c2
Updated•11 years ago
|
Summary: zlb is cause connections to timeout on graphite-relay.private.scl3 → zlb is causing connections to timeout on graphite-relay.private.scl3
Assignee | ||
Comment 6•11 years ago
|
||
Infra has been running with 60s collectd interval for a day now and it looks good.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•