Closed
Bug 1025145
Opened 11 years ago
Closed 9 years ago
Zeus loses graphite data for new connections
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: cknowles, Unassigned)
References
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/90] )
recently been working to get some data into graphite, and been having a boatload of troubles.
(I know, you're asking why this is labeled as a zeus problem... hang on)
Upshot of the difficulty is, I'm doing python code to open a socket, connect, send data and close. Concern is, graphite would only actually notice the data coming in about a few times in a hundred.
I've found two ways to work... 1) put a one second delay between opening the socket and sending the data, and 2) connecting *not* to graphite-relay.private.scl3.mozilla.com, but instead to the actual non-zeus address of graphite6.private.scl3.mozilla.com
<speculation>
So, it feels like there's something in the zeus where it accepts the connection from the client, then tries to open the connection to the server, but in the meantime drops some/all of the data that's been sent while it's negotiating with the server.
</speculation>
Certainly waiting 1 second makes the connection foolproof, whereas my previous days attempts, I'd be lucky to have 5 10 minute data points in a 4 hour period, I currently see my full 24 since I started this method this morning.
Now, I've got a perfectly acceptable workaround, mainly wanted to mention this here in case there are other similar complaints rolling around.
If you have any questions/comments/concerns, please let me know.
---
Non-working code, where message was a /n terminated list of strings:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((socket.gethostbyname('graphite-relay.private.scl3.mozilla.com'),2003))
sock.sendall(message)
sock.close
working code, where message was a /n terminated list of strings:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((socket.gethostbyname('graphite-relay.private.scl3.mozilla.com'),2003))
time.sleep(1) # That's the difference, 1 second
sock.sendall(message)
sock.close
Updated•11 years ago
|
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/185]
Updated•10 years ago
|
Summary: Zeus and data into graphite → Zeus loses graphite data for new connections
Updated•10 years ago
|
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/185] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2441] [kanban:https://kanbanize.com/ctrl_board/4/185]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2441] [kanban:https://kanbanize.com/ctrl_board/4/185] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2448] [kanban:https://kanbanize.com/ctrl_board/4/185]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2448] [kanban:https://kanbanize.com/ctrl_board/4/185] → [kanban:https://kanbanize.com/ctrl_board/4/185]
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/185] → [kanban:https://webops.kanbanize.com/ctrl_board/2/90]
:cknowles, :ericz, is this still an issue? If not, I'd like to RESO WFM this.
Flags: needinfo?(eziegenhorn)
Flags: needinfo?(cknowles)
Comment 2•10 years ago
|
||
I have no reason to think this has improved but haven't tested recently.
Flags: needinfo?(eziegenhorn)
Reporter | ||
Comment 3•10 years ago
|
||
I'm testing now. Will clear the needinfo when I have a determination
(In reply to Eric Ziegenhorn :ericz from comment #2)
> I have no reason to think this has improved but haven't tested recently.
AIUI there have been Zeus platform upgrades (we're up to 9.8 now) in the intervening time that *may* have improved the quality of its handling of your connections. So I'm interested to see if we can still repro the issue, because if we can, this may well be a bug for our vendor to address.
Reporter | ||
Comment 5•10 years ago
|
||
Verified.
Ran the code without the 1 second sleep - no data was updated.
Ran it again with the 1 second sleep - data immediately appeared.
Still a problem.
Flags: needinfo?(cknowles)
Is there a script I can safely run to reproduce this issue, so I can do wire tracing of all involved moving pieces?
Reporter | ||
Comment 7•10 years ago
|
||
As promised, wrote up a quick script.
it's in admin1a.private.scl3.mozilla.com /home/cknowles/bin/graphitetest.py
by defaults it sends data to the graphite relay in scl3 - without a delay.
./graphitetest.py
Connected: ('10.22.75.39', 2003)
Sent: test.virtualization.esx.RanVal0 3 1426516960
test.virtualization.esx.RanVal1 5 1426516960
test.virtualization.esx.RanVal2 8 1426516960
using a -d option, you'll get the 1 second delay.
./graphitetest.py -d
Connected: ('10.22.75.39', 2003)
Delayed 1 second
Sent: test.virtualization.esx.RanVal0 7 1426516984
test.virtualization.esx.RanVal1 3 1426516984
test.virtualization.esx.RanVal2 8 1426516984
I've created a user graph for these data points in SCL3 - under cknowles@mozilla.com "testdata"
Given the way the values are displayed, should be pretty clear when they're updating.
Let me know if you need anything else.
Reporter | ||
Comment 8•10 years ago
|
||
Just checked, the problem still persists, even post the upgrade.
Updated•9 years ago
|
QA Contact: nmaul → smani
Comment 9•9 years ago
|
||
Chris,
Let's go down the path of leave it as is for now (you can introduce the sleep in your code) and we'll keep an eye on this (we haven't seen this elsewhere, with anything else) and shall take this up with Brocade in case we see this elsewhere.
Thanks for the report :)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INCOMPLETE
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•