Closed Bug 1025145 Opened 11 years ago Closed 9 years ago

Zeus loses graphite data for new connections

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: cknowles, Unassigned)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/90] )

recently been working to get some data into graphite, and been having a boatload of troubles. (I know, you're asking why this is labeled as a zeus problem... hang on) Upshot of the difficulty is, I'm doing python code to open a socket, connect, send data and close. Concern is, graphite would only actually notice the data coming in about a few times in a hundred. I've found two ways to work... 1) put a one second delay between opening the socket and sending the data, and 2) connecting *not* to graphite-relay.private.scl3.mozilla.com, but instead to the actual non-zeus address of graphite6.private.scl3.mozilla.com <speculation> So, it feels like there's something in the zeus where it accepts the connection from the client, then tries to open the connection to the server, but in the meantime drops some/all of the data that's been sent while it's negotiating with the server. </speculation> Certainly waiting 1 second makes the connection foolproof, whereas my previous days attempts, I'd be lucky to have 5 10 minute data points in a 4 hour period, I currently see my full 24 since I started this method this morning. Now, I've got a perfectly acceptable workaround, mainly wanted to mention this here in case there are other similar complaints rolling around. If you have any questions/comments/concerns, please let me know. --- Non-working code, where message was a /n terminated list of strings: sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.connect((socket.gethostbyname('graphite-relay.private.scl3.mozilla.com'),2003)) sock.sendall(message) sock.close working code, where message was a /n terminated list of strings: sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.connect((socket.gethostbyname('graphite-relay.private.scl3.mozilla.com'),2003)) time.sleep(1) # That's the difference, 1 second sock.sendall(message) sock.close
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/185]
Summary: Zeus and data into graphite → Zeus loses graphite data for new connections
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/185] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2441] [kanban:https://kanbanize.com/ctrl_board/4/185]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2441] [kanban:https://kanbanize.com/ctrl_board/4/185] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2448] [kanban:https://kanbanize.com/ctrl_board/4/185]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2448] [kanban:https://kanbanize.com/ctrl_board/4/185] → [kanban:https://kanbanize.com/ctrl_board/4/185]
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/185] → [kanban:https://webops.kanbanize.com/ctrl_board/2/90]
:cknowles, :ericz, is this still an issue? If not, I'd like to RESO WFM this.
Flags: needinfo?(eziegenhorn)
Flags: needinfo?(cknowles)
I have no reason to think this has improved but haven't tested recently.
Flags: needinfo?(eziegenhorn)
I'm testing now. Will clear the needinfo when I have a determination
(In reply to Eric Ziegenhorn :ericz from comment #2) > I have no reason to think this has improved but haven't tested recently. AIUI there have been Zeus platform upgrades (we're up to 9.8 now) in the intervening time that *may* have improved the quality of its handling of your connections. So I'm interested to see if we can still repro the issue, because if we can, this may well be a bug for our vendor to address.
Verified. Ran the code without the 1 second sleep - no data was updated. Ran it again with the 1 second sleep - data immediately appeared. Still a problem.
Flags: needinfo?(cknowles)
Is there a script I can safely run to reproduce this issue, so I can do wire tracing of all involved moving pieces?
As promised, wrote up a quick script. it's in admin1a.private.scl3.mozilla.com /home/cknowles/bin/graphitetest.py by defaults it sends data to the graphite relay in scl3 - without a delay. ./graphitetest.py Connected: ('10.22.75.39', 2003) Sent: test.virtualization.esx.RanVal0 3 1426516960 test.virtualization.esx.RanVal1 5 1426516960 test.virtualization.esx.RanVal2 8 1426516960 using a -d option, you'll get the 1 second delay. ./graphitetest.py -d Connected: ('10.22.75.39', 2003) Delayed 1 second Sent: test.virtualization.esx.RanVal0 7 1426516984 test.virtualization.esx.RanVal1 3 1426516984 test.virtualization.esx.RanVal2 8 1426516984 I've created a user graph for these data points in SCL3 - under cknowles@mozilla.com "testdata" Given the way the values are displayed, should be pretty clear when they're updating. Let me know if you need anything else.
Depends on: 1164509
Just checked, the problem still persists, even post the upgrade.
QA Contact: nmaul → smani
Chris, Let's go down the path of leave it as is for now (you can introduce the sleep in your code) and we'll keep an eye on this (we haven't seen this elsewhere, with anything else) and shall take this up with Brocade in case we see this elsewhere. Thanks for the report :)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INCOMPLETE
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.