web5.bugs.scl3 issues reporting to graphite

VERIFIED FIXED

Status

VERIFIED FIXED
3 years ago
3 years ago

People

(Reporter: glob, Assigned: fubar)

Tracking

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1581] )

(Reporter)

Description

3 years ago
web5.bugs.scl3's is currently idle: http://glob.uno/graphite/?webheads_memory

at first i thought it was a system impacted by The Great SCL3 Filter Incident™ (bug 1172666) however it isn't in the list of impacted systems.

it doesn't appear to be down either -- i had no issues during today's push.
(Assignee)

Comment 1

3 years ago
According to the apache log, it's been serving requests all morning. Your graph just now shows a bump of memory usage in the last hour or so; I'm wondering if something went odd with collectd/graphite - I can't get into graphite-relay to check, though.

top - 05:56:30 up 88 days, 13:23,  1 user,  load average: 0.13, 0.28, 0.30
Tasks: 182 total,   1 running, 181 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.8%us,  0.3%sy,  0.2%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    15.578G total, 7171.812M used, 8779.762M free,   98.465M buffers
Swap: 2047.996M total,   33.020M used, 2014.977M free, 1675.277M cached
Assignee: nobody → klibby
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → WORKSFORME
(Reporter)

Comment 2

3 years ago
graphite still appears to unhappy.  this appears to have started some time on the 5th of june.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
(Assignee)

Comment 3

3 years ago
changing summary to reflect what appears to be the actual issue. related, there have been infra issues around grahite/collectd for the past week or three.
Summary: web5.bugs.scl3 doesn't appear to be handling any requests → web5.bugs.scl3 issues reporting to graphite
(Assignee)

Comment 4

3 years ago
I'm still convinced that graphite isn't getting the correct data, but I can't find anything in the logs that indicated a problem (not that graphite/collected/zeus are actually logging much).

restarting collected on web5.bugs.scl3 caused a jump in the graph that puts it in the same ballpark as web4, but collected is restarted every morning on its own. graphite shows data for it for about 6 hrs this morning - but it's very choppy unlike the other web heads - and then it drops to zero again.

... and just now, the jump in mem usage for web5 disappeared in graphite. poking at it a bit (eg removing web5 and re-adding it) caused it to reappear, and then it disappeared again. >.<

the host is clearly handling traffic and dstat shows correct data. moving this over to webops to poke graphite with sharp sticks.
Component: Infrastructure → WebOps: Other
Flags: needinfo?(eziegenhorn)
Product: bugzilla.mozilla.org → Infrastructure & Operations
QA Contact: mcote → smani
Version: Production → other

Updated

3 years ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1581]
I believe this was caused by the old web5 blade (pre-p2v) powering back on and sending junk data to Graphite.  I've shut down the offending blade and Graphite's data looks much nicer to me now.  I've made bug 1195852 to permanently disable that blade.  Let me know how the data looks to you in the last half hour or so and then we can hopefully close this.
Status: REOPENED → ASSIGNED
Flags: needinfo?(eziegenhorn)

Updated

3 years ago
See Also: → bug 1195852
(Assignee)

Comment 6

3 years ago
(In reply to Eric Ziegenhorn :ericz from comment #5)
> I believe this was caused by the old web5 blade (pre-p2v) powering back on
> and sending junk data to Graphite.  

waaaaaat

> Let me know how the data looks to you in the last half hour or so and then we can hopefully close this.

yep, looks much better. thanks!
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
(Reporter)

Updated

3 years ago
Status: RESOLVED → VERIFIED
Just so we won't lose more sleep, I checked about web[1-4] too. Those nodes were Seamicros and that whole chassis has been decommissioned and *unracked*. web5 was a lone HP blade...
You need to log in before you can comment on or make changes to this bug.