Closed Bug 1211133 Opened 10 years ago Closed 10 years ago

Sat 05:00:07 PDT [5073] vc1.ops.scl3.mozilla.com:vmware_vcenter is WARNING: [UCS1] vSphere HA failover in progress is yellow (http://m.mozilla.org/vmware_vcenter)

Categories

(Infrastructure & Operations :: Virtualization, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Usul, Assigned: cknowles)

Details

Filling a bug as per run book: Sat 05:00:07 PDT [5073] vc1.ops.scl3.mozilla.com:vmware_vcenter is WARNING: [UCS1] vSphere HA failover in progress is yellow (http://m.mozilla.org First alert was Sat 04:57:16 PDT [5042] graphite2.dev.private.scl3.mozilla.com (10.22.75.111) is DOWN :CRITICAL - Host Unreachable
web{1-3}.releng/webapp.scl3 had memm pegged after the failover was finished.
make that {1-2} and I restarted the hosts from vm ware.
Looking at this now ... looks like node37.esxc1.ops.scl3.mozilla.com went nonresponsive for a bit. First record I see in the VC logs is that the node37 went unreachable around 0757 Pacific. It seems to have come right back - but the VC did the right thing and restarted hosts on new blades. And looking at 37's uptime of 2:15 shows that it did indeed fallover at around that time. Checking the UCS side, no registered faults there. I'm putting 37 into maintenance mode so that no one is running on it, and I'm starting stats collection. I will open a case with vmware after that is complete.
Looking at the logs - web{1-2}.releng.webapp were not running on node37 at the time, and have no events around the time of the outage - web1 migrated to node48 last night and then 30 minutes after the outage I see you doing things ... web2 has been on node36 for a week - and again, no action until I see you doing things 30 minutes after the outage. I'm still collecting data, will open a case with vmware on the outage.
Case 15770852510 opened with vmware. Logs uploading now. Because I couldn't say the words "We're down and dead in the water" the response time is set to be first thing monday morning.
Assignee: server-ops-virtualization → cknowles
Might have affected people doing a hackathon in sri lanka
:Usul, that's a month old bug, this should be having only sharp, acute affects, not a lingering issue.
And on reread I would like to retract ... generic1.db.scl3 dropped due to this. Which would totally cause the same symptom - but unlikely to be the same cause as that bug.
and mistype - generic2 is the one affected.
Timeline is off slightly in comment 3: first nagios alerts were 0457 PDT / 0757 EDT / 1157 UTC.
So, full timeline ... Per the VC and host logs. The timeline error was purely timezone related. :) HA detected a potential issue with the host at 0455 PDT/0755 EDT/1155 UTC. Host went non responding at 0457 PDT/0757 EDT/1157 UTC (At this point, HA swung into action and started booting affected VMs on other hosts.) Host finished booting at 0501 PDT/0801 EDT/1201 UTC At this point it was ready to take load again - and did. Until I came along and put it into maintenance mode until we can get VMware to look at it. (~ 0719 PDT/1019EDT/ 1419 UTC)
Reiterate timeline in comment 11 - which is correct on check and recheck. I've updated the timeline with vmware's case, and confirmed that we're due a response Monday morning. So, full timeline ... Per the VC and host logs. The timeline error was purely timezone related. :) HA detected a potential issue with the host at 0455 PDT/0755 EDT/1155 UTC. Host went non responding at 0457 PDT/0757 EDT/1157 UTC (At this point, HA swung into action and started booting affected VMs on other hosts.) Host finished booting at 0501 PDT/0801 EDT/1201 UTC At this point it was ready to take load again - and did. Until I came along and put it into maintenance mode until we can get VMware to look at it. (~ 0719 PDT/1019EDT/ 1419 UTC)
Day 1 - they responded with "we're looking at the logs"
Day 3 - early - still no word - poked at them for an update on day 2.
Got a response ... "there is no definitive trace of why the host went unresponsive" There are some log entries that they want to talk about some more - I'm asking for more information - but in general they've given the green light to returning it to service. I'm going to wait to get that further information from them. It's worth noting that the host has been up since that reboot (albeit without load for most of that time)
Alright, working with vmware - due to the lack of definitive anything, and the lack of signs of hardware issues, they've asked we return it to service. I have done so, and it is taking load. I'll close this out when I'm happier with some uptime.
Alright, it's wednesday - things continue well with no sign of issues - closing this out.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.