Closed Bug 1069119 Opened 11 years ago Closed 11 years ago

sea-vm-win32-{1-4} are not responding to rdp connections

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
Windows Server 2003
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ewong, Assigned: cknowles)

References

Details

+++ This bug was initially created as a clone of Bug #1069117 +++ I can ping and ssh into sea-vm-win32-{1-4} from jump1 However, rdp'ing it via the jump1 tunnel doesn't connect.
All four of them were complaining that they'd exhausted memory, even swap - rebooted and RDP started responding on all of them.
Assignee: server-ops-virtualization → cknowles
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
(In reply to Chris Knowles [:cknowles] from comment #1) > All four of them were complaining that they'd exhausted memory, even swap - > rebooted and RDP started responding on all of them. Hi Chris, I'm sorry to report but they're back down. :(
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Moving to the general queue (just because it's a VM doesn't mean it's a virtualization bug). Copying :callek (bless 'im, he seems to know if anyone else needs copying on seamonkey things). Leaving :cknowles as assigned... no real reason, probably not the right choice. I saw swap exhaustion on one node, and based on Chris' earlier diagnosis of memory exhaustion I gave all 4 a reboot. That this is starting to be patternistic suggests more triage is needed as to what's eating all the RAM.
Status: REOPENED → NEW
Component: Server Operations: Virtualization → Server Operations
QA Contact: cshields → shyam
(In reply to Greg Cox [:gcox] (plz don't needinfo me) from comment #3) > Moving to the general queue (just because it's a VM doesn't mean it's a > virtualization bug). Thanks Greg. I'll keep that in mind. > Copying :callek (bless 'im, he seems to know if anyone else needs copying on > seamonkey things). > Leaving :cknowles as assigned... no real reason, probably not the right > choice. > > I saw swap exhaustion on one node, and based on Chris' earlier diagnosis of > memory exhaustion I gave all 4 a reboot. That this is starting to be > patternistic suggests more triage is needed as to what's eating all the RAM. I'm wondering if a run-away process is horking the system.
Hey greg, can you/someone help us get these hosts back online. Sadly its a hard-to-diagnose issue on our end, (we just switched to "mozmake" instead of MSYS-based GNU Make. The system is running fine on moco systems, (w2k8, while we have w2k3) and seems to run fine on non-debug seamonkey builds, its just the debug builds that seem to cause this issue. We are also running tests on these systems unlike moco jobs. In order to diagnose and actually fix it will take some elbow grease on our part. I'm not aware of any smoking gun at this point. Unfortunately that time and elbow grease may mean a few repeated bugs for these being down (or at least reopens of this one). I ask for your patience while we do so, and particularly am open to screams/concerns from your end if this even comes close to destabilizing the ESX cluster, which aiui its not currently doing so. I'm also *willing* to get the access and a brief training on how to do this very work so we don't bother you guys, but I'm unsure if there are political/procedural hurdles to cross that make it an untenable option, since we hope this is a short term solution. TL;DR: please reboot these from the admin console, and don't worry about diagnosing any current state, we'll try and handle that on our end for now.
Flags: needinfo?(gcox)
Couldn't give them a graceful reboot, they'd gone unresponsive. Kicked harder, they seem back. From what I've seen on the console / VM resources, it looks like it's running out of RAM. It doesn't seem to be adversely affecting affecting the cluster. It's more that it's in that "community, so it's not really supported" vs "building product, so it kinda is" limbo that is going to have to be sorted out someday, for exactly cases like this.
Flags: needinfo?(gcox)
They're back up now for a few days.
Any word on these, or do you need anything so that we can close this out?
Flags: needinfo?(ewong)
(In reply to Chris Knowles [:cknowles] from comment #8) > Any word on these, or do you need anything so that we can close this out? Ah yes, sorry. This bug is good for resolution. Thanks
Flags: needinfo?(ewong)
Alright - let us know if you need anything further.
Status: NEW → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.