Closed Bug 690051 Opened 13 years ago Closed 12 years ago

rust-mac1.mv.mozilla.com is unresponsive

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: andersrb, Unassigned)

References

Details

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0a1) Gecko/20110926 Firefox/9.0a1 Build ID: 20110926030901 Steps to reproduce: This machine hasn't been performing its duties for a few days. I can ping it but can't shell in. Can we get a reboot?
Depends on: 688540
Assignee: server-ops-labs → gozer
Assignee: gozer → server-ops-labs
I'll smack it on Monday. This machine is falling over a lot, are you doing something particuarly evil to it?
Assignee: server-ops-labs → zandr
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
I'm not aware of any evil being committed against this machine.
Kicking over to the Relops queue since that lets me set colo-trip: mtv1
Assignee: zandr → server-ops-releng
Component: Server Operations: Labs → Server Operations: RelEng
colo-trip: --- → mtv1
Assignee: server-ops-releng → jwatkins
This machine remains offline. Can we get someone to poke at it? Might not take much more than a power-cycle.
Machine has been poked (power-cycled)
Thanks, back online.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
This machine keeps going "unresponsive". I am able to ssh in, and everything appears normal, but it's dead until I ssh in. I've dug further into its syslogs and it appears to be ... powering off? Idling? I think it's waking up only when I ssh to it. It's a build slave, so it actually has to _not sleep_, even if it's not busy at the moment. I would run 'pmset sleep 0' on it, but I do not have root access, only administrator access. I compared configuration to the other mac we're using (rust-mac2), which does _not_ randomly fall off the net, and it seems like they have the same power settings, but mac2 is constantly (every 5 seconds) associating and disassociating with a bluetooth mouse someone has left near the machine. So I think mac2 is only staying awake due to its mistaken belief that someone is "eternally wiggling its mouse". This can't be healthy. Can someone do the following: - Move the 'moco mouse' bluetooth mouse away from mac2 - log in to both mac1 and mac2 and 'pmset sleep 0' the machines (or whatever else is "normal make-the-mac-not-sleep configuration" for build slaves in mozilla clusters) - Tell me what was done - Possibly: email me the root password or put my ssh pubkey on root as well so I can do this sort of thing myself? Thanks
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Hi, Graydon, you should be able to change the power settings as Administrator (via the GUI at the very least) and change the root password as well. We don't have the password to these machines since they were built and handed off.
I forgot to mention that you should also be able to turn off bluetooth as well.
So it turns out that you never changed the Administrator password since these machines were built. I VNCed in and turned off sleep and bluetooth for you.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
This has gone from "randomly sleeping but available to wake-on-LAN" to "totally unreachable". Can someone take another look at it? Whatever was attempted last time made matters worse.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
FYI, there are lots of network problems w/ the entire labs cluster as part of the move. It's nothing rust specific in other words =)
(In reply to David Ascher (:davida) from comment #12) > FYI, there are lots of network problems w/ the entire labs cluster as part > of the move. It's nothing rust specific in other words =) I think it is. Unless by "move" you mean a longer-term move (in MV). I've been meaning to point this out for a week or two; it's been unreachable for some time.
This isn't a labs box, afaik. It's a one-off, basically. We should get you admin on this mini, and turn it back on. Luckily, all of the flags are still set for that from the last time it was reopened..
Im pretty sure this mini is toast. It powers up but doesn't get past the grey boot screen. Sending it to Desktop for repairs.
Depends on: 750549
rust-mac2.mv.mozilla.com has also become unreachable. Just now. That's all our macs, so rust development is now at "no test coverage" on macs. Can we get some loaner macs? Or can someone take a look at what happened to mac2?
mac-rust2 was asleep. I woke it up and it seems to be running now. Connection to 10.250.1.239 22 port [tcp/ssh] succeeded!
Ok. In https://bugzilla.mozilla.org/show_bug.cgi?id=690051#c10 the sleep function on mac2 was supposedly turned off, yet it slept here. Is there a normal routine of setting macs so they don't fall asleep? (I realize this stuff is fussy, I read the instructions concerning sleep-inhibition a dozen times and couldn't make heads or tails of them. There appear to be multiple overlapping tools and settings. Just wondering if IT has a standard incantation...)
More than likely the power button on rust-mac2 got knocked while extracting rust-mac1. Both minis were mounted in the same sonnet chassis and it was close to impossible to remove one without disturbing the other. Especially since the power and network cables are so taut. I did my best, my apologies.
No worries. I mostly just want to stop bugging y'all. Appreciate the ongoing babysitting. I have a mini here as well I'll probably put back into service once the vancouver move completes, should add a little redundancy to our party.
mac-rust1 has returned from repair from which the harddrive was replaced. Can someone provide some insight as to what needs be installed? (eg. OS)
per irc with :graydon I've installed osx Lion and set the username/password.
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → FIXED
Apologies for the run-around. While attempting to put the machine back in service, I appear to have disabled the machine again. This was due to my mistaken belief that the GCC packages made by 3rd party packagers were still usable on lion. This is no longer true. I need to use Xcode, and (sadly) install it via the app store, remotely operated by VNC. Unfortunately in the process of discovering and attempting to repair (un-install) the broken GCC package, I wound up deleting substantial portions of the OS. I suspect it won't even boot anymore, due to my own damage. It needs to be re-imaged. I'm sorry. Would prefer to be doing all this manually but at a distance have to keep asking for help.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
With the reorg and relops no longer being part of specops, I'm going to pass this bug off to the SRE group to see who should be handling things for these machines going forward.
Assignee: jwatkins → server-ops
Component: Server Operations: RelEng → Server Operations
QA Contact: zandr → phong
This got taken to SCL1 yesterday to have a new image put on it again. It is now up and running back in 3.mdf.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.