Closed Bug 1379881 Opened 7 years ago Closed 7 years ago

sea-mini-osx64-1 is awol

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ewong, Assigned: van)

References

Details

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1376192 +++

Sea-mini-osx64-1 is back to being awol.  

Since this system has been AWOL far longer than it has worked, I'm
starting to believe some sort of imminent hardware failure.

Can someone do a checkdisk(or whatever the equivalent of a
check disk is on an OSX system)?

I have removed it from the build pool so it can be operated on without
any concern of it getting jobs.

Thanks!
running verify disk on the mini but it's fluctuating between 15min - 5 hours to complete. will check again in a few hours and hopefully it'll allow us to run repair disk.
Assignee: server-ops-dcops → vle
verify disk completed, running repair disk. will check back on this tomorrow.
repair disk finished, it didn't report any issues and the drive is OK. i tried running diagnostics on this mini as well but it wont read any of our "Applications Install Disc 2” (tried 4 discs) in its DVD drive. the host is back online and i don't have a login. 

is it a script or something locking up the host? are there any available logs to help you out? it didn't crash at all when i left it running overnight, running its repair.


[vle@jump1.community.scl3 ~]$ fping sea-mini-osx64-1.community.scl3.mozilla.com
sea-mini-osx64-1.community.scl3.mozilla.com is alive
[vle@jump1.community.scl3 ~]$ ssh sea-mini-osx64-1.community.scl3.mozilla.com
The authenticity of host 'sea-mini-osx64-1.community.scl3.mozilla.com (63.245.223.80)' can't be established.
RSA key fingerprint is e7:19:ba:03:b7:5b:02:8a:7a:0d:e5:7d:a6:8b:2a:35.
Are you sure you want to continue connecting (yes/no)?
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
(In reply to Van Le [:van] from comment #3)
> repair disk finished, it didn't report any issues and the drive is OK. i
> tried running diagnostics on this mini as well but it wont read any of our
> "Applications Install Disc 2” (tried 4 discs) in its DVD drive. the host is
> back online and i don't have a login. 
> 
> is it a script or something locking up the host? are there any available
> logs to help you out? it didn't crash at all when i left it running
> overnight, running its repair.
> 

It seems to happen sporadically(if that's even the right word to use).  It seems to 
happen during cloning, though not sure if it has anything to do with it.

Right now, I'm still having some difficulty in ssh into it even though ping works.
cannot ssh into this.  

Callek, any idea?
Status: RESOLVED → REOPENED
Flags: needinfo?(bugspam.Callek)
Resolution: FIXED → ---
Just a thought: did you run an HD surface scan and a full memory test?
I'm a Mac sysadmin at work and this would be the first thing I'd try with a system freezing or crashing randomly.
However, I'm not aware of free tools performing hard disks surface scans, I usually use Micromat's TechTool Pro that check RAM and internal sensors, too. Please tell me if I can help in any way.
(In reply to Andrea Govoni from comment #6)
> Just a thought: did you run an HD surface scan and a full memory test?
> I'm a Mac sysadmin at work and this would be the first thing I'd try with a
> system freezing or crashing randomly.
> However, I'm not aware of free tools performing hard disks surface scans, I
> usually use Micromat's TechTool Pro that check RAM and internal sensors,
> too. Please tell me if I can help in any way.

I'm not sure as I don't have physical access to this system. 

But judging from comment #3, it seems to be ok.  

:van, can you reboot this system again?

Thanks
yah we did a surface scan, no issues. i also did the apple diagnostics which tests sensors/memory and it didn't come up with any issues. there could be underlying problems that the diagnostics can't pick up though.

hosts kicked and is back online.

[vle@jump1.community.scl3 ~]$ ping sea-mini-osx64-1.community.scl3.mozilla.com
PING sea-mini-osx64-1.community.scl3.mozilla.com (63.245.223.80) 56(84) bytes of data.
64 bytes from sea-mini-osx64-1.community.scl3.mozilla.com (63.245.223.80): icmp_seq=1 ttl=64 time=0.752 ms
^C
--- sea-mini-osx64-1.community.scl3.mozilla.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.752/0.752/0.752/0.000 ms
[vle@jump1.community.scl3 ~]$ ssh !$
ssh sea-mini-osx64-1.community.scl3.mozilla.com
The authenticity of host 'sea-mini-osx64-1.community.scl3.mozilla.com (63.245.223.80)' can't be established.
RSA key fingerprint is e7:19:ba:03:b7:5b:02:8a:7a:0d:e5:7d:a6:8b:2a:35.
Are you sure you want to continue connecting (yes/no)?
Thanks Van!  It's back up!
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Flags: needinfo?(bugspam.Callek)
Resolution: --- → FIXED
and it's back to being frozen..   

You know. I think it's something to do with either memory or hg or harddisk.

Right now, it's at hg update for the past 5 hrs... and I'm guessing it's this
line:

/usr/local/bin/hg clone --verbose --noupdate https://hg.mozilla.org/releases/comm-esr52 build

And I know for a fact that it shouldn't take 4 hrs+ to clone comm-esr52, so it's pretty much
hung during cloning.

hd failure or memory screw up or something else.

So I'm gonna take it back off the builder list.

Callek: my original question still stands.. what should we do?
Status: RESOLVED → REOPENED
Flags: needinfo?(bugspam.Callek)
Resolution: FIXED → ---
and now.. I can't even ssh into it but it does respond to pings.
(In reply to Edmund Wong (:ewong) from comment #12)
> and now.. I can't even ssh into it but it does respond to pings.

or it's taking an inordinate amount of time to allow me to log in..  going to leave this
ssh seabld@sea-mini-osx64-1  and see if it connects.
(In reply to Edmund Wong (:ewong) from comment #11)
> and it's back to being frozen..   
> 
> Callek: my original question still stands.. what should we do?

If this were a MoCo machine I'd say decomm (after hardware diags don't turn up anything) I suspect either the disk is literally dying, the memory is dying, or both (its also not unheard of for the entire motherboard to go in this hardware after this long)

I don't have good advice on "what to do" -- I might sometimes recommend reimage, but iirc we don't have a good way to do so here, nor a good way to bring it back up from a fresh OS install...
Flags: needinfo?(bugspam.Callek)
(In reply to Justin Wood (:Callek) from comment #14)
> (In reply to Edmund Wong (:ewong) from comment #11)
> > and it's back to being frozen..   
> > 
> > Callek: my original question still stands.. what should we do?
> 
> If this were a MoCo machine I'd say decomm (after hardware diags don't turn
> up anything) I suspect either the disk is literally dying, the memory is
> dying, or both (its also not unheard of for the entire motherboard to go in
> this hardware after this long)
> 
> I don't have good advice on "what to do" -- I might sometimes recommend
> reimage, but iirc we don't have a good way to do so here, nor a good way to
> bring it back up from a fresh OS install...

Thanks for the advice..  since -1 isn't helping at all, there's little
point in keeping it around.  Unfortunately, we don't have any replacements
and we're one osx64 short of a complete platform miss since -2 and -4
are, in essence, decomissioned (though not physically I think, right
:van?  sea-mini-osx64-2 and 4 are still in production right?)
Flags: needinfo?(vle)
they are still in the rack and have not been unplugged/removed.
Flags: needinfo?(vle)
:ewong, what are the plans for these minis after SCL3? do you plan to buy/install new ones into the new data centers or are you guys using a 3rd party like Mac Stadium?
(In reply to Van Le [:van] from comment #17)
> :ewong, what are the plans for these minis after SCL3? do you plan to
> buy/install new ones into the new data centers or are you guys using a 3rd
> party like Mac Stadium?

We're still considering both options though not sure about whether
the higher ups want to have 'community'-based stuff in the new
data centre.  

Thanks!
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.