Bug 888674 (foopy87)

foopy87 problem tracking

RESOLVED FIXED

Status

P3
normal
RESOLVED FIXED
5 years ago
5 months ago

People

(Reporter: nthomas, Unassigned)

Tracking

Details

(Whiteboard: [buildduty][buildslaves][capacity])

Attachments

(1 attachment)

(Reporter)

Description

5 years ago
foopy87.p7.releng.scl1.mozilla.com didn't come back from the scl1 power outage (bug 888656)

<vinh>	dustin: Looks like foopy-87 might be hosed.
Can you take down foopy85.p7.releng.scl1.mozilla.com so I can pop foopy-87 into the slot to test?

We opted to defer that for a few days because we can tolerate a missing foopy.

Updated

5 years ago
Depends on: 889106
Current status: waiting for RMA.
:arr in order to return capacity and stop releng deploys for foopies from alerting that the scripts can't connect here, I wonder if we have a spare identical system that can hot-swap in here (even if different DNS entry) and then when this RMA is done we just use it place it back in place of whatever spare we end up putting here.

Final setup of the replacement foopy shouldn't be hard.
Flags: needinfo?(arich)
I presume you aren't using mozpool to request devices yet, which would allow you to use pandas from any rack (we built this into the management software so that the loss of any one machine wouldn't impact capacity significantly)?

As far as I know, iX does not provide spare hardware while we're RMAing machines.  I've followed up to ask DCOps in the RMA bug.

As a stopgap, (if you're hard coding things), you could switch to a different set of pandas and their respective foopy, since I know we have an abundance of panda capacity.
Flags: needinfo?(arich)
yea, overall we're not starving for capacity my primary reason for asking was "is there a spare machine we can get in place to avoid needing to care about this bug anymore" ;-)

And additionally to get it off my radar when doing code deploys, since we get errors about unable to update this foopy (since it exists in our devices.json allocation information).

There are ways around it if there is not a spare though.
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering
Host is down again, can we try and see what is wrong -- and specifically what iX did in the previous RMA:

https://bugzilla.mozilla.org/show_bug.cgi?id=889106#c12
Flags: needinfo?(arich)
Please coordinate with DCOps for triage/replacement.
Flags: needinfo?(arich)

Updated

5 years ago
Depends on: 945778
There's a thing running that's called foopy87, but it doesn't appear to have any devices running successfully on it. Anyone know the status here?
Flags: needinfo?(bugspam.Callek)
unsure top of my head, dcops?
Flags: needinfo?(bugspam.Callek) → needinfo?(vhua)

Comment 9

5 years ago
This node was replaced by iX Systems from bug 889106.  Right now I can ping and ssh to it.  Did you want me to run a hard disk and memtest on it?
Flags: needinfo?(vhua)
(In reply to Vinh Hua [:vinh] from comment #9)
> This node was replaced by iX Systems from bug 889106.  Right now I can ping
> and ssh to it.  Did you want me to run a hard disk and memtest on it?

Please, then please reimage it for extra sanity.
Flags: needinfo?(vhua)

Comment 11

5 years ago
Running memtest.
Flags: needinfo?(vhua)

Comment 12

5 years ago
Memtest and hard disk diagnostics came back clean. I tried to reimage but keeps giving me the error message indicating not enough disk space.

Comment 13

5 years ago
Created attachment 8370856 [details]
Screenshot from 2014-02-05 12:41:17.png

Rebooting this machine always brings it back to this screen.

Comment 14

5 years ago
[vle@boris ~]$ /usr/sbin/fping foopy87.p7.releng.scl1.mozilla.com
foopy87.p7.releng.scl1.mozilla.com is alive
[vle@boris ~]$ ssh !$
ssh foopy87.p7.releng.scl1.mozilla.com
The authenticity of host 'foopy87.p7.releng.scl1.mozilla.com (10.12.134.23)' can't be established.
RSA key fingerprint is ee:4d:22:23:1d:b3:3d:78:bf:8d:7e:d1:03:0b:b5:6e.
Are you sure you want to continue connecting (yes/no)?
(Reporter)

Comment 15

5 years ago
Nagios is alerting about this box since:
Thu 14:54:59 PDT [4857] foopy87.p7.releng.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%

Doesn't seem to be related to bug 985129.
(In reply to Nick Thomas [:nthomas] from comment #15)
> Nagios is alerting about this box since:
> Thu 14:54:59 PDT [4857] foopy87.p7.releng.scl1.mozilla.com is DOWN :PING
> CRITICAL - Packet loss = 100%
> 
> Doesn't seem to be related to bug 985129.

Certainly not from Bug 985129, I suspect this is a bad machine either has a boot CD in it (per :armen's c#13) or has other hardware problems.

Updated

5 years ago
Assignee: nobody → armenzg

Updated

5 years ago
Depends on: 992119

Updated

5 years ago
Assignee: armenzg → nobody
Host is online, responding to pings, and SSH'able.  Current uptime:  06:47:06 up 8 days, 16:20,  1 user,  load average: 0.00, 0.00, 0.00
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
and back offline again :(

re-opening IT bug 992119
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
QA Contact: armenzg → bugspam.Callek
using devices.json as my guide:

[root@foopy87.p7.releng.scl1.mozilla.com builds]# for i in {0685..0697}; do mkdir panda-$i; chown cltbld.cltbld panda-$i; done
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago4 years ago
Resolution: --- → FIXED

Updated

5 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.