Closed Bug 888674 (foopy87) Opened 12 years ago Closed 11 years ago

foopy87 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Unassigned)

References

Details

(Whiteboard: [buildduty][buildslaves][capacity])

Attachments

(1 file)

foopy87.p7.releng.scl1.mozilla.com didn't come back from the scl1 power outage (bug 888656) <vinh> dustin: Looks like foopy-87 might be hosed. Can you take down foopy85.p7.releng.scl1.mozilla.com so I can pop foopy-87 into the slot to test? We opted to defer that for a few days because we can tolerate a missing foopy.
Depends on: 889106
Current status: waiting for RMA.
:arr in order to return capacity and stop releng deploys for foopies from alerting that the scripts can't connect here, I wonder if we have a spare identical system that can hot-swap in here (even if different DNS entry) and then when this RMA is done we just use it place it back in place of whatever spare we end up putting here. Final setup of the replacement foopy shouldn't be hard.
Flags: needinfo?(arich)
I presume you aren't using mozpool to request devices yet, which would allow you to use pandas from any rack (we built this into the management software so that the loss of any one machine wouldn't impact capacity significantly)? As far as I know, iX does not provide spare hardware while we're RMAing machines. I've followed up to ask DCOps in the RMA bug. As a stopgap, (if you're hard coding things), you could switch to a different set of pandas and their respective foopy, since I know we have an abundance of panda capacity.
Flags: needinfo?(arich)
yea, overall we're not starving for capacity my primary reason for asking was "is there a spare machine we can get in place to avoid needing to care about this bug anymore" ;-) And additionally to get it off my radar when doing code deploys, since we get errors about unable to update this foopy (since it exists in our devices.json allocation information). There are ways around it if there is not a spare though.
Product: mozilla.org → Release Engineering
Host is down again, can we try and see what is wrong -- and specifically what iX did in the previous RMA: https://bugzilla.mozilla.org/show_bug.cgi?id=889106#c12
Flags: needinfo?(arich)
Please coordinate with DCOps for triage/replacement.
Flags: needinfo?(arich)
Depends on: 945778
There's a thing running that's called foopy87, but it doesn't appear to have any devices running successfully on it. Anyone know the status here?
Flags: needinfo?(bugspam.Callek)
unsure top of my head, dcops?
Flags: needinfo?(bugspam.Callek) → needinfo?(vhua)
This node was replaced by iX Systems from bug 889106. Right now I can ping and ssh to it. Did you want me to run a hard disk and memtest on it?
Flags: needinfo?(vhua)
(In reply to Vinh Hua [:vinh] from comment #9) > This node was replaced by iX Systems from bug 889106. Right now I can ping > and ssh to it. Did you want me to run a hard disk and memtest on it? Please, then please reimage it for extra sanity.
Flags: needinfo?(vhua)
Running memtest.
Flags: needinfo?(vhua)
Memtest and hard disk diagnostics came back clean. I tried to reimage but keeps giving me the error message indicating not enough disk space.
Rebooting this machine always brings it back to this screen.
[vle@boris ~]$ /usr/sbin/fping foopy87.p7.releng.scl1.mozilla.com foopy87.p7.releng.scl1.mozilla.com is alive [vle@boris ~]$ ssh !$ ssh foopy87.p7.releng.scl1.mozilla.com The authenticity of host 'foopy87.p7.releng.scl1.mozilla.com (10.12.134.23)' can't be established. RSA key fingerprint is ee:4d:22:23:1d:b3:3d:78:bf:8d:7e:d1:03:0b:b5:6e. Are you sure you want to continue connecting (yes/no)?
Nagios is alerting about this box since: Thu 14:54:59 PDT [4857] foopy87.p7.releng.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100% Doesn't seem to be related to bug 985129.
(In reply to Nick Thomas [:nthomas] from comment #15) > Nagios is alerting about this box since: > Thu 14:54:59 PDT [4857] foopy87.p7.releng.scl1.mozilla.com is DOWN :PING > CRITICAL - Packet loss = 100% > > Doesn't seem to be related to bug 985129. Certainly not from Bug 985129, I suspect this is a bad machine either has a boot CD in it (per :armen's c#13) or has other hardware problems.
Assignee: nobody → armenzg
Depends on: 992119
Assignee: armenzg → nobody
Host is online, responding to pings, and SSH'able. Current uptime: 06:47:06 up 8 days, 16:20, 1 user, load average: 0.00, 0.00, 0.00
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
and back offline again :( re-opening IT bug 992119
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
QA Contact: armenzg → bugspam.Callek
using devices.json as my guide: [root@foopy87.p7.releng.scl1.mozilla.com builds]# for i in {0685..0697}; do mkdir panda-$i; chown cltbld.cltbld panda-$i; done
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: